BatchMapper

batchmapper is a command-line translation utility for mapping large numbers of identifiers automatically. batchmapper is not very user-friendly, but it's great for fast large-scale identifier translation. It can be used to translate microarray, gene, protein or metabolite identifiers from one type to another. You can for example translate Kegg Compound id's to CAS numbers, or you can translate Agilent reporters to Entrez gene id's.

Tutorial

Here follows a step-by-step tutorial for running batchmapper. To keep things simple, I'll assume you are using Windows, and all your files are in a folder named "batchmapper" on your desktop. It makes use of an example input file named myinputfile.xls, which you can download if you want to follow along.

In this example we want to translate metabolite identifiers in an Excel file from HMDB to Kegg Compound. Our input file is in Excel format and looks like this:

This is just an example. You can have more than two columns, and you don't need to provide identifier names. In principle you can use any number of columns as long as there is at one column containing identifiers that are recognized by batchmapper.

Step 1: Prepare your input file

Batchmapper accepts either a tab-delimited table or a simple list of identifiers in text format. It does not accept our Excel file, so we have to export it to tab-delimited text format. To do this, just click File->Save As... and choose "Tab delimited text" in the drop down box. We save it to a file named "myinputfile.txt" in the batchmapper folder on our desktop.

Step 2: Download and unzip bridgedb

Go to DownloadBridgeDb and download the latest version of the BridgeDb? library. It includes the batchmapper script. Unzip it. Move all the unzipped files to the batchmapper folder on your Desktop as well.

Step 3: Download and unzip a mapping library

The same DownloadBridgeDb page also links to single file databases. You have to choose a database depending on the type of identifier you want to use. If you want to map microarray reporters, genes or proteins you have to pick the database that matches the species you work with. If you want to map metabolites, there is a different database just for that. Our case is the latter, so we choose metabolites_081205.pgdb and place that in the batchmapper folder as well.

After this last step, the contents of the batchmapper folder looks like this:

Step 4: Figure out the input and output system codes

You need to tell batchmapper the type of identifier you want to map from and to. Batchmapper uses so-called system codes, short codes of 1 to 4 letters that identify a certain database. For example the system code for Entrez gene is 'L' and for Affymetrix is 'X'.

You can get a complete list by running batchmapper with the -ls option.

Click Start->Run...

Type "cmd" <Enter>

Go to the directory where you unzipped batchmapper by typing the following commands:

cd %USERPROFILE%\Desktop
cd batchmapper

Now type

batchmapper -ls

You should see a list of supported data types and their system codes, like this:

From this list we find out that the system code for HMDB is "Ch" and for Kegg Compound (our goal) is "Ck"

Step 5: Run batch mapper

Now we are ready to run the actual batchmapper command. The command is

batchmapper
   -i <input file>
   -c <column in input file that contains identifiers.>
   -is <system code used in input file>
   -o <name of output file that we want> 
   -os <the system code to map to, the goal>
   -g <name of gene or metabolite database>

Columns are counted from zero, so the first column is -c 0, the second column is -c 1} etc. The column option can be left out if you want the first column, or if there is only one column.

In our example, it looks like

batchmapper -i myinputfile.txt -c 1 -is Ch -o myoutputfile.txt -os Ck -g metabolites_081205.pgdb

Tip An easy way to type the input file and metabolite database is by typing the first two or three letters and then pressing <tab>

After pressing enter, the command runs, generating output like this:

The result is written to a new file named myoutputfile.txt. After opening it in Excel, you can see the result looks like this.

Attachments