DATABASES & PATHWAYS File name DATABASES & PATHWAYS 2008
· DNA, protein, biochemical, gene expression, and metabolic pathway databases and associated analysis systems are valuable tools for investigating metabolism.
· These genomics-driven approaches (‘database mining’) complement classical biochemical approaches to the metabolism of all organisms, including plants.
DNA sequence and expression information - from genomes, cDNAs, and expressed sequence tags (ESTs) and their contigs, and from microarray experiments – can be used to complement biochemical information in several ways:
1. Finding DNA sequences for plant enzymes. Because many enzymes are highly conserved, homology searches (e.g., with BLAST) using bacterial, yeast, or animal sequences can identify the corresponding proteins in plants, and show whether they are encoded by single genes or gene families. Searching plant genomes in this way can lead to the discovery of enzymes and pathways that plants were not known to have. Conversely, such searches can also indicate that an enzyme may not be present in plants.
The plant sequences can then be expressed heterologously (e.g., in E. coli or yeast, with or without a tag to facilitate purification), and the recombinant proteins can be characterized. This is especially useful for low-abundance or unstable enzymes, which are hard to isolate from plants in sufficient amounts for detailed study
2. Discovering new enzymes and pathways by comparative genomics. By looking for functional linkages among genes in bacteria (gene fusions, conserved gene clusters, and co-occurrence patterns) it is possible to find enzymes and transporters that are ‘missing’ from known pathways, or to discover entirely new enzymes or pathways. Having discovered a new bacterial enzyme by this approach, its counterpart can be sought in plants via homology searches. Conversely, if an unknown plant enzyme has bacterial homologs, comparative genomic analysis of the latter can help predict the function of the enzyme in both groups. This is a powerful approach because bacteria share many pathways with plants.
3. Inferring organellar targeting, localization in membranes. ESTs, cDNAs and genomic sequences can give information about the organellar targeting of enzymes, via their characteristic signal sequences, and about whether proteins have membrane-spanning domains and hence are likely to be located in membranes.
4. Inferring gene expression patterns from digital gene expression profiles (electronic Northerns). Differentially expressed genes can be detected from variation in the count of their cognate ESTs or from microarray experiments, and such digital gene expression profiles can provide valuable clues about the relative levels at which pathways may be operating in a tissue, a species, or in plants in general.
5. Deducing biochemical function from microarray expression data. Similar patterns of gene expression (‘co-expression’) in relation to environmental change, genetic change (e.g., knocking out or overexpressing genes), and to development can in principle point to related function.
This part of the course (‘Resourcement’) introduces web resources needed to extract the above types of information, and illustrates how this is done.
ABIM On-Line Analysis Tools http://www.up.univ-mrs.fr/~wabim/english/logligne.html An up-to-date and comprehensive links site.
NCBI http://www.ncbi.nlm.nih.gov/ Entrez nucleotide and protein data bases; Blast similarity search programs.
TIGR http://www.tigr.org/db.shtml. Annotated Arabidopsis and rice genomes. TIGR Gene Indices are an analysis of the transcribed sequences represented in the world's public EST data (contig assembly, analysis of expression patterns).
TAIR http://www.arabidopsis.org/ The Arabidopsis information resource.
ExPASy Translate Tool http://www.expasy.ch/tools/dna.html Translates a DNA sequence in all 6 frames
Multalin Sequence Alignment http://prodes.toulouse.inra.fr/multalin/multalin.html Aligns sequences, output in color. Also makes simple phylogenetic analyses.
ClustalW and Phylogenetic Trees http://saier-144-37.ucsd.edu/clustalw.html Aligns protein sequences and makes phylogenetic trees.
TMHMM http://www.cbs.dtu.dk/services/TMHMM/ Prediction of transmembrane helices in proteins
4 Sites for predicting protein localization sites in cells (chloroplast, mitochondrion, peroxisome etc) and targeting peptide cleavage sites:
WoLF PSORT http://wolfpsort.org/
METABOLIC PATHWAY RESOURCES
Swiss-Prot Enzyme http://ca.expasy.org/enzyme/ Enzyme nomenclature data base (linked to SWISS-PROT protein database, BRENDA, KEGG, etc)
BRENDA http://www.brenda-enzymes.info/ Comprehensive enzyme database.
KEGG http://www.genome.ad.jp/kegg/ The Kyoto Encyclopedia of Genes and Genomes. Includes metabolic pathways, and compound structures that can be captured.
IUBMB http://www.chem.qmul.ac.uk/iubmb/ and the subsection on Reaction schemes http://www.chem.qmul.ac.uk/iubmb/enzyme/reaction/ The website of the International Union of Biochemistry and Molecular Biology – Searchable database on enzyme, enzyme nomenclature; some high quality information on pathways etc.
EcoSal http://www.ecosal.org/ecosal/index.jsp EcoSal, a new, continually updated Web resource based on the ASM Press publication Escherichia coli and Salmonella: Cellular and Molecular Biology. EcoSal is a comprehensive archive of knowledge on the enteric bacterial cell and an excellent source of the latest knowledge of metabolic pathways.
BioCyc, EcoCyc & MetaCyc http://BioCyc.org/ EcoCyc - Encyclopedia of E. coli Genes and Metabolism; MetaCyc - Metabolic Encyclopedia. Also computationally-derived pathway/genome databases.
AraCyc http://www.arabidopsis.org/biocyc/index.jsp Similar to BioCyc, for Arabidopsis. Software allows querying, graphical representation of pathways, and overlay of expression data on the biochemical pathway overview diagram.
KEGG, EcoCyc, MetaCyc, AraCyc, ERGO Plants, etc. have overlapping aims but each has features the others lack.
Beware! These metabolic pathway databases have several weaknesses:
· Even the best of them have omissions and errors in their pathways – so you need to check them against the literature.
· They may present a composite picture of metabolism, i.e. not all the reactions shown in a scheme necessarily occur in any one organism.
· Proteins are very often assigned functions solely based on homology - but it is not clear from the database that this is what has been done.
· To reach firm conclusions you therefore need to go to the literature to find whether a putative function has been authenticated biochemically or genetically.
· If there is no literature on the protein of interest, check that you agree with the assigned function (e.g., by doing a BLASTp search and verifying that the protein is similar to proteins whose function has been authenticated).
COMPARATIVE GENOMICS (‘PHYLOGENOMICS’) RESOURCES
SEED http://theseed.uchicago.edu/FIG/index.cgi An annotation and analysis tool, and database containing hundreds of genomes. Very useful for gene cluster analysis.
STRING http://string.embl.de/ STRING is a database of known and predicted protein-protein relationships, derived from genomic context (fusions, conserved gene clusters, co-occurrence), high throughput experiments (co-expression), and the literature. STRING quantitatively integrates data from bacteria and other organisms.
PHYDBAC http://igs-server.cnrs-mrs.fr/phydbac/ PHYDBAC displays phylogenomic profiles (fusions, co-occurrence, co-localization in genome) of bacterial protein sequences. Analyzing the annotation of a protein’s phylogenomic neighbors helps generate hypothetical functions for the query protein(s).
FusionDB http://igs-server.cnrs-mrs.fr/FusionDB/main.html (Linked to PHYDBAC) FusionDB is a database of bacterial and archaeal gene fusion events.
COGs http://www.ncbi.nlm.nih.gov/COG/ A system of gene families from complete genomes. Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in many complete genomes representing 30 major phylogenetic lineages. Each COG consists of proteins from at least 3 lineages and thus corresponds to an ancient conserved domain. Other databases use the COGs database.
MICROARRAY DATABASES AND ANALYSIS RESOURCES (mainly Arabidopsis)
Golm Transcriptome database http://csbdb.mpimp-golm.mpg.de/csbdb/dbxp/ath/ath_xpmgq.html Good tools for getting an overview of expression, and for finding co-responses.
ATTED http://www.atted.bio.titech.ac.jp/ A simple site to use to look for co-expression patterns; it shows gene networks, not just lists of correlated genes.
Botany Array Resource http://bbc.botany.utoronto.ca/ Tools for finding co-responses, electronic Northerns, and identifying elements in promoters of individual or co-regulated genes
PRIMe http://prime.psc.riken.jp/ A web-based RIKEN service for metabolomics and transcriptomics; has unique tools that are useful for metabolomics, transcriptomics and integrated analysis of different omics data.
MicrobesOnline http://www.microbesonline.org/ A comprehensive database that includes correlated gene expression in E. coli and other bacteria
EcoGene http://ecogene.org/ A rich resource on E. coli that includes Microarray data on the major changes in gene expression observed in various experiments
MapMan http://gabi.rzpd.de/projects/MapMan/#mapman_overview Displays large datasets (e.g. gene expression data from arrays and metabolome data, alone or together) onto diagrams of metabolic pathways
USING METABOLIC PATHWAY RESOURCES
• SWISS-PROT ENZYME Enzyme nomenclature database http://ca.expasy.org/enzyme/
ENZYME is a repository of information on enzyme nomenclature, with links to other databases. It describes enzymes that have been given an EC (Enzyme Commission) number, and the reactions they catalyze. It can be searched in various ways, e.g. by EC number, by common name, by substrate or product.
Example: alcohol dehydrogenase = EC 188.8.131.52 NiceZyme view ą Links to:
Biochemical pathways (E6)
BRENDA (convenient entry point)
KEGG (Kyoto University Ligand Chemical Database (maps – glycolysis)
Cloned enzymes in SwissProt (not exhaustive but curated, i.e. high quality)
• BRENDA Enzyme database http://www.brenda-enzymes.info/ BRENDA is an extensively referenced enzyme data information system; it includes data on substrate specificity, physical and kinetic characteristics, inhibitors, sources, cloning, purification etc.
Example: alcohol dehydrogenase EC 184.108.40.206
• KEGG Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/
KEGG computerizes knowledge of molecular and cell biology in terms of pathways that consist of interacting molecules or genes and provides links from gene catalogs produced by genome sequencing. It covers regulatory pathways and molecular assemblies as well as metabolic pathways. Its metabolic pathway maps have links to the enzymes and compounds.
Example: KEGG PATHWAY * 1.9 Metabolism of Cofactors and Vitamins - Folate biosynthesis * Note that all enzymes (EC numbers) and intermediates are clickable, e.g. * 220.127.116.11 and its * product (structure can be captured). Note that this is a case of a composite metabolic scheme. It includes methanopterin biosynthesis (found only in methane-producing microbes) and tetrahydrobiopterin synthesis (found in animals). Note the pulldown table (top left) of folate biosynthesis enzymes in different organisms.
• IUBMB Reaction Schemes http://www.chem.qmul.ac.uk/iubmb/enzyme/reaction/
This page on the International Union of Biochemistry and Molecular Biology website has some of the highest-quality pathway information available on the web.
Example: Miscellaneous * Folate biosynthesis (early stages) * Note that enzymes (EC numbers) and connecting metabolic pathways are clickable * Compare this site with KEGG – it is clearer which reactions are in folate synthesis, which are in connecting pathways, and what is not known (e.g., the enzymes of DHNTP hydrolysis).
• EcoCyc and MetaCyc http://BioCyc.org/
EcoCyc- Encyclopedia of E. coli Genes and Metabolism: Describes the genome and biochemical machinery of E. coli. Contains annotations of all E. coli genes, and their DNA sequences, and describes all known pathways of E. coli small-molecule metabolism. Each pathway and its component reactions and enzymes have detailed annotations, and are extensively referenced.
MetaCyc - Metabolic Encyclopedia: A metabolic-pathway database that describes pathways, reactions, and enzymes of various organisms, especially microbes. MetaCyc contains the E. coli pathways of EcoCyc, plus other pathways from the literature and on-line sources, with citations to the sources of pathways.
Example: MetaCyc * Browse Pathways * Betaine biosynthesis I-V * Note that all elements in pathway are clickable.
• Thermodynamics of Enzyme-Catalyzed Reactions http://xpdb.nist.gov/enzyme_thermodynamics/
Compilation of data on the thermodynamics of enzyme-catalyzed reactions - principal thermodynamic information needed to determine the position of equilibrium of a reaction. Entries have the literature reference; the reaction studied; enzyme name EC number; method of measurement; conditions of measurement (temperature, pH, buffer, etc.); the data and a quality evaluation (from A = very good to D = poor).
Example: Homoserine dehydrogenase (EC 18.104.22.168) * Search With User Defined Values * enter 22.214.171.124 and homoserine * scroll down, click on reference_id links:
L-homoserine(aq) + NADP(aq) = L-aspartate 4-semialdehyde(aq) + NADPH(aq) * Keq = [L-aspartate 4-semialdehyde] • [NADPH]/[L-homoserine] • [NADP] = 6.3 x 10-4 ( i.e. homoserine formation strongly favored) * Other details * Evaluation = C.
• ORGANELLAR TARGETING
Example: 10-Formyltetrahydrofolate deformylase (PurU) is an enzyme found in E. coli and other bacteria that hydrolyzes 10-formyltetrahydrofolate, releasing formate. The Arabidopsis genome encodes two homologs of E. coli PurU (At5g47435 and At4g17360).
>PurU gi|548645|sp|P37051|PURU_ECOLI FORMYLTETRAHYDROFOLATE DEFORMYLASE (FORMYL-FH(4) HYDROLASE)
>At5g47435 gi|18422794|ref|NP_568682.1| formyltetrahydrofolate deformylase, putative [Arabidopsis thaliana]
>At4g17360 gi|15236046|ref|NP_193467.1| formyltetrahydrofolate deformylase, putative [Arabidopsis thaliana]
Targeting predictions for the At5g47435 and At4g17360 proteins using:
TargetP: http://www.cbs.dtu.dk/services/TargetP/ Paste in both Arabidopsis sequences * Check ‘Plant’, ‘Perform cleavage site predictions’
Predotar: http://urgi.versailles.inra.fr/predotar/predotar.html Paste in both Arabidopsis sequences
iPSORT: http://hc.ims.u-tokyo.ac.jp/iPSORT/ Paste in individual Arabidopsis sequences, without header line
WoLF PSORT http://wolfpsort.org/ Paste in both Arabidopsis sequences * Check ‘Plant’, ‘Submit query’
The consensus of the prediction algorithms is that both proteins are mitochondrial. To check this, align them with the E. coli PurU sequence using Multalin http://prodes.toulouse.inra.fr/multalin/multalin.html * Alignment shows that both Arabidopsis proteins have N-terminal extensions of about 35 residues. This is a typical size for a mitochondrial targeting peptide.
• DIGITAL GENE EXPRESSION PROFILES (ELECTRONIC or DIGITAL NORTHERNS)
There are two types of digital Northerns, based either on abundance of ESTs in libraries or on microarray data.
ESTs: In cDNA libraries from which many randomly selected clones have been sequenced, the relative abundance of cDNAs reflects the relative abundance of mRNAs, so that differentially expressed genes can be detected from variations in the counts of their cognate ESTs. See Audic, S., and Claverie, J.-M. (1997) The significance of digital gene expression profiles. Genome Res. 7: 986-995.
Microarrays: The 22K Affymetrix chip contains most Arabidopsis genes, so in principle it can be used to monitor the expression of almost all metabolic genes. However, many metabolic genes have low expression levels, and so cannot be monitored with confidence. Genes with low average expression levels tend to give large numbers of spurious co-expression matches.
mRNA abundance in general correlates broadly with protein abundance and with in-vivo metabolic fluxes. Therefore digital gene expression data can indicate which organs have a pathway and which do not, and whether a pathway is likely to be a major or minor one. Note also that primary metabolic pathways are expressed everywhere and always, and that secondary pathways by definition are not. Unexpected differences in expression may provide clues about genetic control of pathways, e.g. an enzyme whose transcript level varies more than those of others (i.e. is highly regulated, not constitutive) may be an important control point in the pathway.
Constructing an EST-based gene expression profile: Capture the protein sequence of interest, e.g., ‘histidine decarboxylase’ from tomato:
>Histidine decarboxylase – Lycopersicon esculentum (tomato)
Cognate ESTs sought in dbEST or in TIGR Gene Indices (which is much more convenient – but not all species are in this database).
1. dbEST Go to NCBI BLAST http://www.ncbi.nlm.nih.gov/BLAST/ * Select tblastn, est, species of interest (tomato – Lycopersicon esculentum) * change number of descriptions and alignments to 1000 * Result page shows >300 hits * The hits need to be examined individually to determine whether they are exact matches to the search sequence or are hits on related proteins, and to determine the organ used to make the cDNA library * Nevertheless, it is clear that most hits are to fruits.
2. TIGR Gene Indices Go to TIGR Gene Indices http://www.tigr.org/tdb/tgi/ * Select Plant Gene Indices, Tomato, BLAST * On BLAST page select tblastn, tomato * Hits several contigs (assemblies of ESTs from the same gene) – showing that there is a small gene family. Best hit (identical to search sequence) = tomato|TC153887 * Click on tomato|TC153887, scroll down, click on Expression summary button * A pre-computed Electronic Northern is available.
Total ESTs found in TC153887: 88
# of ESTs
% of library total
tomato breaker fruit
tomato red ripe fruit
Tomato fruit library
Normalized cDNA library from ripening tomato pericarp
This result shows that this protein is fruit-specific.
Microarray-based gene expression profiling using the Golm Transcriptome database http://csbdb.mpimp-golm.mpg.de/csbdb/dbxp/ath/ath_xpmgq.html
For an overview of expression in different organs and in different environmental conditions: On face page, paste in one or several AGI numbers e.g.
(At3g12930 is the plastid Iojap protein; At5g47190 is chloroplast ribosomal protein L19; At2g39800 is the first enzyme of proline biosynthesis)
* scroll down to graphs. Note positive correlation between At3g12930 and At5g47190. Note induction of At2g39800 by stresses.
To search for positively and negatively correlated genes, go to Transcript Co-Response, Single Gene Query, paste in At5g47190 * Select a dataset (‘Matrix’), e.g. developmental series * select an output, e.g. positive, top 100 of co-responding genes * Scroll down list of hits – note many strong correlations with other chloroplast ribosomal proteins, which associate together to form the protein complexes of the ribosome * At bottom of page note pie-chart showing predominance of metabolic enzymes related to tetrapyrrole (= chlorophyll) biosynthesis * Repeat, changing output to negative, top 100 of co-responding genes * Note disparate nature of correlated genes (which fall into many different categories in the pie chart).
Microarray-based gene expression profiling using ATTED http://www.atted.bio.titech.ac.jp/
On face page, paste in one or several AGI numbers, e.g. At5g47190 * In the box ‘Making coexpressed gene list from coexpressed query genes’ click Execute * Click Submit * Note the targeting predictions (from TargetP and Wolf PSORT) * Click on locus At5g47190 * Note that the gene expression network and list of most highly correlated genes contain various ribosomal proteins.
• OVERLAYING (PAINTING) LARGE DATA SETS ON METABOLIC PATHWAY DIAGRAMS (THE FUTURE)
Using e.g. AraCyc or MapMan, it is becoming possible in plants to overlay microarray expression data (and proteomics and metabolomics data) directly on pathway schemes, and so to see at a glance which genes, proteins, and metabolites ‘belong’ to which pathways. For a glimpse of this future:
Go to Aracyc http://www.arabidopsis.org/biocyc/index.jsp * Click on Omics Viewer * Click on More information about the Omics Viewer * Scroll down, click on Time series gene expression animation, Sample display * Click Play button or advance frame-by-frame * Shows color-coded changes in E. coli pathway gene expression during the timecourse of a diauxic growth experiment (initial growth on glucose, then on lactose, accompanied by major changes in gene expression – whole metabolic pathways are activated or deactivated.) * Note that the genes of the Krebs cycle are turned on at intermediate time points.
USING COMPARATIVE GENOMICS RESOURCES
• STRING http://string.embl.de/
STRING is a precomputed database to explore functional relationships between proteins (clustering, fusions, co-occurrence etc). STRING gives an integrated confidence score for the associations it predicts. It is the often the best database to begin a comparative genomic project.
Example 1: Enter via a gene name, view associations among proteins with that name. * FolE (GTP cyclohydrolase I, EC 126.96.36.199, the first enzyme of pterin and folate synthesis) * E. coli K12 button (results are similar but not identical using other species) * Press continue * Displays ‘Evidence View’ - different line colors represent the types of association (proximity on the chromosome, co-occurrence, co-expression, protein-protein interactions identified experimentally, etc) * Click Confidence View’ - stronger associations are represented by thicker lines * Note strong associations with FolK (an enzyme of folate synthesis) and YgcM (an enzyme of pterin synthesis) * To see more interactions, expand list in pull-down menu e.g. to 50 interactors * In ‘Evidence View’ screen, click on bullets in the table for more information, e.g. FolK bullet in ‘Neighborhood’ - shows linkage between FolE and Folk in various diverse genomes (the more diverse the genomes, the more probable it is that the linkage represents a functional relationship) * click to expand display to all genomes * click on FolK ‘Gene Fusion’ – shows a fusion in Lactococcus lactis.
Example 2: Enter via a protein sequence, view associations of COG(s) to which the protein belongs. Note that operation in the COG mode tends to give more hits but of lower specificity because COGs can contain similar proteins with different biochemical functions. * Click on COGS button * Query sequence = Plant plastidial Iojap protein
* Click on Arabidopsis protein (COG0799), expand to 20 interactors * Note strong associations with COG1057 (NAD synthesis enzyme nicotinic acid mononucleotide adenylyltransferase) and with ribosomal proteins L21 and L27 (compare with the co-expression in plants with chloroplast ribosomal proteins, see above).
• PHYDBAC http://igs-server.cnrs-mrs.fr/phydbac/
PHYDBAC is quite similar to STRING but can be accessed by keywords or KEGG pathways as well as by gene names and protein sequences.
Example 1: Enter via gene name. * Choose a bacterium – E. coli K12 * Gene name folE * Results page shows occurrence (zoom to see species names) * Click on P to see phylogenomic profile, i.e. genes with similar occurrence patterns. Note that one such gene is HPPK (folK) * Click on C to see co-localized genes, i.e. genes that are consistently near folE in different prokaryotes. These include various other fol (folate synthesis) genes * Click on F to see gene fusions – note fusion in Lactococcus lactis with HPPK *
Example 2: Enter via pathway. * Choose a bacterium – E. coli K12 * KEGG pathway – enter folate * Choose ‘folate biosynthesis’ * Results page shows all folate pathway genes (and some that may be associated with the folate pathway). * Scroll down – KEGG metabolic map * To compare a subset of these genes check them and add to cart, e.g., folE, folB, and folP * Click on ‘visualize the profiles of the genes in your cart’. Note that some organisms lack all three genes, and therefore presumably cannot synthesize folates de novo (folates are taken up from the environment in these cases). Note also that some organisms lack just folE or folB although they have folP – i.e. folE and folB are ‘missing genes’ (‘pathway holes’). In many of these cases it has now been shown that completely different enzymes mediate the missing steps in the pathway * To add another gene to the profile, e.g. glyA (serine hydroxymethyltransferase) – a major folate-dependent enzyme that produces glycine from serine. Enter ‘glyA’ in ‘Search for genes’, select serine hydroxymethyltransferase, click add. Note that some bacteria lack this key gene.
This result is a simple case of ‘METABOLIC RECONSTRUCTION’ – using genomic information to deduce which metabolic reactions an organism can – or cannot – do
Functional relationships between proteins can often be inferred from genomic associations between the genes that encode them: groups of genes involved in the same pathway tend to be close together (clustered) in prokaryote genomes (often in operons), to be involved in gene-fusion events, and to show similar species coverage. SEED is a versatile tool for investigating these relationships. Unlike STRING and PHYDBAC it is not rigidly precomputed; the user has more control.
To enter SEED as a user: Enter master:JaneDoe (substitute your name, and make sure that ‘master’ is all lower case), click ‘Work on subsystems’ * Now that you have entered the system as a user, click on Fig search to return to the face page
Typical starting points for a comparative genomics project are a protein sequence or a gene name. The following notes walk you through both routes.
1. Enter SEED starting from a protein sequence, e.g. the plant plastidial Iojap protein
Paste the sequence into the box for Searching DNA or Protein Sequences (in a selected organism). Select the organism whose genome you wish to search for homologs from the list near the top of the page, e.g. Escherichia coli K12 (the standard laboratory E. coli strain) * Click ‘Search for matches’ * There is one hit, peg638 (PEG= Protein Encoding Gene) * Clicking on this takes you to the Protein Page for this gene (this page is the heart of SEED) * It shows the flanking genes and their annotations * To see homologs of this gene in other bacteria, and how they cluster with other genes, scroll down to Compared Regions in SeedViewer and Click * A small set of similar genomes is displayed. Genes whose relative position is conserved in at least four other species have gray boxes (these are good candidates for genes that may be functionally related to the query gene) * Click on ‘Advanced’, expand number of regions to 400, relax both the cutoffs to 1e-02, check ‘collapse close genomes’, click ‘Update graphic’ * Regions around homologs of the query gene are displayed from hundreds of genomes. The genes are numbered in order of decreasing frequency of occurrence, 1 being the query gene, 2 being the most often clustered, 3 being the next most often etc. Again, gray boxes flag genes whose relative position is conserved in at least four species. * Note that gene 2 is annotated nicotinate-nucleotide adenylyltransferase, NadD (this is the same gene that we found above with STRING).
2. Enter SEED starting from a gene name, e.g. folE, encoding the first enzyme of folate and pterin biosynthesis
* Go to the face page by clicking on Fig Search * Enter folE in the Searching for Genes or Functional Roles Using Text Search Pattern box, select Escherichia coli K12, click Search genome selected below’ * Clicking on peg2128 takes you to the protein page. Note that the annotation for FolE has an EC number (EC 188.8.131.52) * Clicking on this takes you to a version of KEGG map 00790 (folate biosynthesis and related reactions) in which all the enzymes present in E. coli are tabulated and linked, and colored in red on the map itself * You can thus see at a glance that E. coli has all the genes for de novo folate biosynthesis
3. Annotating genes. Many genes in SEED have already been annotated by experts, and are included in metabolic pathways (or other sets of related genes) called ‘subsystems’. However, many have not yet been annotated. As a user, you can annotate genes yourself. However, you may only do this for genes (a) that have not been annotated by someone else already (i.e. are in an existing subsystem), or (b) are closely related to genes that are in an existing subsystem. It is easy to tell whether a gene is in a subsystem – this information is given in the section of the protein page ‘Subsystems in Which This Protein Plays a Role’ and in summary form in the SS column of the table. To find whether a gene is closely related to one that is already in a subsystem:
* Scroll down to Similarities, set ‘Max sims’ and ‘Max expand’ to 500, select ‘Just FIG IDs (all)’, check ‘Hide aliases’ and click Similarities * A table of similar genes is displayed, the table column In Sub tells you if each gene in the list is already in a subsystem. If genes in the list are in subsystems, you should leave everything alone.
If and only if the gene is not annotated or not similar to genes that are annotated, you can make your own annotation from the protein page as follows:
* As an example, the gene next to folE, currently annotated ‘Putative esterase (EC 3.1.1.-)’ is not in a subsystem or similar to any gene that is in a subsystem, so we can annotate it * Go to its protein page from the FolE page by clicking on the link to peg 2129 * Click ‘Annotate’ underneath the table, and enter an annotation. e.g. ‘esterase, putative’ in both boxes of the form * Click ‘add annotation’ and close the box * Refresh the protein page to verify that the annotation has been changed.
There are many other ways to annotate genes in SEED, but this is the basic operation.
4. Building a subsystem. As an example, we will build a small subsystem using the Iojap gene and the genes that are associated with it, starting with nicotinate-nucleotide adenylyltransferase.
Begin by repeating the above BlastP search of the E. coli K12 genome with the plant Iojap sequence. This shows that the Iojap protein is annotated as ‘COG0799: Uncharacterized homolog of plant Iojap protein’ and that the nearby NadD gene is annotated ‘Nicotinate-nucleotide adenylyltransferase (EC 184.108.40.206)’.
Start the subsystem with these two genes. The sequence of operations is: Fig Search → Work on subsystems → Manage your subsystems → Insert name of subsystem (name it Iojap_XX where XX are the initials of your name) → Start new subsystem → copy-paste exact annotations above into ‘Functional role’ column of table, and add an abbreviation (e.g. Iojap and NadD) → In ‘Pick Genomes to Extend With’ use shift key to select all bacterial, archaeal, and eukaryote genomes → Click ‘fill’ (or ‘refill spreadsheet from scratch’) and ‘update spreadsheet’. When the subsystem screen appears, click ‘show clusters’ and ‘update spreadsheet’ and inspect the result. This confirms that Iojap is clustered with NadD in many diverse bacteria (the taxonomy of the organisms in the spreadsheet can be found by going to any protein page and clicking on ‘NCBI Taxonomy Id’).
To investigate other genes that may be associated with Iojap, go again to Compare Regions, click on ‘Compare Regions in SeedViewer’, modify settings as before (i.e. Click on ‘Advanced’, expand number of regions to 400, relax both the cutoffs to 1e-02, check ‘collapse close genomes, click ‘Update graphic’. Two highly ranked genes are annotated ‘Gamma-glutamyl phosphate reductase (EC 220.127.116.11)’ and ‘Glutamate 5-kinase (EC 18.104.22.168)’ – these are proA and proB, the first genes of proline synthesis; add them to your subsystem as above (get the exact annotations to copy-paste by going to Fig search and entering the EC numbers into the Searching for Genes or Functional Roles Using Text ‘Search Pattern’ box).
Finally, to ‘publish’ your spreadsheet to the SEED server, go to Fig search → Work on Subsystems → manage your subsystems → Check ‘Publish’ → Click ‘Publish marked subsystems’.
5. Using the subsystem. The subsystem spreadsheet shows that proA and proB quite often cluster with iojap and nadD in a wide range of genomes. This is the starting point for developing hypotheses about what the functional links between Iojap (of unknown function), NAD biosynthesis, and proline biosynthesis may be. Various tools, accessible from the subsystem, help in developing hypotheses. These include:
● Essential genes tool. High-throughput studies of various bacteria have basically knocked out most or all genes to determine whether they are essential. The results have been incorporated in SEED. To determine if genes in your subsystem are essential: Underneath the spreadsheet, uncheck ‘show clusters’, in ‘color columns by each PEGs attribute’ highlight ‘Essential_Gene_Sets_Bacterial’, click ‘update spreadsheet’. Genes for which there are essentiality data show up as colored highlights (‘essential’, ‘non-essential’, and ‘undetermined’).
● Organisms’ attributes tool. Underneath the spreadsheet, the ‘color rows by each organism's attribute’ tool (which works the same way as the Essential Genes tool) provides immediate access to information on the lifestyle of the organisms in the subsystem, e.g. aerobe/anaerobe (‘Oxygen_Requirement’), pathogen/non-pathogen (‘Pathogenic’). This information can provide clues to gene function.
● Links to KEGG pathways. For enzymes that include an EC number in their annotation, the protein page is linked to one or more KEGG pathways, e.g. ProA and ProB are both linked to KEGG map 00220 (Urea cycle and metabolism of amino groups). This allows rapid exploration of metabolic relationships between genes.
● Links to other subsystems. In the protein page ‘Subsystems in Which This Protein Plays a Role’ provides links to all the subsystems that contain that protein, and the SS column in the table shows the subsystems that contain the neighboring genes. Since some subsystems represent research hypotheses from expert SEED annotators, these subsystem links can connect provide up-to-the minute ideas about gene function.
● Links to GenBank. The protein page always contains the protein sequence (click on ‘Show protein sequence’) and often has links to GenBank Sequences (which in turn are linked to the Conserved Domain database).