Vol. 15 no. 12 1999
Pages 965–973
BIOINFORMATICS
IMAGEne I: clustering and ranking of I.M.A.G.E.
cDNA clones corresponding to known genes
M. Cariaso 1,2 , P. Folta 1,∗ , M. Wagner 1 , T. Kuczmarski 1 and
G. Lennon 1,2
1 Biology
and Biotechnology Program, Lawrence Livermore National Laboratory,
Livermore, CA 94550, USA
Received on September 25, 1998; revised on May 21, 1999; accepted on June 17, 1999
Abstract
Motivation: To enhance the usefulness of the I.M.A.G.E.
Consortium (Lennon et al., 1996, Genomics, 33, 151–152)
cDNA clone collection by directed analysis and organization of their associated Expressed Sequence Tags (ESTs),
thus enabling effective mining of the immense amounts of
public cDNA information.
Results: This paper introduces the IMAGEne suite of
tools, which clusters ESTs around known genes, then
ranks each clone within a cluster. IMAGEne filters data
from known gene sequence databases and the GenBank’s
EST database (Boguski and Shuler, 1995, Nature Genet.,
10, 369–371). It applies biological criteria in connection
with judicious use of the BLAST (Altschul et al., 1990,
J. Mol. Biol., 215), FASTA (Pearson and Lipman, 1988,
Proc. Natl Acad. Sci. USA, 85, 2444–2448; Pearson,
1990, Methods Enzymol., 183, 63–98; Gusfield, 1997,
Algorithms on Strings, Trees, and Sequences, Cambridge
University Press), and SIM (Huang et al., 1990, Comput.
Appl. Biosci., 6, 373–381) tools to form known gene
clusters. It then applies criteria derived from experienced
biologists to select the best representative I.M.A.G.E.
clone for a gene. The tool provides an intuitive Java
interface for query and display of the gene and its
associated clones, thus directing researchers in selecting
a clone that will best enhance their research. An important
product is a listing of clones that best represent all known
genes. The listing will be used for re-arraying clones into
minimally redundant Master Arrays. Both the listings and
Master Arrays will be made available to the public, which
will be a valuable resource to the genomic community in
furthering discovery in the area of gene function.
Availability: IMAGEne can be accessed free of charge
through the I.M.A.G.E. Consortium web page at http:
// bbrp.llnl.gov/ image/ image.html
Contact: folta2@llnl.gov
∗ To whom correspondence should be addressed.
2 Present address: Gene Logic, Gaithersburg, MD 20878, USA.
c Oxford University Press 1999
Introduction
One goal of the I.M.A.G.E. Consortium is to make cDNA
clones publicly available to the worldwide community,
thus advancing gene discovery and functional knowledge.
To that end the Consortium manages and distributes the
world’s largest, publicly available collection of cDNA
clones. Currently the collection contains over 2.3 million
clones from over 350 cDNA libraries. Thanks largely
to the efforts of Washington University’s Genome Sequencing Center (Hillier et al., 1996), over 1.5 million
sequences have been deposited into GenBank’s dbEST
database from this collection. The IMAGEne tool analyzes these EST sequences, produces clone clusters
around known genes, and then ranks each clone within its
cluster. A sophisticated Java-based display allows the user
to mine this database for their cluster of interest. The user
can then view the alignment of the clones within a given
cluster, along with associated clone and gene information.
The highest ranked clones in each cluster will be used
to produce a non-redundant Master Array containing
the best cDNA clone that represents each known gene.
I.M.A.G.E. will make this Master Array publicly available
through their distributors. Automated procedures enable
the cluster database to be updated periodically to keep
up with both the growing number of EST sequences in
GenBank and discovered genes.
IMAGEne I is focused on clustering only I.M.A.G.E.
clones to known genes. The conclusion section will
briefly discuss the next generation of the product that will
cluster all clones within the collection, regardless of their
relationship to known genes. Use of other high quality
EST information may also be used.
The I.M.A.G.E. Master Array will benefit large-scale
functional discovery by providing the clones for both gene
expression and proteomic research. The IMAGEne tool
can also benefit the genomic community on a smaller scale
by allowing gene specific researchers to determine which
clones best meet their own specific needs.
965
M.Cariaso et al.
Systems and methods
IMAGEne has a modular design that permits the use of
existing public methods during its initial development
and allows for later incorporation of updated or custom
approaches. This reduces the initial development time
and permits IMAGEne to stay current. Modularity is
most evident in its five principle stages: data preparation,
clustering, alignment, sorting, and display. Interaction
between stages is minimized through a pipeline approach.
Each stage takes input, processes the data, and produces
an output. Usually the input and output is a collection of
files. The output from each stage is maintained to provide
a history mechanism that avoids regeneration of data when
only a later module has been modified. Figure 1 illustrates
the process flow of stage 1 and 2.
IMAGEne was implemented in Perl and Java under
Solaris 2.5. Consult the documentation for each respective program for specific information. Sources for the
programs IMAGEne uses are available as follows:
• BLAST v1.4
•
•
•
•
ftp://blast.wustl.edu/blast/executables/
(current efforts are underway
to update to BLAST v2.0.8)
FASTA v2.0u63 ftp://ftp.virginia.edu/pub/fasta/
SIM
http://globin.cse.psu.edu
Java v1.0.2
http://java.sun.com/
Perl v5.002
http://www.perl.com/
Stages 1–4 utilized a 14-processor, Sun Ultra Enterprise 4000, with 14 GB main memory and 122 GB
available disk space.
Algorithms and implementation
Stage 1: data preparation
IMAGEne begins with two data sets extracted from
the National Center for Biotechnology Information (NCBI). ESTs are taken from dbEST, the
flat file EST database, available on the FTP site
(ftp://ncbi.nlm.nih.gov/genbank/). Human genes are
taken from mRNAs in Genbank. Completely redundant
entries were identified and removed (Boguski and Shuler,
1995), forming the humannr collection. These two data
sets are re-generated with each re-build. The data is
reformatted for stage 2 processing.
The EST data is known to be quite noisy (Wolfsberg and
Landsman, 1997). Identical records can be found under
multiple names, features are often mislabeled, and naming conventions vary by institution and individual. Even
spelling errors contribute to the confusion. This necessitates a standardization of annotation and formatting.
A Perl script scans the full Genbank record and performs standardization for each EST. Only human ESTs
derived from the I.M.A.G.E. Consortium clones are used,
which currently comprise over 75% of all human dbEST
966
sequences. Quality controls remove sequences with poor
text annotation and trim low quality portions. Each EST’s
key features (clone id, library, end, sequence) are identified, formatted into an annotated FASTA format and the
EST is accepted into IMAGEne.
The annotated FASTA format offers the compactness
of FASTA format while including additional features of
a Genbank entry. FASTA formatting has two sections,
comment and sequence. The sequence can be multi-lined
and consists of nucleotide or protein sequence separated
with new lines. Annotated FASTA uses the comment field
to pass additional information. For example IMAGEne
would include an EST’s clone id and orientation by adding
/clone=12345 and /end= 5 to the comment line. This
simple extension to the FASTA structure is very flexible.
When an application is capable of making use of the extra
data it can extract it from the comment, but when the data
is not necessary it is ignored.
The resulting file of ESTs is indexed by Genbank accession number and I.M.A.G.E. clone id. With BLAST’s
‘pressdb’ command a BLAST formatted copy is generated
as well. These multiple views of the data are the primary
result of the preparation stage.
Stage 2: clustering
Each known gene forms the basis of a cluster. IMAGEne
compares each gene to all the ESTs. ESTs that show a high
degree of similarity are noted. The clones from which they
originated are identified, and all ESTs from those clones
are provisionally accepted as members of a gene’s cluster.
In this way clusters are composed of entire clones rather
than individual ESTs.
The method of comparison is a hybrid approach that
combines the best features of BLAST and FASTA. Both
are popular public tools that use heuristic methods to
search for local similarity between sequences. For our
purposes FASTA appears to be a better indicator of
the agreement of an EST with a gene, however it is
approximately an order of magnitude slower than BLAST.
To balance speed with quality IMAGEne uses both
techniques. A known gene is compared with BLAST to the
BLAST formatted EST database. Matches are noted and
are extracted from the indexed FASTA database created
during Stage 1. These candidates are copied into a temporary database that is examined by FASTA. ESTs that
continue to match well are accepted. The clones from
which they were derived are determined and all ESTs from
these clones are accepted into the cluster. This process is
illustrated in Figure 1.
The clustering method requires two cutoff scores. The
first filters which BLAST matches become candidates, the
second determines which FASTA matches are accepted
into the cluster. For the BLAST score we found no benefit
from being selective. Almost regardless of the BLAST
IMAGEne I
Fig. 1. IMAGEne process flow: data preparation and clustering stages.
sensitivity used, a relatively small database is defined.
Therefore we use the default BLAST cutoff score (limit
of 50 high scoring segment pairs matches) to reduce the
number of missed ESTs.
The FASTA cutoff is quite significant and directly
affects which ESTs remain in the cluster. We had initially
assumed that when looking at enough scores there would
be some natural separation between those that should
and should not be clustered. Instead we found an almost
perfect linear relationship between the cutoff score and the
average cluster size. After careful inspection of numerous
alignments we selected a FASTA opt score of 1300 as the
most reliable indicator of the dividing line between good
and poor matches. The FASTA analysis averaged less than
one second per cluster.
Stage 3: alignment
Since the previous phase uses FASTA, which is capable
of generating alignments, some people may question the
use of a distinct alignment phase. However, FASTA uses
a heuristic to locate regions of high similarity. While
this improves speed, it may not find the optimal overall
match. Since IMAGEne generates alignments only once,
it is worth spending extra time to locate the best match.
In the future IMAGEne may use a different clustering
method. If that method doesn’t create alignments, an
explicit alignment stage would be required.
To create alignments IMAGEne uses SIM. This program
locally aligns two sequences and returns the coordinates of
the N regions that match well. SIM’s output is in a format
that provides seamless integration into IMAGEne.
A Perl script uses SIM to compare each EST to its
associated gene. Where necessary the matching regions
are extended to ensure full coverage of the EST (see
Fig. 2. IMAGEne utilizes SIM during the alignment phase to extend
regions to overlap or gap as necessary. Gaps are padded with
hyphens and the resulting alignment is shown in the Java window.
Figure 2). These alignments are then constructed into a
multiple alignment table, in which the known gene serves
as a consensus sequence.
Stage 4: sorting
Since IMAGEne is intended as a tool for re-arraying, its
ability to pick the best clone is crucial. Rather than pick
an individual clone we sort all clones within a cluster by
preference; the highest one is our tentative candidate for
the Master Array.
967
M.Cariaso et al.
The sorting criteria, in order of importance, are as
follows:
• Full coding region coverage was defined as having the
5’ EST precede or overlap the 5’ UTR/CDS junction
and having the 3’ EST overlap or be downstream of
the 3’ CDS/UTR junction. Clones with full coverage
of the coding segment are preferred over those with
only partial coverage.
• Clones from more reliable libraries are favored. We
have divided all libraries into four categories (1–4,
least to most reliable). Each category is ranked, but
libraries within a category are considered equal. New
libraries are ranked in category 2 until enough sequence has been obtained to determine an appropriate
ranking. A list of the ranked libraries is available
from the IMAGEne web page, which will be updated
as new libraries are produced. Preference is based
on statistics collected from the Merck Gene Index
(Aaronson et al., 1996) and our own lab experience.
• When the above criteria fail to resolve the ordering,
clone length is the deciding factor. When possible,
clone length is calculated by reference to the known
sequence, otherwise the EST length is used.
A master listing and a candidate gold listing are generated for public use. The master listing contains the top
ranked clone for each populated, known gene cluster. The
candidate gold listing is a subset of the master list, containing only clones that cover the coding region. These
text-based listings can be obtained from the /pub/image
directory via our anonymous ftp site at imagex.llnl.gov.
The IMAGEne version number from which it originated
designates each file.
Stage 5: display
A web-based user interface was developed in HTML and
Java. It is accessible through the I.M.A.G.E. Consortium’s
web page or directly at http://bbrp.llnl.gov/imagene/bin/
search. From the IMAGEne initial page, searches for
clusters can be initiated based on the Genbank accession
number of the gene or its keywords, an I.M.A.G.E. clone
ID, a Genbank accession number of an EST, or a sequence
comparison. An example of a query on ‘interleukin’ as a
keyword is displayed in Figure 3.
All initial queries return a table containing information
on each cluster that matches the search criteria. When
searching by sequence this table will be ranked by
similarity. Each row of the results table contains the
gene id for that cluster, a description of the gene, and
the number of full coding and partial length clones
contained within that cluster. The gene id is linked to its
detailed cluster display. The clusters returned from the
‘interleukin’ query are provided in Figure 4.
968
The detailed cluster display provides a tabular description of each clone on the top of the page and the
alignments of the clones or ESTs with the gene on the
bottom. Figure 5 provides the detailed cluster display
for gene M15330. The header of the table contains a
description of the known gene and a link to its Genbank
record. The clones within the table are ordered by their
ranking, with the top ranked clone listed first. Each row
contains the I.M.A.G.E. clone identifier that is linked to
all associated Genbank entries; an indicator of full or
partial coverage of the coding region; the library from
which the clone was derived; the calculated clone length;
and the number of other genes the clone is associated.
This last column is used to alert users when a clone occurs
in more than one cluster. A non-zero entry in this field will
be hyper-linked to a results table, formatted as the table
described above, which contains all clusters of which this
clone is a member.
A Java applet is used to display the alignments at the
bottom of the page. The gene sequence is displayed at
the top in red text, with the coding sequence underlined
in blue. When an alignment is viewed ‘by clone’, the
I.M.A.G.E. clone ids are given on the left in ranked order
as the table above. The clones that span the coding region
are highlighted in both the table and the alignment for
ease of identification. These features allow the user to
easily link the tabular information with the alignments.
Scrollbars are available to view the alignment of all
clones or ESTs. A special feature of the scrollbars is the
transparent ability to slide column or row information
under there respective column (i.e. gene sequence) header
or row (i.e. clone id) header.
Documentation on how to use IMAGEne, frequently
asked questions (FAQ), release notes, and help are
available from links at the top of the page. Release notes
include a version identifier (IMAGEne and GenBank),
release date, a description of all modifications since the
previous version, and a hot link to a limited number of
older generations of IMAGEne.
Discussion
Results
Results in this section reflect IMAGEne version 1.3,
which is based on Genbank release 108 and the NCBI’s
humannr collection developed on 10/1/98. Note that new
versions are released approximately every 2 months, so
these statistics will not be current at time of print. Of the
830 724 human ESTs derived from I.M.A.G.E. cDNAs,
49 203 were of questionable quality and were discarded.
As a result, Stage 1 used 781 521 ESTs.
Data preparation required approximately 1 h of elapsed
time, clustering required about 18 h, and alignment and
sorting required 104 h. One-time post processing on the
IMAGEne I
Fig. 3. IMAGEne web display: query page.
web server required about 1 h. Display of the results
is performed interactively via the web. Currently the
dbEST database is updated every 2–3 months. Complete
IMAGEne database rebuilds will be performed with each
of these major releases.
The humannr collection contained 7368 known genes,
each of which is the basis for a cluster. 2132 clusters
contain clones covering the full coding region. Of the
remaining clusters, 4645 have partial length clone representatives, i.e. clones that cover only a part of the coding
segment. The remaining 591 clusters are empty, meaning
that no ESTs are sufficiently similar to the gene, and thus
no I.M.A.G.E. clone is available.
Analysis
Analysis of these results has found that gene clusters
with full clones covering the entire coding region average
1580 bases in length, while genes with only partial
length clones average 3063 bases in length. This strongly
suggests that the current methods for cDNA clone
construction are insufficient to reliably produce clones
long enough to fully represent many genes. Resolution
of this size bias is now a major goal of the I.M.A.G.E.
Consortium.
The known gene listing used to generate the IMAGEne
cluster database, NCBI’s humannr, was discovered to
be missing many genes of interest and contained some
969
M.Cariaso et al.
Fig. 4. IMAGEne web display: partial listing of all clusters matching the query on keyword/gene ‘interleukin’.
redundant genes. The collection also contained pseudogenes and fusion proteins, which are not relevant to our
goals. Consistency was also an issue, with the same gene
disappearing from the list from build to build. In the
future, this collection will be modified or replaced to better
represent a complete and consistent, non-redundant set
of true genes. NCBI’s RefSeq project may help to build
this foundation (http://www.ncbi.nlm.nih.gov/LocusLink/
refseq.html).
Repeats were not masked from this version of IMAGEne. As a result ESTs with significant repeats would get
970
clustered to several genes. A gene with significant repeat
regions would attract numerous clones. Efforts are underway to mask the repeats before sequence comparison.
Figure 6 summarizes the size of the IMAGEne I clusters, which produces few empty or very large clusters. The
empty clusters appear to accurately reflect the I.M.A.G.E.
collection’s coverage of known genes. It is believed
that empty clusters may be derived primarily from
low-abundance and/or highly tissue-specific transcripts.
Methods are being targeted to obtain these clones. The
largest clusters often have many clones in common and
IMAGEne I
Fig. 5. IMAGEne web display: listing and display of all I.M.A.G.E. clones from D11086 gene.
seem to reflect a small number of well-conserved gene
families.
The IMAGEne method allows the same clone to be a
member of different clusters (see Figure 7), which can
occur for a number of reasons. First, if an EST is taken
from a highly conserved gene family it is possible for
that clone to be placed in clusters for all the genes that
contain the conserved sequence. As an example, clones
taken from one of the highly conserved MHC genes are
likely to cluster with other MHC genes. Secondly, since
the humannr actually contains some redundant entries a
clone can be placed in all corresponding clusters. Also
chimeric clones created by either biological or sequencing
artifacts will have 5 ends differing from 3 ends and
our approach may place such clones in both clusters.
Alternative splicing could also create two clusters with
many clones in common. Repeat sequences, such as alu
repeats, also cause this phenomenon. The next version of
IMAGEne will screen out such repeat sequences.
While the ranking criteria resulted in a ranking that fit a
general need, it was determined during beta testing that
not every researcher would choose to use the same set
of criteria. In future enhancements to this product, the
user will be given the opportunity to weight the criteria
according to their own needs and re-rank a cluster on the
fly.
Currently calculated clone length is a major ranking
criteria. As stated before clone length is calculated in
reference to the known gene if both ends have been
sequenced, and by EST size alone if only one end has been
sequenced. Over the last few years sequencing of the 5
end of the cDNA clones has greatly diminished, resulting
in a higher percentage of clones with only a single end
being sequenced. Also, when IMAGEne is extended to
cluster ESTs that are not associated with known genes, a
reference sequence will be unavailable for use. When no
971
M.Cariaso et al.
Conclusion
The I.M.A.G.E. cDNA collection is an extremely valuable
public resource. Enriching the collection by eliminating
redundancy and providing a current ‘best’ clone for each
known gene is the main benefit of IMAGEne. IMAGEne
mines the dbEST databases and humannr collections for
relevant information and adds internal knowledge of the
I.M.A.G.E. cDNA collection to aid in the direction of
gene research. With minimal effort researchers can make
their experiments more effective by better utilizing this
wonderful resource. It is an obligation of the I.M.A.G.E.
consortium to maximize the usefulness of this collection.
The two tangible products produced by this effort
include:
Fig. 6. IMAGEne I: number of clones per cluster.
• a web-based tool that aids public selection of appropriate I.M.A.G.E. cDNA clones, and
• listings (master and candidate gold) of the best
I.M.A.G.E. clones for each known gene.
Fig. 7. IMAGEne I: overlap in clusters.
other information is known, precise clone size estimation
is not possible. Washington University does provide an
approximated clone size for the clones they sequence. This
may be useful for clones that are not associated with a
known gene and not sequenced from both ends. Other
ranking criteria are also being considered.
It is worthwhile to consider the clones that did not
fit into any cluster. Of the 593 515 clones accepted
into IMAGEne, only 199 793 were placed into clusters.
The remaining 66% are either yet to be characterized,
significantly different alternate forms of known genes, or
chimeric clones.
Figure 8 tracks three builds over a nine-month period.
As you can see, as the number of known genes increases,
so does their I.M.A.G.E. clone representation by both full
coding and partial length clones. Yet the number of known
genes without an I.M.A.G.E. clone representation remains
consistent, resulting in a decrease in percentage. It is the
highest priority of the I.M.A.G.E. consortium to provide
representative clones for each gene. Working closely with
the library providers, this goal is considered to be within
reach.
972
These resources are in use by many in the community,
and since I.M.A.G.E. is the main public source of cDNA
clones, usage is expected to grow significantly.
The master and candidate gold listings are directing I.M.A.G.E. re-arraying at Lawrence Livermore
National Laboratory to produce Master and Gold
Arrays. It is expected that these arrays will be
used for micro-array and chip expression studies.
These arrays will be, as all I.M.A.G.E. clones are
now, available through I.M.A.G.E. distributors at
http://www-bio.llnl.gov/image/idistributors.html.
There are some basic difference in the clustering method
of IMAGEne and others clustering products, such as
NCBI’s Unigene and the TIGR Gene Index. IMAGEne I:
− is based solely on I.M.A.G.E. clones, thus allowing
for internal knowledge of the clones to be used in the
algorithm, providing assurance of clone availability,
and increasing usefulness of the collection;
− puts looser constraints on membership into clusters,
thus providing identification of alternatively spliced
members;
− allows a clone to be in more than one cluster, thus
providing links to gene families;
− limits the clustering to known genes only.
Work has already begun on IMAGEne II, the next
generation clustering product that will expand these
capabilities to cluster all clones within the I.M.A.G.E.
collection. The IMAGEne method can also easily be
used to develop similar information to support projects
involving genes of other species, such as the WashUHHMI Mouse EST Project (http://genome.wustl.edu/est/
IMAGEne I
Fig. 8. IMAGEne I: Summary of cluster evolution.
mouse esthmpg.html) or the University of Iowa Rat EST
Project (http://ratest.uiowa.edu/).
Acknowledgments
Thanks to Christa Prange for tireless hours of product testing and analysis on IMAGEne, Matt Torres for careful review of this manuscript, Bill Ladd for patient discussions
on the statistical analysis of the clusters, and Greg Schuler
for creation of the humannr database.
This work was performed by Lawrence Livermore
National Laboratory (LLNL) under the auspices of U.S.
Department Of Energy, Contract No. W-7405-Eng-48.
Reviewed by LLNL Technical Information Department,
report UCRL-JC-131909.
References
Aaronson,J.S., Eckman,B. et al. (1996) Toward the development of
a gene index to the human genome: an assessment of the nature of
high-throughput EST sequence data. Genome Res., 9, 829–845.
Altschul,S.F., Gish,W. et al. (1990) A basic local alignment search
tool. J. Mol. Biol., 215.
Boguski,D.E. and Shuler,M.S. (1995) Establishing a human transcript map. Nature Genet., 10, 369–371.
Boguski,M.S., Lowe,T.M. et al. (1993) dbEST-database for ‘expressed sequence tags’. Nature Genet., 4, 332–333.
Gusfield,D. (1997) Algorithms on Strings, Trees, and Sequences.
Cambridge University Press, Cambridge.
Hillier,L., Lennon,G. et al. (1996) Generation and analysis of
280000 human expressed sequence tags. Genome Res., 6, 807–
828.
Huang,X., Hardison,R.C. et al. (1990) A space-efficient algorithm
for local similarities. Comput. Appl. Biosci., 6, 373–381.
Lennon,G., Auffray,C. et al. (1996) The I.M.A.G.E. Consortium: an
integrated molecular analysis of genomes and their expression.
Genomics, 33, 151–152.
Pearson,W.R. (1990) Rapid and sensitive sequence comparison with
FASTP and FASTA. Methods Enzymol., 183, 63–98.
Pearson,W.R. and Lipman,D. (1988) Improved tools for biological
sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–
2448.
Wolfsberg,T.G. and Landsman,D. (1997) A comparison of expressed sequence tags (ESTs) to human genomic sequences. Nucleic Acids Res., 8, 1626–1632.
973