European Patent Office
Principal Directorate Tools/Documentation
Trilateral Technical conference
Washington May 17-20, 2004
Subject:
Study on Single Nucleotide Polymorphism (SNP) / Haplotype
Databases and Search Tools for Examiners (Status Report)
Project:
DR2
Author:
EPO
EPO Responsible: Gérard Giroud
Prepared by:
Ana Richart de la Torre & Stéphane Nauche
1
1. Introduction
2. Patent applications disclosing SNPs/ Haplotypes
2.1 Filing figures
2.2 Concealed workload
2.3 Origin of Patent applications
2.4 Source of claimed SNPs
3. Provision of SNP information in Patent Applications: Representation of SNPs in
Sequence Listings
4. Representation of SNPs and Haplotypes in General Sequence Databases
4.1 Example of variation release
4.2 Example of haplotype release
5. Specific Databases
Conclusion
2
1. Introduction
During the last few years the development of pharmacogenetics, based on how variation in
human genes leads to variation in our response to drugs, has gained a great importance.
This, together with the development of the "Human genome project", focused on determining the
sequences of the chemical base pairs that make up human DNA, has lead to the discovery of
multiple variations of genomic DNA, such as single nucleotide polymorphisms (SNPs) (1) and
haplotypes (2), a combination of those. Therefore it has sparked a great interest in patent
protection resulting in an increase in the number of patent applications claiming SNPs and
haplotypes as disease markers, as well as corresponding methods of use (figure 1).
Furthermore, patent applications disclosing SNPs and haplotypes claim hundreds or thousands of
related nucleic acid molecules. Claims to SNPs and haplotypes do present special search and
examination challenges.
Currently examiners, do not have optimized tools to perform exhaustive searches, therefore the
analysis of claimed SNPs or haplotypes comprises multiple searches where a plurality of
sequences and keywords needs to be entered for a search to be complete. The search for prior
art relating to the specific use of SNPs or haplotypes should be done using automated tools,
whereby a limited number of sequences and keywords suffice for the whole search (Trilateral
Project DR2).
The EPO presented a report on SNPs/Haplotypes databases during the last trilateral working
group. The Trilateral Offices agreed to further study Single Nucleotide Polymorphism
(SNP)/Haplotype Databases and Search Tools for Examiners.
The EPO provided a questionnaire to its partners to identify most relevant SNP/Haplotype
Databases. This document provides a further status report
(1) SNPs are single base pair positions in genomic DNA at which different sequence alternatives
(alleles) exist in normal individuals in some population(s), wherein the least frequent allele has
an abundance of 1% or greater. "SNPs what they are & what might they tell us", Anthony
Brookes Research Group, available at http://www.cgr.ki.se/cgb/groups/brookes/snps.htm.
(2) The term 'haplotype' refers to a combination of SNPs on a chromosome, usually within the
context of a particular gene. "Haplotype identification" at
http://www.variagenics.com/articles/haplotypeid.html.
3
2. Patent applications disclosing SNPs/Haplotypes
2.1 Filing figures
As a consequence of the great development of research in this field, an increase in the number of
patent applications has also taken place during the last few years. It is reflected by the analysis of
published and incoming applications disclosing SNPs and haplotypes, recorded at WPI,
EPODOC, and DOSYS databases, as shown in figure 1.
Filing SNPs at trilateral Offices
1400
Relative growth
1200
1000
800
600
400
200
0
250 journals.
Contains also links to unpublished mutation data available in online public
locus-specific mutation databases.
LINKS (Xref)
The records are cross-referenced to different LSDBs. Other useful links to
core databases home pages are provided.
SRS-EMBL (Availability)
NO
QUERY SEQUENCES
NO
SNPdb
CONTENTS
NCBI's SNPs database, reporting data from several species.
.
URL
http://www.ncbi.nih.gov/SNP/
SPECIES COVERAGE
It covers amongst many other species, human, rodents, pig, cattle sps, and
plant species.
ENTRIES
11,805,698 RefSNPs
REDUNDANCY
no answer
UPDATES
every 4-8 weeks
SOURCES
SNPs derived from ~300 sources. The major contributors to the database
are laboratories associated with the National Human Genome Research
Institute (NHGRI) grants program.
LINKS (Xref)
GenBank eventually
SRS-EMBL (Availability)
NO
QUERY SEQUENCES
NO
11
Human Genome Variation Database (HGVbase)
CONTENTS
Summarizes all known variations in the human genome, facilitating
genotype-phenotype association analyses that explore how SNPs and other
sequence variations may influence phenotypes.
.
URL
http://hgvbase.cgb.ki.se/
SPECIES COVERAGE
Human
ENTRIES
2,859,130 records (99% reporting SNPs)
REDUNDANCY
non redundant
UPDATES
Last update on 23-July-2003.
This activity has ceased for the moment due to limitations of funding.
HGVbase is presently more focused into a phenotype-genotype project.
SOURCES
SNPs derived from nearly 800 sources. HGVbase data is harvested (with
permission) or submitted from all major public genome databases and
extracted from published literature. Individual or bulk submissions from
research groups are also received.
LINKS (Xref)
EMBL, Ensembl, GenBank, dbSNP, OMIM, PubMed, PolyPhen.
YES
SRS-EMBL (Availability)
QUERY SEQUENCES
Direct DNA sequence searches use the BLAST program. It is possible to
enter a query sequence up to 25,000 bases in raw format as well as DNA
sequences of the allelic variations plus their flanking domains.
ALFRED
CONTENTS
Focused on allele frequencies, comprising DNA polymorphisms and other
sequence variations, sufficiently defined and studied in at least 6 human
populations.
.
URL
http://alfred.med.yale.edu/alfred/index.asp
SPECIES COVERAGE
Human
ENTRIES
932 polymorphisms, including SNPs, STRPs, VNTRs, INDELs, and
Haplotypes.
It's estimated that at least half of them are SNPs.
REDUNDANCY
Some redundancy
UPDATES
Daily
SOURCES
Data from the published literature, and directly submitted from researchers.
LINKS (Xref)
PubMed, Gene Bank, dbSNP, OMIM, GDB, CHLC, CEPH, and LSDBs.
SRS-EMBL (Availability)
NO
QUERY SEQUENCES
NO
12
The SNP Consortium ltd
CONTENTS
Database reporting more than 1,8 million of single nucleotide
polymorphisms.
URL
http://snp.cshl.org/
SPECIES COVERAGE
Human
ENTRIES
~1.8 million SNPs.
REDUNDANCY
no answer
UPDATES
Last update on October 24 , 2002: final ~400,000 previously unreleased
TSC SNPs made available via website. These SNPs have also been
submitted to dbSNP.
SOURCES
TSC allele frequency/genotype project member laboratories (Celera,
Motorola, Sanger, WICGR)
LINKS (Xref)
no answer
th
SRS-EMBL (Availability)
NO
QUERY SEQUENCES
NO
JSNP database
CONTENTS
A database of common gene variations in the Japanese population.
URL
http://snp.ims.u-tokyo.ac.jp/
SPECIES COVERAGE
Human
ENTRIES
195,059 SNPs; 84,566 SNPs with allele frequency
REDUNDANCY
non- redundant
UPDATES
bimonthly
SOURCES
NCBI, Laboratory for Genotyping, the SNP Research Center, the Institute of
Physical and Chemical Research (RIKEN), JBIC, dbSNP, UniSTS,
UniGene, Model mRNA, RefSeq.
LINKS (Xref)
Search through HOWDY (human organized whole genome database)
linked to GenBank, OMIM, dbSNP, GDB.
Search by Blast SNP, linked to dbSNP, GenBank, and HGVbase.
NO
SRS-EMBL (Availability)
QUERY SEQUENCES
YES, they use BLAT (BLAST-Like Alignment Tool) search against NCBI
build 34 of the human genome.
13
TM
HAP
database
CONTENTS
Database reporting both SNPs and gene-based haplotypes, discovered in
pharmaceutically relevant genes. Includes information for ~9,000 genes
reporting frequency values in each population group.
It's a subscription-based database.
URL
http://www.dna.com/products_services/hapdatabase.html
SPECIES COVERAGE
Human
ENTRIES
Aprox. 180,000 unique SNPs.
Aprox. 180,000 unique haplotypes.
REDUNDANCY
non-redundant
UPDATES
It varies depending on internal research programs necessities.
SOURCES
Data generated by re-sequencing of 93 individuals from various ethnic
backgrounds. (Geinanssance Pharmaceuticals research groups)
Data from the two major public databases: dbSNP and HGVbase.
LINKS (Xref)
Public databases for comparison purposes.
NO
SRS-EMBL (Availability)
QUERY SEQUENCES
Could be possible, they use BLAST algorithm.
OMIM
CONTENTS
OMIM, Online Mendelian Inheritance in Man. This database is a catalog of
human genes and genetic disorders The database contains textual
information and references. It also contains links to MEDLINE an other
databases.
URL
http://www3.ncbi.nlm.nih.gov/omim/
SPECIES COVERAGE
Human
ENTRIES
15 325
REDUNDANCY
No
UPDATES
once per month
SOURCES
NCBI
LINKS (Xref)
LSDBs, MEDLINE, Entrez, related resources at NCBI.
SRS-EMBL (Availability)
Yes
QUERY SEQUENCES
NO
14
Celera Human SNP reference database
CONTENTS
It reports mapped genetic variations to help further the understanding of
genetic basis of disease. Thus, the polymorphisms present at Celera
database are correlated with genes, gene structure, conserved and
regulatory regions, protein changes and disease.
It's a subscription-based database.
URL
http://www.celeradiscoverysystem.com/index.cfm
SPECIES COVERAGE
Human
ENTRIES
3.5 million mapped genetic variations
REDUNDANCY
non-redundant
UPDATES
4 times a year.
SOURCES
Celera discovery system, OMIM, TSC, dbSNP, HGVbase, HGMDB
LINKS (Xref)
OMIM, TSC, dbSNP, HGVbase, HGMDB
NO
SRS-EMBL (Availability)
QUERY SEQUENCES
YES, NCBI BLAST version 2,2,5,
Celera Mouse SNP reference database
CONTENTS
It reports over 3 million non-redundantly mapped variations distributed
throughout the mouse genome. These variations are correlated with gene
structure and protein changes to aid in the identification or validation of
potential disease genes.
It's a subscription-based database.
URL
http://www.celeradiscoverysystem.com/index.cfm
SPECIES COVERAGE
Mouse strains (129x1/SvJ, DAB/2J, A/J, C57BL/6J, 129s1/SvlmJ).
Human-mouse synthetic gene regions?
ENTRIES
Over 3,1 million mouse SNPs
REDUNDANCY
non-redundant
UPDATES
every 4 months, corrections in annotations, but new SNPs are registered
every 1-2 years
SOURCES
Celera discovery system for the following four strains of mouse:
129s1/svimj, 129x1/svj, a/j and dba/2j.
C57bl/6j data, imported from the publicly available sequence.
LINKS (Xref)
They do not cross-reference to other sources.
NO
SRS-EMBL (Availability)
QUERY SEQUENCES
It's possible to BLAST a query sequence and retrieve the corresponding
SNP IDs.
15
Integrated Information Databases
TM
HOWDY
DATABASE
GeneCards
URL
http://bioinfo.weizmann.ac.il/cards/
CONTENTS
GeneCards is a database of human
genes, their products and their involvement
in diseases.
HOWDY is a database system to
retrieve human genome information
of most of the important public data
sources in the world.
SPECIES
COVERAGE
Human
Human
SOURCES (SNPs
information)
SNPdb
SNPdb, JSNP
LINKS (SNPs
repositories)
SNPdb
SNPdb, JSNP
http://gdb.jst.go.jp/HOWDY/
TM
Minor SNP repositories
DATABASE
URL
CONTENTS
COVERAGE
GenesSNP
http://www.genome.utah.edu/genesn
ps/
Integrates gene, sequence and
polymorphism data into
individually annotated gene
models. The human genes
included are related to DNA
repair and cell cycle pathways;
these genes are though to play a
role in susceptibility to
environmental exposure.
Human
Leelab SNP
database
http://www.bioinformatics.ucla.edu/s
np/
Leelab has developed some
programs such as, PHRAP, BRO
and POA, to identify in coding
regions (cSNPs) from publicly
available expressed sequence
tag (EST) databases. All data
has been deposited in dbSNP.
human EST
data
rSNP guide
http://util.bionet.nsc.ru/databases/rsn
p.html
SNPs in regulatory gene regions
onto their interaction with
nuclear proteins.
Human
topoSNP
http://gila.bioengr.uic.edu/snp/toposn
p/
Visualization of non-synonymous
SNPs. Online resource for
analyzing nsSNPs that can be
mapped onto known 3D
structures of proteins.
Human
16