3-1
Click here to view PowerPoint presentation; Press Esc to exit
Information Discovery and Retrieval Tools
Michael T. Frame
U.S. Geological Survey
Center for Biological Informatics
Mail Stop No. 302
12201 Sunrise Valley Drive
Reston, VA 22092, USA
mike_frame@usgs.gov
Abstract
Due to the rapid growth of electronically accessible content from the Internet, there is a corresponding
increase in demand for information of all types from a number of diverse users. Although the World-Wide Web
presents tremendous opportunities to users for access to this wealth of information, the quantity of that
information can be overwhelming. The user who attempts to find information can become confounded by the
sheer volume of data and information returned as “pertinent” to his/her need. In addition, current awareness
becomes an obstacle, as variations in search engine crawls of the Web, as well as the user’s own ability to
keep up with frequent queries to multiple search tools, can prevent timely access to and knowledge of pertinent
information. This session will focus on the various Internet search engines, directories, and how to improve
the user experience through the use of such techniques as metadata, meta-search engines, subject specific
search tools, and other developing technologies.
1.0.
Background
Ever since the Internet’s beginnings in the 1990s, the amount of information available on the World-Wide
Web has steadily increased.
It is estimated that over 1 billion web pages exist on the World-Wide Web
today. As expected, this number is continuing to grow; however, at a much slower and some say more
controlled rate. The rate of growth of World-Wide Web content has also caused the community of casual and
advanced users, to consider alternatives means to finding information.
As the information content has grown on the World-Wide Web, so too has the need for improved tools and
products to aid users in this discovery of information. Several tools basically perform the same function, but
may differ slightly in their methods and results. This primarily has to do with vendor specific interpretation of
World-Wide Web terms such as: Spam, spider/crawler configurations, and collection size. All of this leads to
industry estimates that less than 20% of the entire content of the World-Wide Web is available to the typical
user (World-Wide Web Consortium 2002). This paper investigates various terminologies and provides simple
techniques users can perform to improve their search experiences on the World-Wide Web.
Paper presented at the RTO IMC Lecture Series on “Electronic Information Management for PfP Nations”,
held in Vilnius, Lithuania, 24-26 September 2002, and published in RTO-EN-026.
3-2
2.
Basic Terminology
2.1.
What do Internet search engines really see?
From a user’s perspective, as shown in Figure 1, users often simply enter a term in a simple search box and
wait for results. They are oblivious to what the computer or system is doing. This is the way it should be.
If users have to worry about how an Internet search engine is configured or what it expects, then most likely
the search engine user interface needs to be redesigned or another product selected. Users have too many
other things to do, whether at work or home, to concern themselves with learning the various idiosyncrasies of
each Internet search engine.
Figure 1. Typical User Search
However, what the user often does not realize is that Internet search engines primarily read the underlying
document codes or “metatags” within a web document. Metatags are document tags or properties that are
often stored within the Header of an HTML document. Figure 2 below describes a typical view that an
Internet search engine would see.
3-3
Figure 2. Typical Internet document as viewed by search engines
2.2.
What is Spam?
“Spam” is a term you often hear thrown about on the World-Wide Web today. Spam is not just a popular
Hawaiian luncheon meat anymore. Understanding what spam is and is not is very important in understanding
how search engines on the WWW discover and display information to users. Spam is considered to be
anything that a software developer or HTML creator does to try to falsify his or her content to a web engine.
In today’s web environment content creators jockey for position on Internet search engines results/hits lists
and often resort to categorizing their sites in ways that may not truly represent the content or overall purpose.
This is considered spamming a search engine crawler or data harvester. Tricks commonly employed by web
content creators include applying keywords within the Header section of an HTML document that have
nothing to do with their site, or simply creating BLANK HTML pages with white text so that users don’t see
the content, but a search engine can. Internet Search Engines are all wise to these tricks and this is why it is
often difficult for content producers and/or developers who have truthful content and are trying to do a good
job in making their content available understand what an Internet search engine expects and applies
preferences to.
2.3
The basic Internet search engine model
Internet search engines on the WWW “harvest” data from publicly available web sites. This harvesting or
gathering of summary information (usually items such as URL, keywords, X number of characters from the
full-text of the site) to a central point is done with spiders and/or crawlers. Spiders and crawlers are simply
automated jobs or processes that run from an Internet search engine provider’s server and scour the WWW for
content. This content is then made available through the Internet search engine providers’ central index.
Figure 3 below demonstrates this process.
3-4
Figure 3. Basic Internet search engine harvesting model
2.4.
What are Metatags and why are they important?
Embedding metatags within the HTML of your Web site not only aids promotes higher ranking, and thus,
better retrieval, of your site by many of the major search engines, but also provides a foundation for future
information retrieval and discovery on the Web. The algorithms used by search engines constantly change;
however, the presence of metatags on your pages can often make a dramatic difference in enabling users to
find your information. Remember, too, that as various sites use of metatags, an integrated system whereby
users can easily locate your site through a search engine are likely to explore other related sites within the
WWW.
The tables below describes both standard metatags and unique discipline, in this example biological
information, metatags that all can be implemented on Web sites. Some tags are required by search engines,
while others are optional, depending upon the scope and context of the page(s) under development. Additional
meta-tag requirements may be added as retrieval tools become more sophisticated. Fortunately, the creation
and editing of metatags is a quick and simple process, thanks to the development of meta-tag software which
can rapidly generate tags selected by a content provider across designated pages, directories, or an entire site.
The metatags in Table 1 below are all standard HTML 3.0 or above supported tags. If users are using
dynamically created web sites, the metatags described below can simply be created automatically out of a
database dump or export.
3-5
Table 1. Recommended Metatags
Metatag
Author
Title
Keywords
Page
Description
Language
Classification
Ratings/PICS
Definition
The Author Tag contains name of the content provider (not
the Webmaster / programmer).
Even though the Title tag is not considered a true metatag,
it is critical in search engines’ ranking algorithms, and
provides users with general information about your page.
Search Engines results/hit lists also display the Title tag.
Up to 80 characters can be contained within this tag.
Keywords are probably the most important meta-tag that a
Web site manager can include. Up to 1000 characters
can be contained within this tag.
Your keyword contents should include the basic tags at
left, plus all terms relevant to your site and particular subsections. Include several generic terms that apply to your
entire node, plus terms specific to various sub-directories
and pages. Try to think of as many synonyms for your
terms as you can. Note that you need to include term
variations (e.g. bird, birds, birding, birdwatcher), as the
search engines do not employ stemming when parsing
keywords. Spelling counts! Use terms found within the
page contents to boost relevancy rankings.
The Description tag is used by search engines to display
information about your page and to index its contents. Up
to 200 characters can be contained within this tag. The
description often determines whether the searcher will
choose to view your page. Make the description relevant
to the particular sub-section or page; don’t rely on one
generic description for all pages on your site. Use keyword
tag terms in your description to boost term relevancy
rankings.
Even though most content on the web is in English, the
Language tag adds value to your Web site, helping users
limit search engine retrieval to a particular language.
The Classification tag is often used by a number of the
Web search engines when you register your site and/or
when your site is indexed so that your site can be classified
with other similar sites.
Typical values include:
“Government, Science, Education, etc.”.
The Ratings and PICS tags are used by Internet providers
and search engines to limit access to a particular page.
Often this is used to restrict access to “Mature Audience
Only” pages for children using the Internet. Typical Values
include: “General, Restricted, Mature, Safe for Kids”, etc.
Because filters are becoming more common within retrieval
tools and browsers, or as added software, these tools may
arbitrarily block your site if the tag is not implemented.
Format & Sample Value
West Nile Virus:
Wildlife
Impacts
NBII
** please maintain this format
when naming your pages **
place these standard keywords
AFTER your page-specific
keywords
3-6
Table 2 below describes the unique or custom metatags for a domain specific organization. In this case, these
custom metatags are relevant to categorizing, displaying, and delivering biological data and information.
Table 2. Domain Specific Metatags (Custom tags)
Metatag
Definition
Format & Sample Value
Species
Scientific
Name
The Scientific Name of a particular Species on the web
page being classified.
NBII Partners are strongly
encouraged to utilize the Integrated Taxonomic Information
System
(ITIS)
(http://www.itis.usda.gov/plantproj/itis/
index.html) as its basis for completing this information.
Species
Common
Name
The Common Name of a particular Species on the Web
page being classified. The Common Name is extremely
important to both expert and novice users for finding
information about a particular species. ITIS is a source for
completing this meta-tag.
Organization
Useful advice on preparing your ‘Ilife Timesheets’ online
Are you fed up with the inconvenience of dealing with documentation? Look no further than airSlate SignNow, the premier electronic signature platform for individuals and organizations. Wave goodbye to the lengthy routine of printing and scanning papers. With airSlate SignNow, you can effortlessly finalize and endorse documents online. Take advantage of the powerful features integrated into this user-friendly and cost-effective platform and transform your method of document handling. Whether you require to authorize forms or gather signatures, airSlate SignNow manages it all efficiently, necessitating just a few clicks.
Follow this comprehensive guideline:
- Sign in to your account or register for a complimentary trial with our service.
- Click +Create to upload a document from your device, cloud storage, or our template collection.
- Open your ‘Ilife Timesheets’ in the editor.
- Click Me (Fill Out Now) to complete the document on your end.
- Add and designate fillable fields for other participants (if necessary).
- Continue with the Send Invite options to request electronic signatures from others.
- Save, print your copy, or convert it into a reusable template.
No need to worry if you need to collaborate with your teammates on your Ilife Timesheets or send it for notarization—our platform has everything you require to execute such tasks. Sign up with airSlate SignNow today and enhance your document management to new levels!