Skip to main content

Deep Web: The Database Challenges

The Deep Web Library Guide discusses why the Deep Web exists and what it contains. The Guide also provides tools for searching the Deep Web, along with resources for further information.

Web Crawlers

Remember that the Deep Web consists of all information on the World Wide Web which cannot be located by general-purpose search engines (Google, Yahoo, Bing etc.).  Why are there some web pages which search engines can't locate? Read on:

When you perform a Web search, the results you get back are only those web pages which the search engine "knows about."  How does the engine know about these pages?  It does so in two main ways:

Sometimes the web page creator submits the web address of the page directly to the engine. Or, much more commonly, the engine's web crawler has crawled the page.

All general-purpose engines each have their own web crawler.  These crawlers visit pages on the World Wide Web, automatically following one link after another. 

After a crawler visits a page, it submits the text on that page to an indexing program. The resulting index of words is stored in a database.  Each search engine has its own indexing program and index.

When you use a general-purpose search engine, you are searching not the Web itself, but this huge index of words.  Your search terms are compared to the index, which returns crawled web sites matching your search requirements.  For instance, you may want sites which contain all of your search terms, or which contain certain terms but exclude others.

If for some reason the crawler for a particular search engine hasn't visited and indexed a web page, that page will not show up in any of the search engine's results.

The Database Challenge

Challenge 1

If you've ever looked for books in a library catalog or shopped for items on Ebay, you've searched a database on the Web. The Web includes a huge number of databases.

To search a database, the user has to input keywords into a search form.

Doing so is a problem for web crawlers, because to extract and index the information in a database, a crawler has to "know" what types of keywords will work best. For instance, it needs to input art-related keywords (and not, say, engineering-related terms) to get results from an art database. How does a crawler determine what kinds of keywords it should input into a given database? This question has proven to be a major challenge.

Towards a Solution

Google has started to overcome this problem. Through advanced programming, Googlebot, the Google crawler, can now figure out which search terms will work best in a given database.  Rather than inputting terms "blindly," Googlebot tailors its terms to the database it is currently searching.  The content can than be indexed.*

Challenge 2

Webcrawlers cannot enter fee-based or password-protected websites. Among these are the STLCC Libraries' subscription-based databases.

 

 * For more information on crawlers, forms and databases, see:

Madhavan, Jayant , and Alon Halevy. "Crawling Through HTML Forms." Google Webmaster Central Blog. Google, 11 Apr. 2008. Web. 13 Nov. 2009.

Madhavan, Jayant, et al. "Google's Deep Web Crawl." Proceedings of the VLDB Endowment 1.1 (2008): 1243-1252. Web. 13 Nov. 2009.

St. Louis Community College Libraries

Florissant Valley Campus Library
3400 Pershall Rd.
Ferguson, MO 63135-1408
Phone: 314-513-4514

Forest Park Campus Library
5600 Oakland
St. Louis, MO 63110-1316
Phone: 314-644-9210

Meramec Campus Library
11333 Big Bend Road
St. Louis, MO 63122-5720
Phone: 314-984-7797

Wildwood Campus Library
2645 Generations Drive
Wildwood, MO 63040-1168
Phone: 636-422-2000