Remember that the Deep Web consists of all information on the World Wide Web which cannot be located by general-purpose search engines (Google, Yahoo, Bing etc.). Why are there some web pages which search engines can't locate?
When you perform a Web search, the results you get back are only those web pages which the search engine "knows about." How does the engine know about these pages? It does so in two main ways:
After a crawler visits a page, it submits the text on that page to an indexing program. The resulting index of words is stored in a database. Each search engine has its own indexing program and index.
When you use a general-purpose search engine, you are searching not the Web itself, but this huge index of words. Your search terms are compared to the index, which returns crawled web sites matching your search requirements. For instance, you may want sites which contain all of your search terms, or which contain certain terms but exclude others.
If for some reason the crawler for a particular search engine hasn't visited and indexed a web page, that page will not show up in any of the search engine's results.
Challenge 1: Search words
If you've ever looked for books in a library catalog or shopped for items on Ebay, you've searched a database on the web. The web includes a huge number of databases. To search a database, the user has to input keywords into a search form.
Doing so is a problem for web crawlers, because to extract and index the information in a database, a crawler has to "know" what types of keywords will work best. For instance, it needs to input art-related keywords (and not, say, engineering-related terms) to get results from an art database. How does a crawler determine what kinds of keywords it should input into a given database? This question has proven to be a major challenge.
Google has started to overcome this problem. Through advanced programming, Googlebot, the Google crawler, can now figure out which search terms will work best in a given database. Rather than inputting terms "blindly," Googlebot tailors its terms to the database it is currently searching. The content can than be indexed.*
Challenge 2: Logins
Webcrawlers cannot enter fee-based or password-protected websites. Among these are the STLCC Libraries' subscription-based databases.
* For more information on crawlers, forms and databases, see:
Madhavan, Jayant , and Alon Halevy. "Crawling Through HTML Forms." Google Webmaster Central Blog. Google, 11 Apr. 2008. Web. 13 Nov. 2009.
Madhavan, Jayant, et al. "Google's Deep Web Crawl." Proceedings of the VLDB Endowment 1.1 (2008): 1243-1252. Web. 13 Nov. 2009.
Here are some other kinds of challenges facing webcrawlers:
Pages deliberately excluded from crawls
Pages not yet crawled
Dynamically Generated Pages
* Alpert, Jesse and Nissan Hajaj. "We Knew the Web Was Big..." The Official Google Blog. Google, 25 Jul. 2008. Web 16 Nov. 2009.
** "Help Forum." Google Webmaster Central. Google, 06 Mar. 2009. Web. 13 Nov. 2009
The Deep Web, as traditionally defined, results primarily from limitations in search engine technology. However, there's another kind of Deep Web, one resulting from some human limitations:
1. People often grow comfortable with a particular search engine and come to rely on it exclusively, even when using more than one search engine could result in an increased number of relevant search results.
2. The sheer number of results which a general search engine such as Google or Bing returns may overwhelm searchers, meaning that they only look at the first few results and ignore possibly higher quality or more relevant websites appearing further down in the results list.
3. The relevancy ranking features of search engines may lull searchers into a fall sense of security: they may feel there's no need to look beyond the first few search results, even though better websites, located further down in the list, are perhaps available to them.
4. Many people do not search thoughtfully. For instance, they may not take the time to choose the best keywords, or to refine their initial keywords based on their search results.
These factors may result in searchers not seeing or obtaining useful search engine results, even though those results are available to them. Searchers in these instances are, in effect, creating their own personal deep or invisible webs, which are every bit as limiting as the "real" deep web -- the one consisting of web sites which search engine spiders cannot index.
Note: The above ideas are discussed in Jane Devine and Francine Egger-Sider's 2014 book entitled Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web (available at STLCC libraries).
St. Louis Community College Libraries
Florissant Valley Campus Library
Forest Park Campus Library
Meramec Campus Library
Wildwood Campus Library