Skip to Main Content

Deep Web

The Deep Web Library Guide discusses why the Deep Web exists and what it contains. The Guide also provides tools for searching the Deep Web, along with resources for further information.

Web Crawlers

Remember that the Deep Web consists of all information on the World Wide Web which cannot be located by general-purpose search engines (Google, Yahoo, Bing etc.).  Why are there some web pages which search engines can't locate?

When you perform a Web search, the results you get back are only those web pages which the search engine "knows about."  How does the engine know about these pages?  It does so in two main ways:

  • Sometimes the web page creator submits the web address of the page directly to the engine. 
  • Or, much more commonly, the engine's web crawler has crawled the page. 

After a crawler visits a page, it submits the text on that page to an indexing program. The resulting index of words is stored in a database.  Each search engine has its own indexing program and index.

When you use a general-purpose search engine, you are searching not the Web itself, but this huge index of words.  Your search terms are compared to the index, which returns crawled web sites matching your search requirements.  For instance, you may want sites which contain all of your search terms, or which contain certain terms but exclude others.

If for some reason the crawler for a particular search engine hasn't visited and indexed a web page, that page will not show up in any of the search engine's results.

Challenges for Webcrawlers

Here are some other kinds of challenges facing webcrawlers:

Logins and paywalls

  • Webcrawlers cannot enter fee-based or password-protected websites. 

Social Media

  • Google only indexes a small percentage of pages and posts on social media sites like Facebook and X, and only when not blocked by privacy settings. Much of social media is in the Deep Web.

Pages deliberately excluded from crawls

  • A web designer can prevent crawlers from visiting a web page by using a special piece of computer code. This is useful when the content of a page is for private or restricted use only.

Isolated pages

  • Isolated pages are web pages to which no other pages link.  Unless the web address of an isolated page is submitted directly to a search engine, no crawler will find it, making such a page part of the Deep Web.

Pages not yet crawled

  • No matter how efficient, web crawlers take time to reach pages.  This is because the Web is vast: in 2008 Google announced that it had processed 1 trillion unique web addresses.*  Even though Google isn't indexing all of these pages, it takes time for Googlebot, its crawler, or any other crawler simply to visit that many sites. Google says that it may take several days to a month or longer for Googlebot to reach a page.**
  • Until a new web page is visited, it's part of the Deep Web.

Crawl depth

  • A website contains an opening page, which in turns links to sub-pages. Each of these pages may link to sub-sub-pages, and so on. 
  • These pages are nested, just as a folder on your computer can contain other folders, which themselves can contain further folders.
  • Crawlers have maximum crawl depths, meaning that starting from an initial web address, they will delve only so far into sub-pages.  
  • The pages that are not visited and indexed remain part of the Deep Web.

Format

  • At one time web crawlers could only crawl "standard" web pages, i.e. those written in HTML format. 
  • Thanks to improved crawler technology, these programs can now visit pages in a variety of non-HTML formats as well including PDFs and Microsoft Excel and Word documents.
  • As new file formats become available crawlers may be unable to handle pages written in these formats.  Such pages will therefore be relegated to the Deep Web, until crawler technology improves again.

Dynamically Generated Pages

  • When you search a database, whether subscription-based or freely available on the Web, the database typically assembles your results into a web page created right at the moment of searching. Unlike static web pages, which exist as files stored on a web server, dynamically generated pages only come into being as the result of a search.  Web addresses for dynamic pages usually contain symbols such as question marks or equal signs.  
  •  Until recently, web crawlers had difficulty crawling dynamic addresses.  However crawlers can now read these addresses provided they are not above a certain level of complexity.

* Alpert, Jesse and Nissan Hajaj. "We Knew the Web Was Big..." The Official Google Blog. Google, 25 Jul. 2008. Web 16 Nov. 2009.
** "Help Forum." Google Webmaster Central. Google, 06 Mar. 2009. Web. 13 Nov. 2009

 

Other Challenges for Search Engines

Human Factors and the Deep Web

The Deep Web, as traditionally defined, results primarily from limitations in search engine technology. However, there's another kind of Deep Web, one resulting from some human limitations:

1. People often grow comfortable with a particular search engine and come to rely on it exclusively, even when using more than one search engine could result in an increased number of relevant search results.

2. The sheer number of results which a general search engine such as Google or Bing returns may overwhelm searchers, meaning that they only look at the first few results and ignore possibly higher quality or more relevant websites appearing further down in the results list.

3. The relevancy ranking features of search engines may lull searchers into a fall sense of security: they may feel there's no need to look beyond the first few search results, even though better websites, located further down in the list, are perhaps available to them.

4. Many people do not search thoughtfully. For instance, they may not take the time to choose the best keywords, or to refine their initial keywords based on their search results. 

These factors may result in searchers not seeing or obtaining useful search engine results, even though those results are available to them. Searchers in these instances are, in effect, creating their own personal deep or invisible webs, which are every bit as limiting as the "real" deep web -- the one consisting of web sites which search engine spiders cannot index.

Note: The above ideas are discussed in Jane Devine and Francine Egger-Sider's 2014 book entitled Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web (available at STLCC libraries).

St. Louis Community College Libraries

Florissant Valley Campus Library
3400 Pershall Rd.
Ferguson, MO 63135-1408
Phone: 314-513-4514

Forest Park Campus Library
5600 Oakland
St. Louis, MO 63110-1316
Phone: 314-644-9210

Meramec Campus Library
11333 Big Bend Road
St. Louis, MO 63122-5720
Phone: 314-984-7797

Wildwood Campus Library
2645 Generations Drive
Wildwood, MO 63040-1168
Phone: 636-422-2000