LibGuides: Deep Web: Web Crawlers

Web Crawlers

Remember that the Deep Web consists of all information on the World Wide Web which cannot be located by general-purpose search engines (Google, Yahoo, Bing etc.). Why are there some web pages which search engines can't locate?

When you perform a Web search, the results you get back are only those web pages which the search engine "knows about." How does the engine know about these pages? It does so in two main ways:

Sometimes the web page creator submits the web address of the page directly to the engine.
Or, much more commonly, the engine's web crawler has crawled the page.

After a crawler visits a page, it submits the text on that page to an indexing program. The resulting index of words is stored in a database. Each search engine has its own indexing program and index.

When you use a general-purpose search engine, you are searching not the Web itself, but this huge index of words. Your search terms are compared to the index, which returns crawled web sites matching your search requirements. For instance, you may want sites which contain all of your search terms, or which contain certain terms but exclude others.

If for some reason the crawler for a particular search engine hasn't visited and indexed a web page, that page will not show up in any of the search engine's results.

The Database Challenge

Challenge 1: Search words

If you've ever looked for books in a library catalog or shopped for items on Ebay, you've searched a database on the web. The web includes a huge number of databases. To search a database, the user has to input keywords into a search form.

Doing so is a problem for web crawlers, because to extract and index the information in a database, a crawler has to "know" what types of keywords will work best. For instance, it needs to input art-related keywords (and not, say, engineering-related terms) to get results from an art database. How does a crawler determine what kinds of keywords it should input into a given database? This question has proven to be a major challenge.

Google has started to overcome this problem. Through advanced programming, Googlebot, the Google crawler, can now figure out which search terms will work best in a given database. Rather than inputting terms "blindly," Googlebot tailors its terms to the database it is currently searching. The content can than be indexed.*

Challenge 2: Logins

Webcrawlers cannot enter fee-based or password-protected websites. Among these are the STLCC Libraries' subscription-based databases.

* For more information on crawlers, forms and databases, see:
Madhavan, Jayant , and Alon Halevy. "Crawling Through HTML Forms." Google Webmaster Central Blog. Google, 11 Apr. 2008. Web. 13 Nov. 2009.
Madhavan, Jayant, et al. "Google's Deep Web Crawl." Proceedings of the VLDB Endowment 1.1 (2008): 1243-1252. Web. 13 Nov. 2009.

Other Challenges for Search Engines

Here are some other kinds of challenges facing webcrawlers:

Social Media

A huge amount of web content now exists on Facebook. However, privacy settings block crawlers from indexing much of this content, meaning a great deal of what's on Facebook is part of the Deep Web.
Google only indexes a small percentange of messages (tweets) on Twitter, so much of Twitter's content is also part of the Deep Web. (For an interesting article on this topic, see https://blogs.perficientdigital.com/2016/11/30/twitter-indexing-decline/.)

Pages deliberately excluded from crawls

A web designer can prevent crawlers from visiting a web page by using a special piece of computer code.
This is useful when the content of a page is for private or restricted use only.

Isolated pages

Isolated pages are web pages to which no other pages link. Unless the web address of an isolated page is submitted directly to a search engine, no crawler will find it, making such a page part of the Deep Web.

Pages not yet crawled

No matter how efficient, web crawlers take time to reach pages. This is because the Web is vast: in 2008 Google announced that it had processed 1 trillion unique web addresses.* Even though Google isn't indexing all of these pages, it takes time for Googlebot, its crawler, or any other crawler simply to visit that many sites. Google says that it may take several days to a month or longer for Googlebot to reach a page.**
Until a new web page is visited, it's part of the Deep Web.

Crawl depth

A website contains an opening page, which in turns links to sub-pages. Each of these pages may link to sub-sub-pages, and so on. For instance, the College's website has the following opening page: http://www.stlcc.edu. This page links to https://www.stlcc.edu/libraries, which in turn links to https://www.stlcc.edu/libraries/department-librarians.html.
These pages are nested, just as a folder on your computer can contain other folders, which themselves can contain further folders.
Crawlers have maximum crawl depths, meaning that starting from an initial web address, they will delve only so far into sub-pages. So in this example, a hypothetical crawler might get as far as the Libraries page and stop there.
Consequently the Department Librarians page wouldn't be visited and indexed, thus remaining part of the Deep Web.

Format

At one time web crawlers could only crawl "standard" web pages, i.e. those written in HTML format.
Thanks to improved crawler technology, these programs can now visit pages in a variety of non-HTML formats as well, including Microsoft Excel and Word, PDF, and Adobe Flash.
As new file formats become available, however, crawlers may be unable to handle pages written in these formats. Such pages will therefore be relegated to the Deep Web, until crawler technology improves again.

Dynamically Generated Pages

When you search a database, whether subscription-based or freely available on the Web, the database typically assembles your results into a web page created right at the moment of searching. Unlike static web pages, which exist as files stored on a web server, dynamically generated pages only come into being as the result of a search. Web addresses for dynamic pages usually contain symbols such as question marks or equal signs. Example:
http://www.amazon.com/s/ref=nb_ss?url=search-alias%3Dstripbooks&field-keywords=cows&x=0&y=0
Until recently, web crawlers had difficulty crawling dynamic addresses. However crawlers can now read these addresses provided they are not above a certain level of complexity.

* Alpert, Jesse and Nissan Hajaj. "We Knew the Web Was Big..." The Official Google Blog. Google, 25 Jul. 2008. Web 16 Nov. 2009.
** "Help Forum." Google Webmaster Central. Google, 06 Mar. 2009. Web. 13 Nov. 2009

Human Factors and the Deep Web

The Deep Web, as traditionally defined, results primarily from limitations in search engine technology. However, there's another kind of Deep Web, one resulting from some human limitations:

1. People often grow comfortable with a particular search engine and come to rely on it exclusively, even when using more than one search engine could result in an increased number of relevant search results.

2. The sheer number of results which a general search engine such as Google or Bing returns may overwhelm searchers, meaning that they only look at the first few results and ignore possibly higher quality or more relevant websites appearing further down in the results list.

3. The relevancy ranking features of search engines may lull searchers into a fall sense of security: they may feel there's no need to look beyond the first few search results, even though better websites, located further down in the list, are perhaps available to them.

4. Many people do not search thoughtfully. For instance, they may not take the time to choose the best keywords, or to refine their initial keywords based on their search results.

These factors may result in searchers not seeing or obtaining useful search engine results, even though those results are available to them. Searchers in these instances are, in effect, creating their own personal deep or invisible webs, which are every bit as limiting as the "real" deep web -- the one consisting of web sites which search engine spiders cannot index.

Note: The above ideas are discussed in Jane Devine and Francine Egger-Sider's 2014 book entitled Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web (available at STLCC libraries).