Here are some other kinds of challenges facing webcrawlers:
- A huge amount of web content now exists on Facebook. However, privacy settings block crawlers from indexing much of this content, meaning a great deal of what's on Facebook is part of the Deep Web.
- Google only indexes a small percentange of messages (tweets) on Twitter, so much of Twitter's content is also part of the Deep Web. (For an interesting article on this topic, see http://www.stonetemple.com/how-does-google-index-tweets.)
Pages deliberately excluded from crawls
- A web designer can prevent crawlers from visiting a web page by using a special piece of computer code.
- This is useful when the content of a page is for private or restricted use only.
- Isolated pages are web pages to which no other pages link. Unless the web address of an isolated page is submitted directly to a search engine, no crawler will find it, making such a page part of the Deep Web.
Pages not yet crawled
- No matter how efficient, web crawlers take time to reach pages. This is because the Web is vast: in 2008 Google announced that it had processed 1 trillion unique web addresses.* Even though Google isn't indexing all of these pages, it takes time for Googlebot, its crawler, or any other crawler simply to visit that many sites. Google says that it may take several days to a month or longer for Googlebot to reach a page.**
- Until a new web page is visited, it's part of the Deep Web.
- A website contains an opening page, which in turns links to sub-pages. Each of these pages may link to sub-sub-pages, and so on. For instance, the College's website has the following opening page: http://www.stlcc.edu. This page links to https://www.stlcc.edu/libraries, which in turn links to https://www.stlcc.edu/libraries/department-librarians.html.
- These pages are nested, just as a folder on your computer can contain other folders, which themselves can contain further folders.
- Crawlers have maximum crawl depths, meaning that starting from an initial web address, they will delve only so far into sub-pages. So in this example, a hypothetical crawler might get as far as the Libraries page and stop there.
- Consequently the Department Librarians page wouldn't be visited and indexed, thus remaining part of the Deep Web.
- At one time web crawlers could only crawl "standard" web pages, i.e. those written in HTML format.
- Thanks to improved crawler technology, these programs can now visit pages in a variety of non-HTML formats as well, including Microsoft Excel and Word, PDF, and Adobe Flash.
- As new file formats become available, however, crawlers may be unable to handle pages written in these formats. Such pages will therefore be relegated to the Deep Web, until crawler technology improves again.
Dynamically Generated Pages
- When you search a database, whether subscription-based or freely available on the Web, the database typically assembles your results into a web page created right at the moment of searching. Unlike static web pages, which exist as files stored on a web server, dynamically generated pages only come into being as the result of a search. Web addresses for dynamic pages usually contain symbols such as question marks or equal signs. Example:
- Until recently, web crawlers had difficulty crawling dynamic addresses. However crawlers can now read these addresses provided they are not above a certain level of complexity.
* Alpert, Jesse and Nissan Hajaj. "We Knew the Web Was Big..." The Official Google Blog. Google, 25 Jul. 2008. Web 16 Nov. 2009.
** "Help Forum." Google Webmaster Central. Google, 06 Mar. 2009. Web. 13 Nov. 2009