A web crawler (also spider or Searchbot) is a computer program that automatically searches the web and analyzes web pages. Web crawlers are mainly used by search engines. Other applications include the collection of RSS feeds, e-mail addresses or other information.
Web crawlers are a special kind of bots, ie computer programs that largely autonomously pursue repetitive tasks.
The first web crawler was 1993, the World Wide Web Wanderer, which should measure the growth of the Internet. In 1994, with WebCrawler , the first publicly accessible Web search engine with full text index. From this comes the name crawlers for such programs. Since the number of the search engines grew rapidly , today there are a variety of different web crawlers. They produce up to 40 % of all Internet traffic.
Structure of crawlers
As for surfing the internet reaches a crawler via hyperlinks from one site to another URL. In this case, all the addresses are stored and retrieved visited sequentially. The hyperlinks found will be added to the list of all URLs. In this way, can be found all theoretically linked and unlocked for web crawlers sides of the web.
In practice, however, a selection process is often taken sometime stopped and started from scratch. Depending on the task of web crawlers the content of web pages found, for example, evaluated and stored to facilitate future searches in the data collected by means of indexing.
Using the Robots Exclusion Standards, a webmaster may in the robots.txt file and in some meta tags in the HTML header to tell a crawler which pages it should index and which are not, as long as the web crawler keeps the protocol. To combat unwanted crawlers there are also special websites, called tar pits, which provide false information to the web crawlers and additionally slow down this much.
A large part of the entire Internet is not detected by web crawlers and thus also of public search engines because not a lot of content through simple links , but for example only using search and restricted access portals are accessible.
There are also thematically focused web crawlers. The focus of web search is provided by means of the classification of a Web page and the classification of individual hyperlinks. This is the focused crawler the best path through the web and indexes only (for a subject or a domain) relevant areas of the Web. Obstacles in the practical implementation of such web crawlers are mainly non-linked portions and the training of classifiers.
Web crawlers are also used for data mining and analysis of the Internet (webometrics) and are not necessarily limited to the WWW.
A special form of web crawlers are harvester. This designation is used for software that scans the Internet (WWW, Usenet, etc) to email addresses and reaping. Thus, electronic addresses are collected and can be sold afterwards.
The result is usually, but especially with spambots , promotional e-mails (spam). Therefore, from the earlier common practice to Websites and e-mail addresses to contact you via mailto : indicate link, put more and more distance; sometimes tries to make the addresses unreadable by the insertion of spaces or words for the bots . Thus firstname.lastname@example.org to a (at) example (dot) com. Most bots can recognize these addresses.
Also a popular method is to embed the e-mail address in a graphic. The e-mail address is thus not available as a string in the source code of the website and thus not be found for the bot as text information.
However, this has the disadvantage that it can not take on by clicking user-friendly in his e-mail program ready for the e-mail address, but must write the address for the user. Much more serious, however, is that the page that is no longer accessible and visually impaired people are excluded as well as bots. Another use of Web crawlers is to find copyrighted content on the Internet.