Before a user searches, web crawlers collect information from millions of web pages and index and organize it.
The crawl process begins with a list consisting of web addresses from past crawls and a sitemap provided by the website owner. Crawlers visit these websites and use links on those sites to find other pages. In particular, focus on new site information, changes to existing sites, and broken links. The computer programs determine which sites to crawl, how often to crawl, and how many pages to retrieve from each site.
Google offers Search Console to give site owners more control over how Google crawls your site. For example, you can specify how pages on your site should be handled, request recrawls, or disable crawls using a file called robots.txt. Google does not crawl sites more often for a fee. We provide the same tools for all websites so that we can provide the best search results for our users.
Discovering information by crawling:
The Web is like a library of books that is constantly growing without being managed in one place. Google uses software called web crawlers to find published web pages. Crawlers look at web pages and follow links on pages, just as users browse content on the web. It moves from link to link and accumulates data about web pages on our servers.
Organizing information by index:
When a crawler finds a particular webpage, our system displays the content of the page, similar to a browser. Google focuses on key signals such as keywords and website freshness, and records all that information in our search index.
Google Search indexes hundreds of billions of web pages. Similar to the index at the end of the book, Google’s index adds every word contained in each webpage, one at a time. This means that when a web page is added to the index, all the words it contains will be added to the index. We continue to use the Knowledge Graph not just to match keywords, but to get specific insights about people, places, things, and more.