Googlebot is the web crawler used by Google to collect the necessary information and create a searchable index of the web. Googlebot has mobile, desktop, and specialized crawlers for news, images, and videos.
There are more crawlers that Google uses for specific tasks, and each crawler identifies itself with a different string of text called a “user agent.” The Googlebot is evergreen, meaning it sees websites the way users would in the latest Chrome browser.
Googlebot runs on thousands of computers. You determine how fast and what is crawled on websites. But they will slow down their crawling so as not to overwhelm sites.
Let’s look at their process for creating an index of the web.
How Googlebot crawls and indexes the web
Google has shared some versions of its pipeline in the past. The following is the latest.
Google starts with a list of URLs it collects from various sources, e.g. B. Pages, sitemaps, RSS feeds and URLs submitted to Google Search Console or the Indexing API. It prioritizes what to crawl, fetches the pages, and stores copies of the pages.
These pages are processed to find more links, including links to things like API requests, JavaScript, and CSS that Google needs to render a page. All of these additional requests are crawled and cached (stored). Google uses a rendering service that uses these cached resources to display pages similar to a user.
It processes this again and looks for changes to the page or new links. The content of the rendered pages is what is stored and searchable in Google’s index. Any newly found links go back to the URL bucket so they can be crawled.
For more details on this process, see our article on how search engines work.
How to control the Googlebot
Google gives you a few ways to control what is crawled and indexed.
Ways to control crawling
Ways to control indexing
- Delete your content – If you delete a page, there is nothing to index. The downside is that no one else can access it either.
- Restrict access to the content – Google doesn’t log into websites, so any kind of password protection or authentication prevents it from seeing the content.
- No index – A noindex in the meta robots tag tells search engines not to index your page.
- URL removal tool – The name for this tool from Google is slightly misleading as the way it works is that it temporarily hides the content. Google will continue to see and crawl this content, but the pages will not appear in search results.
- Robots.txt (images only) – Blocking Googlebot images from being crawled means your images will not be indexed.
If you’re not sure which indexing control to use, check out our flowchart in our post on removing URLs from Google Search.
Is it really Googlebot?
Many SEO tools and some malicious bots pretend to be Googlebot. This may allow them to access websites that try to block them.
In the past, you had to perform a DNS lookup to verify Googlebot. But recently Google made it even easier and provided a list of public IPs that you can use to verify that the requests are from Google. You can compare this to the data in your server logs.
You also have access to a “Crawl Statistics” report in Google Search Console. If you go to… Settings > Crawl Statisticsthe report contains a lot of information about how Google is crawling your website. You can see which Googlebot is crawling which files and when it accessed them.
Final Thoughts
The web is a big and chaotic place. The Googlebot has to navigate through all the different setups, along with downtime and limitations, to collect the data that Google needs to make its search engine work.
A fun fact to wrap up with is that Googlebot is usually portrayed as a robot and aptly referred to as “Googlebot”. There is also a spider mascot named “Crawley”.
Do you have anymore questions? let me know on twitter.
Follow us on Facebook | Twitter | YouTube
WPAP (907)