Robot Txt, crawl budgets, indexing

Posted on 19 Jun 2019

Dear Doctor Digital - crawling spiders with budgets and robots in my txt. I’m told these are living in my website - should I be scared?

Doctor Digital Says

Oh that world wide web, of course it’s full of spiders. Unless you are a web developer, you may be unfamiliar with some of the weird and wonderful terms that are used to describe what goes on in the back end of your website. If you’ve been part of the Digital Ready program for any time you’ve probably talked with your coach or gone to a seminar about SEO or Search Engine Optimisation. SEO is important as it is the way that your customers and soon to be customers can find you, and optimising search is made up of a whole bunch of actions that move you up in priority in the rankings of Google, meaning you are more easily found.

As Google is still by far the biggest search engine on the planet, we still talk about things in terms of how Google operates, so if you are a Bing user, this still relates to you. Let me try and break down the process that happens to the pages on your website so you have a more granular understanding. Why do you need to know? Well, knowledge is power as they say, and certainly in discussions with your developer, or trying to up your SEO game, all of this will help you stay informed and ahead.

While you might think of your website as a single entity, it’s not, it’s an amalgamation of pages, each one individually recognised by Google and stored in a massive database of information called a web index. Search engines like Google and Bing use these databases to store billions of pages of information which is what you are searching when you are ‘Googling’ something.

(Arachnophobe trigger warning) ‘Spiders’ (which are actually clever algorithms sometimes also called bots) ‘crawl’ new pages on the web and store them in an index based on their topics, relevance, authority, etc. Indexing is the spider’s way of gathering and processing all the data from pages and sites during its crawl around online. Frequent indexing improves your search results, as the spider notes new documents and changes, which are then added to the searchable index Google maintains and ranks.

Your pages are only added if they contain quality content and don’t trigger any alarms by doing shady things like keyword stuffing or building a bunch of links from unreputable sources. When the spider sees a change on your website, it processes both the content (text) on the page as well as the locations on the page where search terms are placed and images that have been appropriately tagged. The spider then adds, or “indexes”, that content into Google.

What makes your content able to be indexed by the hungry hungry spiders is a fast loading site, fresh relevant content, and having Google Analytics and Google Search Console set up. The better the bot’s experience of indexing across all the measures of the Google algorithm, the better your ranking in the search engine.

The robots.txt file, also known as the robots exclusion protocol or standard, is a text file you create that tells web robots (most often search engines) which pages on your site to crawl. It also tells web robots which pages not to crawl. If a search engine crawls your site, it will crawl every single one of your pages. And if you have a lot of pages, it will take the search engine bot a while to crawl them, which can have negative effects on your ranking.

By using your robots.txt to limit the pages crawled to the ones that are most essential to be indexed, you can tell search engine bots to spend their crawl budgets wisely by only crawling on your key pages. The bots that crawl your pages have priorities and limits in their crawling (there are a lot of pages to crawl on the web, those bots are busy.) If you don’t make your pages appealing for them, they won’t waste their crawl budgets on your site, which equals less indexing and less optimisation.

Most template sites like Wix, Weebly and Squarespace will have automatic robots.txt enabled. If you want to find out if yours does, enter your url with /robots.txt extension ie: digitalready.tas.gov.au/robots.txt and you will see what pages are excluded if it is active, and if it isn’t you can have a conversation with your web builder to get your robot activated.

While this might seem a little technical and mysterious, it really comes down to the fundamentals of good web design, relevant and up to date content. And what THAT really comes down to is a website that is not only great for Google bots and spiders to crawl all over, but also great for your customers to navigate as you will have taken the time to make sure it is accessible and engaging. If any of this is keeping you up at night, holler for a Digital Ready coach who can help you get your website in tippy top condition.

Outdated browser

Robot Txt, crawl budgets, indexing

Doctor Digital Says

Share this page