Check what are the common mistake related To Robots.txt in this blog. It will help you to resolve the errors to crawl the website. Read More. The text file created by the webmaster to direct the search engine robots for easy crawling of the web pages is the robots.txt file. It is one of the crucial elements that you should not ignore from the SEO perspective. It is the most common thing that crawlers check on your website.
It is basically the standard document that is followed by the robots to crawl the web pages, index them, and present them in front of the users. Web crawlers get the information of which part of the website is allowed for crawling and which is not. If any mistake is made in the directive, it will result in improper crawling of the web page, automatically affecting the website ranking.
The basic syntax of this file is:
User-agent: [user-agent name]Disallow: [address of URL which is not to be crawled]
Quick Must-Know Points About Robots.txt
- The robots.txt file should be placed in the top directory of the website.
- The characters in the robots.txt are case sensitive; thus, you should be careful while typing the words. Also, the correct case of this file is “robots.txt”, not Robots.txt, robots.TXT, and so on.
- Some of the robots will like to ignore the robots.txt file of your webpage. It is most common with nefarious crawlers.
- It is publicly available. You can simply write /robots.txt at the end of the root domain of the website, and you will be able to see the website directories.
- Every subdomain should have a distinct robots.txt file.
- It is good to add the sitemap of the website in the file.
Some Of The Common Mistakes
Mistakes are a common part of any work. And like that, individuals also make some mistakes with the robots.txt file. So here we are describing some of the common mistakes. Have a look at them and try to remove or eliminate them if you are doing.
No Proper Placement Of File In Root Directory
It is the most common mistake that any individual makes. They don’t place the Robots.txt file in the right directory that needs to be. It should always be placed in the root directory of the website. It would be best if you did not place it inside the other directory because it will become undiscoverable for the crawlers when they scroll the websites.
Correct format- https://marketingsweet.com.au/robots.txt
Incorrect form- https://marketingsweet.com.au/assests/robots.txt
Wrong Use Of Wildcards
Maybe some of you will not be familiar with the word wildcard. So it is the special character that is defined for the crawlers in directives with respect to the robots.txt file. $ and * are two wildcards that are mostly used in the robots.txt file. There is some meaning to each wildcard. The symbol $ means the ending of the URL, and * means “all” or “0”. Here we can explain this to you with an example.
The correct syntax
User-Agent: * ( * is used to represent all types of agents)
Disallow: /group* (Here * represents that all URLs with “/group” will be blocked)
Disallow: *.audio$ (This directive denotes any URL ending with .audio extension should be blocked)
Casual Use Of Trailing Slash
Another mistake that is made is the unnecessary utilization of the slash for making any URL in the robots.txt to allow or block. For example, if you are trying to block a URL such as URL: https://www.example.com/about us.
So what is wrong with this? Let us explain this to you.
User-Agent: *
Disallow: /about us/
Now, this code will give an indication to Googlebot to not crawl any URLs that come under the “/about-us/” folder. Also, it won’t block the URL “/about-us” as there is no trailing slash with it.
Now, what is the right way to block the URL?
User-Agent: *
Disallow: /about us
Practising Noindex Directive In Robots.Txt
Now, most of the individuals are familiar with this that there is no benefit of using Noindex in the robots.txt file. Google has officially announced that the Noindex directive will not be effective from September 1, 2019, onwards for the robots.txt file.
Example of NoIndex in robots.txt
User-Agent:*
Noindex: /privacy-policy/
Noindex: /disclaimer/
There is an alternative to this.
You can use the meta robots tag instead of the Noindex attribute.
You can easily use this code in the page code of the URLs that you don’t want to index.
Not Having Sitemap URL in Robots.Txt
Most of the individuals do not mention the location of the sitemap in the robots.txt file, which is not acceptable. If you do so, then it will help the crawlers to discover the website sitemap from there easily. The time of the Googlebot will get saved as you are mentioning the website sitemap in Robots.txt, and we all know that it will be effective for the easy crawling of the website. So how to include the sitemap in the Robots.txt?
In your robots.txt file, you have to include the following command.
Sitemap: https://www.example.com/sitemap.xml
Blocks CSS and JS
People block the JS and CSS file in the robots.txt, and they think that it will get indexed by Googlebot. But SEO experts say that there is no need to block the CSS and JS files because Googlebot has to crawl these pages for improving page efficiency.
Not Creating The Robots.Txt File For Each Domain And Subdomain
It is a good practice that every domain, either a subdomain or staging sub-domains, would have a robots.txt file. If you do not do this thing, then it would lead to crawling and indexing of the undesired domains, and there should be inefficient crawling left for the important domains and subdomains. Therefore it is important that you should have a dedicated robots.txt file for each of the domains and subdomains for better crawling and indexing.
Allowing Crawlers To Access The Staging Site
Mostly all developers do experiments or tests on the staging or test website. After testing all things, everything gets deployed on the main website. But they often forget that the staging websites are also considered as the website by Googlebot. Googlebot will discover and crawl such web pages like other websites. If you do not block the crawlers from accessing such websites, then they will get indexed and might appear in some search queries. And no one will like such a thing.
Also, you should not use the same robots.txt file of the main website on the staging website. It is completely wrong.
Overlooking Case Sensitivity
You should be aware that URLs are case sensitive for the crawlers. It means that small “s” and capital “S” both are different in terms of URL. For example-
https://marketingsweet.com.au/category
and https://marketingsweet.com.au/Category, both are different URLs. Thus you need to be very careful while mentioning the URL address.
Let us clear it with the help of an example.
Suppose you want to block the URL https://marketingsweet.com.au/news then you should type it in the correct form.
Like-
User-Agent: * Disallow: /news
And the incorrect form is:
User-Agent: * Disallow/News.
Conclusion
Use of the latest digital marketing trends and strategies are very much effective in improving the position of the business. However, making mistakes in the robots.txt can affect the SEO adversely. It is small but a little tricky to set the right robots.txt file. Therefore you should pay utmost attention and care while setting the robots.txt file.
If you are not aware of this, you can contact Marketing Sweet’s experts to experience the right robots.txt file for your online business. We keep on exploring the best strategy for your business so that you can experience growth in your business. We understand the requirements of the customers and work accordingly.
You can give us a call at 08 8337 4340 to get better and detailed information.