Understanding Robots Web Crawlers

Another tip that is very useful for bloggers and developers is the ~~incrumb.com~~ "Robots.txt" file. This is very important that you must know its importance and usage. Like I said it is very useful, this will allow the internet "bot" to whether allow the contents of your website or not. The correct use of this file will allow you website to be popular among all the internet bot. The more popular, the max visitor and when there is max visitor thats a fuel to any website.

So by now, everybody might have an idea of what we are talking now. The Robots.txt file is the one that actually is speaking. So, to start with, the syntax looks like this:

User-agent: Googlebot Disallow:
Sitemap: http://www.example.com/sitemap.xml

The syntax contains two parts rather sentences. This might be greek and latin, however once you understand this it will be very simple and easy. Let me start with what each words mean.

User-agent - The user agent refers to the files and bots that will crawl your website. It specifies the user.

What is Googlebot ? Bot is nothing bust just the crawler. Every search engine needs to have a crawler who actually scans your website and finds out whats in the Website. The Googlebot is the Google's crawler.

Disallow - The disallow simply tells the bot to disallow access to the resource on the website directory. So after the disallow keyword, you can specify the resource.

Disallow: /

The "Disallow: /" is easy and can be understood easily. it means that disallow everything on the parent current directory which is the main folder. However "Disallow:" without the "/" will tell not to disallow anything. So the bot would be able to access everything.

Allow: /

However the "Allow: /" will allow certain resource, when the other resource are restricted.

/images/pets/*.jpg

This means that only certain resource will be crawled by the bots.

The last content of the robots.txt file is the location of the website details. The bot does not crawl the website and find out what you have on your website. It crawl the link on your website, and where your website points. This will be fed by the sitemap file of your website. The sitemap.cml file is the file that will have all the links to and will have all the content to your blog.

http://www.example.com/sitemap.xml

The sitemap location should be on the directory of your website. There is no disadvantage of having more than one sitemap. A site can have many sitemaps.

Thats all about the Robot.txt files. Should have any more questions, feel free to reach out.

Understanding Robots Web Crawlers

Post a Comment

Contact form