If you are using WordPress as your website’s/blog’s Content Management System and would like to communicate with good bots of various search engines (like Google, Yahoo!, Bing etc.) and services (like Alexa, Archive.org etc.) about which areas of your site should not be processed or scanned by them, then lets check out in this post the best format of robots.txt file for your self-hosted site.



Basic format for robots.txt file for WordPress

Feel free to use following robots.txt file for your WordPress site (you can also download the file from here):

sitemap: http://www.yourdomain.com/sitemap.xml

User-Agent: *

Disallow: /wp-admin/

Disallow: /wp-includes/

Disallow: /wp-content/

Disallow: /?

Disallow: /feed/

Disallow: /comments/feed/

Disallow: /trackback/

 

User-agent: Googlebot-Image

Allow: /wp-content/uploads/

 

User-agent: msnbot-media

Allow: /wp-content/uploads/

 

User-agent: MSNBOT_Mobile

Allow: /

 

User-agent: Googlebot-Mobile

Allow: /

 

User-agent: MediaPartners-Google

Allow: /

Note: Replace yourdomain.com with your actual domain name. Also there is no need to enter www if your site don’t have it. Once you have downloaded the file onto your computer’s hard drive and have made necessary changes, you are now required to upload it to the root directory of your domain name so that its access URL becomes http://www.yourdomain.com/robots.txt.

What above lines are going to do?

The first line i.e. sitemap: http://www.yourdomain.com/sitemap.xml will tell search engine bots about the exact location of your site’s XML sitemap. If your site has multiple sitemaps (like Image Sitemap, Video Sitemap, Mobile Sitemap etc.), then you can specify their respective URLs in robots.txt file as follows:

sitemap: http://www.yourdomain.com/sitemap.xml

sitemap: http://www.yourdomain.com/sitemap-image.xml

sitemap: http://www.yourdomain.com/sitemap-video.xml

sitemap: http://www.yourdomain.com/sitemap-mobile.xml

User-Agent: *

Important note: Depending on the type of sitemap plugin you are using with your site, the name of your sitemap file can be completely different. Instead of sitemap.xml, it can be sitemapindex.xml or sitemap-index.xml or something else. Before entering the sitemap URL in your robots.txt file, make sure that you double check your sitemap file’s exact name.

The second line i.e. User-Agent: * tells ALL good bots to follow all the Disallow: and Allow: rules written after this line. If you’d like to specify custom Disallow:/Allow: rules for a single web bot (like Google AdSense bot, Google Image bot etc.), then it is discussed later in this post.

The third line i.e. Disallow: /wp-admin/ instructs bots not to access/scan any page present in your site’s admin area like /wp-admin/, /wp-admin/plugins.php, /wp-admin/themes.php etc. These pages are meant for your site’s administrators, not for third-party bots.

The fourth line i.e. Disallow: /wp-includes/ instructs bots not to access/scan your /wp-includes/ folder. /wp-includes/ folder of your site contains core files of your installation, which contains everything needed to run WordPress. This folder is meant only for your site’s administrators/developers.

The fifth line i.e. Disallow: /wp-content/ instructs bots not to access/scan any part of your /wp-content/ folder like /wp-content/plugins//wp-content/themes/ etc. This folder contains important files of all installed plugins, themes etc. of your site and third-party web bots shouldn’t access them.

The sixth line i.e. Disallow: /? instructs bots not to access/scan any URL that contains a slash and a question mark like /?s= (WordPress’s default search page) etc. These type of pages don’t provide any unique content to search engine bots and must be disallowed using robots.txt file. Sometimes these type of URLs also result in duplicate content problems.

The seventh line i.e. Disallow: /feed/ instructs bots not to crawl your site’s RSS feed page as this may lead to duplicate content problems.

The eighth line i.e. Disallow: /comments/feed/ tell bots not to crawl your site’s RSS feed page of comments. This page generally consists of unoptimised content and may also lead to duplicate content issues.

The ninth line i.e. Disallow: /trackback/ instructs bots not to access WordPress trackback URLs.

Custom Rules for Specific Web bots

After the ninth line in above robots.txt file, you will see following lines:

User-agent: Googlebot-Image

Allow: /wp-content/uploads/

 

User-agent: msnbot-media

Allow: /wp-content/uploads/

 

User-agent: MSNBOT_Mobile

Allow: /

 

User-agent: Googlebot-Mobile

Allow: /

 

User-agent: MediaPartners-Google

Allow: /

These lines are custom rules for specific web bots viz. Google Image bot, MSN or Bing Media (Image), MSN Mobile Bot, Google Mobile Bot and AdSense bot. These rules instructs bots to do things as follows:

  • For Google Image bot and MSN Media BotAllow: /wp-content/uploads/ instructs image bots of both Google and Bing to crawl the /wp-content/uploads/ folder of your installation without any restriction. This folder contains all the files that you have uploaded to your site like image files (.jpg, .png, .gif etc.), video files (.mp4, .flv etc.), documents (.pdf, .doc, .docx etc.) etc. Other bots can’t access this folder because of the fifth line i.e. Disallow: /wp-content/ under User-Agent: *.
  • For MSN bot Mobile, Google Mobile Bot and AdSense botAllow: / instructs mobile and AdSense bots to access ALL parts of your site without any restriction. Other bots will follow the rules specified under User-Agent: *.

Note:

  • You can always add more user agents and custom rules in your robots.txt file as per your site requirements. Feel free to check user agent of all Google bots on this page.
  • You can always add more directories to your site’s robots.txt file as per your requirements. For example, suppose you have uploaded pictures of your dog to /pets/ folder of your site using FTP (assuming this folder is completely outside of your WordPress installation) and don’t want search engines to crawl this particular folder, then all you need to do is to enter Disallow: /pets/ under User-Agent: *.
  • Good bots are those web bots which follows Robots Exclusion Standard, while bad/rogue bots don’t follow anything! If you’d like to stop bad bots from accessing certain parts of your site, then you need to block them using .htaccess file of your site or contact your web host.
  • If you need any help with your WordPress site’s robots.txt file, then feel free to post your query in the comments section below.