The robots meta tag and the robots.txt file are 2 different and independent mechanisms to pass information to search engine robots. They specifically allow you to specify which parts of your website should be indexed by search engines and which ones not. Both are very powerful, but also, should be used with care, since small mistakes can have serious consequences!
Robots.txt is used to block system folders, like the /plugins folder that ships with a Joomla installation by default. The robots metatag is usually used more specifically used to block certain pages. As an example, Google does not like your internal search pages in the Google index (see www.seroundtable.com/google-block-search-results-pages-24279.html) and you should use the robots metatag to block these. So, in short: robots.txt tells Google: do not go here, while the Robots metatag tells Google: do not index me. These are 2 really different things!
Both solutions do not replace one other, both have their specific purpose. Do not use them at the same time! I will discuss both solutions in depth.
The configuration of the robots.txt file takes place outside the Joomla administrator, you simply open and edit the actual file. The robots.txt file is a file that basically contains information about which part of the site should be made publicly available. It is there especially for the search engines bots that crawl the websites to determine which page should be made part of the index. By default, engines are allowed to crawl everything, so if parts of the site need to be blocked, you need to specify them specifically.
Note that blocking URLs in robots.txt does not prevent Google from indexing the page. It will simply stop checking the page. Just check this result for the Raven tools SEO software, which is actually high up in the rankings:
So if you want to be absolutely sure not to be indexed, you should use the robots meta tag, see lower this page.
Back to the robots.txt file: Joomla ships with a standard robots.txt file which should work fine for most sites, except for older sites: In older Joomla versions, it blocked the /images folder. This prevents the images for your site from being indexed, which of course you should not want. Therefore, either comment out this line, or remove it completely:
User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
# Disallow: /images/ <-------- Commented out using # or remove them
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /logs/
# Disallow: /media/ <-------- Commented out using # or remove them
Disallow: /modules/
Disallow: /plugins/
# Disallow: /templates/ <-------- Commented out using # or remove them
Disallow: /tmp/
As you can see, the file is mainly used to block system-folders. Next to this, you can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta-tag.
You can also check whether your robots.txt file works well using the Blocked URL's section of your Google Webmaster Tools.
Advanced users can use the robots.txt file to block pages from being indexed using pattern-matching. You could for example block any page containing a '?' to prevent duplicate content from non-SEF URL's:
User-agent: *
Disallow: /*?*
No need to say you need to be cauteous with this. More example can be found on searchengineland.com.
A remark that Google recently made regarding mobile sites (see this video with Google's Matt Cutts talking) is the following:
Don't block CSS, Javascript and other resource files by default. This prevents Google bot from properly rendering the page and understanding that it's optimized for mobile.
This is why the /templates and /media folder are not blocked anymore for Joomla installs since July 2014. Make sure though, that all your resource file are not blocked. If you use a plugin like JCH-Optimize, which combines multiple CSS and Javascript files into single files, you may need to specify an Allow rule for that, like this:
Allow: /plugins/system/jch_optimize/assets2/
Allow: /plugins/system/jch_optimize/assets/
Google has become more strict regarding the robots.txt in 2014. It is more picky about blocked resources (CSS and JS), but it also introduced some tooling in your webmaster account to help you troubleshooting issues. First of all, this concerns the robots.txt Tester that you can find under the Crawl options:
In this case, there are no errors and warnings, but if there are any, you will be notified. Mind that this is just a basic check on the validity of the lines you typed in, it does not check if the blocked resources are crucial for the display of the site.
This is where the Fetch as Google tool comes in handy. I really advise you to check your site with this tool, you may find astonishing results! This tool tries to check your site through the eyes of the Google bots crawling your site. Now let's see how our site looks through this tool.
The result may be a green tick box, but if the result is partial, you are not done yet!!!! Click on the checkbox, and a new page will open up. Now your site could look like this:
This could be the display you get when Google encounters a block for your /template folder, where all your CSS and JS resides. Which resources are blocked is easy to find, Google tells you right here. An advanced blog article on this topic is right here. Make sure you check this, as this could actually have an impact on search rankings due to Google not being able to correctly render your site. Specifically. it can not tell whether your site is responsive or not!
Something else: robots.txt can be used to point to your xml-sitemap files, especially if they are not located in the root of your website, which is often the case if your sitemap is created by Joomla extensions like PWT Sitemap, OSmap, Jsitemap, etc.. What you should do is to look up the sitemap location in the configuration of the extension, and then simply point to it in the bottom of your robots.txt file, like this:
Sitemap: index.php?option=com_osmap&view=xml&tmpl=component&id=1
Every now and then the Joomla project releases updates to the robots.txt file, like not blocking certain folders anymore for example. If they do, they will not simply distribute a new robots.txt file, because it would overwrite any customizations you made for yourself. Instead, they distribute a file called robots.txt.dist. If you never made any customizations, you can simply delete your existing robots.txt file and rename robots.txt.dist to robots.txt.
If you did customize it though, simply check what is changed and copy this change to your customized file. Usually, you will be notified of changes like this in the post-installation messages in your Joomla dashboard. By the way, the same routine is applicable for .htaccess changes.
The robots meta tag is a better method to block content from being indexed, but you can only use it for URLs, not for system folders. It is a very effective method to keep stuff out of the Google index. In Joomla, you can specify the tag on a number of locations, basically parallel to other SEO settings like the meta-descriptions. On a global level, most sites should leave the default as set in the Global Configuration screen, under the Metadata Settings. As you can see, you can set 4 combinations of settings:
Unless you want to hide your site from search engines (useful for development), leave the default Index, Follow. For specific pages, you can override this, either from the article or from the menu-item. For example: search page results should not be indexed, but you would like the links to be followed: set the tag to No index, follow. You can found more info on this in the E-book.
When you use the tag, you effectively create the following code in your HTML, so you can easily check if your configuration is correct:
<meta name="robots" content="NOINDEX, NOFOLLOW">
One warning: if you use Noindex, Nofollow to hide your development sites, make sure to change this once the site goes live (it happened to me...), otherwise your SEO scores will be very bad.... For some further reading on this subject, check out this post on Moz.com.
Joomlaseo.com is fully built and written by Simon Kloostra, Joomla SEO Specialist and Webdesigner from the Netherlands. I have also published the Joomla 3 SEO & Performance SEO book. Next to that I also sometimes blog for companies like OStraining, TemplateMonster, SEMrush and others. On the monthly Joomla Community Magazine I have also published a few articles.