Robots.txt + meta tag
The robots meta tag and the robots.txt file are 2 different and independent mechanisms to pass information to search engine robots. They specifically allow you to specify which parts of your website should be indexed by search engines and which ones should not. Both are very powerful, but also, they should be used with care, since small mistakes can have serious consequences!
Differences between robots.tx & robots meta
Robots.txt is used to block system folders, like the /plugins folder that ships with a Joomla installation by default. It basically tells the Google bot to bugger off when it tries to crawl the file or folder, which basically makes it blind for any content contained within.
The robots metatag is used more specifically to block the indexing of certain pages. As an example, Google does not like your internal search pages in the Google index (see www.seroundtable.com/google-block-search-results-pages-24279.html) and you should use the robots metatag to block these. So, in short: robots.txt tells Google: do not go here, while the Robots metatag tells Google: do not index me. These are 2 really different things!
Both solutions do not replace one other, both have their specific purpose. Do not use them at the same time! I will discuss both solutions in depth.
The configuration of the robots.txt file takes place outside of the Joomla administrator area, you simply have to open and edit the actual file. The robots.txt file is a file that basically contains information about which part of the site should be made publicly available. It is there, especially for the search engine bots that crawl the websites to determine which page should be made part of the index. By default, engines are allowed to crawl everything, so if parts of the site need to be blocked, you need to specify them specifically.
Note that blocking URLs in robots.txt does not prevent Google from indexing the page. It will simply stop checking the page. Just check this result for the Raven tools SEO software, which is actually high up in the rankings, where the URL got blocked in robots.txt, while it is still being indexed:
So if you want to be absolutely sure not to be indexed, you should use the robots meta tag, see lower this page.
Back to the robots.txt file: Joomla ships with a standard robots.txt file which should work fine for most sites, except for older sites: In older Joomla versions, it blocked the /images, /media and /templates folders. This prevents the images or CSS for your site from being indexed, which of course you should not want. Therefore, should you still see this in your robots.txt file, remove it completely:
# Disallow: /images/ <-------- Remove
# Disallow: /media/ <-------- Remove
# Disallow: /templates/ <-------- Remove
As you can see, the file is mainly used to block system folders. Next to this, you can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta-tag.
You can also check whether your robots.txt file works well using the Blocked URL's section of your Google Webmaster Tools.
Advanced tweaking with robots.txt
Advanced users can use the robots.txt file to block pages from being indexed using pattern-matching. You could for example block any page containing a '?' to prevent duplicate content from non-SEF URL's (which is not an advised practice nowadays):
No need to say you need to be cautious with this. More examples can be found on searchengineland.com.
A remark that Google recently made regarding mobile sites (see this video with Google's Matt Cutts talking) is the following:
Point to your sitemap
Something else: robots.txt can be used to point to your xml-sitemap files, especially if they are not located in the root of your website, which is often the case if your sitemap is created by Joomla extensions like PWT Sitemap, OSmap, Jsitemap, etc. What you should do is look up the sitemap location in the configuration of the extension, and then simply point to it at the bottom of your robots.txt file, like this:
Joomla updates and changes to robots.txt
Every now and then the Joomla project releases updates to the robots.txt file, like not blocking certain folders anymore for example. If they do, they will not simply distribute a new robots.txt file, because it would overwrite any customizations you made for yourself. Instead, they distribute a file called robots.txt.dist. If you never made any customizations, you can simply delete your existing robots.txt file and rename robots.txt.dist to robots.txt.
If you did customize it though, simply check what is changed and copy this change to your customized file. Usually, you will be notified of changes like this in the post-installation messages in your Joomla dashboard. By the way, the same routine is applicable for .htaccess changes.
Robots meta tag
The robots meta tag is a better method to block content from being indexed, but you can only use it for URLs, not for system folders. It is a very effective method to keep stuff out of the Google index. In Joomla, you can specify the tag on a number of locations, basically parallel to other SEO settings like the meta descriptions. On a global level, most sites should leave the default as set in the Global Configuration screen, under the Metadata Settings. As you can see, you can set 4 combinations of settings:
Unless you want to hide your site from search engines (useful for development), leave the default Index, Follow. For specific pages, you can override this, either from the article or from the menu item. For example: internal search pages should not be indexed, but you would like the links to be followed: set the tag to No index, follow. You can find more info on this in the E-book.
When you use the tag, you effectively create the following code in your HTML, so you can easily check if your configuration is correct:
<meta name="robots" content="NOINDEX, NOFOLLOW">
One warning: if you use Noindex, Nofollow to hide your development sites, make sure to change this once the site goes live (it happened to me...), otherwise your SEO scores will be very bad... For some further reading on this subject, check out this post on Moz.com.
In 2019, Google introduced advanced settings for the robots meta tag for the purpose of configuring how your snippets should be shown. There are many combinations possble, but for most sites, the following combination of settings is best advised for optimal visibility in the search results:
<meta name="robots" content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" class="4SEO_robots_tag">
If you want to read more, you can start at this blogpost: searchengineland.com/google-adds-new-snippet-controls-to-enable-control-over-how-your-search-listings-are-displayed-322456.
Achieving this requires the use of an extension that does support this. Some SEO extensions do, at least 4SEO.