The robots meta tag and the robots.txt file are 2 different and independent mechanisms to pass information to search engine robots. They specifically allow you to specify which parts of your website should be indexed by search engines and which ones not. Both are very powerful, bur for this reason, also should be used with care, since small mistakes can have serious consequences!
Both solutions do not replace the other, both have their specific purpose. I will discuss both solutions in depth.
The configuration of the robots.txt file takes place outside the Joomla administrator, you simply open and edit the actual file. The robots.txt file is a file that basically contains information about which part of the site should be made publicly available. It is there especially for the search engines bots that crawl the websites to determine which page should be made part of the index. By default, engines are allowed to crawl everything, so if parts of the site need to be blocked, you need to specify them specifically.
Note that blocking URLs in robots.txt does not prevent Google from indexing the page. It will simply stop checking the page. Just check this result for the Raven tools SEO software, which is actually high up in the rankings:
So if you want to be absolutely sure not to be indexed, you should normally use the robots meta tag, see lower this page. However, in a recent blogpost by Deepcrawl it was mentioned that you can use a Noindex setting. It is not a generally advised solution though, usage is experimental. Currently I am testing whether this actually works indeed
Back to the robots.txt file: Joomla ships with a standard robots.txt file which should work fine for most sites, except for one quite important issue: It blocks the /images folder. This prevents the images for your site from being indexed, which of course you should not want. Therefore, either comment out this line, or remove it completely:
# Disallow: /images/ <-------- Commented out using # or remove them
# Disallow: /media/ <-------- Commented out using # or remove them
# Disallow: /templates/ <-------- Commented out using # or remove them
As you can see, the file is mainly used to block system-folders. Next to this, you can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta-tag.
You can also check whether your robots.txt file works well using the Blocked URL's section of your Google Webmaster Tools.
Advanced tweaking with robots.txt
Advanced users can use the robots.txt file to block pages from being indexed using pattern-matching. You could for example block any page containing a '?' to prevent duplicate content from non-SEF URL's:
No need to say you need to be cauteous with this. More example can be found on searchengineland.com.
A remark that Google recently made regarding mobile sites (see this video with Google's Matt Cutts talking) is the following:
Test robots.txt in Google Webmaster Tools
Google has become more strict regarding the robots.txt in 2014. It is more picky about blocked resources (CSS and JS), but it also introduced some tooling in your webmaster account to help you troubleshooting issues. First of all, this concerns the robots.txt Tester that you can find under the Crawl options:
In this case, there are no errors and warnings, but if there are, you will be notified. Mind that this is just a basic check on the validity of the lines you typed in, it does not check if the blocked resources are crucial for the display of the site.
This is where the Fetch as Google tool comes in handy. I really advise you to check your site with this tool, you may find astonishing results! This tool tries to check your site through the eyes of the Google bots crawling your site. Now let's see how our site looks through this tool.
The result may be a green tick box, but if the result is partial, you are not done yet!!!! Click on the checkbox, and a new page will open up. Now your site could look like this:
This could be the display you get when Google encounters a block for your /template folder, where all your CSS and JS resides. Which resources are blocked is easy to find, Google tells you right here. An advanced blog article on this topic is right here. Make sure you check this, as this could actually have an impact on search rankings due to Google not being able to correctly render your site. Specifically. it can not tell whether your site is responsive or not!
Point to your sitemap
Something else robots.txt can be used for is to point to your xml-sitemap files, especially if they are not located in the root of your website, which is often the case if your sitemap is created by Joomla extensions like Xmap, OSmap, Jsitemap, etc.. What you should do is to look up the sitemap location in the configuration of the extension, and then simply point to it in the bottom of your robots.txt file, like this:
Joomla updates and changes to robots.txt
Every now and then the Joomla project releases updates to the robots.txt file, like not blocking certain folders anymore for example. If they do, they will not simply distribute a new robots.txt file, because it would overwrite any customizations you made for yourself. Instead, they distribute a file called robots.txt.dist. If you never made any customizations, you can simply delete your existing robots.txt file and rename robots.txt.dist to robots.txt.
If you did customize it though, simply check what is changed and copy this change to your customized file. Usually, you will be notified of changes like this in the post-installation messages in your Joomla dashboard. By the way, the same routine is applicable for .htaccess changes.
Robots meta tag
The robots meta tag is a better method to block content from being indexed, but you can only use it for URLs, not for system folders. It is a very effective method to keep stuff out of the Google index. In Joomla, you can specify the tag on a number of locations, basically parallel to other SEO settings like the meta-descriptions. On a global level, most sites should leave the default as set in the Global Configuration screen, under the Metadata Settings. As you can see, you can set 4 combinations of settings:
Unless you want to hide your site from search engines (useful for development), leave the default Index, Follow. For specific pages, you can override this, either from the article or from the menu-item. For example: search page results should not be indexed, but you would like the links to be followed: set the tag to No index, follow. You can found more info on this in the E-book.
When you use the tag, you effectively create the following code in your HTML, so you can easily check if your configuration is correct:
<meta name="robots" content="NOINDEX, NOFOLLOW">
One warning: if you use Noindex, Nofollow to hide your development sites, make sure to change this once the site goes live (it happened to me...), otherwise your SEO scores will be very bad.... For some further reading on this subject, check out this post on Moz.com.