WordPress Duplicate Content Prevention with robots.txt

WordPress Duplicate Content Prevention with robots.txt

To prevent the WordPress Duplicate Content in Google that may arise when you use any versions of WordPress, here is the typical content of the robots.txt file:


User-agent: *
Disallow: /comments/feed/
Disallow: /feed/
Disallow: /feed/atom/
Disallow: /feed/rss/
Disallow: /rss/
Disallow: /trackback/
Disallow: /wp/
Disallow: /*/comments/feed/$
Disallow: /*/feed/$
Disallow: /*/feed/atom/$
Disallow: /*/feed/rss/$
Disallow: /*/rss/$
Disallow: /*/trackback/$

The above code is based on the following assumptions:

  • The WordPress address (URL) is in the form http://www.yoursite.com/wp
  • The robots.txt file is located at http://www.yoursite.com/robots.txt
  • The above robots.txt file content is to be used with the AdFlex WordPress theme
Posted on Apr 2nd, 2007 by VK

Support this Site

If this article was useful and you want to help keep this site running, please consider a donation by clicking the PayPal Donate button below. Any amount is welcome. Thank you.

19 thoughts on “WordPress Duplicate Content Prevention with robots.txt

  1. Mike

    Hey VK,

    Another question. I’ve added the lines in my robots.txt since some days but I haven’t seen any changes yet?

    How long does it take to be effective?

    Do I need to add other things somewhere else?

    Mike

  2. VK Post author

    Hi Mike,

    Another question. I’ve added the lines in my robots.txt since some days but I haven’t seen any changes yet?

    How long does it take to be effective?

    Do I need to add other things somewhere else?

    It’s taken into account by Google and other SE during their next web spider visit.

    Now if you have already current items in your SERPs that you want to remove, you’ll have to use the URL remove tool as given by Google, Yahoo and MSN.

    As I suppose you’re interested mainly by Google as it’s currently the only search engine that may give you some duplicate content if similar items are in your SERPs.

    For Google, you need to use the Google URL Remover tool (also referred as Google Automatic URL Removal System). It’s located at URL: http://services.google.com/urlconsole/controller

    This Google tool is not as robust as other Google tools you may know but it does the job in most of the case. It happens it’s down time to time. It’s also not as secure as other Google tools as the Google’s Remove URL tool is accessible with HTTP and not HTTPS.

    The process is simple:
    1) After you’ve added the content to exclude in your robots.txt file, connect to the URL:
    http://services.google.com/urlconsole/controller
    2) There, just create an account to use the tool. You’ll get an e-mail that you’ll need to confirm.
    3) After confirming the e-mail for the account creation, just connect to the Google URL Remover service.
    4) Select the option “Remove pages, subdirectories or images using a robots.txt file.”
    5) In the field “URL to your robots.txt”, type the URL to your robots.txt. For instance: http://www.yoursite.com/robots.txt
    6) Click on the button “Remove Pages”

    In about 2 days, you can check your SERPs by check if Google has already done the URL removal request by doing a Google site command. The Google site command means you do a Google search for the following string:
    site:www.yoursite.com
    Of course, you have to replace http://www.yoursite.com by your site name but don’t put http:// there.

    You’ll see that the bad, not nice, crappy URLs that were in your SERPs will be removed and you’ll get a much nicer SERPs when doing the Google site command.

    Now some notes:
    a) – Be really sure to put the right content in the robots.txt file because when performing using the Google Automatic URL Removal System, the URLs to be removed will be removed out of Google index for 6 months!

    b) – As said previously the Google Automatic URL Removal System at URL http://services.google.com/urlconsole/controller is often down. If it’s the case, you can also use the other URL http://services.google.com:8882/urlconsole/controller that is more often up.
    The second URL version use port 8882 and it means you may not be able to access it when behind a company firewell that blogs such non-standard port number.
    Of course a direct connection to Internet will work fine most of the time.

    Hope it helps.

    Cheers,
    VK

  3. Pingback: My Internet Marketing Page » Blog Archive » Wordpress Duplicate Content Prevention with robots.txt

  4. Jameson

    He is spot on. His html is spot on. Follow this article exactly! You will have to wait for google spiders to re-crawl/update your website to see the changes. I give 65 thumbs up to this article and the others. Great content blog here. Add it to your favs if you want to make it in the “Internet” industry.

  5. Ric Raftis

    Great post. I was actually looking for info on Joomla robots.txt, but this really gave me some ideas about a few things. As a result, I switched off the pdf’s and print options on the site. I doubt anyone uses them anyway.

    Cheers,

  6. Internet Marketing Company

    Wow, what a find. This post is definitely one of the best WP tips that I have seen. I also write a different excerpt for each post because it shows on the category page. This prevents duplicate content on the category pages. Keep up the good work!

  7. Keith Davis

    Hi VK
    Still trying to get my head around this duplicate content thing with WordPress, posts, archives, categories etc.
    I can handle a static site, you just don’t repeat yourself, but with a dynamic site!
    This certainly helps.

  8. Keith Davis

    Hi VK
    I’m trying to put together a robots.txt file to prevent duplicate content and I notice that Archives and Categories are not excluded in your robots.txt file.

    If they are not excluded, won’t they produce duplicate content?

  9. VK Post author

    In fact, the assumption is that the Archives and Categories listing just show the excerpt and not the full post content. Full post content will be only displayed when you are in front of a page or a post.

    If you need even more control at the level of the SEO anti-duplicate settings, you’ll have to not only use the robots.txt file but also change other things at the level of the WordPress itself. For instance, using a SEO Plug-in or a SEO WordPress theme.

    I have created a SEO WordPress theme that comes up with about 200 options/settings and lots of SEO options/settings among them to give the user a better control of the SEO aspect and the anti-duplicate content thing. If you have to look at my SEO WordPress theme, just go to URL:
    http://www.vklabx.com/wordpress/

    Then just read the PHP code provided with my SEO WordPress theme and if needed, just copy the code that may be of any interests to you and add/include that code to your WordPress site.

  10. Joe Jones

    Askvk,

    I have one website that’s a wordpress site of only 2 web pages. I used all the disallows you have listed but Google’s keyword list for the site had “disallow” as the most frequently used keyword for the site. How do I stop google from using words in the robots.txt file as a source of keywords.

    Joe Jones

  11. Al Sefati

    Forgive me if I disagree. I don’t really subscribe to blocking pages by robots.txt for purpose of duplicated content prevention. That is not why robots.txt made for.

    Robots.txt was made to prevent search engines accessing inside pages such as login and admin pages. Any other usages of Robots.txt is incorrect.

    I recommend using rel canonical link for the purpose you mentioned.

  12. Droid Tv

    @ Al Sefati: Where did you hear that Robots.txt was made for, and ONLY for, preventing access to “inside pages such as login and admin”?

    Robots.txt was made to prevent access to any pages the webmaster does not want indexed is what I have always heard.

    @jon: this wont “help” with your rankings, but it will avoid you losing rank due to google finding duplicate content on your site and assuming your site is stealing or replicating content.

Leave a Reply

Your email address will not be published. Required fields are marked *

48 queries. 0.510 seconds