Often times in SEO discussion communities you come across questions from webmasters asking, ‘If I do XYZ, will it cause a duplicate content penalty?’ The common misconception, ever since Google released its Panda update, is that a duplicate content penalty exists and you risk having your site removed from Google’s index if you have the same content on different pages of your site. At some point during your website’s content creation you might have thought about duplicate content; using the same images multiple times across the site or, if it is an e-commerce site, worrying about category pages appearing in more than one URL with the same product and description, or about your articles being syndicated word-for-word on other sites. So, how much and what do you really need to worry about in terms of duplicate content? Let’s start with the basics.
What is Duplicate Content?
Any content that is identical to other content that exists either on the same website or a different one.
Examples:
- Your blog content syndicated (copied) onto another website.
- If your home page has multiple URLs serving the same content, for example: http://yoursite.com, http://www.yoursite.com and http://www.yoursite.com/index.htm.
- Pages that have been duplicated due to session ids and URL parameters, such as http://yoursite.com/product and http://yoursite.com/product?sessionid=5486481.
- Pages that have sorting options on the basis of time, date, color or other sorting criteria can produce duplicate pages, such as http://yoursite.com/category and http://yoursite.com/category?=sort=medium.
- Pages with tracking codes and affiliate codes, such as http://yoursite.com/product and http://yoursite.com/product?ref=name.
- Printer-friendly pages created by your CMS that have exactly the same content as your web pages.
- Pages that are http before login and https after.
What is Not Duplicate Content?
Examples:
- Quotes from other sites when used in moderation on your page inside quotation marks. They must preferably be associated with a source link.
- Images from other sites or images repeated on your own site(s). (This is not considered duplicate content as search engines cannot crawl images).
- Infographics shared via embed codes.
There is no such thing as a duplicate content penalty. You have proof right out of the horse’s mouth from Google
here and
here. But that does not mean taking the issue of duplicate content lightly. The repercussions of having duplicate content on your web pages are a loss of traffic, simply because you are “omitted from search results”. That’s right, you are not de-indexed or penalized, but the duplicate content is simply not shown to users in search results. On Google, you may find a message similar to the one shown below:
Google Message at the End of Search Results
If a user clicks the link to repeat the search, they will come across these missing, duplicate-content pages. The chance of a user actually clicking this link, however, is basically nil, as the message is shown on the last search page – yes, page 8042 or however many pages a search might return. Plus, if you have one version of the content why would you need a repeat one? This is one way Google refines the user-experience of its search engine, and rightly so. So, how is your site affected by this? There are many ways your site can be affected by the way Google handles duplicate content:
- Lose Your Original Content to Omitted Results: If your original blog has been syndicated onto many third-party websites without a link back to your content, there is a good chance that your original content will be omitted and replaced by their content. This is especially true if the third-party site has a higher PageRank, higher influence and/or higher-quality backlinks than your site.
- Waste of Indexing Time for Bots: While indexing your site, search engine bots treat every link as unique and index the content on each of them. If you have duplicate links due to session ids or any of the reasons mentioned above, the bots waste their time indexing repeat content rather than indexing other unique content on your site.
- Multiple Duplicate Links Means Diluted Link Juice: If you build links pointing to a page that has multiple URLs, the passing link juice is distributed among them. If all the pages are consolidated into one, the link juice will also be consolidated which could increase the search rankings of the web page. For more information, see our blog on The Flow of Link Juice.
- Traffic Loss: It is obvious that if your content is not the version Google chooses to show in search results, you will lose valuable traffic to your site.
How Can You Detect Duplicate Content on Your Site?
The simplest and most logical method is to copy and paste a snippet of your content into Google search and see if any other page shows up with exactly the same content. There are other ways as well, and they are as follows:
1.) Google Webmaster Tools:
Check for Duplicate Content on Google Webmaster Tools
Duplicate content is not limited to content present on a web page but can also be content seen in search snippets, such as meta titles and meta descriptions. The duplication of such content can be detected easily via Google Webmaster Tools under Optimization > HTML Improvements, as shown in the screenshot above.
2.) External Tools:
Copyscape.com is an excellent tool to check for duplicate content on your site. It is a free tool available for both Mac and PC.
3.) “Site:” Search Operator:
Enter your site on search using the site: search operator along with part of the content from the page, as follows:
site:www.yoursite.com [a part of the content copied from your site here]
If you see a message from Google talking about omitted results (as shown in the first screenshot on this blog), it is an indication that your site has duplicate content present on the website or outside of it.
So, the final question is…
How Can You Get Rid of Duplicate Content? Here are 7 ways:
Removing duplicate content from your site is possible, and it is worth the time and effort to make your site as search-engine friendly as possible. Removing duplicate content from other sites that syndicate your original content should be taken care of in a way you prefer; either by sending them a polite email, or a mention in their blog comments giving credit and a link to your original content.
The following are ways to cope with duplicate content generated on your own site:
1. Rel=“canonical”:
When you have multiple URLs serving the same content, choose the URL you would prefer to be displayed in search results. This will be your canonical URL. You must then add a rel=“canonical” tag in the <head> section of any other pages with duplicate content. So for instance, your preferred page is A and its duplicate page is B, the line of code in the markup of page B should be as follows:
<link href=“Page A URL” rel=“canonical”/>
Adding this code to the duplicate page suggests to the search bots, quite transparently, that it is a duplicate of the canonical URL mentioned. The bot then knows which page to show in search results and where to point all the incoming link juice.
2. 301 Redirects:
You can use 301 redirects on duplicate pages that are automatically generated and are not necessary for the user to see. Adding rel=“canonical” tags to the duplicate pages keeps the page visible to users, while 301 redirects point both search engine bots and users to the preferred page only. This should be done specifically to home page URLs from the WWW URL to the non-WWW URL or vice versa, depending on which URL is used most. Similarly, if you have duplicate content on multiple websites with different domain names, you could redirect the pages to one URL using a 301 redirect. NOTE: 301 redirects are permanent, so please be careful when you choose your preferred URL.
3. Meta Robots Tag
You can use the meta robots tag with nofollow and noindex attributes if you have to keep a duplicate page from being indexed by a search engine. Simply add the following code to the duplicate page:
<meta name=”robots” content=”noindex”>
There is another way of excluding duplicate pages from the search engine indexes, and that is to disallow the links with special characters in the robots.txt file. Note: Google has advised not to disallow pages on the basis of duplicate content using robots.txt, because if the URL is completely blocked there is a chance that search engine bots might find the URLs outside of the website via links and may treat these as unique pages. This means that search engines will probably choose this as the preferred page among all the duplicates, even though that was not your intention.
4. Google Webmaster Tools:
You can set preferred URLs in your Google Webmaster Tools account under the option
Configuration> Sitelink> Preferred Domain. Going one step further, you can set URL parameters to drop duplicate pages from Google-bot indexing. This option is also available under
Configuration in the sub-section
URL Parameters, however, using this option may cause de-indexing of important pages if not properly configured, hence it is not recommended if you are not entirely sure how to do it. Learn more about URL parameters in our blog on
Clean URLs for SEO and Usability.
Set URL Parameters on Google Webmaster Tools
5. Hash Tag Tracking:
Instead of using tracking parameters in URLs (which creates duplicate pages with the same content), try using the hash tag tracking method. Tracking parameters are used to track visits from specific sites to your site, for example, from an affiliate marketer’s site. These parameters are usually present after a question mark (?) in the URL. With the hash tag method, we remove the question mark and use a hash tag (#). Why? Well, Google bots tend to ignore anything present after a hash tag. So, for example, you might have duplicate URLs like
http://yoursite.com/product/ and
http://yoursite.com/product/#utm_source=xyz. When you use the hash tag, Google sees both the links as
http://yoursite.com/product/. To do this, use the
_ setAllowAnchor
method, as illustrated here.
6. Content on Country-Specific Top-Level-Domains:
When you have businesses spread all over the world it is natural to have multiple domains for each location and it is likely not possible to create unique content for each of these sites when the product/service is the same. How do you handle content duplication within your country-specific domains? To start with, go to Google Webmaster Tools>Configuration>Settings in each of the country-specific domains and choose the country of the target audience for each site, as shown below:
Choose Target Audience on Google Webmaster Tools
- If possible, use a local server for each country-specific domain.
- Enter local addresses and phone numbers on each of the country-specific sites.
- Use geo meta tags. These tags may not be used by Google, as you have already set the target users option in Google Webmaster Tools, but they may come in handy to let secondary search engines, such as Bing, know that your site targets a specific country.
- Use rel=“alternate” hreflang=“x” to let Google bots know more about your foreign pages with the same content and to show which page should be returned for which audience in search results.
Some SEOs may suggest using rel=“canonical ” to cope with cross-domain duplicates, but it is not yet clear if using this to redirect multi-domain pages is the right solution, as it is necessary for geo-targeted sites to show up in search results for their respective country-specific searches. For now we recommend clarifying that your content is geo-targeted so that search engines know which content to show to which audience, avoiding confusion.
7. Paginated Content:
When you have content with cohesive components spread between multiple pages and you want to send users to specific pages via search results, use rel=“next” and rel=”prev” to let search engines know that these pages are part of a sequence. Learn more about implementing these rel attributes on the Google Webmaster Central blog on Pagination with rel=“next” and rel=”prev”. There is another sort of pagination when it comes to blog comments. Disable comments pagination in your CMS, otherwise (on most sites) different URLs of the same content will be created.
Note: Once you have used these strategies to get rid of duplicate content, remember to update your XML Sitemap by removing duplicate URLs and leaving only the canonical URLs, then re-submit the Sitemap to Google Webmaster Tools. Read our blog on All About XML Sitemaps for more information.
There are also a few things you can do to fight duplicate content on your site regularly. For example, improve your internal linking, and link to preferred domains. As more links are found pointing to preferred URLs it becomes easier for search engines to judge which is the preferred page. Also, on e-commerce sites, when you have products that are categorized based on colors, sizes or anything else, every time a user clicks the size or color the URL changes due to a sorting parameter, and this creates duplicate content. In such cases, provide the option to choose selection criteria on the same page, such that the URL does not change.
Let us know in the comments if you have any questions about duplicate content on your site or if you have any suggestions for coping with duplicate content that have not been mentioned in this blog.