What is Duplicate Content?
According to Google, duplicate content refers to text that is repeated on more than one web page either on your site or across different sites. Google takes the issue of duplicate content very seriously. In fact, sites with duplicate content were a key target of Google’s Panda algorithm update.
The more URLs you build for the same page, the more it appears that you are trying to game the system. The bottom line is that every different URL the search engines see on your site should display substantially different content.
According to Google, it only takes a few pages of poor-quality or duplicated content to damage traffic on an otherwise solid site and recommends such pages be removed, blocked from being indexed by the search engine, or rewritten.
However, Matt Cutt also warns that rewriting duplicate content so that it is original may not be enough to recover from Panda. The rewrites must be of “sufficiently high quality, as such content brings additional value to the web.”
He went on to say that content that is general, non-specific, and not substantially different from what is already out there should not be expected to rank well: “Those other sites are not bringing additional value. While they’re not duplicates, they bring nothing new to the table.”
Google puts duplicate content into the same category as low quality content. It’s fair to say that if you don’t have lots of original, remarkable and unique content, you will not rank high on Google. If you have content that exactly matches the content on a high authority site, your webpage will be filtered out of the search results even if it is relevant to the search query.
Issues that duplicate content can create include
- If the search engine discovers two or more versions of the same content, which does it filter out? Which one does it show?
- Links to duplicate content pages represent a waste of link juice. E.g. if the homepage has 100 natural inbound links, the major search engines gives the duplicate pages credit separately to each page.
- The search engines will dilute the metrics of the webpage such as TrustRank, PageRank, authority, etc. It will share them between multiple versions of the same page;
- Search ranking can be impacted negatively.
Note that the search engines may consider two pages duplicates even if only part of the page is duplicated. For instance, if you use the same html headings, page title, web copy and Meta description tags on multiple pages, the search engine spider could interpret them as duplicate pages, even if they are not exactly the same.
Google has confirmed that it doesn’t have a penalty for duplicate content unless the intent of the duplicate content is to be deceptive and manipulate search engine results. Even so, it is hard to imagine that duplicate webpages would be able to rank high on Google.
Even though the page might not be explicitly penalized, the site will not rank high in these circumstances. This means that the site might as well be penalized. To find out whether you have duplicate content on your site, sign up to Copyscape, which is the best known online duplicate content checker tool.
Common Causes of Duplicate Content
Printer-friendly pages are separate pages specifically designed for printing and they are a common cause of duplicate content. Some sites usually offer two versions of the same page on their website for the convenience of their readers: an HTML. One to read online and a text-only version that can be printed.
The printer-friendly page has its own URL, even though it might not have some of the bells and whistles of the HTML version. To the search engines it’s an exact duplicate of the HTML page because it is the content that really counts.
Dynamic Web Pages
A dynamic web page is a user-generated web page that displays different content each time it is created. Dynamic content changes frequently, based on the environment or situation. For example, the page may change with the time of day, the user that accesses the webpage, or the type of user interaction.
Dynamic sites create pages which are similar (or very similar) in content with just different URLs. It’s like having one page with a lot of different URLs pointing to it. The Session ID is an example of a dynamic web page that causes duplicate content. A session ID is a unique number that a website’s server assigns a specific user for the duration of that user’s visit (session).
Examples of Sessions IDs
The session ID can be stored as a cookie, form field, or URL. Every time an Internet user visits a specific website, a new session ID is assigned to each page’s URL as the user traverses the site. Closing a browser and then reopening and visiting the site again generates a new session ID.
This generates a different URL for the same page, creating what appears to be duplicate content. Even though the page itself hasn’t changed, the varying parameters at the end of each URL causes search engines to think they’re separate pages. In addition, some URLs with session ids expire, leaving dead links. So if Search Engine indexes this link, the next time someone clicks on the indexed link, the user will get an error page.
If your web site uses a slogan, you probably want it to be displayed on every web page throughout your site. However, repeating your company slogan in HTML text throughout your site can cause duplicate content issues. This is because the same lines of text would be repeated on various pages throughout the site.
Duplicate Content in eCommerce Systems
Duplicate content occurs in eCommerce systems when several different URLs are generated for the same product. For example, retail sites may divide the list of items in a large product category into multiple pages. The same product can appear on a special seasonal sales page in addition to its regular listing. Also, if you have separate categories for things like “all shoes” and “men’s shoes”, the same product may be displayed on different URLs.
“Men’s shoes” would fit into both categories and could be reached via two separate URLs, effectively creating duplicate content. Since your site features the same content on different URLs, this can be interpreted by the search engines as a deliberate attempt to boost your SEO through illegitimate means.
For instance, an e-commerce site featuring fashionwear might have its content listed in various ways. A pair of jeans might be listed under ’summer wear’, ‘casual wear’, ‘menswear’, and so on, and be available for on-site searches with those options. A search engine wants to only list pages its index that are unique. Some search engines may decide to combat this issue by cutting off the URLs after a specific number of variable strings (e.g.: ? & =).
For example, consider the following URLs:
All three of these URLs point to three different pages. But if the search engine purges the information after the first offending character, the question mark (?), now all three pages look the same:
Now, you don’t have unique pages, and consequently, the duplicate URLs won’t be indexed. If you use dynamic content, it is imperative that you address the issues caused by dynamic content if you want your site to be indexed.
Pagination on e-commerce systems can also cause duplicate content issues because they can display the same title tag and meta description but have different URLs. For example, if your website visitors can search by category (e.g. mens shoes) for your products. That mens shoes category is well stocked, and a search returns 5 pages. The first few might look like this:
First page displayed in search: www.coolshoes.com/mens-shoes/black/3
Second page: www.coolshoes.com/mens-shoes/mens-shoes/black/3?page=1
Third page: www.mycoolsite.com/category/mens-shoes/black/3?page=2
and so on…
Each of those pages have duplicated meta title tags and meta description tags, even though they have different products.
International organizations or companies with multiple websites will repurpose the same content for various locations. For example, a large property company with a number of brokerage offices may offer the same website template to all of their brokers. Such templates are usually customized only with a different city name and local property listings that matches their locality.
Similarly, a national product manufacturer may give local representatives and affiliates their “own” sites, but all of them would have the same standard template and content. This leads to some of these websites producing thousands of duplicate websites all promoting the same thing and, according to Google, offering nothing of value to the Internet community.
Press release sites were also heavily impacted by the Panda 4 update with sites like PRWeb, PR Newswire, Business Wire and PRLog losing 60% to 85% of their search visibility overnight. The problem is, press releases are published on multiple press release and news websites. They are also syndicated across thousands of partner sites. This makes them a prime target for the Panda update.
The big upside of syndicating your content is the exposure and increased traffic that you receive as a result. The potential downside is that your content is duplicated across the web. An RSS feed is a typical example of content syndication. RSS is a method of distributing links to new content in your website, and the recipients are people that have subscribed to your RSS feed.
In addition, you can syndicate your articles to other websites or article directories such as ezine.com. However, the very real problem here is that you’ll now have multiple URLs for the same article, which makes your site vulnerable to the Panda algorithmic penalty.
Note that having the same content on different top level domains which represent different countries is not considered to be duplicate content that could get penalized by Google. Matt Cutts addressed this issue in the video below.
How Many Duplicate Pages Does Google Think You Have?
You can find out how many of your web pages are currently indexed, versus how many pages Google considers to be duplicates:
In the Google search query box, type site:yourdomain.com and then click Search.
On the results page, scroll to the bottom and click the highest page number that shows (usually 10). Note that doing this can sometimes cause the total number of pages to recalculate at the top of the page. Notice the total number of pages shown in “Results 1 – 10 of about ###” at the top of the page. The “of about ###” number represents the approximate total number of indexed pages in the site.
Navigate to the very last page of the results. The count shown there represents the filtered results. The difference between these two numbers most likely represents the number of duplicates.
Note that Google doesn’t actually display all of the indexed pages and omits duplicates. To see all of the indexed listings for your site, navigate to the very last results page of your [site:] query and click the option to repeat the Search with the omitted Results Included. Note however, that Google will only show up to a maximum of 1,000 listings.
You can use the free Search Engine Saturation tool available from Acxiom Digital to discover the number of indexed pages in Yahoo! and Microsoft Bing.
Identifying Duplicate Content
To check whether you have any duplicate content issues on your site, start with the page title and Meta description tags on your site using the Screaming Frog SEO Spider tool. At the top where it says enter URL to spider, enter your home page URL here and click start. In the results, click on the page titles tag. And under the filter, click on duplicate. What this does is it automatically organize your pages to see if there are error or any duplicate title tags.
If the tool has identified any duplicate tags, you’ll want to look at the address that it’s saying the duplicate title tags are on. Check whether these addresses are either content pages or product or category pages. If they are, you’ll want to go back and make sure each title is unique.
Repeat the same process for the Meta description tag. Click on Meta description, filter by duplicate, and you can see whether or not you have duplicate Meta descriptions on your product category or blog post pages. And if that’s the case, you want to go back and fix them. Again, you definitely want to check out the actual web page to check that you do have duplicate content issues.
Start by signing up for a premium subscription at www.copyscape.com and do a Batch Search. You need to enter your sitemap URL for Copyscape to crawl each URL in your sitemap and analyze it against the rest of the web. It will identify all cross-domain duplicate content and rank it by “risk factor.” You will be able to export the report to a CSV file and sort the columns on the basis of how you want to analyze the data.
The service is pretty affordable for small to medium sized websites. However, for larger sites, it can get a little pricey.
Tip: Be sure to remove any links to external websites, or pages that you know do not have duplicate content after Copyscape crawls your sitemap, but before your pay for their service. You don’t want to pay to have pages crawled when those pages do not have a problem.
The SiteLiner Tool
The SiteLiner tool is the quickest way to identify and fix internal duplicate content issues on your site. If you have a site under 250 pages the service is free. Anything over 250 will cost a penny per URI. To use the tool, start by putting in your homepage URL into the box provided.
Once you’ve entered in your domain name, you’ll be taken to a list of pages hosted on your site in a summary tab. You’ll want to click on the Duplicate Content tab on the left (where you’ll see your overall percentage of duplicate content).
On the duplicate content screen you’ll see the following columns:
- Match Words – This shows the number of duplicated words that are matched on this page.
- Match Percentage – The overall percentage of matched words versus the total words on the page. You’ll want this number to be very low. If you notice the page has a match percentage of 90 percent, this means that percentage of the content on that page matches other content on your site. Not good.
- Match Pages – The total number of pages that have matched duplicate content.
- Page Power – An estimate of page importance on a scale of 1-100 (with 100 being the most important.)
You’ll then be able to sort, filter and export the data.
You’ll then be sent to an overlay of the page. On the right hand side, you’ll see the matched content become highlighted on the page. This is the content that is matched from the page in question to the duplicated content.