How Search Engines Handle Duplicate Content

Duplicate Content and Multiple Site Issues

Greg says, “One of the interesting issues in this problem is that Google doesn’t know the first time the content appears. At best, we only know the first page we crawled the content on. It is entirely plausible that we crawl the original content after the copycat.”

Joachim Kupke, Sr. Software Engineer of Google’s Indexing Team reiterated much of what Grothaus said. He also said that Google has a lot of infrastructure for content duplication elimination:

– redirects
– detection of recurrent URL patterns (the ability to ‘learn’ recurrent url patterns to find duplicated content)
– actual contents
– most recently crawled version
– earlier content
– contents minus things that don’t change on a site

Kupke said to avoid dynamic URLs when possible (although Google is “rather good” at eliminating dupes). If all else fails, use the canonical link element. Kupke calls this a “Swiss Army Knife” for duplicate content issues.

Also, do not disallow directives in robots.txt to annotate duplicate content. It makes it harder to detect dupes, and disallowed 404s are a nuisance. There is an exception however, and that is that interstitial login pages may be a good candidate to “robot out,” according to Kupke.

Impressions & Clicks
Joachim repeatedly used the terms ‘impressions’ and ‘clicks’ in the context of a URL in Google’s index. He mentioned that if they see a URL with very few impressions (or none), it will likely take very long to be updated in the index (no surprise there). However, URLs with a lot of impressions and clicks (or on domains that are important and crawled frequently) will be updated quickly.

Infrastructure for Handling Duplicate Content
1. Redirects
2. Detection of recurring URL patterns
3. The contents of a page
4. The link canonical tag (if all else fails)

Historical Record of URLs
Google compares the recently-crawled version with a previously crawled version of the URL

Google + rel=canonical = <3
2 out of 3 times, rel=canonical alters the organic decision

302s work for Canonical Targets
* Because of an internal method for handling the trailing slash on URLs, Google needs to have (and recommends all web developers deploy) a trailing slash on canonical targets and internal links. Without the trailing slash, Google will actually add the slash and update the URL in its index. Now, I’ve found multiple examples of pages where this doesn’t happen, but Joachim was pretty firm that it’s a web problem in general that Google is forced to work around.
* The takeaway is that you should always add the trailing slash to the absolute URL in the canonical target. If you don’t, Google will add it anyway, but adding it proactively should speed up server response times (which may have impact on very large sites).

Google says a common mistake is designating a 404 as canonical, and this is typically caused by unnecessary relative links. So, avoid changing rel=”canonical” designations, and avoid designating permanent redirects as canonical.

Leave a Reply

Your email address will not be published.