<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Boston Media Domain &#187; robot Search Engine Crawl</title>
	<atom:link href="http://www.bostonmediadomain.com/tag/robot-search-engine-crawl/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.bostonmediadomain.com</link>
	<description>Search, Social and Online Media for Domains</description>
	<lastBuildDate>Mon, 25 Apr 2011 15:18:01 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Search Engine Ranking Factors and Sensors</title>
		<link>http://www.bostonmediadomain.com/search-engine-ranking-factors-sensors/</link>
		<comments>http://www.bostonmediadomain.com/search-engine-ranking-factors-sensors/#comments</comments>
		<pubDate>Sun, 06 Dec 2009 20:20:40 +0000</pubDate>
		<dc:creator>jeff selig</dc:creator>
				<category><![CDATA[SEO Analysis]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[robot Search Engine Crawl]]></category>
		<category><![CDATA[search engine optimization]]></category>
		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://bostonmediadomain.com/?p=1319</guid>
		<description><![CDATA[According to Matt Cutts at Google, there are some 200 variables in the Google Algorithm.  Each factor unto itself is less than the sum of the parts. The key idea is to efficiently compute for each sensor an estimate of the probability that it matches the query and process sensors in the order of decreasing probability, such that effort is first spent on sensors that are very likely to actually match the query. Another class of techniques and sensors, known as black hat SEO or spamdexing, use methods such as link farms, keyword stuffing and article spinning that degrade both the relevance of search results and the user-experience of search engines. Search engines look for sites that employ these techniques in order to remove them from their indices. What should be noted, techniques considered acceptable today may be viewed upon with disdain going forward. Google makes note in its Quality Guidelines, using off-the-shelf software can lead to detrimental results in your rankings. Increasing prominence A variety of other methods are employed to get a web page shown at the top of search results. These include: Cross linking between pages of the same website. Giving more links to main pages of [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop --><!-- End Shareaholic LikeButtonSetTop --><div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bostonmediadomain.com%2Fsearch-engine-ranking-factors-sensors%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bostonmediadomain.com%2Fsearch-engine-ranking-factors-sensors%2F&amp;source=seosem&amp;style=normal&amp;service=is.gd&amp;hashtags=google,robot+Search+Engine+Crawl,search+engine+optimization,SEO&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p><img class="alignright size-medium wp-image-1370" title="google-ranking-factors" src="http://bostonmediadomain.com/wp-content/uploads/2009/12/google-ranking-factors-300x168.jpg" alt="google-ranking-factors" width="316" height="176" />According to <a href="http://twitter.com/mattcutts">Matt Cutts</a> at Google, there are some 200 variables in the Google Algorithm.  Each factor unto itself is less than the sum of the parts. The key idea is to efficiently compute for each sensor an estimate of the probability that it matches the query and process sensors in the order of decreasing probability, such that effort is first spent on sensors that are very likely to actually match the query.</p>
<p>Another class of techniques and sensors, known as black hat SEO or spamdexing, use methods such as link farms, keyword stuffing and article spinning that degrade both the relevance of search results and the user-experience of search engines. Search engines look for sites that employ these techniques in order to remove them from their indices. What should be noted, techniques considered acceptable today may be viewed upon with disdain going forward.</p>
<p>Google makes note in its <a title="quality guidelines" href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&amp;answer=35769#quality" target="_blank">Quality Guidelines</a>, using off-the-shelf software can lead to detrimental results in your rankings.</p>
<p><strong>Increasing prominence</strong><br />
A variety of other methods are employed to get a web page shown at the top of search results. These include:<br />
Cross linking between pages of the same website. Giving more links to main pages of the website, to increase PageRank used by search engines.  Linking from other websites, including link farming and comment spam. Writing content that includes frequently searched keyword phrase, so as to be relevant to a wide variety of search queries. Adding relevant keywords to a web page meta tags, including keyword stuffing. URL normalization of web pages accessible via multiple urls, using the &#8220;canonical&#8221; meta tags.</p>
<h2>Domain</h2>
<p>- Age of Domain<br />
- History of domain<br />
- KWs in domain name<br />
- Sub domain or root domain?<br />
- TLD of Domain<br />
- IP address of domain<br />
- Location of IP address / Server</p>
<h2>Architecture</h2>
<p>- HTML structure<br />
- Use of Headers tags<br />
- URL path<br />
- Use of external CSS / JS files</p>
<h2>Content</h2>
<p>- Keyword density of page<br />
- Keyword in Title Tag<br />
- Keyword in Meta Description (Not Meta Keywords)<br />
- Keyword in KW in header tags (H1, H2 etc)<br />
- Keyword in body text<br />
- Freshness of Content</p>
<h2>Per Inbound Link</h2>
<p>- Quality of website linking in<br />
- Quality of web page linking in<br />
- Age of website<br />
- Age of web page<br />
- Relevancy of page’s content<br />
- Location of link (Footer, Navigation, Body text)<br />
- Anchor text if link<br />
- Title attribute of link<br />
- Alt tag of images linking<br />
- Country specific TLD domain<br />
- Authority TLD (.edu, .gov)<br />
- Location of server<br />
- Authority Link (CNN, BBC, etc)</p>
<h2>Cluster of Links</h2>
<p>- Uniqueness of Class C address.</p>
<h2>Internal Cross Linking</h2>
<p>- Number of internal links to page<br />
- Location of link on page<br />
- Anchor text of FIRST text link</p>
<h2>Penalties</h2>
<p>- Over Optimization<br />
- Purchasing Links<br />
- Selling Links<br />
-Links from Web Spam sites<br />
-frequent server down time<br />
- Comment Spamming<br />
- Cloaking<br />
Cloaking by cookie detection<br />
Cloaking by JavaScript<br />
Cloaking by rich media support<br />
Cloaking by IP address<br />
Cloaking by User Agent</p>
<p>- Hidden Text<br />
- Duplicate Content<br />
- Keyword stuffing<br />
- Manual penalties<br />
- Sandbox effect</p>
<h2>Miscellaneous</h2>
<p>- JavaScript Links<br />
- No Follow Links</p>
<h2>Pending</h2>
<p>- Performance / Load of a website<br />
- Speed of JavaScript</p>
<h2>Misconceptions</h2>
<p>- XML Sitemap (Aids the crawler but doesn’t help rankings)<br />
- PageRank (General Indicator of page’s performance)</p>
<div class="shr-publisher-1319"></div><!-- Start Shareaholic LikeButtonSetBottom --><!-- End Shareaholic LikeButtonSetBottom -->]]></content:encoded>
			<wfw:commentRss>http://www.bostonmediadomain.com/search-engine-ranking-factors-sensors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Search Engines, keyword stuffing and spam</title>
		<link>http://www.bostonmediadomain.com/search-engines-keyword-stuffing-spam/</link>
		<comments>http://www.bostonmediadomain.com/search-engines-keyword-stuffing-spam/#comments</comments>
		<pubDate>Sun, 11 Oct 2009 00:23:23 +0000</pubDate>
		<dc:creator>jeff selig</dc:creator>
				<category><![CDATA[Commentary & Analysis]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[robot Search Engine Crawl]]></category>
		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://bostonmediadomain.com/?p=913</guid>
		<description><![CDATA[Photo by pyratorKeyword stuffing is a technique used &#8211; and overused &#8211; by SEOs to recharge keywords on a site so that search engines find it more relevant to those particular keywords. This technique is based on one factor that search engines weighed to determine the relevance of a document X for a keyword. This practice is strongly discouraged by the rules of search engines as it is considered a hoax, destroys the relevance of the web, the end user experience and the readability of the article gets polluted and diluted. For example, say this post is relevant to the search engines keyword &#8216;overload&#8217;. &#8220;I say the keyword overload is a technique referred to SEO keywords and overload overload which consists of keywords, keywords and keywords, and more stress from them, and that the results of this technique for keywords on search engines based on your keywords overload&#8221; Clearly I&#8217;m doing keyword stuffing (and it reads pretty stupid too). The reality is the world of search engines today are less naive and sorting algorithms are just no longer sensitive to keyword stuffing, with much of the burden of relevance moved to off-page factors and calculations of proximity. Keyword stuffing is [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop --><!-- End Shareaholic LikeButtonSetTop --><div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bostonmediadomain.com%2Fsearch-engines-keyword-stuffing-spam%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bostonmediadomain.com%2Fsearch-engines-keyword-stuffing-spam%2F&amp;source=seosem&amp;style=normal&amp;service=is.gd&amp;hashtags=google,robot+Search+Engine+Crawl,SEO&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p><span class="wp-decoratr-image"><img src="http://farm3.static.flickr.com/2078/2168226355_4939da2d4b_m.jpg" alt="Kirlian ring 2" /><br />
<a rel="external nofollow" href="http://www.flickr.com/photos/15706617@N00/2168226355">Photo by pyrator</a></span>Keyword stuffing is a technique used &#8211; and overused &#8211; by SEOs to recharge keywords on a site so that search engines find it more relevant to those particular keywords. This technique is based on one factor that search engines weighed to determine the relevance of a document X for a keyword.<span class="wp-decoratr-image"><a rel="external nofollow" href="http://www.flickr.com/photos/59479445@N00/2600573580"></a></span></p>
<p>This practice is strongly discouraged by the rules of search engines as it is considered a hoax, destroys the relevance of the web, the end user experience and the readability of the article gets polluted and diluted.</p>
<p>For example, say this post is relevant to the search engines keyword &#8216;overload&#8217;.</p>
<p>&#8220;I say the keyword overload is a technique referred to SEO keywords and overload overload which consists of keywords, keywords and keywords, and more stress from them, and that the results of this technique for keywords on search engines based on your keywords overload&#8221;</p>
<p>Clearly I&#8217;m doing keyword stuffing (and it reads pretty stupid too).</p>
<p>The reality is the world of search engines today are less naive and sorting algorithms are just no longer sensitive to keyword stuffing, with much of the burden of relevance moved to off-page factors and calculations of proximity. Keyword stuffing is a primitive practice that demonstrates little knowledge and a certain contempt for the reader.</p>
<p>Indeed, this technique can still get some results on MSN and to a lesser extent Yahoo, but Google definitely does little or nothing for indexing in the SERP&#8217;s.</p>
<p>There are several ways to do keyword stuffing. A slightly more elaborate method is repeatedly putting in a keyword input type = &#8220;field&#8221; or in the alt text of an image. The user does not see the form field but the search engines do. Others love stuffing hidden text pages full of keywords. The concealment is the color of the font or CSS.</p>
<p>Keyword stuffing is also called loading or spamdexing.</p>
<p>According to various mathematical formulas, keyword stuffing (the high density of keywords) has no effect on the body of the text of a website for the relevance of the site in search engines. The reason is that in the process of indexing the keywords weight of each is based on its position to the other keywords. This is what determines the prominence of each keyword. In other words, if you fill your text with keywords, each keywords prominence is reduced significantly, thereby reducing its impact in relation to other keywords.</p>
<div class="shr-publisher-913"></div><!-- Start Shareaholic LikeButtonSetBottom --><!-- End Shareaholic LikeButtonSetBottom -->]]></content:encoded>
			<wfw:commentRss>http://www.bostonmediadomain.com/search-engines-keyword-stuffing-spam/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Control your robot searches: Search Engine Crawl</title>
		<link>http://www.bostonmediadomain.com/control-your-robot-searches-search-engine-crawl/</link>
		<comments>http://www.bostonmediadomain.com/control-your-robot-searches-search-engine-crawl/#comments</comments>
		<pubDate>Tue, 10 Jun 2008 21:17:59 +0000</pubDate>
		<dc:creator>jeff selig</dc:creator>
				<category><![CDATA[Commentary & Analysis]]></category>
		<category><![CDATA[robot Search Engine Crawl]]></category>

		<guid isPermaLink="false">http://bostonmediadomain.com/2008/06/10/control-your-robot-searches-search-engine-crawl/</guid>
		<description><![CDATA[Controlling what content is blocked from being found in search engines is crucial for many websites. Fortunately, the major search engines and other well-behaved robots observe the Robots Exclusion Protocol (REP), which has evolved organically since the early 1990&#8242;s to provide a set of controls over what parts of a web site search engines robots can crawl and index. Article Sections: * Capabilities of REP * Deciding what should be Public vs. Private * Implementing the REP * Common implementation mistakes * Testing your implementation * Verifying robot identity * Removing content from search engine indices * Additional resources Capabilities of the REP The Robots Exclusion Protocol provides controls that can be applied at the site level (robots.txt), at the page level (META tag, or X-Robots-Tag), or at the HTML element level to control both the crawl of your site and the way it&#8217;s listed in the search engine results pages (SERPs). Below is a table listing the common scenarios, directives, and which search engines support them. Use Case  Robots.txt  META/ X-Robots-Tag  Other  Supported By Allow access to your content  Allow  FOLLOW INDEX    Google Yahoo Microsoft Disallow access to your content  Disallow NOINDEX NOFOLLOW    Google Yahoo Microsoft Disallow access to [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop --><!-- End Shareaholic LikeButtonSetTop --><div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fwww.bostonmediadomain.com%2Fcontrol-your-robot-searches-search-engine-crawl%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fwww.bostonmediadomain.com%2Fcontrol-your-robot-searches-search-engine-crawl%2F&amp;source=seosem&amp;style=normal&amp;service=is.gd&amp;hashtags=robot+Search+Engine+Crawl&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Controlling what content is blocked from being found in search engines is crucial for many websites. Fortunately, the major search engines and other well-behaved robots observe the Robots Exclusion Protocol (REP), which has evolved organically since the early 1990&#8242;s to provide a set of controls over what parts of a web site search engines robots can crawl and index.</p>
<p>Article Sections:</p>
<p>* Capabilities of REP<br />
* Deciding what should be Public vs. Private<br />
* Implementing the REP<br />
* Common implementation mistakes<br />
* Testing your implementation<br />
* Verifying robot identity<br />
* Removing content from search engine indices<br />
* Additional resources</p>
<p>Capabilities of the REP</p>
<p>The Robots Exclusion Protocol provides controls that can be applied at the site level (robots.txt), at the page level (META tag, or X-Robots-Tag), or at the HTML element level to control both the crawl of your site and the way it&#8217;s listed in the search engine results pages (SERPs). Below is a table listing the common scenarios, directives, and which search engines support them.<br />
Use Case  Robots.txt  META/ X-Robots-Tag  Other  Supported By<br />
Allow access to your content  Allow  FOLLOW<br />
INDEX    Google<br />
Yahoo<br />
Microsoft<br />
Disallow access to your content  Disallow<br />
NOINDEX<br />
NOFOLLOW    Google<br />
Yahoo<br />
Microsoft<br />
Disallow access to index images on the page    NOIMAGEINDEX    Google<br />
Disallow the display of a cached version of your content in the SERP    NOARCHIVE    Google<br />
Yahoo<br />
Microsoft<br />
Disallow the creation of a description for this content in the SERP    NOSNIPPET    Google<br />
Yahoo<br />
Microsoft<br />
Disallow the translation of your content into other languages    NOTRANSLATE    Google<br />
Do not follow or give weight to links within this content    NOFOLLOW  a href attribute:<br />
rel=NOFOLLOW  Google<br />
Yahoo<br />
Microsoft<br />
Do not use the Open Directory Project (ODP) to create descriptions for your content in the SERP    NOODP    Google<br />
Yahoo<br />
Microsoft<br />
Do not use the Yahoo Directory to create descriptions for your content in the SERP    NOYDIR    Yahoo<br />
Do not index this specific element within an HTML page      class=robots-nocontent  Yahoo<br />
Stop indexing this content after a specific date    UNAVAILABLE_AFTER    Google<br />
Specify a sitemap file or a sitemap index file  Sitemap      Google<br />
Yahoo<br />
Microsoft<br />
Specify how frequently a crawler may access your website  Crawl-Delay<br />
Google WMT  Yahoo<br />
Microsoft<br />
Authenticate the identity of the crawler      Reverse DNS Lookup  Google<br />
Yahoo<br />
Microsoft<br />
Request removal of your content from the engine&#8217;s index      Google WMT<br />
Yahoo SE<br />
Microsoft WMT  Google<br />
Yahoo<br />
Microsoft<br />
Deciding what should be Public vs. Private</p>
<p>One of the first steps in managing the robots is knowing what type of content should be public vs. private. Start with the assumption that by default, everything is public, then explicitly identify the items that are private.</p>
<p>If you want search engines to access all the content on your site, you don&#8217;t need a robots.txt file at all. When a search engine tries to access the robots.txt file on your site and the server can&#8217;t return one (ideally by returning a 404 HTTP status code), the search engine treats this the same as a robots.txt file that allows access to everything.</p>
<p>Every website and every business has a different set of needs, so there&#8217;s no blanket rule for what to make private, but some common elements may apply.</p>
<p>* Private data &#8211; You may have content on your site that you don&#8217;t want to be searchable in search engines. For instance, you may have private user information (such as addresses) that you don&#8217;t want surfaced. For this type of content, you may want to use a more secure approach that keeps all visitors from the pages (such as password protection). However, some types of content are fine for visitor access, but not search engine access. For instance, you may run a discussion forum that is open for public viewing, but you may not want individual posts to appear in search results for forum member names.<br />
* Non-content content &#8211; Some content, like images used for navigation, provides little value to searchers. It&#8217;s not harmful to include these items in search engine indices, but since search engines allocate limited bandwidth to crawl each site and limited space to store content from each site, it may make sense to block these items to help direct the bots to the content on your site that you do want indexed.<br />
* Printer-friendly pages &#8211; if you have specific pages (URLs) that are formatted for printing you may want to block them out to avoid duplicate content issues. The drawback to allowing the printer-friendly page to be indexed is that it could potentially be listed in the search results instead of the default version of the page, which wouldn&#8217;t provide an ideal user experience for a visitor coming to the site through search.<br />
* Affiliate links and advertising &#8211; If you include advertising on your site, you can keep search engine robots from following the links by redirecting them to a blocked page, then on to the destination page. (There are other methods for implementing advertising-based links as well.)<br />
* Landing pages &#8211; Your site may include multiple variations of entry pages used for advertising purposes. For instance, you may run AdWords campaigns that link to a particular version of a page based on the ad, or you may print different URLs for different print ad campaigns (either for tracking purposes or to provide a custom experience related to the ad). Since these pages are meant to be an extension of the ad, and are generally near duplicates of the default version of the page, you may want to block these landing pages from being indexed.<br />
* Experimental pages &#8211; As you try new ideas on your site (for instance, using A/B testing), you likely want to block all but the original page from being indexed during the experiment.</p>
<p>Implementing the REP</p>
<p>REP is flexible and can be implemented a number of ways. This flexibility lets you easily specify some policies for your entire site (or subdomain) and then enhance them more granularly at the page or link level as needed.<br />
Site Level Implementation (Robots.txt)</p>
<p>Site wide directives are stored in a robots.txt file, which must be located in the root directory of each domain or sub-domain. (e.g. http://janeandrobot.com/robots.txt). Robots.txt files located in subfolders are ignored.</p>
<p>A robots.txt file is a UTF-8 encoded file that contains entries that consist of a user-agent line (that tells the search engine robot if the entry is directed at it) and one or more directives that specify content that the search engine robot is blocked from crawling or indexing. A simple robots.txt file is shown below.</p>
<p>User-agent: *</p>
<p>Disallow: /private</p>
<p>user-agent: &#8211; Specifies which robots the entry applies to.</p>
<p>* Set this to * to specify that this entry applies to all search engine robots.<br />
* Set this to a specific robot name to provide instructions for just that robot. You can find a complete list of robot names at robotstxt.org.<br />
* If you direct an entry at a particular robot, then it obeys that entry instead of any entries defined for user-agent: * (rather than in addition to those entries).</p>
<p>The major search engines have multiple robots that crawl the web for different types of content (such as images or mobile). They generally begin all robots with the same name so that if you block the major robot, all robots for that search engine are blocked as well. However, if you want to block only the more specific robot, you can block it directly and still allow web crawl access.</p>
<p>* Google &#8211; The primary search engine robot is Googlebot.<br />
* Yahoo! &#8211; The primary search engine robot is Slurp.<br />
* Live Search &#8211; The primary search engine robots is MSNbot.</p>
<p>Disallow: &#8211; Specifies what content is blocked</p>
<p>* Must begin with a slash (/).<br />
* Blocks access to any URLs that begin with the characters after the /. For instance, Disallow: /images blocks access to /images/, /images/image1.jpg, and /images10.</p>
<p>You can specify other rules for search engine robots in addition to the standard instructions that block access to content as noted in other robot instructions.</p>
<p>Some things to note about robots.txt implementation:</p>
<p>* The major search engines support pattern matching using the asterisk character (*) for wildcard match and the dollar sign ($) for end of sequence matching as described below in using pattern matching.<br />
* The robots.txt file is case sensitive, so Disallow: /images would block http://www.example.com/images but not http://www.example.com/Images.<br />
* If conflicts exist in the file, the robot obeys the longest (and therefore generally more specific) line.</p>
<p>Basic Samples</p>
<p>Block all robots &#8211; Useful when your site is in pre-launch development and isn&#8217;t ready for search traffic.</p>
<p># This keeps out all well-behaved robots.</p>
<p># Disallow: * is not valid.</p>
<p>User-agent: *</p>
<p>Disallow: /</p>
<p>Keep out all bots by default &#8211; Blocks all pages except those specified. Not recommended as is difficult to maintain and diagnose.</p>
<p># Stay out unless otherwise stated</p>
<p>User-agent: *</p>
<p>Disallow: /</p>
<p>Allow: /Public/</p>
<p>Allow: /articles/</p>
<p>Allow: /images/</p>
<p>Block specific content &#8211; The most common usage of robots.txt.</p>
<p># Block access to the images folder</p>
<p>User-agent: *Disallow: /images/</p>
<p>Allow specific content &#8211; Block a folder, but allow access to selected pages in that folder.</p>
<p># Block everything in the images folder</p>
<p># Except allow images/image1.jpg</p>
<p>User-agent: *</p>
<p>Disallow: /images/Allow: /images/image1.jpg</p>
<p>Allow specific robot &#8211; Block a class of robots (for instance, Googlebot), but allow a specific bot in that class (for instance, Googlebot-Mobile).</p>
<p># Block Googlebot access</p>
<p># Allow Googlebot-Mobile access</p>
<p>User-agent: Googlebot</p>
<p>Disallow: /</p>
<p>User-agent: Googlebot-Mobile</p>
<p>Allow: /</p>
<p>Pattern Matching Examples</p>
<p>The major engines support two types of pattern matching.</p>
<p>* * matches any sequence of characters<br />
* $ matches the end of URL.</p>
<p>Block access to URLs that contain a set of characters &#8211; Use the asterisk (*) to specify a wildcard.</p>
<p># Block access to all URLs that include an ampersand</p>
<p>User-agent: *</p>
<p>Disallow: /*&amp;</p>
<p>This directive would block search engines from crawling http://www.example.com/page1.asp?id=5&amp;sessionid=xyz.</p>
<p>Block access to URLs that end with a set of characters &#8211; Use the dollar sign ($) to specify end of line.</p>
<p># Block access to all URLs that end in .cgi</p>
<p>User-agent: *</p>
<p>Disallow: /*.cgi$</p>
<p>This directive would block search engines from crawling http://www.example.com/script1.cgi but not from crawling http://www.example.com/script1.cgi?value=1.</p>
<p>Selectively allow access to a URL that matches a blocked pattern &#8211; Use the Allow directive in conjunction with pattern matching for more complex implementations.</p>
<p># Block access to URLs that contain ?</p>
<p># Allow access to URLs that end in ?</p>
<p>User-agent: *Disallow: /*?Allow: /*?$</p>
<p>That directive blocks all URLs that contain ? except those that end in ?. In this example, the default version of the page will be indexable:</p>
<p>* http://www.example.com/productlisting.aspx?</p>
<p>Variations of the page will be blocked:</p>
<p>* http://www.example.com/productlisting.aspx?nav=price<br />
* http://www.example.com/productlisting.aspx?sort=alpha</p>
<p>Other robot instructions</p>
<p>Specify a Sitemap or Sitemap index file &#8211; If you&#8217;d like to provide search engines with a comprehensive list of your best URLs, you can provide one or more Sitemap autodiscovery directives. Note, user-agent does not apply to this directive so you cannot use this to specify a Sitemap to some but not all search engines.</p>
<p># Please take my sitemap and index everything!</p>
<p>Sitemap: http://janeandrobot.com/sitemap.axd</p>
<p>Reduce the crawling load &#8211; This only works with Microsoft and Yahoo. For Google you&#8217;ll need to specify a slower crawling speed through their Webmaster Tools. Be careful when implementing this because if you slow down the crawl too much, robots won&#8217;t be able to get to all of your site and you may lose pages from the index.</p>
<p># MSNBot, please wait 5 seconds in between visits</p>
<p>User-agent: msnbot</p>
<p>Crawl-delay: 5</p>
<p># Yahoo&#8217;s Slurp, please wait 12 seconds in between visits</p>
<p>User-agent: slurp</p>
<p>Crawl-delay: 12</p>
<p>Page Level Implementation (META Tags)</p>
<p>The REP page-level directives allow you to refine the site wide policies on a page-by-page basis</p>
<p>Placing a meta tag on the page &#8211; Place the meta tag in the head tag. Each directive should be comma delimited inside the tag. E.g. &lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;Directive1, Directive 2&gt;.</p>
<p>&lt;html&gt;</p>
<p>&lt;head&gt;</p>
<p>&lt;title&gt;Your title here&lt;/title&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOINDEX&#8221;&gt;</p>
<p>&lt;/head&gt;</p>
<p>&lt;body&gt;Your page here&lt;/body&gt;</p>
<p>&lt;/html&gt;</p>
<p>Targeting a specific search engine &#8211; Within the meta tag you can specify which search engine you would like to target, or you can target them all.</p>
<p>&lt;!&#8211; Applies to All Robots &#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOINDEX&#8221;&gt;</p>
<p>&lt;!&#8211; ONLY GoogleBot &#8211;&gt;</p>
<p>&lt;meta name=&#8221;Googlebot&#8221; content=&#8221;NOINDEX&#8221;&gt;</p>
<p>&lt;!&#8211; ONLY Slurp (Yahoo) &#8211;&gt;</p>
<p>&lt;meta name=&#8221;Slurp&#8221; content=&#8221;NOINDEX&#8221;&gt;</p>
<p>&lt;!&#8211; ONLY MSNBot (Microsoft) &#8211;&gt;</p>
<p>&lt;meta name=&#8221;MSNBot&#8221; content=&#8221;NOINDEX&#8221;&gt;</p>
<p>Control how your listings &#8211; there are a set of options you can use to determine how your site will show up on the SERP. You can exert some control over how the description is created, and remove the &#8220;Cached page&#8221; link.</p>
<p>Example search engine results page (SERP)</p>
<p>&lt;!&#8211; Do not show a description for this page &#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOSNIPPET&#8221;&gt;</p>
<p>&lt;!&#8211; Do not use http://dmoz.org to create a description &#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOODP&#8221;&gt;</p>
<p>&lt;!&#8211; Do not present a cached version of the document in a search result &#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOARCHIVE&#8221;&gt;</p>
<p>Using other directives &#8211; Other meta robots directives are shown below.</p>
<p>&lt;!&#8211; Do not trust links on this page, could be user generated content (UCG) &#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOFOLLOW&#8221;&gt;</p>
<p>&lt;!&#8211; Do not index this page &#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOINDEX&#8221;&gt;</p>
<p>&lt;!&#8211; Do not index any images on this page (will still index the if they are linked</p>
<p>elsewhere) Better to use Robots.txt if you really want them safe.</p>
<p>This is a Google Only tag. &#8211;&gt;</p>
<p>&lt;meta name=&#8221;GOOGLEBOT&#8221; content=&#8221;NOIMAGEINDEX&#8221;&gt;</p>
<p>&lt;!&#8211; Do not translate this page into other languages&#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;NOTRANSLATE&#8221;&gt;</p>
<p>&lt;!&#8211; NOT RECOMMENDED, there really isn&#8217;t much point in using these &#8211;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;FOLLOW&#8221;&gt;</p>
<p>&lt;meta name=&#8221;ROBOTS&#8221; content=&#8221;UNAVAILABLE_AFTER&#8221;&gt;</p>
<p>HTTP Header Implementation (X-ROBOTS-Tag)</p>
<p>Allows developers to specify page-level REP directives for non text/html content types like PDF, DOC, PPT, or dynamically generated images.</p>
<p>Using the X-Robots-Tag &#8211; to use the X-Robots-Tag, simply add it to your header as shown below. To specify multiple directives you can either comma delimit them, or add them as separate header items.</p>
<p>HTTP/1.x 200 OK</p>
<p>Cache-Control: private</p>
<p>Content-Length: 2199552</p>
<p>Content-Type: application/octet-stream</p>
<p>Server: Microsoft-IIS/7.0</p>
<p>content-disposition: inline; filename=01 &#8211; The truth about SEO.ppt</p>
<p>X-Robots-Tag: noindex, nosnippet</p>
<p>X-Powered-By: ASP.NET</p>
<p>Date: Sun, 01 Jun 2008 19:25:47 GMT</p>
<p>The X-Robots-Tag directive supports most of the same directives as the meta tag. The only limitation with this method over the meta tag implementation is that there is no way to target a specific robot &#8211; though that probably isn&#8217;t a big deal for most use cases.</p>
<p>* X-Robots-Tag: noindex<br />
* X-Robots-Tag: nosnippet<br />
* X-Robots-Tag: notranslate<br />
* X-Robots-Tag: noarchive<br />
* X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT</p>
<p>Content Level Implementation</p>
<p>You can further refine your site level and page level directives within several content tags.</p>
<p>Each anchor tag (link) can be modified to tell search engines that you do not trust where this URL is pointing to. This is typically used for links within user generated content (UCG) like wikis, blog comments, reviews and other community sites.</p>
<p>&lt;a href=&#8221;#&#8221; rel=&#8221;NOFOLLOW&#8221;&gt;My Hyperlink&lt;/a&gt;</p>
<p>Also, in Yahoo Search you can specify which &lt;div&gt; elements on a page you would not like indexed using the class=robots-nocontent attribute. However, we don&#8217;t highly recommend using this tag because it is not supported in any other engine, making it not super-useful.</p>
<p>&lt;div class=&#8221;robots-nocontent&#8221;&gt;</p>
<p>No content for you! (or at least Yahoo!)</p>
<p>&lt;/div&gt;</p>
<p>Common implementation mistakes</p>
<p>While implementing the REP is generally straight-forward, there are a few common mistakes.</p>
<p>*</p>
<p>GoogleBot follows the most specific directive, ignoring all others. In the robots.txt file, if you specify a section for all user-agents (user-agent: *) and also declare a section for Googlebot (user-agent: Googlebot), Google will disregard all sections in the robots.txt file except the Googlebot section. This could potentially leave you exposing much more content to Google that you might have thought.</p>
<p># This keeps out all well-behaved robots</p>
<p>User-agent: *</p>
<p>Disallow: /</p>
<p># This looks like it is giving Google access to only this directory, but since it is a</p>
<p># GoogleBot specific section, Google will disregard the previous section</p>
<p># and access the whole site.</p>
<p>User-agent: Googlebot</p>
<p>Allow: /Content_For_Google/</p>
<p>*</p>
<p>NOFOLLOW will most likely not prevent indexing &#8211; if you use NOFOLLOW at either the page or the link level, it is still possible for the links from the page to be indexed because the search engine may have found a reference to them from another source. Another note, using rel=&#8221;NOFOLLOW&#8221; within your anchor text is still perceived as a recommendation by the search engines, not a command.</p>
<p>To ensure that content is not indexed, either use the Disallow directive at the site level, or use NOINDEX at the page level.<br />
*</p>
<p>Directives that are not recommended &#8211; the directives in the REP are all about exceptions, by default the robots assume they can crawl your whole site. Therefore, you do not need to explicitly use the FOLLOW and INDEX directives as they will not be taken into account by the search engines. It sounds silly but I&#8217;ve seen a few sites that have implemented these on every page and every link.</p>
<p>Another directive that is not recommended is the NOCACHE directive. This was created by Microsoft, and is synonymous with NOARCHIVE. While they will most likely always continue to support the directive, it is better to use NOARCHIVE so it will work on all the search engines.</p>
<p>Testing your implementation</p>
<p>As you&#8217;re implementing your REP design, you may want to test it to see how it is working out. The easiest way to test this is to use the robots validator in either Google or Microsoft&#8217;s Webmaster Tools. These tools are generally good enough test beds for most folks, however advanced developers (or paranoid ones with critical business requirements) will want to definitively know what the robots are doing, not simply rely on what the robots say they are doing. These folks will want to look at their tools as well look at their server logs and even possibly setup some specific test.<br />
The Easy Way</p>
<p>Both Google and Microsoft provide some tools as part of their Webmaster Centers to help you verify if you&#8217;ve configured your REP the way you expect. Let&#8217;s start with Google&#8217;s tools:</p>
<p>The first thing you should check are the list of URLs that Google has seen from your website and not indexed due to the REP. Note you can also download the list and filter, sort, and have-your-way-with-it in Excel.</p>
<p>Google Webmaster Tools: Blocked URLs</p>
<p>The next step is to use their interactive robots.txt tool to analyze your rules and test specific URLs for blockage. When you pull up the tool they already should have it pre-populated with the robots.txt file they have on file from the last time they crawled. You can input a list of URLs you&#8217;d like to check below, select the user-agent you&#8217;d like to check against and the tool will tell you if they are blocked or not. You can also use the tool to test changes to your robots.txt file to see how Google would interpret things.</p>
<p>Google Webmaster Tools: robots.txt analysis</p>
<p>Microsoft has a similar tool in their Webmaster Center that will validate a robots.txt file against the standard that MSNBot supports. To use the tool, simply log in copy &amp; paste your robots.txt file into the top field and select Validate. A list of all detectable issues are displayed in the bottom box.</p>
<p>Microsoft Live Search Webmaster Tools: robots.txt validator<br />
The Hard Way (More Accurate)</p>
<p>If you have a specific business need to ensure that the robots are following your rules, (or you&#8217;re just paranoid) then you should not simply rely on the tools they provide to test compliance. You&#8217;re going to need to go straight to the horse&#8217;s mouth and analyze your web server logs to see exactly what they are doing. There is no one easy tool for doing this, you&#8217;ll likely have to use an existing tool like one of these (Microsoft HTTP Log Parser) or write your own. It isn&#8217;t difficult, it will simply take some time to implement. A useful reference for this is a list of all the robot user agents, and more complete list of bots from Google, and Microsoft.<br />
When Blocked Content Appears to be Indexed</p>
<p>If search engines are blocked from crawling pages, they may still index the URL if the robot finds a link to that URL on a page that isn&#8217;t blocked. The listing may display the URL only, such as shown below.</p>
<p>Google partially indexed results</p>
<p>Or, it may include a title and in some instances, a description. This makes it appear as though the search engine robot is disregarding the directive that blocks access to the page, but the search engine is in fact obeying the directive not to crawl the page and is using anchor text from the link to that page and descriptive details from either the page that contains the link or a source such as the Open Directory Project.</p>
<p>For more details, see:</p>
<p>* Google: partially indexed page<br />
* Yahoo!: thin documents</p>
<p>Verifying Robot Identity</p>
<p>Another thing you&#8217;ll likely want to consider in this endeavor is to validate that the robots are who they actually say they are. Google, Yahoo and Microsoft all support Reverse DNS authentication of their robots. The process is pretty simple and described here by Google, Yahoo and Microsoft, essentially you simply find out what range their robot&#8217;s DNS is hosted in, and use that in your tool. This way, if the address changes (which it will), you don&#8217;t need to update your code.</p>
<p>Should you find any issues, where one of the robots are not minding the REP, or are misbehaving in some other way, you can always communicate directly with each engine through one of their forums:</p>
<p>* Google Crawling, Indexing and Ranking Forum<br />
* Yahoo Crawler Feedback Form<br />
* Microsoft Crawler Error and Feedback Forum</p>
<p>Removing Content From Search Engine Indices</p>
<p>If you find that you haven&#8217;t implemented the techniques described here correctly and private content from your site is indexed, each of the major search engines has methods available for requesting that it be removed. For more information, see:</p>
<p>* Google: Requesting removal of content from our index<br />
* Yahoo!: Deleting URLs</p>
<p>Additional Resources:</p>
<p>* Google<br />
o How to create a robots.txt file<br />
o Descriptions of each user-agent that Google uses<br />
o How to use pattern matching<br />
o How often we recrawl your robots.txt file<br />
o All about Googlebot<br />
* Yahoo!<br />
o Wild card support<br />
o X-Robots tag directive support<br />
* Microsoft Live Search<br />
o Search robots in disguise<br />
* Other resources<br />
o Search Engine Land: Meta Robots Tag 101<br />
o Search Engine Land: Yahoo!, Microsoft, Google Clarify Robots.txt Support<br />
o Search Engine Land: URL Removal Options<br />
o robotstxt.org<br />
o Wikipedia: Robots Exclusion Standard</p>
<div class="shr-publisher-416"></div><!-- Start Shareaholic LikeButtonSetBottom --><!-- End Shareaholic LikeButtonSetBottom -->]]></content:encoded>
			<wfw:commentRss>http://www.bostonmediadomain.com/control-your-robot-searches-search-engine-crawl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

