We’ve all seen the little “cached” link in Google results. This link can give some useful information for an SEO such as the most recent cache date which can be nice to keep track of so you can trend when Google caches your site. If you know how often Google caches your site, you can judge the effectiveness of new campaigns and new content based on caching frequency. A cache is also important so you can check out the text version of your site that Google has saved. Looking at the text cache can help you troubleshoot content presentation problems, and see what content is not visible to Google because of frames, flash etc. You can also identify what Google and other search engines think about your site navigation, internal linking structure, and outbound links. Basically, there is a wealth of information to a skilled SEO.
If there’s so much good info, why would anyone want to block caching? One reason; content ownership. If your site offers some kind of service that is required to have certain phrasing in disclaimers, disclosures, etc. the Google cache can be dangerous because it can retain stale information if your terms change. The real danger here is quite limited, but I worked with a client once where cached legal info was a problem and their attorneys needed to know how to get rid of it.
Another problem can arise from content scrapers. There isn’t much you can do to prevent a content scraper from grabbing text from your site, but there are a few tricks you can use. For the example above, I came up with a system so all of the client’s sites and pages contained a call to another server that would dynamically populate their legal disclaimers for pages. We configured the server that provided the text to only accept requests from certain IPs (the website IPs) to prevent that include from working on other websites. Most content scrapers just pull the HTML from a site and with this call, the legal info was contained in something that didn’t present scrapeable content. The problem this client had was that scrapers were stealing and reproducing entire websites and impersonating the company. This led to legal confusion and the lawyers wanted a way to protect against future scraping of the legal info. Not only did the solution above ensure every one of the sites had up-to-date legal disclaimers, but it prevented scrapers from getting the content.
Another related challenge is a site called the Internet Archive that keeps a record of your site’s changes. This site also contains a wealth of information and content that could be used against you. A skilled SEO can look through the history of your site and reverse-engineer all the improvements you made to increase user conversion. It’s really not all that difficult. I could go to a competitor’s site in the Internet Archive and look at design changes they made to improve user conversion. If I know the competitor and their target client well, I can find out all kinds of valuable information about user conversion improvement that the competitor probably spent a lot of money and time learning. When I conduct User Funnel Improvement Research and Conversion Improvement Studies I usually start by digging through competition to see how they drive traffic into their high-value conversion pages. I look at their current site design as well as their change history in the Internet Archive. Starting from scratch on a multi-variant testing (MVT) campaign can be quite expensive so I use what I learn from competitors and improve it, then start my MVT from there. By this time in my career, most conversion improvements are intuitive for me, but it’s still nice to look at where other companies are focusing their efforts.
The Internet Archive is especially useful for keeping tabs on other SEO firms that like to brag about their big clients and their latest client acquisitions. If I know XYZ company has a big new client, I can watch the changes they make to see if there’s anything I can learn about their methods. A tip for you: don’t advertise your new clients until AFTER you have your SEO improvements in place and the Internet Archive blocked. This will make it more difficult for other SEO firms to track your changes. Better still, keep under the radar and don’t talk about clients until you’re way into their campaign.
So, how do you stop Google and the Internet Archive from caching your critical, and expensive information? Simple. To stop the Internet Archiver from keeping a record of your site, simply block their user-agent in your Robots.txt. This will also remove any previous records for your site in their archive. The fix for Google is also pretty easy and can be controlled at the page-level with a robots “nocache” tag in the HEAD section of your page. This should really only be used on pages that contain legally sensitive information that you don’t want cached, such as a “terms & conditions” page.
It’s important to protect your site from legal problems by taking every measure you can against scrapers. This helps ensure the most up-to-date version of your information is available online. Blocking the Internet Archive helps erase the bread crumb trail of improvements you have spent so much time and money implementing.