Pushing Bad Data- Google’s Latest Black Eye

Yahoo stopped counting, or at least publicly displaying, the number of pages it indexed in September of 05, after a school-yard “measuring contest” with compete with Yahoo. That count lead out around 8 billion dollars pages before it was removed from the home page. News broke recently through various SEO forums that Google had suddenly, over the past few several weeks, added another few million pages to the index. This might could be seen as a reason for celebration, but this “accomplishment” would not reflect well on the search engine that achieved it. scraping google

What had the SEO community buzzing was your nature of the fresh, new few million pages. These people were blatant spam- containing Pay-Per-Click (PPC) advertisements, scraped content, and they were, in many circumstances, appearing well in the search results. They forced out far older, more established sites in doing so. A Google rep responded via forums to the problem by calling it a “bad data force, ” something that fulfilled with various groans through the SEO community. 

Just how did someone manage to dupe Google into indexing so many pages of spam in such a short time of time? I’ll provide a higher level overview of the process, but do not get too excited. Just like a diagram of a nuclear explosive isn’t heading to teach you how to help make the real thing, you’re not going to be able to run off and do it yourself after reading this article. However it makes for a fascinating tale, the one that illustrates the ugly problems cropping plan ever increasing frequency in the world’s most popular google search.

A Dark and Stormy Night

Our account commences deep in the heart of Moldva, placed scenically between Romania and the Ukraine. In between fending off local goule attacks, an enterprising local had a brilliant idea and ran with it, presumably away from the vampires… His idea was to exploit how Yahoo handled subdomains, and not simply a little bit, in a major way.

The center of the issue is that currently, Google addresses subdomains very similar way as it treats full domains- as unique entities. This kind of means it will add the homepage of a subdomain to the index and return at some point later to perform a “deep crawl. ” Deep crawls are simply the index following links from the domain’s homepage deeper in the site until it detects everything or gives up and returns later for more.

Briefly, a subdomain is a “third-level domain name. ” You’ve probably seen them before, they look this type of thing: subdomain. domain. com. Wikipedia, for instance, uses them for languages; the Uk version is “en. wikipedia. org”, the Dutch version is “nl. wikipedia. org. ” Subdomains are a good way to organize large sites, in contrast to multiple directories or even separate domain names entirely.

Therefore, we have a kind of page Yahoo will index nearly “no questions asked. ” 2 weeks. wonder no person used this case sooner. Some many believe the reason for that may be this “quirk” was introduced after the recent “Big Daddy” update. Our Eastern Western friend met up some servers, content scrapers, spambots, PPC accounts, and several all-important, very inspired scripts, and mixed them all jointly thusly…

Five Billion Served- And Counting…

First, our hero here crafted intrigue for his servers that will, when GoogleBot dropped by, start making an essentially endless number of subdomains, all with a solitary page containing keyword-rich scraped content, keyworded links, and PPC advertising for those keywords. Spambots are directed out to put Online search engine spiders on the scent via referral and comment trash to tens of hundreds of websites surrounding the world. The spambots give the extensive setup, and keep in mind that take much to find the dominos to fall.

Online search engine bots finds the spammed links and, as is it is purpose in life, comes after them in the network. When GoogleBot is sent into the web, the pieces of software running the servers simply keep creating pages- webpage after page, all with an unique subdomain, all with keywords, scraped content, and PPC ads. These types of pages get indexed and suddenly you’ve got yourself a Google index approximately for five billion pages heavier in under 3 weeks.

Reviews indicate, at first, the PPC advertising on these pages were from Adsense, Google’s own PPC service. The supreme irony then is Google benefits financially from all the impressions being charged to AdSense users as they look across these billions of fake pages. The AdSense profits from this endeavor were the purpose, after all. Put in so many web pages that, by sheer push of numbers, people would find and click on the advertisings in those pages, making the spammer a nice profit in a very short while.

© My Info Blog