This site examines the phenomenon of Wikipedia. We are interested in them because they have a massive, unearned influence on what passes for reliable information. Search engines rank their pages near the top. While Wikipedia itself does not run ads, they are the most-scraped site on the web. Scrapers need content — any content will do — in order to carry ads from Google and other advertisers. This entire effect is turning Wikipedia into a generator of spam. It is primarily Google’s fault, since Wikipedia might find it difficult to address the issue of scraping even if they wanted to. Google doesn’t care; their ad money comes right off the top.
Another problem is that most of the administrators at Wikipedia prefer to exercise their police functions anonymously. The process itself is open, but the identities of the administrators are usually cloaked behind a username and a Gmail address. (Gmail does not show an originating IP address in the email headers, which means that you cannot geolocate the originator, or even know whether one administrator is really a different person than another administrator.) If an admin has a political or personal agenda, he can do a fair amount of damage with the special editing tools available to him. The victim may not even find out that this is happening until it’s too late. From Wikipedia, the material is spread like a virus by search engines and other scrapers, and the damage is amplified by orders of magnitude. There is no recourse for the victim, and no one can be held accountable. Once it’s all over the web, no one has the power to put it back into the bottle.
Plagiarism by Wikipedia editors
by Daniel Brandt
Here are the results of the first study of plagiarism in Wikipedia that has ever been undertaken. I started with a list of 16,750 Wikipedia articles. They came from a partial list of Wikipedia biographies of persons born before 1890. There was no reason for this, other than the fact that the list was available and the size was manageable.
The next task was to download the XML version of each Wikipedia article and try to extract between one and five clean sentences. They have to be clean (i.e., “de-wikified”) or there is no hope of using them in a search engine. That’s because the entire sentence is searched inside of quotation marks. Each search looks like this:
-wikipedia -wiki “this is a complete sentence from an article”
When the sentence is inside of quotation marks, any little variation inside the sentence can mean the difference between a hit or no-hit. The cleaner the sentence, the better. The reason for the exclusion terms in front is that I don’t want mirrors of Wikipedia. I’m looking for plagiarism that goes the other direction — in other words, for articles that existed prior to the Wikipedia article. Wikipedia Watch