How Common is Common Language?
You know Google Book Search—and, realistically, Google itself (which includes book results)—has been touted as a fast, easy way to check for plagiarism. Just do a phrase search for a distinctive sentence or long phrase, and if you get a match you should check further for likely plagiarism.
Some writers seem to take it a little farther. Plug in a distinctive five-word phrase; if there’s a match you’ve got a plagiarist. That’s nonsense. Few five-word phrases are all that distinctive, and I’m not sure it’s reasonable to call five borrowed words plagiarism.
Still, the general approach seems sound. With several million digitized books and billions of other text sources, and with phrase searching that appears able to accept fairly long phrases, it’s a good first step (if only a first step).
But what’s distinctive? How do you identify sentences that are good candidates for checking? Let’s turn that around: Aren’t most sentences common enough that they’re useless for detecting plagiarism?
I can’t take credit for that idea—although, in one of those “accidental plagiarism” situations, I’d nearly forgotten the seed of it. Fortunately, Google comes to the rescue. Paul Collins wrote “Dead plagiarists society” on November 21, 2006 at Slate (www.slate.com/id/ 2153313). Collins focuses on Google Book Search, noting the extent to which it had already aided folks in uncovering published plagiarism. He offers some examples and suggests that we may see a lot of newly detected plagiarism in the future.
But wait, you might ask, don't people accidentally repeat each other's sentences all the time? It seems to me that this should not be unusual. Yet try plugging that last sentence word by word into Google Book Search, and watch what happens.
It: Rejected—too many hits to count
It seems: 11,160,000 matches
It seems to: 3,050,000
It seems to me: 1,580,000
It seems to me that: 844,000
It seems to me that this: 29,700
It seems to me that this should: 237
It seems to me that this should not: 20
It seems to me that this should not be: 9
It seems to me that this should not be unusual: 0
It seems to me that this should not be unusual is itself ... unusual.
There’s much more to Collins’ thoroughly entertaining article—and I already discussed the article in January (Cites & Insights 8.1). How do I know that? Because I tried Collins’ 10-word sentence in Google…and Cites & Insights 8.1 comes up third in a list of seven hits. (GBS still shows zero hits.)
When I discussed this in January 2008, I didn’t entirely buy the notion:
I would note that this is probably not the case for descriptive nonfiction sentences, at least taken one at a time: After all, there are only so many ways to state a fact. (That sentence, not including “after all,” appears twice in a Google search—both discussions of plagiarism—but not in Google Book Search.)
The first hit for Collins’ phrase is now a post at Althouse, a blog by Ann Althouse, a law professor. She titles the post “Hasn’t it all been said before?” and follows that with “No. Everything is actually amazingly new:” and the same stuff I quoted—properly attributed and linked, of course. The second commenter on the post argues against this proposition, saying in part:
The argument that a similar string of words necessarily proves plagiarism is a statistically naive argument.
There are relatively few commonly used words in English; it is highly unlikely that one can come up with a phrase or sentence that has not been used before, even, possibly, for the same subject matter. There are, for example, only so many ways that one can discuss Hamlet's ambivalence…
The next commenter quoted the phrase beginning “it is highly…” and responded:
Actually it is highly likely that any given sentence you speak has never been used before, unless the sentence is short and about a common subject. It just seems like the same sentences get reused a lot because our brains are amazingly efficient at distilling sentences down to their core meanings, which do get reused regularly.
The next commenter took the direct approach, searching “Ann has too much time on her hands.” No match. Another commenter searched the three phrases in the long sentence above (“There are…English,” “it is highly…sentence” and “that has not…subject matter”)—and didn’t find matches for any of them.
A later commenter points out the truth behind the statistics. Even if you assume a truly tiny vocabulary, the number of combinations in a sentence gets very big very fast. If you assume a mere 1,000 words (I’ve seen 2,000 to 6,000 words cited as the smallest plausible vocabularies for people to communicate effectively in English), you can construct one billion three-word sentences, one trillion four-word sentences and, shall we say, an exceedingly large number of nine-word sentences. (One billion billion billion, or 10 to the 27th power—an octillion different sentences using American wording.) As the commenter notes, “Not all sequences of valid English words are valid English sentences, but what you lose for that reason is peanuts, relatively speaking.” True: If 99.9% of combinations are invalid, that would still leave 10 to the 24th nine-word sentences…for an unrealistically small subset of English. Have a 6,000-word vocabulary? The numbers mount up a lot faster: Allowing nonsense combinations, you could have 10 million times as many nine-word sentences. I write a lot, but I won’t write even a billion sentences in my lifetime (almost certainly not even ten million)—and most of my sentences are a lot more than four words long.
Somehow, my English background trumped my math background—and I found it hard to believe most non-literary sentences would be all that unique. So I thought I’d run a slightly larger experiment, using random sentences from a body of writing by an author who doesn’t strive for clever phrasing—in other words, pretty ordinary sentences.
Fortunately, I could locate a writer who doesn’t use many fancy words, doesn’t strive for literary effect, could provide a bunch of paragraphs in machine-readable form—and wouldn’t take offense at being called a writer of ordinary sentences. All I had to do was look in the mirror.
The process was simple enough. I set up a simple spreadsheet, opened Google, and started copying in the first sentence of each paragraph from an issue of Cites & Insights heavy on essays, where odd proper names and the like would be less likely to skew the results. I expected to find matches for at least 25% of the sentences—after all, this is commonplace nonfiction writing. (How common is that last five-word phrase? Apparently not as common as I’d expect: A Google phrase search yields zero results.)
Almost as soon as I began the process (I’d planned to search 100 sentences in all, taking up to 18 words at a time and checking Google results to see how many different authors were involved, counting up to five) I ran into trouble.
I was consistently coming up with one author: Me. Even on shorter sentences. This made no sense. I deleted a couple of sentences that used the word “liblogs.” No help—and although some sentences used “blogs,” surely there have been millions of sentences written using that word by now.
The ones and zeros (portions of Cites & Insights don’t seem to be indexed by Google, although most of it is) kept on coming. After 20 or so, I started deliberately skewing the research toward “indistinct” sentences. I omitted sentences with proper nouns and sentences with nouns much more unusual than “librarians.” I started selecting smaller portions of sentences. And I set up a parallel column, taking the first eight words of sentences and retesting those: Surely I’d get lots of matches then!
I also moved beyond that issue of Cites & Insights to some unedited copy for this issue (being unedited, it was even more likely to be humdrum) and unedited drafts for an early “Crawford Files” column and a “disContent” column. All the while, I was avoiding distinctive nouns and what I thought of as distinctive writing.
And still there were few matches—a few more in eight-word subsets, but even there not all that many.
After 130 sentences or so (I was fascinated enough to enlarge the sample) I decided to broaden the range of authorship a bit. That had actually happened already: A few of the sentences were quotations from blog posts.
I took a handful of well-known liblogs and tried ten sentences from each—again, avoiding sentences that were inherently distinctive because of the terminology. In doing so, I noted that Dorothea Salo’s writing is distinctive even in short bursts, as are the posts of a few of the other bloggers I sampled.
I went outside the library field for one essay from New York (on the death of traditional publishing)—and then I tried something different: Wikipedia, frequently faulted for plagiarism. I took one essay that seemed like a good candidate (and included distinctive words in this case), “Jeremy Bentham,” and another essay on a fairly obscure topic, “Theosophy.” I did find cases that had the feel of plagiarism—but, except for the definition of theosophy (where an edit war seems to keep inserting a plagiarized definition), the cases of possible plagiarism seemed to be the other way around: Other websites using Wikipedia text without attribution. I’m not saying Wikipedia’s free of plagiarism; I’m saying I didn’t find obvious instances in the two articles (out of several million) and 18 sentences (out of hundreds) that I tried, except for the one definition. In the end, I removed the Wikipedia tests from the overall sample, replacing them with 18 others from my own unedited drafts in order to maintain overall coherence.
I tested 300 full or partial sentences—most twice. Forty-four test phrases were eight words or less; for the other 256, I also tested the first eight words.
Ten phrases out of 300—just over three percent—showed up more than once in Google, not counting attributed quotations. Here’s the full list—after all, it’s short!
· A funny thing happened on the way to this column
· On the other hand, it’s a lot of work
· I swear I don’t do this on purpose
· Let’s go a little further
· Times change, and change again.
· We learn in many ways
· That time came and went
· They’re free to express their opinions within reason
· That does not mean print is dead
· Improved technology cuts both ways
The first seven phrases were used by at least five different writers. The last three were used by two writers each.
Only two of these phrases are longer than eight words, and the first is an awfully convenient way to start an offbeat column (I believe it always appears as the first sentence in an article). I’ll take credit for six of the ten phrases so ordinary they couldn’t possibly represent plagiarism (the first, fourth through sixth, ninth and tenth).
Half of the non-unique phrases are only five words long—and, as it happens, none of the five-word phrases I tested turned out to be unique. Remember that I skewed selections toward non-uniqueness—I mean, “we learn in many ways” and “let’s go a little further” (both my sparkling prose) are so ordinary they come close to cliché status.
But there were also three four-word phrases—and all three of them tested as unique:
· Delayed commentary makes sense
· This is a scattered essay
· Is blogging scholarly communication
So did all but one of the six six-word sentences:
· Formal language does not grant authority
Personal attacks undermine reasoned arguments
· But who cares about my conclusions
· Blind posts can damage honest discussion
· I’m ignoring all sorts of context
And all eight of the seven-word sentences and initial phrases:
· Was this genuine controversy or incited controversy?
· Citations are tricky, so many different formats
· She plays us a few of the clips
It’s time common sense prevailed in Washington
· I’m a reasonably well-read, well-informed, well-educated person
· Collegiality and professionalism are perfectly fine qualities
· Blogging does have a real intellectual value
· That seems unlikely as a general situation
I wrote all six of the six-word sentences—but only one of the seven-word sentences, which came from seven different sources.
In all, 22 of the 300 test cases (7%) were sentences shorter than eight words, with another 22 (7%) exactly eight words. Two of the eight-word sentences showed up more than once, but 91% were unique. It’s certainly true that most non-unique test cases were eight words or fewer (eight of ten), but also true that most short sentences were still unique (81%).
Here’s the distribution for the 256 test phrases longer than eight words:
· Nine words: 31 cases
· Ten words: 45 cases
· Eleven words: 30 cases
· Twelve words: 51 cases
· Thirteen words: 32 cases
· Fourteen and fifteen words: 25 cases each
· Sixteen words: eight cases
· Seventeen words: six cases
· Eighteen words: three cases.
Here are a dozen of the phrases and sentences so unusual they don’t show up anywhere in Google’s corpus—or if they do, it’s only in the source from which I quoted and in sources properly quoting that one:
Maybe not, but it turns out that this one works for me
· The task force has recently completed an initial draft report with recommendations
· The easiest thing to do would have been to skip the whole discussion
· He offers a few sentences that speak to what I’m getting at
· Partnerships are where you find them and what you make of them
· The other day I was walking from a meeting with a valued colleague
· It occurred to me that I’d probably be quite natural in a similar role
People are angry and confused, searching for meaning and otherwise unclear how to respond
· There is much of interest in the specific results
· If you haven’t witnessed this type of behavior in person
· I think the answer is still yes, at least some of the time
· We disagree on a number of issues—and do so agreeably
I’m reluctant to label any of these as ordinary text, since some of them come from other people. They’re all clear and straightforward (which may not be ordinary at all). I’m guessing the authors would not suspect plagiarism if they happened to see these sentences in someone else’s writing. (I included sentences from the following blogs in this test—and, with few exceptions, it would be easy to find the original: The aardvark speaks, Blogwithoutalibrary, Catalogablog, Caveat lector, Free range librarian, Librarian.net, Mamamusings, LibrarianInBlack, Off the Mark, Lorcan Dempsey’s blog, Open stacks, The travelin’ librarian, The medium is the message, The shifted librarian, Tame the web and Walking paper.)
Common language isn’t nearly as common as you might think—or, rather, “ordinary” sentences seem to be unique a great deal more often than I would have anticipated.
Is it reasonable to suggest that nine of ten sentences (nine words or longer) are unique? I have no idea. That was the case in this small sampling, deliberately excluding most sentences I thought likely to be unique—except that it was more than 19 of 20.
Sure, if you look at the statistics, that seems likely. But it feels wrong, at least to me.
Is it a matter of vocabulary? I couldn’t resist that question. The 300 text samples total 3,392 words—and include 1,219 different words, with no normalization except for capitalization. That’s a modest vocabulary.
Roughly five-sixths of these sentences are at least nine words long. So, I suspect, are most sentences in everyday written and spoken English. In Collins’ single test, uniqueness occurred at the ninth word. What happens with the 300 samples in this run if they’re truncated to eight words?
More of them show up from more than one author—35 more, for a total of 45 out of 300 tests, or 15%. (Naturally, the ten tests that weren’t unique at all weren’t unique at an eight-word limit.)
Of those commonly occurring briefer phrases, 28 have at least five different occurrences. I kept looking through the first hundred results (if there were that many) before giving up. Three more had four sources, four had three each, and ten had only two sources.
Some of the briefer phrases that weren’t unique:
· But if I want to go back to something
· I also found it interesting that there were
· I can’t tell you how exciting it is
· I come down strongly on the side of
· I don’t think I would have said that
· I have to say I had no idea
· It really got me thinking about how
· Oh, and for those of you unfamiliar with
· We need to treat each other with dignity
· What is very interesting to me is that
· We have a small but excellent group of
· Fall is always a hectic time of year
· You can read the fine print yourself, but
· If you see something that is not just
· It occurred to me that I’d probably be
· So a few weeks ago we started the
· We have been having some internal discussion about
A dozen of the 255 test phrases that still showed up once (or not at all) when limited to eight words or less:
· A sort of sewing kit for my life
· Absurd and even dangerous as it may be
· All the while he has argued for a
· And, later, the clear suggestion that increased plagiarism
· As a profession how do we find, identify
· At first, I was just thrilled to see
· Copyright helps maintain a balance between the needs
· He makes the interesting point that although a
Humans may be flawed, but we have discovered
· Mixing the old and the young, the established
· That won’t reassure those who prefer to worry
The same as before, I think—although at a lower level. Even relatively short sentences seem to be unusual most of the time. On the order of 85% in this sample, and I suspect that percentage would be higher in a truly random sample.
While this is only an anecdotal study, I find it mildly convincing. Our sentences are much more varied than I expected, even when we’re not striving for literary excellence. (To those whom I’ve quoted here—always without attribution—who do strive for literary excellence and distinctive phrasing in each and every blog post: My apologies. It’s all good writing, or at least I’ll grant that for the 55% of these samples that other people wrote.)
What this little study does not show: That one duplicated sentence is evidence of plagiarism. There are, indeed, only so many ways to discuss Hamlet’s ambivalence. Or are there? Searching the words “Hamlet” and “ambivalence” in Google yields a claimed 57,000 results—and I didn’t spot any obvious duplications of phrasing in the first hundred.
What I believe may be true: If you’re suspicious that a clumsy plagiarist has cut-and-pasted without paraphrasing, almost any medium-length sentence may suggest you should check further. It may be entirely innocent. But it seems surprisingly uncommon for the same, say, 11-word string to show up more than once.
Cites & Insights is sponsored by YBP Library Services, http://www.ybp.com.
Opinions herein may not represent those of PALINET or YBP Library Services.
Comments should be sent to firstname.lastname@example.org. Cites & Insights: Crawford at Large is copyright © 2008 by Walt Crawford: Some rights reserved.
All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.