Jeffrey Friedl's Blog » Doubling Up on Grammar Checks

by Jeffrey Friedl Mon, April 23rd, 2007 1:02pm JST (18 years, 2 months ago)

When I write for public consumption (book, magazine article, blog post....), I try to be a bit careful with how I present myself. I have the most difficult time with misspellings because they could bite me on the nose and I still wouldn't sense them. I tend to be okay with grammar, and I pick up most typos because usually I read and reread many times before publishing. Some often sneak through anyway.

Part of this carefulness is evident in the first example I give in my book on regular expressions. On the first page (First edition, Chapter 1, Page 1) I describe how regular expressions can be used to identify doubled words:

Here's the scenario: you're given the job of checking the pages on a web
server for doubled words (such as 'this this'), a common problem with
documents subject to heavy editing. Your job is to create a solution that
will:

* Accept any number of files to check, report each line of each file
that has doubled words, highlight (using standard ANSI escape
sequences) each doubled word, and ensure that the source filename
appears with each line in the report.

* Work across lines, even finding situations where a word at the end of
one line is repeated at the beginning of the next.

* Find doubled words despite capitalization differences, such as with 'The
the', as well as allow differing amounts of whitespace (spaces, tabs,
newlines, and the like) to lie between the words.

* Find doubled words even when separated by HTML tags. HTML tags are for
marking up text on World Wide Web pages, for example, to make a word bold:
...it is <B>very</B> very important....

That's certainly a tall order! But, it's a real problem that needs to be
solved. At one point while working on the manuscript for this book, I ran
such a tool on what I'd written so far and was surprised at the way
numerous doubled words had crept in. There are many programming languages
one could use to solve the problem, but one with regular expression support
can make the job substantially easier.

I first wrote that in 1995 or 1996, and it's survived through to the third edition, where I go on to present solutions in Perl, Java, and even emacs.

Running the program has long been part of the book's build process, to vet the prose for doubled words, but for some reason I never thought to apply it to my blog posts. I thought of it yesterday, and was shocked at the dozens of doubled-word typos I found in the 430ish posts I currently have on my blog. Dozens. I guess it's so easy to read what you believe to be there than what's actually there.

It's mildly interesting to realize that such errors would never make it past a first re-reading if the text were Japanese, because my reading of Japanese does not “flow” like when I read English. When I read Japanese, I read individual words (or at least try to), then put them together to understand the sentence (or at least try to). Thus, I'd immediately notice a misplaced word.... at least to the extent that my Japanese abilities allowed me to realize that it was indeed misplaced. 🙂

Since I was in tidy-up mode, I thought to also run a little utility I'd developed in 1996 but had since forgotten about. It checks for the misuse of “a” and “an” (e.g. “a apple”). This kind of error also crops up often in text subject to a lot of editing.

For example, in my post yesterday about the river in Kibune having a cooling effect, I originally wrote that it had the effect of “an air conditioner,” but at one point during composition I decided to add “natural” in there, and so ended up with “an natural air conditioner.” I didn't notice the glaring grammar error before posting, nor until I ran my little utility that pointed out about a dozen such errors spread across my posts over the last few years. Doh!

The two checks — a vs. an and doubled words — are now part of my pre-post checkup. Hopefully, no more such errors will creep in (leaving room for different errors, no doubt).

July	21st,	2006	Mastering Regular Expressions, Third Edition
April	22nd,	2007	Getting Ready for Summer, in Kibune

You can submit a comment below | Posted under: General, Tech | Comments via RSS | Trackback

All 3 comments so far, oldest first...

Was it intentional, in an article about making mistakes, to misspell “grammar” in the title? 🙂

If I said “yes,” would you believe me and compliment me on my cleverness? 🙂

Sigh, it seems that the built-in spell checker in Firefox does not apply to the form fields in the WordPress new-post page. I guess they use funky Javascript to simulate such a field, or something, which is why I always seem to not catch the misspellings in my titles. I don’t have that problem (or excuse) with the post bodies because I write them in Emacs, and use my file-based-posts plugin to hook them into WordPress.

In any case, thanks Derek… fixed. —Jeffrey

— comment by Derek on April 23rd, 2007 at 7:27pm JST (18 years, 2 months ago) — comment permalink

You can turn on spell checking of the title field by right clicking it and choosing “Spell check this field.”

Ah, I see, thanks Ben. Do you know how/why it’s different from normal form fields (for which spellchecking is always on)? —Jeffrey

— comment by Ben Pharr on April 24th, 2007 at 12:01am JST (18 years, 2 months ago) — comment permalink

You know, *I’ve* never seen a spelling mistake in your blogs…

— comment by Marcina on April 24th, 2007 at 1:39am JST (18 years, 2 months ago) — comment permalink