{"id":432,"date":"2007-04-23T13:02:57","date_gmt":"2007-04-23T04:02:57","guid":{"rendered":"https:\/\/regex.info\/blog\/2007-04-23\/432"},"modified":"2007-04-23T20:01:25","modified_gmt":"2007-04-23T11:01:25","slug":"doubling-up-on-grammer-checks","status":"publish","type":"post","link":"https:\/\/regex.info\/blog\/2007-04-23\/432","title":{"rendered":"Doubling Up on Grammar Checks"},"content":{"rendered":"\n\n<p>When I write for public consumption (book, magazine article, blog\npost....), <span class='nobr'>I try to<\/span> be <span class='nobr'>a bit<\/span> careful with how <span class='nobr'>I present<\/span> myself. <span class='nobr'>I have the<\/span>\nmost difficult time with misspellings because they could bite me on the\nnose and <span class='nobr'>I still<\/span> wouldn't sense them. <span class='nobr'>I tend to<\/span> be okay with grammar, and <span class='nobr'>I\npick<\/span> up most typos because usually <span class='nobr'>I read<\/span> and reread many times before\npublishing. Some often sneak through anyway.<\/p>\n\n<p>Part of this carefulness is evident in the first example <span class='nobr'>I give<\/span> in <a\nhref=\"http:\/\/regex.info\">my book on regular expressions<\/a>. <span class='nobr'>On the first<\/span>\npage (<span style='white-space:nowrap'>First edition,<\/span> <span\nstyle='white-space:nowrap'><span class='nobr'>Chapter 1,<\/span><\/span> <span\nstyle='white-space:nowrap'>Page 1<\/span>) <span class='nobr'>I describe<\/span> how regular\nexpressions can be used to identify doubled words:<\/p>\n\n<div class=\"ic\">\n<img loading=\"lazy\" decoding=\"async\" src=\"\/i\/MRE-DoubledWords.gif\" width=\"545\" height=\"430\"\nalt=\"\nHere's the scenario: you're given the job of checking the pages on a web\nserver for doubled words (such as 'this this'), a common problem with\ndocuments subject to heavy editing. Your job is to create a solution that\nwill:\n\n* Accept any number of files to check, report each line of each file\nthat has doubled words, highlight (using standard ANSI escape\nsequences) each doubled word, and ensure that the source filename\nappears with each line in the report.\n\n* Work across lines, even finding situations where a word at the end of\none line is repeated at the beginning of the next.\n\n* Find doubled words despite capitalization differences, such as with 'The\nthe', as well as allow differing amounts of whitespace (spaces, tabs,\nnewlines, and the like) to lie between the words.\n\n* Find doubled words even when separated by HTML tags. HTML tags are for\nmarking up text on World Wide Web pages, for example, to make a word bold:\n...it is &lt;B&gt;very&lt;\/B&gt; very important....\n\nThat's certainly a tall order! But, it's a real problem that needs to be\nsolved. At one point while working on the manuscript for this book, I ran\nsuch a tool on what I'd written so far and was surprised at the way\nnumerous doubled words had crept in. There are many programming languages\none could use to solve the problem, but one with regular expression support\ncan make the job substantially easier.\n\"\nclass=\"old_floating_img\"\nid=\"iMRE_DoubledWords\"\nindexhint=\"right\"\nstyle=\"margin: 20px auto; display:block\"\ntitle=\"\nHere's the scenario: you're given the job of checking the pages on a web\nserver for doubled words (such as 'this this'), a common problem with\ndocuments subject to heavy editing. Your job is to create a solution that\nwill:\n\n* Accept any number of files to check, report each line of each file\nthat has doubled words, highlight (using standard ANSI escape\nsequences) each doubled word, and ensure that the source filename\nappears with each line in the report.\n\n* Work across lines, even finding situations where a word at the end of\none line is repeated at the beginning of the next.\n\n* Find doubled words despite capitalization differences, such as with 'The\nthe', as well as allow differing amounts of whitespace (spaces, tabs,\nnewlines, and the like) to lie between the words.\n\n* Find doubled words even when separated by HTML tags. HTML tags are for\nmarking up text on World Wide Web pages, for example, to make a word bold:\n...it is &lt;B&gt;very&lt;\/B&gt; very important....\n\nThat's certainly a tall order! But, it's a real problem that needs to be\nsolved. At one point while working on the manuscript for this book, I ran\nsuch a tool on what I'd written so far and was surprised at the way\nnumerous doubled words had crept in. There are many programming languages\none could use to solve the problem, but one with regular expression support\ncan make the job substantially easier.\n\"\/><\/div>\n\n<p>I first wrote that in 1995 or 1996, and it's survived through to the <a\nhref=\"\/blog\/2006-07-21\/218\">third edition<\/a>, where <span class='nobr'>I go<\/span>\non to present solutions in <a\nhref=\"\/listing.cgi?ed=3&amp;p=78\">Perl<\/a>, <a\nhref=\"\/listing.cgi?ed=3&amp;p=81\">Java<\/a>, and even <a\nhref=\"\/listing.cgi?ed=3&amp;p=101\">emacs<\/a>.<\/p>\n\n<p>Running the program has long been part of the book's build process, to\nvet the prose for doubled words, but for some reason <span class='nobr'>I never<\/span> thought to\napply it to my blog posts. <span class='nobr'>I thought<\/span> of it yesterday, and was\n<i>shocked<\/i> at the dozens of doubled-word typos <span class='nobr'>I found<\/span> in the 430ish\nposts <span class='nobr'>I currently<\/span> have on my blog. <i>Dozens<\/i>. <span class='nobr'>I guess<\/span> it's so easy to\nread what you believe to be there than what's actually there.<\/p>\n\n<p>It's mildly interesting to realize that such errors would never make it\npast <span class='nobr'>a first<\/span> re-reading if the text were Japanese, because my reading of\nJapanese does not &#8220;flow&#8221; like when <span class='nobr'>I read<\/span> English. When <span class='nobr'>I read<\/span>\nJapanese, <span class='nobr'>I read individual<\/span> words (or at least try to), then put them\ntogether to understand the sentence (or at least try to). Thus, <span class='nobr'>I'd immediately<\/span> notice <span class='nobr'>a misplaced<\/span> word.... at least to the extent that my\nJapanese abilities allowed me to realize that it was indeed misplaced.\n\ud83d\ude42<\/p>\n\n<p>Since I was in tidy-up mode, <span class='nobr'>I thought<\/span> to also run <span class='nobr'>a little<\/span> utility I'd\ndeveloped in 1996 but had since forgotten about. <span class='nobr'>It checks<\/span> for the misuse\nof &#8220;a&#8221; and &#8220;an&#8221; (e.g. &#8220;<span class='nobr'>a apple<\/span>&#8221;). This\nkind of error also crops up often in text subject to <span class='nobr'>a lot<\/span> of editing.<\/p>\n\n<p>For example, in my post yesterday about the <a\nhref=\"\/blog\/2007-04-22\/431\">river in Kibune<\/a> having <span class='nobr'>a\ncooling<\/span> effect, <span class='nobr'>I originally<\/span> wrote that it had the effect of &#8220;an air\nconditioner,&#8221; but at one point during composition <span class='nobr'>I decided<\/span> to add\n&#8220;natural&#8221; in there, and so ended up with &#8220;an natural air\nconditioner.&#8221; <span class='nobr'>I didn't<\/span> notice the glaring grammar error before\nposting, nor until <span class='nobr'>I ran<\/span> my little utility that pointed out about <span class='nobr'>a dozen<\/span>\nsuch errors spread across my posts over the last few years. Doh!<\/p>\n\n<p>The two checks &mdash; <b>a <i>vs.<\/i> an<\/b> and <b>doubled words<\/b>\n&mdash; are now part of my pre-post checkup. Hopefully, no more such errors\nwill creep in (leaving room for different errors, no doubt).<\/p>\n\n","protected":false},"excerpt":{"rendered":"<p>When I write for public consumption (book, magazine article, blog post....), I try to be a bit careful with how I present myself. I have the most difficult time with misspellings because they could bite me on the nose and I still wouldn't sense them. I tend to be okay with grammar, and I pick up most typos because usually I read and reread many times before publishing. Some often sneak through anyway.<\/p> <p>Part of this carefulness is evident in the first example I give in my book on regular expressions. On the first page (First edition, Chapter 1, Page [...]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1,4],"tags":[],"_links":{"self":[{"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/posts\/432"}],"collection":[{"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/comments?post=432"}],"version-history":[{"count":0,"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/posts\/432\/revisions"}],"wp:attachment":[{"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/media?parent=432"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/categories?post=432"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/regex.info\/blog\/wp-json\/wp\/v2\/tags?post=432"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}