Abstract
In this paper, I describe several approaches to automatic or semi-automatic creating symbolic rules for grammar checkers and propose a pure corpora-based approach.
Traditional a priori approaches can reuse existing positive or negative knowledge that is not based on empirical corpora research. For example, they reuse knowledge such as usage dictionaries, spelling dictionaries or formalized grammars. Mixed approaches apply linguistic knowledge to corpora to refine intuitive prescriptions described for humans in dictionaries. For example, it is relatively easy to use machine-learning methods, such as transformation-based learning (TBL) to create error-matching rules using real corpora material. TBL algorithms can start with dictionary knowledge (Mangu & Brill 1997) or with artificially introduced errors to corpora that were known to be relatively free from errors (Sjöberg & Knuttson 2005). Approaches based on reusing error corpora were often discarded as non-realistic, as creating such corpora is costly. Yet, there are ways to automate building such corpora by observing frequency of user revisions to the text (Miłkowski 2008).
I show how an error corpus generated from Wikipedia revision history can be used to automatically generate error-matching symbolic rules for grammar checkers. Though no error corpora can be considered complete, TBL algorithms deal with small corpora sufficiently well. Automated building of rules can also enhance grammar checkers’ rules precision.
I show some of the automatically generated rules for Polish and English: as they were learned using TBL, they had a symbolic form and were easily translatable to the notation required by LanguageTool, an open-source general-purpose proofreading software. As will be shown, some of the automatically generated rules tend to have higher recall than the ones manually crafted. TBL rules don’t allow the level of generality offered by LanguageTool (no regular expressions, not to mention such mechanisms as feature unification) so human intervention is useful to join the resulting rules together in a single LanguageTool rule.
See the full paper draft here.