在R中自动校正所有拼写错误的文本数据

时间:2017-07-29 07:29:28

标签: r text-mining hunspell soundex spelling

所以我一直在寻找方法来纠正R中文本中的拼写错误,而无需手动添加/替换单词。我有文本格式的数据,这是患者在急诊室的投诉。在执行一个简单的随机森林来选择其中的前100个重要特征之后,这就是我得到的结果:

> predictors(results1)

  [1] "back"       "refil"      "pain"       "med"        "cough"      "sob"        "day"        "chronic"    "deni"      
 [10] "right"      "brought"    "hit"        "request"    "injuri"     "hemorrhoid" "hour"       "clot"       "depress"   
 [19] "nausea"     "congest"    "clinic"     "headach"    "chest"      "sore"       "month"      "elev"       "dizzi"     
 [28] "toothach"   "week"       "throat"     "head"       "also"       "small"      "vomit"      "famili"     "seen"      
 [37] "burn"       "last"       "report"     "hematuria"  "per"        "walter"     "abdomin"    "ear"        "side"      
 [46] "low"        "nasal"      "intermitt"  "night"      "drh"        "dri"        "eye"        "obtain"     "patient"   
 [55] "pressur"    "product"    "take"       "vet"        "fever"      "blood"      "ago"        "due"        "extrem"    
 [64] "feel"       "note"       "triag"      "weak"       "aaa"        "aand"       "aarm"       "aava"       "abcess"    
 [73] "abcsess"    "abd"        "abdimin"    "abdnorm"    "abdomen"    "abdomi"     "abdominal"  "abdominla"  "abdonin"   
 [82] "abdpain"    "abil"       "abilifi"    "abl"        "ablat"      "abliat"     "abnd"       "abnorm"     "abouthi"   
 [91] "abraid"     "abraison"   "abras"      "abscess"    "absent"     "abul"       "abus"       "abuterol"   "abx"       
[100] "abxno"

以[73]和[82]开头的行显示拼写错误将如何影响我的结果。 我已经阅读并尝试过Hunspell,Aspell,Soundex和vwr以及RecordLinkage软件包。 Aspell的问题在于,我无法让它在我的笔记本电脑上工作,因为它知道它需要在Windows上安装旧软件,并且该软件非常难以使用。 对于其他软件包,我的问题是我不想逐个查看6k字,并将它们添加到列表中,或者将它们“对”放在一起或者以适当的形式进行比较。这需要很长时间。 您对如何在R中编写代码有任何建议,该代码会自动查找并将拼写中最接近的单词替换为我数据集中的单词吗?或者有没有办法让以前命名的包做同样的工作?

谢谢。

0 个答案:

没有答案