Question

我有一个包含只包含大写字母的句子的数据库。数据库是技术性的，包含医学术语，我想对其进行标准化，以便大写（接近）用户期望的大小。实现这一目标的最佳方法是什么？我可以使用免费的数据集来帮助完成这个过程吗？

Answer 1

一种方法是从POS标记推断大写，例如使用Python Natural Language Toolkit（NLTK）：

import nltk, re

def truecase(text):
    truecased_sents = [] # list of truecased sentences
    # apply POS-tagging
    tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
    # infer capitalization from POS-tags
    normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
    # capitalize first word in sentence
    normalized_sent[0] = normalized_sent[0].capitalize()
    # use regular expression to get punctuation right
    pretty_string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent))
    return pretty_string

这不会是完美的，特别是因为我不知道你的数据是什么样的，但也许你可以理解：

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

Answer 2

搜索有关truecasing的工作：http://en.wikipedia.org/wiki/Truecasing

如果您可以访问具有正常大小写的类似医疗数据，那么生成您自己的数据集会非常容易。将所有内容都大写并使用映射到原始文本来训练/测试您的算法。

Answer 3

最简单的方法是使用基于ngrams的拼写校正算法。

您可以使用，例如LingPipe SpellChecker。您可以找到用于预测单词中空格的源代码，类似于可以用于预测大小写的内容。

如何才能最好地确定单词的正确大小写？

3 个答案: