弄清楚撇号是引用还是收缩

时间:2016-04-25 17:21:39

标签: arrays ruby nlp enumeration

我正在寻找一种方法来查看一个句子,看看撇号是引号还是收缩,这样我就可以从字符串中删除标点符号,然后对所有单词进行标准化。

我的测试句子是:don't frazzel the horses. 'she said wow'.

在我的尝试中,我将句子分成单词部分,以便对单词和非单词进行分类,如下:

contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]

sentence = "don't frazzel the horses. 'she said wow'.".split(/(\w+)|(\W+)/i).reject! { |word| word.empty? }

这会返回["don", "'", "t", " ", "frazzel", " ", "the", " ", "horses", ". '", "she", " ", "said", " ", "wow", "'."]

接下来我希望能够迭代查找撇号'的句子,当找到一个时,比较下一个元素以查看它是否包含在contractionEndings数组中。如果包含它,我想将前缀,撇号'和后缀加入一个索引,否则删除撇号。

在此示例中,don't将作为单个索引加入don't,但. ''.将被删除。

之后我可以运行一个正则表达式从句子中删除其他标点符号,这样我就可以将它传递给我的词干分析器来规范化输入。

我所追求的最终输出是don't frazzel the horses she said wow,其中除撇号之外的所有标点符号都将被删除。

如果有人有任何建议可以使这项工作或更好地了解如何解决这个问题,我想知道。

总体而言,除了收缩外,我想删除句子中的所有标点符号。

由于

4 个答案:

答案 0 :(得分:1)

这个怎么样?

irb:0> s = "don't frazzel the horses. 'she said wow'."
irb:0> contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]
irb:0> s.scan(/\w+(?:'(?:#{contractionEndings.join('|')}))?/)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]

正则表达式扫描一些“单词”字符,然后可选地(使用?)撇号加收缩结束。您可以像双引号字符串一样替换Ruby表达式,因此我们可以将它们与正则表达式交替运算符|连接起来。最后一件事是将组(括号中的部分)标记为使用?:进行非捕获,这样扫描不会返回一堆nil s,只返回每次迭代的整个匹配。

或许您可能不需要使用此方法的显式缩写结尾列表。由于Cary,我还修复了其他有问题的结构。

irb:0> "don't -frazzel's the jack-o'-lantern's handle, ma'am- 'she said hey-ho'.".scan(/\w[-'\w]*\w(?:'\w+)?/)
=> ["don't", "frazzel's", "the", "jack-o'-lantern's", "handle", "ma'am", "she", "said", "hey-ho"]

答案 1 :(得分:1)

正如我在评论中所提到的,我认为试图列出所有可能的收缩结局是徒劳的。事实上,一些收缩,如“无法”,包含不止一个撇号。

另一种选择是匹配单引号。我的第一个想法是删除字符"'",如果它在句子的开头或空格之后,或者如果它后跟一个空格或在句子的末尾。不幸的是,这种方法受到以“s”结尾的占有性词语的挫败:“克里斯'猫有跳蚤”。更糟糕的是,我们如何解释“克里斯'汽车在哪里?”或者“'圣诞节前一天''Twas'。”?

这是一种在单词的开头或结尾没有撇号时删除单引号的方法(诚然,这是值得怀疑的值)。

r = /
    (?<=\A|\s) # match the beginning of the string or a whitespace char in a
               # positive lookbehind
    \'         # match a single quote
    |          # or 
    \'         # match a single quote
    (?=\s|\z)  # match a whitespace char or the end of the string in a
               # positive lookahead
    /x         # free-spacing regex definition mode

"don't frazzel the horses. 'she said wow'".gsub(r,'')
  #=> "don't frazzel the horses. she said wow" 

我认为最好的解决方案是英语使用不同的符号作为撇号和单引号。

答案 2 :(得分:1)

您可以使用Pragmatic Tokenizer gem。它可以检测到English contractions

s = "don't frazzel the horses. 'she said wow'."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]

s = "'Twas the 'night before Christmas'."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["'twas", "the", "night", "before", "christmas"]

s = "He couldn’t’ve been right."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["he", "couldn’t’ve", "been", "right"]

答案 3 :(得分:0)

通常撇号会在标记化后继续收缩。

尝试普通的NLP标记器,例如在python中nltk

>>> from nltk import word_tokenize
>>> word_tokenize("don't frazzel the horses")
['do', "n't", 'frazzel', 'the', 'horses']

对于多个句子:

>>> from string import punctuation
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "don't frazzel the horses. 'she said wow'."
>>> sents = sent_tokenize(text)
>>> sents
["don't frazzel the horses.", "'she said wow'."]
>>> [word for word in word_tokenize(sents[0]) if word not in punctuation]
['do', "n't", 'frazzel', 'the', 'horses']
>>> [word for word in word_tokenize(sents[1]) if word not in punctuation]
["'she", 'said', 'wow']

word_tokenize之前展开句子:

>>> from itertools import chain
>>> sents
["don't frazzel the horses.", "'she said wow'."]
>>> [word_tokenize(sent) for sent in sents]
[['do', "n't", 'frazzel', 'the', 'horses', '.'], ["'she", 'said', 'wow', "'", '.']]
>>> list(chain(*[word_tokenize(sent) for sent in sents]))
['do', "n't", 'frazzel', 'the', 'horses', '.', "'she", 'said', 'wow', "'", '.']
>>> [word for word in list(chain(*[word_tokenize(sent) for sent in sents])) if word not in punctuation]
['do', "n't", 'frazzel', 'the', 'horses', "'she", 'said', 'wow']

请注意,单引号与'she保持一致。遗憾的是,在今天所有关于复杂(深层)机器学习方法的大肆宣传中,简单的标记化任务仍有其弱点=(

即使使用正式的语法文本也会出错:

>>> text = "Don't frazzel the horses. 'She said wow'."
>>> sents = sent_tokenize(text)
>>> sents
["Don't frazzel the horses.", "'She said wow'."]
>>> [word_tokenize(sent) for sent in sents]
[['Do', "n't", 'frazzel', 'the', 'horses', '.'], ["'She", 'said', 'wow', "'", '.']]