Question

在一个文件中，我有这样的文字，随机换行：

Spencer J. Volk, president and CEO of this company, was elected a director. 
Mr. Volk, 55 years old, succeeds Duncan Dwight, 
who retired in September.

我正在使用nltk的句子标记器来查找句子，然后使用词性标记在这些句子中标记单词。例如，在标记之后，我得到这样的输出（单词列表，句子中每个单词的标记元组）：

[('Spencer', u'NNP'), ('J.', u'NNP'), ('Volk', u'NNP'), ('president', u'NN'), ('and', u'CC'), ('CEO', u'NN'), ('of', u'IN'), ('this', u'DT'), ('company', u'NN'), ('was', u'VBD'), ('elected', u'VBN'), ('a', u'DT'), ('director', u'NN')]

[('Mr.', u'NNP'), ('Volk', u'NNP'), ('55', u'CD'), ('years', u'NNS'), ('old', u'JJ'), ('succeeds', u'VBZ'), ('Duncan', u'NNP'), ('Dwight', u'NNP'), ('who', u'WP'), ('retired', u'VBD'), ('in', u'IN'), ('September', u'NNP')]

但是现在我想将标签写在另一个文件中，其中包含相同的换行符，就像我从原始文件中读取文本一样。对于上面的例子，它将类似于：

NNP NNP NNP NN CC NN IN DT NN VBD VBN DT NN
NNP NNP CD NNS JJ VBZ NNP NNP
WP VBD IN NNP

我可以获取此表单中的标记和所有内容，但如何将原始换行符与标记列表中的中断相关联？

这样做的一种方法是拆分每个句子，找到\n的索引，希望每个拆分对应于句子中的一个单词（可能并不总是为真），然后打破标签列表那个指数。这更像是一个黑客，并在许多情况下失败。什么是更有力的方法来实现这一目标？

Answer 1

忽略换行符并使用sent_tokenize：

>>> from nltk import word_tokenize, pos_tag, sent_tokenize
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> from nltk import word_tokenize, pos_tag, sent_tokenize>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> text = " ".join(i for i in text.split('\n'))
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
>>> for sent in tagged_text:
...     poses = " ".join(pos for word, pos in sent)
...     print poses
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP , WP VBN IN NNP .

注意换行符：

>>> from nltk import word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """
>>> 
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in text.split('\n')]
>>> for sent in tagged_text:
...     poses = " ".join(pos for word, pos in sent)
...     print poses
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP ,
WP VBN IN NNP .

你意识到，当使用正确的句子时，标记器没有任何区别。这是POS标记器使用的上下文信息弱于单词的默认标记，因此使用sent_tokenize然后再次拆分非句子并不重要。

如果您想要sent_tokenize，然后将标签拆分为\n原始文件

>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """

>>> sent_lens = [len(word_tokenize(line)) for line in text.split('\n')]
>>> sent_lens
[16, 11, 5]
>>> tagged_text = [[pos for word,pos in pos_tag(word_tokenize(line))] for line in sent_tokenize(text)]
>>> for l in sent_lens:
...     sum = 0
...     for pos in list(chain(*tagged_text))[sum:sum+l]:
...             print pos,
...             sum = sum+l
...     print
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NN NNP NNP , NN CC NNP IN DT NN ,
NN NNP NNP , NN

Answer 2

有趣的谜题。首先，请注意nltk.sent_tokenize()将保留 in 句子中的换行符：

sents = nltk.sent_tokenize(text)
for s in sents:
    print(repr(s))

因此，要将POS标记与换行符交错，您可以一次向下一个句子，并检查它们之间的换行符：

def process_sent(sent):
    tagged = nltk.pos_tag(nltk.word_tokenize(sent))

    for word, tag in tagged:
        pre, _, post = sent.partition(word)
        if "\n" in pre:
            print("\n", end="")
        print(tag, end=" ")
        sent = post # advance to the next word
    if "\n" in post:
        print("\n", end="")

我不太清楚为什么，但nltk.sent_tokenize()会丢弃句子边界之间出现的换行符。所以我们也需要寻找它们。幸运的是，我们可以使用完全相同的算法：一次只查看一个句子的全文，并检查它们之间的换行符。

sents = nltk.sent_tokenize(text) for s in sents: pre, _, post = text.partition(s) if "\n" in pre: print("\n", end="") process_sent(s) text = post # Advance to the next sentence -- munges `text` so use another var if it matters. if "\n" in post: print("\n", end="")

PS。应该这样做，除了你只有在几个相邻的地方输出一个换行符。如果您关心这一点，请将if "\n" in pre: print("\n", end="")替换为对此的调用：

def nlretain(txt): """Output as many newlines as there are in `txt`""" print("\n"*txt.count("\n"), end="")

将句子中的换行符映射到另一个列表

2 个答案: