将句子中的换行符映射到另一个列表

时间:2014-11-25 02:40:41

标签: python nltk

在一个文件中,我有这样的文字,随机换行:

Spencer J. Volk, president and CEO of this company, was elected a director. 
Mr. Volk, 55 years old, succeeds Duncan Dwight, 
who retired in September. 

我正在使用nltk的句子标记器来查找句子,然后使用词性标记在这些句子中标记单词。例如,在标记之后,我得到这样的输出(单词列表,句子中每个单词的标记元组):

[('Spencer', u'NNP'), ('J.', u'NNP'), ('Volk', u'NNP'), ('president', u'NN'), ('and', u'CC'), ('CEO', u'NN'), ('of', u'IN'), ('this', u'DT'), ('company', u'NN'), ('was', u'VBD'), ('elected', u'VBN'), ('a', u'DT'), ('director', u'NN')]

[('Mr.', u'NNP'), ('Volk', u'NNP'), ('55', u'CD'), ('years', u'NNS'), ('old', u'JJ'), ('succeeds', u'VBZ'), ('Duncan', u'NNP'), ('Dwight', u'NNP'), ('who', u'WP'), ('retired', u'VBD'), ('in', u'IN'), ('September', u'NNP')]

但是现在我想将标签写在另一个文件中,其中包含相同的换行符,就像我从原始文件中读取文本一样。对于上面的例子,它将类似于:

NNP NNP NNP NN CC NN IN DT NN VBD VBN DT NN
NNP NNP CD NNS JJ VBZ NNP NNP
WP VBD IN NNP

我可以获取此表单中的标记和所有内容,但如何将原始换行符与标记列表中的中断相关联?

这样做的一种方法是拆分每个句子,找到\n的索引,希望每个拆分对应于句子中的一个单词(可能并不总是为真),然后打破标签列表那个指数。这更像是一个黑客,并在许多情况下失败。什么是更有力的方法来实现这一目标?

2 个答案:

答案 0 :(得分:0)

忽略换行符并使用sent_tokenize

>>> from nltk import word_tokenize, pos_tag, sent_tokenize
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> from nltk import word_tokenize, pos_tag, sent_tokenize>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> text = " ".join(i for i in text.split('\n'))
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
>>> for sent in tagged_text:
...     poses = " ".join(pos for word, pos in sent)
...     print poses
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP , WP VBN IN NNP .

注意换行符:

>>> from nltk import word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """
>>> 
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in text.split('\n')]
>>> for sent in tagged_text:
...     poses = " ".join(pos for word, pos in sent)
...     print poses
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP ,
WP VBN IN NNP .

你意识到,当使用正确的句子时,标记器没有任何区别。这是POS标记器使用的上下文信息弱于单词的默认标记,因此使用sent_tokenize然后再次拆分非句子并不重要。


如果您想要sent_tokenize,然后将标签拆分为\n原始文件

>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """

>>> sent_lens = [len(word_tokenize(line)) for line in text.split('\n')]
>>> sent_lens
[16, 11, 5]
>>> tagged_text = [[pos for word,pos in pos_tag(word_tokenize(line))] for line in sent_tokenize(text)]
>>> for l in sent_lens:
...     sum = 0
...     for pos in list(chain(*tagged_text))[sum:sum+l]:
...             print pos,
...             sum = sum+l
...     print
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NN NNP NNP , NN CC NNP IN DT NN ,
NN NNP NNP , NN

答案 1 :(得分:0)

有趣的谜题。首先,请注意nltk.sent_tokenize()将保留 in 句子中的换行符:

sents = nltk.sent_tokenize(text)
for s in sents:
    print(repr(s))

因此,要将POS标记与换行符交错,您可以一次向下一个句子,并检查它们之间的换行符:

def process_sent(sent):
    tagged = nltk.pos_tag(nltk.word_tokenize(sent))

    for word, tag in tagged:
        pre, _, post = sent.partition(word)
        if "\n" in pre:
            print("\n", end="")
        print(tag, end=" ")
        sent = post # advance to the next word
    if "\n" in post:
        print("\n", end="")

我不太清楚为什么,但nltk.sent_tokenize()会丢弃句子边界之间出现的换行符。所以我们也需要寻找它们。幸运的是,我们可以使用完全相同的算法:一次只查看一个句子的全文,并检查它们之间的换行符。

sents = nltk.sent_tokenize(text)
for s in sents:
    pre, _, post = text.partition(s)
    if "\n" in pre:
        print("\n", end="")
    process_sent(s)
    text = post  # Advance to the next sentence -- munges `text` so use another var if it matters.

if "\n" in post:
    print("\n", end="")

PS。应该这样做,除了你只有在几个相邻的地方输出一个换行符。如果您关心这一点,请将if "\n" in pre: print("\n", end="")替换为对此的调用:

def nlretain(txt):
    """Output as many newlines as there are in `txt`"""
     print("\n"*txt.count("\n"), end="")