在一个文件中,我有这样的文字,随机换行:
Spencer J. Volk, president and CEO of this company, was elected a director.
Mr. Volk, 55 years old, succeeds Duncan Dwight,
who retired in September.
我正在使用nltk的句子标记器来查找句子,然后使用词性标记在这些句子中标记单词。例如,在标记之后,我得到这样的输出(单词列表,句子中每个单词的标记元组):
[('Spencer', u'NNP'), ('J.', u'NNP'), ('Volk', u'NNP'), ('president', u'NN'), ('and', u'CC'), ('CEO', u'NN'), ('of', u'IN'), ('this', u'DT'), ('company', u'NN'), ('was', u'VBD'), ('elected', u'VBN'), ('a', u'DT'), ('director', u'NN')]
[('Mr.', u'NNP'), ('Volk', u'NNP'), ('55', u'CD'), ('years', u'NNS'), ('old', u'JJ'), ('succeeds', u'VBZ'), ('Duncan', u'NNP'), ('Dwight', u'NNP'), ('who', u'WP'), ('retired', u'VBD'), ('in', u'IN'), ('September', u'NNP')]
但是现在我想将标签写在另一个文件中,其中包含相同的换行符,就像我从原始文件中读取文本一样。对于上面的例子,它将类似于:
NNP NNP NNP NN CC NN IN DT NN VBD VBN DT NN
NNP NNP CD NNS JJ VBZ NNP NNP
WP VBD IN NNP
我可以获取此表单中的标记和所有内容,但如何将原始换行符与标记列表中的中断相关联?
这样做的一种方法是拆分每个句子,找到\n
的索引,希望每个拆分对应于句子中的一个单词(可能并不总是为真),然后打破标签列表那个指数。这更像是一个黑客,并在许多情况下失败。什么是更有力的方法来实现这一目标?
答案 0 :(得分:0)
忽略换行符并使用sent_tokenize
:
>>> from nltk import word_tokenize, pos_tag, sent_tokenize
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director.
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>>
>>> from nltk import word_tokenize, pos_tag, sent_tokenize>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director.
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>>
>>> text = " ".join(i for i in text.split('\n'))
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
>>> for sent in tagged_text:
... poses = " ".join(pos for word, pos in sent)
... print poses
...
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP , WP VBN IN NNP .
注意换行符:
>>> from nltk import word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director.
... Mr. Volk, 55 years old, succeeds Duncan Dwight,
... who retired in September. """
>>>
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in text.split('\n')]
>>> for sent in tagged_text:
... poses = " ".join(pos for word, pos in sent)
... print poses
...
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP ,
WP VBN IN NNP .
你意识到,当使用正确的句子时,标记器没有任何区别。这是POS标记器使用的上下文信息弱于单词的默认标记,因此使用sent_tokenize
然后再次拆分非句子并不重要。
如果您想要sent_tokenize
,然后将标签拆分为\n
原始文件
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director.
... Mr. Volk, 55 years old, succeeds Duncan Dwight,
... who retired in September. """
>>> sent_lens = [len(word_tokenize(line)) for line in text.split('\n')]
>>> sent_lens
[16, 11, 5]
>>> tagged_text = [[pos for word,pos in pos_tag(word_tokenize(line))] for line in sent_tokenize(text)]
>>> for l in sent_lens:
... sum = 0
... for pos in list(chain(*tagged_text))[sum:sum+l]:
... print pos,
... sum = sum+l
... print
...
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NN NNP NNP , NN CC NNP IN DT NN ,
NN NNP NNP , NN
答案 1 :(得分:0)
有趣的谜题。首先,请注意nltk.sent_tokenize()
将保留 in 句子中的换行符:
sents = nltk.sent_tokenize(text)
for s in sents:
print(repr(s))
因此,要将POS标记与换行符交错,您可以一次向下一个句子,并检查它们之间的换行符:
def process_sent(sent):
tagged = nltk.pos_tag(nltk.word_tokenize(sent))
for word, tag in tagged:
pre, _, post = sent.partition(word)
if "\n" in pre:
print("\n", end="")
print(tag, end=" ")
sent = post # advance to the next word
if "\n" in post:
print("\n", end="")
我不太清楚为什么,但nltk.sent_tokenize()
会丢弃句子边界之间出现的换行符。所以我们也需要寻找它们。幸运的是,我们可以使用完全相同的算法:一次只查看一个句子的全文,并检查它们之间的换行符。
sents = nltk.sent_tokenize(text)
for s in sents:
pre, _, post = text.partition(s)
if "\n" in pre:
print("\n", end="")
process_sent(s)
text = post # Advance to the next sentence -- munges `text` so use another var if it matters.
if "\n" in post:
print("\n", end="")
PS。应该这样做,除了你只有在几个相邻的地方输出一个换行符。如果您关心这一点,请将if "\n" in pre: print("\n", end="")
替换为对此的调用:
def nlretain(txt):
"""Output as many newlines as there are in `txt`"""
print("\n"*txt.count("\n"), end="")