Question

我正在用python的TextBlob软件包敲打我的头

识别段落中的句子
识别句子中的单词
确定这些单词的POS（词性）标签......

一切顺利，直到我发现可能的问题，如果我没有错。下面将使用示例代码段进行说明。

from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample)                         #Passing it to TextBlob package.
Words = blob.words                              #Splitting the Sentence into words.
Tags = blob.tags                                #Determining POS tag for each words in the sentence

print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']

如上所示，blob.tags函数将'％'符号视为单独的单词，并确定POS标记。

而blob.words函数甚至不单独或与其前一个单词一起打印'％'符号。

我正在创建一个包含两个函数输出的数据框。因此，由于长度不匹配问题，它不会被创建。

以下是我的问题。 这可能是TextBlob包中的问题吗？有没有办法在单词列表中识别'％'？

Answer 1

在令牌化时剥离标点符号似乎是TextBlob开发人员有意识的决定：https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624

他们依赖于NLTK的标记符，它使用了include_punct参数，但我没有看到通过TextBlob将include_punct = True传递给NLTK标记器的方法。

当遇到类似的问题时，我用一个旨在表示它的非字典文本常量替换了有趣的标点符号，即：替换＆＃39;％＆＃39;与＆＃39; PUNCTPERCENT＆＃39;在标记之前。这样，有百分号的信息就不会丢失。

编辑：我已经纠正了，在TextBlob初始化中，您可以通过其compile 'com.android.support:appcompat-v7:23.2.1'方法https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328的tokenizer参数设置一个标记器。

因此，您可以轻松地将TextBlob传递给尊重标点符号的标记器。

__init__

EDIT2：我在查看TextBlob的来源时遇到了这个问题：https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372注意单词方法的文档字符串，它说你应该访问tokens属性而不是单词属性如果要包括标点符号。

Answer 2

最后我发现NLTK正确识别符号。下面给出了相同的代码片段以供参考：

from nltk import word_tokenize
from nltk import pos_tag
Words = word_tokenize(sample)
Tags = pos_tag(Words)

print(Words)
['This', 'is', 'better', 'than', 'that', 'by', '5', '%']

print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('better', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

Python TextBlob Package - 确定'％'符号的POS标记，但不要将其作为单词打印

2 个答案: