Question

我正在研究基于方面的情感分析。在这个项目中，我们从twitter收集数据。收集数据后，我们执行文本清理方法并创建语料库。之后我们使用这个语料库在python中使用noun_phrases查找方面。它给出了名词短语列表。从这个列表中我想只选择那些只包含两个单词的方面。我怎么能这样做？

这是我的代码和生成的输出：

from textblob import Word
comments = TextBlob(' '.join(corpus))
comments.noun_phrases
cleaned = list()
for phrase in comments.noun_phrases:
    count = 0
    for w in phrase.split():
        # Count the number of small words and words without an English definition
        if len(w) <= 2 or (not Word(w).definitions):
            count += 1
    # Only if the 'nonsensical' or short words DO NOT make up more than 40% (arbitrary) of the phrase add
    # it to the cleaned list, effectively pruning the ones not added.
    if count < len(phrase.split())*0.4:
        cleaned.append(phrase)       
print("After compactness pruning:\nFeature Size:")
print(cleaned)

输出： ['值得免费食品k转推请求'，'特定服务员工作'，'红色混合'，'老想法突然'，'全球焦点'，'本地发行'，'非洲食品'，'食品卡车'，'空间avail netbal woman footbal amp squash'，'week world cup'，'minor sign confess'，'french fri coupl day'，'great stuff ban plastic straw serv localcac x x x b b b food food home school school'，'过时的羊角面包'，'东西时间'，'很棒的时间保存'，'干净的菜肴'，'假新闻单位alreadi'，'肯定食品放大器'，'长食'，'狗中国美国'，'贸易中国直到'，'温暖的颜色'，'黄色红毛猩猩'，'快餐餐馆'，'黄色红毛猩猩'，'新兴食品包裹'，'垃圾食品标签parti'，'分享水检查系统'，'土食'， '照顾奇瓦瓦yappi需要食物睡觉'，'新布'，'剂量白痴'，'害怕穷人上升'，'朋友喂'，'错狗屎'，'好人'，'好坏人'，'食品养老金生计'，'食物毛皮babi乐趣逗留']

由此我们只想选择那些只包含两个单词的名词短语，如“红色混合”，“食物卡车”，“陈旧羊角面包”等。我怎么能这样做？

Answer 1

查找列表中只有一个空格

的项目

编辑：更新以列出对简洁和速度的理解：

word_list = [phrase for phrase in a if phrase.count(' ') == 1]

时间比较：

startTime = time.time()

for i in range(1000000):
    word_list = []
    for phrase in comments.noun_phrases:
        if phrase.count(' ') == 1:
            word_list.append(phrase)

print(time.time() - startTime)
9.743234395980835

startTime = time.time()
for i in range(1000000):
    word_list = [phrase for phrase in comments.noun_phrases if len(phrase.split(" ")) == 2]

print(time.time() - startTime)
14.307061433792114

startTime = time.time()
for i in range(1000000):
    word_list = [phrase for phrase in comments.noun_phrases if phrase.count(' ') == 1]

print(time.time() - startTime)
7.5759406089782715

Answer 2

假设你有一个列表，用comments.noun_phrases表示，你试图找到只有2个单词的短语。

word_list = [phrase for phrase in comments.noun_phrases if len(phrase.split(" ")) == 2]

如果你想要速度，你可能更喜欢if情况的计数方法。

word_list = [phrase for phrase in comments.noun_phrases if phrase.count(" ") == 1]

如果给出一个名词短语列表，它会返回一个列表，其中只包含2个单词。这不包括清洁等，正如您在问题中所述，您有一个清洁短语列表。

使用python进行基于方面的情感分析

2 个答案: