Question

以下是我需要分析和提取特定单词的许多行的两个示例。

file2.txt

忽略坐标和数字

案例是查找此关键字列表中每条推文行中有多少单词：

[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl


[37.786221300000001, -122.1965002] 6 2011-08-28 19:55:26 I wish I could lay up with the love of my life And watch cartoons all day.

并且，找到每条推文中找到的关键字的值的总和（例如[＆＃39; love＆＃39;， 10 ]）。

例如，对于句子

['hate', 1]
['hurt', 1]
['hurting', 1]
['like', 5]
['lonely', 1]
['love', 10]

仇恨= 1 和孤独= 1 的情绪值总和等于2。并且没有。该行中的单词是7。

我尝试过使用列表进入列表方法，甚至尝试浏览每个句子和关键字，但是这些因为没有。推文和关键字是几个，我需要使用循环格式来查找值。

提前感谢您的见解!! ：）

我的代码：

'I hate to feel lonely at times'

Answer 1

您可以使用简单的正则表达式来提取单词，并使用标记生成器计算样本字符串中每个单词的出现次数。

from nltk.tokenize import word_tokenize
import collections
import re

str = '[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl'
num_regex = re.compile(r"[+-]?\d+(?:\.\d+)?")
str = num_regex.sub('',str)
words = word_tokenize(str)
final_list = collections.Counter(words)
print final_list

将列表中的关键字匹配到Python中的单词行

1 个答案: