以下是我需要分析和提取特定单词的许多行的两个示例。
[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl
[37.786221300000001, -122.1965002] 6 2011-08-28 19:55:26 I wish I could lay up with the love of my life And watch cartoons all day.
忽略坐标和数字
案例是查找此关键字列表中每条推文行中有多少单词:
['hate', 1]
['hurt', 1]
['hurting', 1]
['like', 5]
['lonely', 1]
['love', 10]
此外,找到每条推文中找到的关键字的值之和(例如['love', 10 ])。
例如,对于句子
'I hate to feel lonely at times'
仇恨= 1 和孤独= 1 的情绪值总和等于2。 并且没有。该行中的单词是7。
我尝试使用list into lists方法,甚至尝试浏览每个句子和关键字,但那些因为没有而没有用。推文和关键字是几个,我需要使用循环格式来查找值。
我想知道的是每行中找到的关键字的情绪值总和以及每行中有多少字
提前感谢您的见解!! :)
我的代码:
try:
KeywordFileName=input('Input keyword file name: ')
KeywordFile = open(KeywordFileName, 'r')
except FileNotFoundError:
print('The file you entered does not exist or is not in the directory')
exit()
KeyLine = KeywordFile.readline()
while KeyLine != '':
if list != []:
KeyLine = KeywordFile.readline()
KeyLine = KeyLine.rstrip()
list = KeyLine.split(',')
list[1] = int(list[1])
print(list)
else:
break
try:
TweetFileName = input('Input Tweet file name: ')
TweetFile = open(TweetFileName, 'r')
except FileNotFoundError:
print('The file you entered does not exist or is not in the directory')
exit()
TweetLine = TweetFile.readline()
while TweetLine != '':
TweetLine = TweetFile.readline()
TweetLine = TweetLine.rstrip()
答案 0 :(得分:1)
如果您的推文位于.txt like this file中,并且推文行的模式与您在问题中描述的相同,那么您可以尝试这种方法:
import re
import json
pattern=r'\d{2}:\d{2}:\d{2}\s([a-zA-Z].+)'
sentiment_dict={'hate' :1,'hurt':1,'hurting':1,'like':5,'lonely':1,'love':10}
final=[]
with open('senti.txt','r+') as f:
for line in f:
data = []
match=re.finditer(pattern,line)
for find in match:
if find.group(1).split():
final.append(find.group(1).split())
line=[]
for item in final:
final_dict = {}
for sub_item in item:
if sub_item in sentiment_dict:
if sub_item not in final_dict:
final_dict[sub_item]=[sentiment_dict.get(sub_item)]
else:
final_dict[sub_item].append(sentiment_dict.get(sub_item))
line.append((item,len(item),{key: sum(value) for key,value in final_dict.items()}))
result=json.dumps(line,indent=2)
print(result)
输出:
[
[
[
"Sometimes", #tweets line or all words
"I",
"wish",
"my",
"life",
"was",
"a",
"movie;",
"#unreal",
"I",
"hate",
"the",
"fact",
"I",
"feel",
"lonely",
"surrounded",
"by",
"so",
"many",
"ppl"
],
21, #count of words in tweets
{
"lonely": 1, #sentiment count
"hate": 1
}
],
[
[
"I",
"wish",
"I",
"could",
"lay",
"up",
"with",
"the",
"love",
"of",
"my",
"life",
"And",
"watch",
"cartoons",
"all",
"day."
],
17,
{
"love": 10
}
],
[
[
"I",
"hate",
"to",
"feel",
"lonely",
"at",
"times"
],
7,
{
"lonely": 1,
"hate": 1
}
]
]
正则表达式的选项,如果一个模式不适用于您的文件:
r'[a-zA-Z].+' #if you use this change find.group(1) to find.group()
r'(?<=\d.\s)[a-zA-Z].+' #if you use this change find.group(1) to find.group()
r'\d{2}:\d{2}:\d{2}\s([a-zA-Z].+)'
r'\b\d{2}:\d{2}:\d{2} (.+)' #group(1)
答案 1 :(得分:0)
最简单的方法是在每个推文的基础上使用nltk库中的word_tokenize。
from nltk.tokenize import word_tokenize
import collections
import re
# Sample text from above
s = '[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl'
num_regex = re.compile(r"[+-]?\d+(?:\.\d+)?")
# Removing the numbers from the text
s = num_regex.sub('',s)
# Tokenization
tokens = word_tokenize(s)
# Counting the words
fdist = collections.Counter(tokens)
print fdist
`