我有两个文件,一个包含200多条推文,另一个包含关键字和值。一个典型的推文看起来像:(我也在下面提供了我的代码)
[41.923916200000001, -88.777469199999999] 6 2011-08-28 19:24:18 My life is a moviee. ( only the number in brackets and the words after the time are relevant)
,关键字看起来像
love,10
like,5
best,10
hate,1
使用推文开头的两个数字,我用它来确定推文的区域(在我的代码中显示如下)。 &安培;对于每个推文(文件中的每一行),根据推文中的关键字数量,我添加它们,除以与它们相关的值的总和(每条推文),这给了我分数。 我的问题是,我如何能够将某个地区的所有推文的得分总和除以该地区的推文数量?下面,我把happynessTweetScore放在哪里,我是如何计算的实际包含关键字的文件(每行)中各个推文的得分。 对于这部分,我不确定如何根据区域添加所有值,并根据该区域中的推文数量进行划分?我应该根据他们的区域将它们添加到列表然后添加?我不知道。 我是这样开始的:
def score(tweet):
total = 0
total_value = 0
for word in tweet:
if word in sentiments:
total_value += sentiments[word]
total_count += 1
return total_value, total_count
但我不知道如何使用这样的SOMETHING,以便将每个地区的所有推文的分数完全相加,并将其除以该地区的推文数量?
我将推文分为四个区域(纬度,长度),使用这些值(矩形)代码底部的所有方式:
p1 = (49.189787, -67.444574)
p2 = (24.660845, -67.444574)
p3 = (49.189787, -87.518395)
p4 = (24.660845, -87.518395)
p5 = (49.189787, -101.998892)
p6 = (24.660845, -101.998892)
p7 = (49.189787, -115.236428)
p8 = (24.660845, -115.236428)
p9 = (49.189787, -125.242264)
p10 = (24.660845, -125.242264)
from collections import Counter
try:
keyW_Path = input("Enter file named keywords: ")
keyFile = open(keyW_Path, "r")
except IOError:
print("Error: file not found.")
exit()
# Read the keywords into a list
keywords = {}
wordFile = open('keywords.txt', 'r')
for line in wordFile.readlines():
word = line.replace('\n', '')
if not(word in keywords.keys()): #Checks that the word doesn't already exist.
keywords[word] = 0 # Adds the word to the DB.
wordFile.close()
# Read the file name from the user and open the file.
try:
tweet_path = input("Enter file named tweets: ")
tweetFile = open(tweet_path, "r")
except IOError:
print("Error: file not found.")
exit()
#Calculating Sentiment Values
with open('keywords.txt') as f:
sentiments = {word: int(value) for word, value in (line.split(",") for line in f)}
with open('tweets.txt') as f:
for line in f:
values = Counter(word for word in line.split() if word in sentiments)
if not values:
continue
keyW = ["love", "like", "best", "hate", "lol", "better", "worst", "good", "happy", "haha", "please", "great", "bad", "save", "saved", "pretty", "greatest", 'excited', 'tired', 'thanks', 'amazing', 'glad', 'ruined', 'negative', 'loving', 'sorry', 'hurt', 'alone', 'sad', 'positive', 'regrets', 'God']
with open('tweets.txt') as oldfile, open('newfile.txt', 'w') as newfile:
for line in oldfile:
if any(word in line for word in keyW):
newfile.write(line)
def score(tweet):
total = 0
for word in tweet:
if word in sentiments:
total += 1
return total
def total(score):
sum = 0
for number in score:
if number in values:
sum += 1
#Classifying the regions
class Region:
def __init__(self, lat_range, long_range):
self.lat_range = lat_range
self.long_range = long_range
def contains(self, lat, long):
return self.lat_range[0] <= lat and lat < self.lat_range[1] and\
self.long_range[0] <= long and long < self.long_range[1]
eastern = Region((24.660845, 49.189787), (-87.518395, -67.444574))
central = Region((24.660845, 49.189787), (-101.998892, -87.518395))
mountain = Region((24.660845, 49.189787), (-115.236428, -101.998892))
pacific = Region((24.660845, 49.189787), (-125.242264, -115.236428))
eastScore = 0
centralScore = 0
pacificScore = 0
mountainScore = 0
happyScoreE = 0
for line in open('newfile.txt'):
line = line.split(" ")
lat = float(line[0][1:-1]) #Stripping the [ and the ,
long = float(line[1][:-1]) #Stripping the ]
if eastern.contains(lat, long):
eastScore += score(line)
elif central.contains(lat, long):
centralScore += score(line)
elif mountain.contains(lat, long):
mountainScore += score(line)
elif pacific.contains(lat, long):
pacificScore += score(line)
else:
continue
答案 0 :(得分:0)
您可以尝试将其放入字典中,其中键是区域,值是该区域的分数数组。这样,数据和操作它的能力将很容易获得。
编辑: 你可以让它实际上成为对象的一部分,你的代码将是更清洁的东西。我没有机会测试它,但它应该为你提供工作的基础
class Region:
score = []
def __init__(self, lat_range, long_range):
self.region_name = region_name
self.lat_range = lat_range
def contains(self, lat, long):
return self.lat_range[0] <= lat and lat < self.lat_range[1] and\
self.long_range[0] <= long and long < self.long_range[1]
def averageScore(self):
return sum(self.score)/len(self.score)
for line in open('newfile.txt'):
line = line.split(" ")
lat = float(line[0][1:-1]) #Stripping the [ and the ,
long = float(line[1][:-1]) #Stripping the ]
if eastern.contains(lat, long):
easter.score.append(score(line))
elif central.contains(lat, long):
central.score.append(score(line))
elif mountain.contains(lat, long):
mountain.score.append(score(line))
elif pacific.contains(lat, long):
pacific.score.append(score(line))
答案 1 :(得分:0)
让我们说 - 正如你所说,我们的文件包含如下数据:
love,10
movie,5
首先,从文件中创建字典。
kw_to_score = {}
kw_file = 'keywords.txt'
with open(kw_file, 'r') as kwf:
for line in kwf.readlines():
word, score = line.split(',')
kw_to_score[word] = int(score)
我们做到了,我们需要创建得分功能:
def score(tweet, keywords):
score = 0
count = 0
for word in tweet.split(): # split words by spaces
if word in keywords:
score += keywords[word]
count += 1
return score, count
之后,继续..
class Region:
def __init__(self, lat_range, long_range):
self.lat_range = lat_range
self.long_range = long_range
self.score = 0 # add new field
self.quantity = 0 # add new field
def contains(self, lat, long):
return self.lat_range[0] <= lat and lat < self.lat_range[1] and\
self.long_range[0] <= long and long < self.long_range[1]
eastern = Region((24.660845, 49.189787), (-87.518395, -67.444574))
central = Region((24.660845, 49.189787), (-101.998892, -87.518395))
mountain = Region((24.660845, 49.189787), (-115.236428, -101.998892))
pacific = Region((24.660845, 49.189787), (-125.242264, -115.236428))
for line in open('newfile.txt'):
line = line.split(" ")
lat = float(line[0][1:-1]) #Stripping the [ and the ,
long = float(line[1][:-1]) #Stripping the ]
for region in (eastern, central, mountain, pacific):
if region.contains(lat, long):
region_score, count = score(line, kw_to_score) # pass the extra dict with keywords mapped to score
region.score += region_score
region.quantity += count
然后您需要做的只是去:
print(eastern.score / eastern.quantity) # That would give you the avg.