我有一个格式化的文本文件,电影名称,评分和原籍国都由每行的标签空间分隔:
"3:0 f¸r die B‰rte" (1971) 6.8 West Germany
"3K Check In" (2002) 4.3 Federal Republic of Yugoslavia
"3MW: Rivers of Blood" (2008) 7.9 UK
"3Way" (2008) 8.2 USA
"3rd Rock from the Sun" (1996) 7.8 USA
"3rd and Bird" (2008) 7.8 UK
"3satfestival" (2000) 6.7 Germany
我的目标是计算每个国家/地区的平均评分,这与我下面的代码一致。但是,我还想重命名一些国家,例如西德'到德国'这样他们的评分可以加在一起,但我所拥有的代码并不起作用。 '西德'的评分分数和'德国'仍然是单独计算的。我可以做些什么改变?
import collections
MovieRating = collections.namedtuple('MovieRating', ['countryorigin', 'ratingscore'])
ratings = {}
movie = open("movieRatingscore.txt", "r") #open the country rating data file
for line in movie.readlines():
line.rstrip()
(moviename, ratingscore, countryorigin) = line.split('\t')
if countryorigin == 'West Germany':
countryorigin = 'Germany'
ratingscore = float(ratingscore)
if countryorigin in ratings:
ratings[countryorigin].append(ratingscore)
else:
ratings[countryorigin] = [ratingscore]
average = lambda alist: sum(alist)/len(alist)
average_ratings = [MovieRating(countryorigin, average(ratings[countryorigin])) for countryorigin in ratings]
print "\nCountries with the highest average movie rating\n------------------------------"
sorted_ratings = sorted(average_ratings, key=lambda countryorigin: countryorigin.ratingscore, reverse=True)
for i, j in enumerate(sorted_ratings):
print '%i. %s \t%g' % (i + 1, j.countryorigin, j.ratingscore)
答案 0 :(得分:2)
最简单的方法是使用字典来替换单词。检查示例代码:
dt = {'West Germany': 'Germany', 'another': 'Replaced'}
for line in movie.readlines():
for item in dt:
line = line.replace(item, dt[item])
答案 1 :(得分:1)
作为一般规则,任何文本比较都应该在剥离和降低的字符串之间进行。这可以避免被使用多个空格分隔符的文件绊倒。
此外,将West Germany
转换为Germany
的更通用检查是检查字符串中是否包含子串germany
。因此:
for line in map(str.strip, movie.readlines()):
(moviename, ratingscore, countryorigin) = map(str.strip, line.split('\t'))
if "germany" in countryorigin.lower():
countryorigin = 'Germany'
# ...
答案 2 :(得分:0)
在将文件放入带有标签文字的字符串列表中之后,文件格式不正确(您需要标签的空格):
movie =["\"3:0 f¸r die B‰rte\" (1971)\t6.8\tWest Germany",
"\"3K Check In\" (2002)\t4.3\tFederal Republic of Yugoslavia",
"\"3MW: Rivers of Blood\" (2008)\t7.9\tUK",
"\"3Way\" (2008)\t8.2\tUSA",
"\"3rd Rock from the Sun\" (1996)\t7.8\tUSA",
"\"3rd and Bird\" (2008)\t7.8\tUK",
"\"3satfestival\" (2000)\t6.7\tGermany"]
for line in movie:
....
我得到了输出:
Countries with the highest average movie rating
------------------------------
1. USA 8
2. UK 7.85
3. Germany 6.75
4. Federal Republic of Yugoslavia 4.3
答案 3 :(得分:0)
使用表达式
print repr(countryorigin)
应该向您展示问题。字符串是"西德\ n"而不是"西德",这就是平等检查失败的原因。来自python docs:
str.rstrip([chars])返回删除了尾随字符的字符串副本。
您正在执行strip命令,但它没有被保存回到行。您可以通过添加line = line.rstrip()来解决问题,但我认为@blz具有最佳语法:
for line in map(str.strip, movie.readlines()):
答案 4 :(得分:0)
您描述的错误似乎来自您的csv文件。你的代码似乎很好,逻辑上很明智。
但你应该使用Python Standard Library提供的工具,他们可以为你做很多繁重的工作。
import csv
from collections import defaultdict, namedtuple
from operator import attrgetter, itemgetter
from itertools import imap
MovieRating = namedtuple('MovieRating', 'countryorigin ratingscore')
fieldnames = 'name', 'year', 'score', 'country'
score_and_country = itemgetter('score', 'country')
ratings = defaultdict(list)
with open("movieRatingscore.txt", "r") as moviefile:
movies = csv.DictReader(moviefile, fieldnames=fieldnames, delimiter='\t')
for score, country in imap(score_and_country, movies):
if country == 'West Germany':
country = 'Germany'
ratings[country].append(float(score))
average = lambda alist: sum(alist) / len(alist)
average_ratings = [MovieRating(country, average(scores))
for country, scores in ratings.iteritems()]
print
print "Countries with the highest average movie rating"
print "------------------------------"
sorted_ratings = sorted(average_ratings, key=attrgetter('ratingscore'),
reverse=True)
for i, j in enumerate(sorted_ratings):
print '%i. %s \t%g' % (i + 1, j.countryorigin, j.ratingscore)