Question

我有一个格式化的文本文件，电影名称，评分和原籍国都由每行的标签空间分隔：

"3:0 f¸r die B‰rte" (1971)  6.8 West Germany
"3K Check In" (2002)    4.3 Federal Republic of Yugoslavia
"3MW: Rivers of Blood" (2008)   7.9 UK
"3Way" (2008)   8.2 USA
"3rd Rock from the Sun" (1996)  7.8 USA
"3rd and Bird" (2008)   7.8 UK
"3satfestival" (2000)   6.7 Germany

我的目标是计算每个国家/地区的平均评分，这与我下面的代码一致。但是，我还想重命名一些国家，例如西德＆＃39;到德国＆＃39;这样他们的评分可以加在一起，但我所拥有的代码并不起作用。＆＃39;西德＆＃39;的评分分数和＆＃39;德国＆＃39;仍然是单独计算的。我可以做些什么改变？

import collections

MovieRating = collections.namedtuple('MovieRating', ['countryorigin', 'ratingscore'])

ratings = {}

movie = open("movieRatingscore.txt", "r") #open the country rating data file

for line in movie.readlines():
    line.rstrip()
    (moviename, ratingscore, countryorigin) = line.split('\t')
    if countryorigin == 'West Germany':
        countryorigin = 'Germany'
    ratingscore = float(ratingscore)
    if countryorigin in ratings:
        ratings[countryorigin].append(ratingscore)
    else:
        ratings[countryorigin] = [ratingscore]

average = lambda alist: sum(alist)/len(alist)
average_ratings = [MovieRating(countryorigin, average(ratings[countryorigin])) for countryorigin in ratings]

print "\nCountries with the highest average movie rating\n------------------------------"
sorted_ratings = sorted(average_ratings, key=lambda countryorigin: countryorigin.ratingscore, reverse=True)
for i, j in enumerate(sorted_ratings):
    print '%i. %s \t%g' % (i + 1, j.countryorigin, j.ratingscore)

Answer 1

最简单的方法是使用字典来替换单词。检查示例代码：

dt = {'West Germany': 'Germany', 'another': 'Replaced'}
for line in movie.readlines():
    for item in dt:
        line = line.replace(item, dt[item])

Answer 2

作为一般规则，任何文本比较都应该在剥离和降低的字符串之间进行。这可以避免被使用多个空格分隔符的文件绊倒。

此外，将West Germany转换为Germany的更通用检查是检查字符串中是否包含子串germany。因此：

for line in map(str.strip, movie.readlines()):
    (moviename, ratingscore, countryorigin) = map(str.strip, line.split('\t'))
    if "germany" in countryorigin.lower():
        countryorigin = 'Germany'
    # ...

Answer 3

在将文件放入带有标签文字的字符串列表中之后，文件格式不正确（您需要标签的空格）：

movie =["\"3:0 f¸r die B‰rte\" (1971)\t6.8\tWest Germany",
"\"3K Check In\" (2002)\t4.3\tFederal Republic of Yugoslavia",
"\"3MW: Rivers of Blood\" (2008)\t7.9\tUK",
"\"3Way\" (2008)\t8.2\tUSA",
"\"3rd Rock from the Sun\" (1996)\t7.8\tUSA",
"\"3rd and Bird\" (2008)\t7.8\tUK",
"\"3satfestival\" (2000)\t6.7\tGermany"]

for line in movie:
  ....

我得到了输出：

Countries with the highest average movie rating
------------------------------
1. USA  8
2. UK   7.85
3. Germany  6.75
4. Federal Republic of Yugoslavia   4.3

Answer 4

使用表达式

print repr(countryorigin)

应该向您展示问题。字符串是＆＃34;西德\ n＆＃34;而不是＆＃34;西德＆＃34;，这就是平等检查失败的原因。来自python docs：

str.rstrip（[chars]）返回删除了尾随字符的字符串副本。

您正在执行strip命令，但它没有被保存回到行。您可以通过添加line = line.rstrip（）来解决问题，但我认为@blz具有最佳语法：

for line in map(str.strip, movie.readlines()):

Answer 5

您描述的错误似乎来自您的csv文件。你的代码似乎很好，逻辑上很明智。

但你应该使用Python Standard Library提供的工具，他们可以为你做很多繁重的工作。

import csv
from collections import defaultdict, namedtuple
from operator import attrgetter, itemgetter
from itertools import imap

MovieRating = namedtuple('MovieRating', 'countryorigin ratingscore')

fieldnames = 'name', 'year', 'score', 'country'
score_and_country = itemgetter('score', 'country')
ratings = defaultdict(list)

with open("movieRatingscore.txt", "r") as moviefile:
    movies = csv.DictReader(moviefile, fieldnames=fieldnames, delimiter='\t')
    for score, country in imap(score_and_country, movies):
        if country == 'West Germany':
            country = 'Germany'
        ratings[country].append(float(score))

average = lambda alist: sum(alist) / len(alist)
average_ratings = [MovieRating(country, average(scores))
                   for country, scores in ratings.iteritems()]

print
print "Countries with the highest average movie rating"
print "------------------------------"
sorted_ratings = sorted(average_ratings, key=attrgetter('ratingscore'),
                        reverse=True)
for i, j in enumerate(sorted_ratings):
    print '%i. %s \t%g' % (i + 1, j.countryorigin, j.ratingscore)

在列表中重命名国家/地区

5 个答案: