我有三个文件,分别是users.dat,ratings.dat和movies.dat。
users.dat
JustEat.Core.Model.MealOption {
ProductChoice (array[JustEat.Core.Model.Product], optional):
IEnumerable[JustEat.Core.Model.Product],
HasAccessory (boolean, optional): Boolean,
DisplayRank (integer, optional): Int32
}
JustEat.Core.Model.ProductAccessory {
Id (integer, optional): Int32,
Name (string, optional): String,
Cost (System.Decimal[System.Decimal], optional): Decimal,
Required (boolean, optional): Boolean,
ProductId (integer, optional): Int32,
SelectionId (integer, optional): Int32
}
ratings.dat
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
movied.dat
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
我的预期输出
1193::One Flew Over the Cuckoo's Nest (1975)::Drama
661::James and the Giant Peach (1996)::Animation|Children's|Musical
914::My Fair Lady (1964)::Musical|Romance
3408::Erin Brockovich (2000)::Drama
2355::Bug's Life, A (1998)::Animation|Children's|Comedy
1197::Princess Bride, The (1987)::Action|Adventure|Comedy|Romance
1287::Ben-Hur (1959)::Action|Adventure|Drama
2804::Christmas Story, A (1983)::Comedy|Drama
我试图在不使用pandas的情况下合并这些文件。我创建了三本字典。用户ID是一个常用密钥。然后,我尝试使用用户密钥合并这三个文件。但是,我没有合并exaclty我想要的。任何建议和建议将不胜感激
我的代码
1::1193::5::978300760::F::1::10::48067::One Flew Over the Cuckoo's Nest::Drama::1975
1::661::3::978302109::F::1::10::48067::James and the Giant Peach::Animation|Children's|Musical::1996
1::914::3::978301968::F::1::10::48067::My Fair Lady ::Musical|Romance::1964
1::3408::4::978300275::F::1::10::48067::Erin Brockovich ::Drama::2000
1::2355::5::978824291::F::1::10::48067::Bug's Life, A ::Animation|Children's|Comedy::1998
我的输出
import json
file = open("users.dat","r",encoding = 'utf-8')
users={}
for line in file:
x = line.split('::')
user_id=x[0]
gender=x[1]
age=x[2]
occupation=x[3]
i_zip=x[4]
users[user_id]=gender,age,occupation,i_zip.strip()
file = open("movies.dat","r",encoding='latin-1')
movies={}
for line in file:
x = line.split('::')
movie_id=x[0]
title=x[1]
genre=x[2]
movies[movie_id]=title,genre.strip()
file = open("ratings.dat","r")
ratings={}
for line in file:
x = line.split('::')
a=x[0]
b=x[1]
c=x[2]
d=x[3]
ratings[a]=b,c,d.strip()
newdict = {}
newdict.update(users)
newdict.update(movies)
newdict.update(ratings)
for i in users.keys():
addition = users[i] + movies[i]+ratings[i]
newdict[i] = addition
with open('data.txt', 'w') as outfile:
json.dump(newdict, outfile)
答案 0 :(得分:0)
您的代码中的第一个错误(除了混乱的缩进)是您使用用户ID作为关键字创建了一个字典:
ratings[a]=b,c,d.strip()
对于您的数据集,字典ratings
最终会以值{ '1': ('2804', '5', '978300719') }
结束。因此,除了一个用户之外,所有评级都会丢失。
您要做的是将评分数据视为列表,而不是字典。您尝试实现的结果也是评级的扩展版本,因为您将得到尽可能多的行,就像您有分数一样。
其次,您不需要json
模块,因为您所需的输出不是JSON格式。
这是完成工作的代码:
#!/usr/bin/env python3
# Part 1: collect data from the files
users = {}
file = open("users.dat","r",encoding = 'utf-8')
for line in file:
user_id, gender, age, occupation, i_zip = line.rstrip().split('::')
users[user_id] = (gender, age, occupation, i_zip)
movies={}
file = open("movies.dat","r",encoding='latin-1')
for line in file:
movie_id, title, genre = line.rstrip().split('::')
# Parse year from title
title = title.rstrip()
year = 'N/A'
if title[-1]==')' and '(' in title:
short_title, in_parenthesis = title.rsplit('(', 1)
in_parenthesis = in_parenthesis.rstrip(')').rstrip()
if in_parenthesis.isdigit() and len(in_parenthesis)==4:
# Text in parenthesis has four digits - it must be year
title = short_title.rstrip()
year = in_parenthesis
movies[movie_id] = (title, genre, year)
ratings=[]
file = open("ratings.dat","r")
for line in file:
user_id, movie_id, score, dt = line.rstrip().split('::')
ratings.append((user_id, movie_id, score, dt))
# Part 2: save the output
file = open('output.dat','w',encoding='utf-8')
for user_id, movie_id, score, dt in ratings:
# Get user data from dictionary
gender, age, occupation, i_zip = users[user_id]
# Get movie data from dictionary
title, genre, year = movies[movie_id]
# Merge data into a single string
row = '::'.join([user_id, movie_id, score, dt,
gender, age, occupation, i_zip,
title, genre, year])
# Write to the file
file.write(row + '\n')
file.close()
第1部分基于您的代码,主要区别在于我将评级保存到列表(而不是字典),并且我添加了多年的解析。
第2部分是保存输出的地方。
运行脚本后output.dat
文件的内容:
1::1193::5::978300760::F::1::10::48067::One Flew Over the Cuckoo's Nest::Drama::1975
1::661::3::978302109::F::1::10::48067::James and the Giant Peach::Animation|Children's|Musical::1996
1::914::3::978301968::F::1::10::48067::My Fair Lady::Musical|Romance::1964
1::3408::4::978300275::F::1::10::48067::Erin Brockovich::Drama::2000
1::2355::5::978824291::F::1::10::48067::Bug's Life, A::Animation|Children's|Comedy::1998
1::1197::3::978302268::F::1::10::48067::Princess Bride, The::Action|Adventure|Comedy|Romance::1987
1::1287::5::978302039::F::1::10::48067::Ben-Hur::Action|Adventure|Drama::1959
1::2804::5::978300719::F::1::10::48067::Christmas Story, A::Comedy|Drama::1983