我正在努力处理我已导入的2个csv文件
csv文件如下:
csv1
planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, black
...
csv2
planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, blue
...
在两个csv文件中,行星都是相同的,但是顺序不同,有时值也不同(不匹配)
每个行星的数据(直径,发现和颜色)已独立输入。我想交叉检查两张纸,找到所有不匹配的字段。然后,我想生成一个新文件,其中每个错误包含一行内容,并附有错误说明。
例如: sceptri:不匹配(黑色/蓝色)
到目前为止,这是我的代码
with open('planets1.csv') as csvfile:
a = csv.reader(csvfile, delimiter=',')
data_a= list(a)
for row in a:
print(row)
with open('planets2.csv') as csvfile:
b = csv.reader(csvfile, delimiter=',')
data_b= list(b)
for row in b:
print(row)
print(data_a)
print(data_b)
c= [data_a]
d= [data_b]```
thank you in advance for your help!
答案 0 :(得分:0)
假设两个文件中行星的名称正确,这是我的建议
# Working with list of list, which could be get csv file reading:
csv1 = [["sceptri",33.41685587,"28-11-1611 05:15", "black"],
["foo",35.41685587,"29-11-1611 05:15", "black"],
["bar",38.7,"29-11-1611 05:15", "black"],]
csv2 = [["foo",35.41685587,"29-11-1611 05:15", "black"],
["bar",38.17,"29-11-1611 05:15", "black"],
["sceptri",33.41685587,"28-11-1611 05:15", "blue"]]
# A list to contain the errors:
new_file = []
# A dict to check if a planet has already been processed:
a_dict ={}
# Let's read all planet data:
for planet in csv1+csv2:
# Check if planet is already as a key in a_dict:
if planet[0] in a_dict:
# Yes, sir, need to check discrepancies.
if a_dict[planet[0]] != planet[1:]:
# we have some differences in some values.
# Put both set of values in python sets to differences:
error = set(planet[1:]) ^ set(a_dict[planet[0]])
# Append [planet_name, diff.param1, diff_param2] to new_file:
new_file.append([planet[0]]+list(error))
else:
# the planet name becomes a dict key, other param are key value:
a_dict[planet[0]] = planet[1:]
print(new_file)
# [['bar', 38.17, 38.7], ['sceptri', 'black', 'blue']]
列表new_file
可能另存为新文件,请参见Writing a list to file
答案 1 :(得分:0)
我建议对这样的任务使用熊猫。
首先,您需要将csv内容读取到dataframe对象中。可以按照以下步骤进行操作:
import pandas as pd
# make a dataframe from each csv file
df1 = pd.read_csv('planets1.csv')
df2 = pd.read_csv('planets2.csv')
如果您的CSV文件中没有这些名称,则可能需要为每列声明名称。
colnames = ['col1', 'col2', ..., 'coln']
df1 = pd.read_csv('planets1.csv', names=colnames, index_col=0)
df2 = pd.read_csv('planets2.csv', names=colnames, index_col=0)
# use index_col=0 if csv already has an index column
import pandas as pd
# example column names
colnames = ['A','B','C']
# example dataframes
df1 = pd.DataFrame([[0,3,6], [4,5,6], [3,2,5]], columns=colnames)
df2 = pd.DataFrame([[1,3,1], [4,3,6], [3,6,5]], columns=colnames)
请注意,df1如下所示:
A B C
---------------
0 0 3 6
1 4 5 6
2 3 2 5
df2看起来像这样:
A B C
---------------
0 1 3 1
1 4 3 6
2 3 6 5
以下代码比较数据帧,将比较连接到新数据帧,然后将结果保存到CSV:
# define the condition you want to check for (i.e., mismatches)
mask = (df1 != df2)
# df1[mask], df2[mask] will replace matched values with NaN (Not a Number), and leave mismatches
# dropna(how='all') will remove rows filled entirely with NaNs
errors_1 = df1[mask].dropna(how='all')
errors_2 = df2[mask].dropna(how='all')
# add labels to column names
errors_1.columns += '_1' # for planets 1
errors_2.columns += '_2' # for planets 2
# you can now combine horizontally into one big dataframe
errors = pd.concat([errors_1,errors_2],axis=1)
# if you want, reorder the columns of `errors` so compared columns are next to each other
errors = errors.reindex(sorted(errors.columns), axis=1)
# if you don't like the clutter of NaN values, you can replace them with fillna()
errors = errors.fillna('_')
# save to a csv
errors.to_csv('mismatches.csv')
最终结果看起来像这样:
A_1 A_2 B_1 B_2 C_1 C_2
-----------------------------
0 0 1 _ _ 6 1
1 _ _ 5 3 _ _
2 _ _ 2 6 _ _
希望这会有所帮助。
答案 2 :(得分:0)
可以通过对csv文件中的行进行排序,然后比较相应的行以查看是否存在差异来解决这种问题。
这种方法使用一种功能样式执行比较,并将比较任意数量的csv文件。
它假定csvs包含相同数量的记录,并且列的顺序相同。
import contextlib
import csv
def compare_files(readers):
colnames = [next(reader) for reader in readers][0]
sorted_readers = [sorted(r) for r in readers]
for gen in [compare_rows(colnames, rows) for rows in zip(*sorted_readers)]:
yield from gen
def compare_rows(colnames, rows):
col_iter = zip(*rows)
# Be sure we're comparing the same planets.
planets = set(next(col_iter))
assert len(planets) == 1, planets
planet = planets.pop()
for (colname, *vals) in zip(colnames, col_iter):
if len(set(*vals)) > 1:
yield f"{planet} mismatch {colname} ({'/'.join(*vals)})"
def main(outfile, *infiles):
with contextlib.ExitStack() as stack:
csvs = [stack.enter_context(open(fname)) for fname in infiles]
readers = [csv.reader(f) for f in csvs]
with open(outfile, 'w') as out:
for result in compare_files(readers):
out.write(result + '\n')
if __name__ == "__main__":
main('mismatches.txt', 'planets1.csv', 'planets2.csv')