我需要比较两个CSV文件并在第三个CSV文件中打印出差异。 在我的例子中,第一个CSV是一个名为old.csv的旧哈希列表,第二个CSV是包含旧哈希和新哈希的新哈希列表。
这是我的代码:
import csv
t1 = open('old.csv', 'r')
t2 = open('new.csv', 'r')
fileone = t1.readlines()
filetwo = t2.readlines()
t1.close()
t2.close()
outFile = open('update.csv', 'w')
x = 0
for i in fileone:
if i != filetwo[x]:
outFile.write(filetwo[x])
x += 1
outFile.close()
第三个文件是旧文件的副本,而不是更新。 怎么了 ?我希望你能帮助我,非常感谢!!
PS:我不想使用diff
答案 0 :(得分:13)
问题是您要将fileone
中的每一行与filetwo
中的同一行进行比较。只要在一个文件中有一个额外的行,您就会发现这些行永远不会再相等。试试这个:
with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
答案 1 :(得分:6)
答案 2 :(得分:5)
使用集合感觉自然检测差异。
#!/usr/bin/env python3
import sys
import argparse
import csv
def get_dataset(f):
return set(map(tuple, csv.reader(f)))
def main(f1, f2, outfile, sorting_column):
set1 = get_dataset(f1)
set2 = get_dataset(f2)
different = set1 ^ set2
output = csv.writer(outfile)
for row in sorted(different, key=lambda x: x[sorting_column], reverse=True):
output.writerow(row)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('infile', nargs=2, type=argparse.FileType('r'))
parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
parser.add_argument('-sc', '--sorting-column', nargs='?', type=int, default=0)
args = parser.parse_args()
main(*args.infile, args.outfile, args.sorting_column)
答案 3 :(得分:3)
我认为你的新文件就像你的旧文件一样,除了在旧文件之间添加了一些行。两个文件中的旧行以相同的顺序存储。
试试这个:
with open('old.csv', 'r') as t1:
old_csv = t1.readlines()
with open('new.csv', 'r') as t2:
new_csv = t2.readlines()
with open('update.csv', 'w') as out_file:
line_in_new = 0
line_in_old = 0
while line_in_new < len(new_csv) and line_in_old < len(old_csv):
if old_csv[line_in_old] != new_csv[line_in_new]:
out_file.write(new_csv[line_in_new])
else:
line_in_old += 1
line_in_new += 1
with
和一些有意义的变量名称,这使它更容易理解。并且您不需要csv
包,因为您没有使用它的任何功能。更新:此解决方案不如Chris Mueller's one那么漂亮,对于小文件来说是完美且非常Pythonic,但它只读取文件一次(保持原始算法的想法)因此,如果你有更大的文件可能会更好。
答案 4 :(得分:0)
with open('first_test_pipe.csv', 'r') as t1, open('validation.csv', 'r') as t2:
filecoming = t1.readlines()
filevalidation = t2.readlines()
for i in range(0,len(filevalidation)):
coming_set = set(filecoming[i].replace("\n","").split(","))
validation_set = set(filevalidation[i].replace("\n","").split(","))
ReceivedDataList=list(validation_set.intersection(coming_set))
NotReceivedDataList=list(coming_set.union(validation_set)-
coming_set.intersection(validation_set))
print(NotReceivedDataList)
答案 5 :(得分:0)
import pandas as pd
import sys
import csv
def dataframe_difference(df1: pd.DataFrame, df2: pd.DataFrame, csvfile, which=None):
"""Find rows which are different between two DataFrames."""
comparison_df = df1.merge(
df2,
indicator=True,
how='outer'
)
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv(csvfile)
return diff_df
if __name__ == '__main__':
df1 = pd.read_csv(sys.argv[1], sep=',')
df2 = pd.read_csv(sys.argv[2], sep=',')
df1.sort_values(sys.argv[3])
df2.sort_values(sys.argv[3])
#df1.drop(df1.columns[list(map(int, sys.argv[4].split()))], axis = 1, inplace = True)
#df2.drop(df2.columns[list(map(int, sys.argv[4].split()))], axis = 1, inplace = True)
print(dataframe_difference(df1, df2, sys.argv[5]))
使用运行:
python3 script.py file1.csv file2.csv some_common_header_to_sort_each_file output_file.csv
如果您想从比较中删除任何列,请取消注释 df.drop
部分并运行
python3 script.py file1.csv file2.csv some_common_header_to_sort_each_file "x y z..." output_file.csv
其中 x,y,z
是要删除的列号,索引从 0 开始。