情况 我有2个CSV,它们是1万行乘140列,它们在很大程度上是相同的,需要识别差异。标头完全相同,行几乎相同(10K中的100个可能已更改)。
示例
File1.csv
ID,名字,姓氏,Phone1,Phone2,Phone3 1,鲍勃·琼斯,5555555555,4444444444,3333333333 2,吉姆·希尔,2222222222,1111111111,0000000000
File2.csv
ID,名字,姓氏,Phone1,Phone2,Phone3
1,鲍勃·琼斯,5555555555,4444455444,3333333333
2,Jim,Hill,2222222222,1155111111,0005500000
3,Kim,Grant,2173659851,3214569874,3698521471
Outputfile.csv
ID,名字,姓氏,Phone1,Phone2,Phone3
1,Bob,Jones,5555555555, 4444444444 ,3333333333
2,Jim,Hill,2222222222, 1111111111 , 0005500000
3 ,金,格兰特, 2173659851 , 3214569874 , 3698521471 < / strong>
我想我希望输出为File2.csv,并以某种方式突出显示File1.csv的更改。我是python和pandas的新手,似乎无法弄清楚从哪里开始。我已尽力在google上搜索类似于我的需求的脚本,但脚本似乎是针对特定情况的。
如果有人知道一种更简单/不同的方式,那么我无所适从。我不在乎这种情况如何发生,只要我不必逐条记录。
答案 0 :(得分:0)
只需使用python内置的CSV库即可完成此操作。如果您还关心条目的顺序,则可以使用OrderedDict来保持原始文件的顺序。
import csv
f = []
f3 = file('results.csv', 'w')
with open('file1.csv', 'rb') as f1, open('file2.csv', 'rb') as f2:
reader1 = csv.reader(f1, delimiter=",")
reader2 = csv.reader(f2, delimiter=",")
for line in enumerate(reader1):
f.append(line) #For the first file, add them all
for line in enumerate(reader2):
if not any(e[0] == line[0] for e in f): #For the second file, only add them if there is not an entry with the same name already
f.append(line)
for e in f:
if e[0] == line[0]:
changedindexes = i != j for i, j in zip(e[0], line[0])
for val in changedindexes:
e[val] = e[val] + 'c'
c3 = csv.writer(f3, , quoting=csv.QUOTE_ALL)
for line in f: #Write the new merged files into another csv
c3.writerow(line)
#Then find the differences between the two orderedDicts
至于粗体,由于CSV文件包含数据而不包含任何格式信息,因此无法在CSV中做到这一点。
答案 1 :(得分:0)
第二种方式:
# get indices of differences
difference_locations = np.where(df1 != df2)
#define reference
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]
df_differences = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
答案 2 :(得分:0)
CSV通常不支持不同的字体,但这是一种使用粗体和颜色输出到控制台的解决方案(注意:我仅在Mac上进行了测试)。如果您使用的是Python 3.7+(按插入顺序对字典进行排序),则不需要字典顺序和列列表。
from collections import OrderedDict
from csv import DictReader
class Color(object):
GREEN = '\033[92m'
RED = '\033[91m'
BOLD = '\033[1m'
END = '\033[0m'
def load_csv(file):
# Index by ID in order, and keep track of the original column order
with open(file, 'r') as fp:
reader = DictReader(fp, delimiter=',')
rows = OrderedDict((r['ID'], r) for r in reader)
return rows, reader.fieldnames
def print_row(row, cols, color, prefix):
print(Color.BOLD + color + prefix + ','.join(row[c] for c in cols) + Color.END)
def print_diff(row1, row2, cols):
row = []
for col in cols:
value1 = row1[col]
if row2[col] != value1:
row.append(Color.BOLD + Color.GREEN + value1 + Color.END)
else:
row.append(value1)
print(','.join(row))
def diff_csv(file1, file2):
rows1, cols = load_csv(file1)
rows2, _ = load_csv(file2)
for row_id, row1 in rows1.items():
# Pop the matching ID row
row2 = rows2.pop(row_id, None)
# If not in file2, then it was added
if not row2:
print_row(row1, cols, Color.GREEN, '+')
# In both files, print the diff
else:
print_diff(row1, row2, cols)
# Anything remaining from file2 was removed in file1
for row in rows2.values():
print_row(row, cols, Color.RED, '-')