Question

情况我有2个CSV，它们是1万行乘140列，它们在很大程度上是相同的，需要识别差异。标头完全相同，行几乎相同（10K中的100个可能已更改）。

示例

File1.csv

ID，名字，姓氏，Phone1，Phone2，Phone3   1，鲍勃·琼斯，5555555555,4444444444,3333333333   2，吉姆·希尔，2222222222,1111111111,0000000000

File2.csv

ID，名字，姓氏，Phone1，Phone2，Phone3
  1，鲍勃·琼斯，5555555555,4444455444,3333333333
  2，Jim，Hill，2222222222,1155111111,0005500000
  3，Kim，Grant，2173659851,3214569874,3698521471


Outputfile.csv
  ID，名字，姓氏，Phone1，Phone2，Phone3
  1，Bob，Jones，5555555555， 4444444444 ，3333333333
  2，Jim，Hill，2222222222， 1111111111 ， 0005500000
   3 ，金，格兰特， 2173659851 ， 3214569874 ， 3698521471 < / strong>

我想我希望输出为File2.csv，并以某种方式突出显示File1.csv的更改。我是python和pandas的新手，似乎无法弄清楚从哪里开始。我已尽力在google上搜索类似于我的需求的脚本，但脚本似乎是针对特定情况的。

如果有人知道一种更简单/不同的方式，那么我无所适从。我不在乎这种情况如何发生，只要我不必逐条记录。

Answer 1

只需使用python内置的CSV库即可完成此操作。如果您还关心条目的顺序，则可以使用OrderedDict来保持原始文件的顺序。

import csv
f = []
f3 = file('results.csv', 'w')
with open('file1.csv', 'rb') as f1, open('file2.csv', 'rb') as f2:
    reader1 = csv.reader(f1, delimiter=",")          
    reader2 = csv.reader(f2, delimiter=",")
    for line in enumerate(reader1):
            f.append(line)                        #For the first file, add them all
    for line in enumerate(reader2):
        if not any(e[0] == line[0] for e in f):       #For the second file, only add them if there is not an entry with the same name already
            f.append(line) 
        for e in f:
            if e[0] == line[0]:
                changedindexes = i != j for i, j in zip(e[0], line[0])
                for val in changedindexes:
                    e[val] = e[val] + 'c'                 
c3 = csv.writer(f3, , quoting=csv.QUOTE_ALL)
for line in f:                                       #Write the new merged files into another csv
    c3.writerow(line)


#Then find the differences between the two orderedDicts

至于粗体，由于CSV文件包含数据而不包含任何格式信息，因此无法在CSV中做到这一点。

Answer 2

第二种方式：

# get indices of differences
difference_locations = np.where(df1 != df2)
#define reference
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]

df_differences = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

Answer 3

CSV通常不支持不同的字体，但这是一种使用粗体和颜色输出到控制台的解决方案（注意：我仅在Mac上进行了测试）。如果您使用的是Python 3.7+（按插入顺序对字典进行排序），则不需要字典顺序和列列表。

from collections import OrderedDict
from csv import DictReader

class Color(object):
    GREEN = '\033[92m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    END = '\033[0m'

def load_csv(file):
    # Index by ID in order, and keep track of the original column order
    with open(file, 'r') as fp:
        reader = DictReader(fp, delimiter=',')
        rows = OrderedDict((r['ID'], r) for r in reader)
        return rows, reader.fieldnames

def print_row(row, cols, color, prefix):
    print(Color.BOLD + color + prefix + ','.join(row[c] for c in cols) + Color.END)

def print_diff(row1, row2, cols):
    row = []
    for col in cols:
        value1 = row1[col]

        if row2[col] != value1:
            row.append(Color.BOLD + Color.GREEN + value1 + Color.END)
        else:
            row.append(value1)

    print(','.join(row))

def diff_csv(file1, file2):

    rows1, cols = load_csv(file1)
    rows2, _ = load_csv(file2)

    for row_id, row1 in rows1.items():

        # Pop the matching ID row
        row2 = rows2.pop(row_id, None)

        # If not in file2, then it was added
        if not row2:
            print_row(row1, cols, Color.GREEN, '+')

        # In both files, print the diff
        else:
            print_diff(row1, row2, cols)

    # Anything remaining from file2 was removed in file1
    for row in rows2.values():
        print_row(row, cols, Color.RED, '-')

熊猫-如何比较2个CSV文件和输出更改

3 个答案: