熊猫-如何比较2个CSV文件和输出更改

时间:2018-11-06 23:16:03

标签: python pandas csv

情况 我有2个CSV,它们是1万行乘140列,它们在很大程度上是相同的,需要识别差异。标头完全相同,行几乎相同(10K中的100个可能已更改)。

示例

  

File1.csv

     

ID,名字,姓氏,Phone1,Phone2,Phone3   1,鲍勃·琼斯,5555555555,4444444444,3333333333   2,吉姆·希尔,2222222222,1111111111,0000000000

     

File2.csv

     

ID,名字,姓氏,Phone1,Phone2,Phone3
  1,鲍勃·琼斯,5555555555,4444455444,3333333333
  2,Jim,Hill,2222222222,1155111111,0005500000
  3,Kim,Grant,2173659851,3214569874,3698521471

     

Outputfile.csv
  ID,名字,姓氏,Phone1,Phone2,Phone3
  1,Bob,Jones,5555555555, 4444444444 ,3333333333
  2,Jim,Hill,2222222222, 1111111111 0005500000
   3 格兰特 2173659851 3214569874 3698521471 < / strong>

我想我希望输出为File2.csv,并以某种方式突出显示File1.csv的更改。我是python和pandas的新手,似乎无法弄清楚从哪里开始。我已尽力在google上搜索类似于我的需求的脚本,但脚本似乎是针对特定情况的。

如果有人知道一种更简单/不同的方式,那么我无所适从。我不在乎这种情况如何发生,只要我不必逐条记录。

3 个答案:

答案 0 :(得分:0)

只需使用python内置的CSV库即可完成此操作。如果您还关心条目的顺序,则可以使用OrderedDict来保持原始文件的顺序。

import csv
f = []
f3 = file('results.csv', 'w')
with open('file1.csv', 'rb') as f1, open('file2.csv', 'rb') as f2:
    reader1 = csv.reader(f1, delimiter=",")          
    reader2 = csv.reader(f2, delimiter=",")
    for line in enumerate(reader1):
            f.append(line)                        #For the first file, add them all
    for line in enumerate(reader2):
        if not any(e[0] == line[0] for e in f):       #For the second file, only add them if there is not an entry with the same name already
            f.append(line) 
        for e in f:
            if e[0] == line[0]:
                changedindexes = i != j for i, j in zip(e[0], line[0])
                for val in changedindexes:
                    e[val] = e[val] + 'c'                 
c3 = csv.writer(f3, , quoting=csv.QUOTE_ALL)
for line in f:                                       #Write the new merged files into another csv
    c3.writerow(line)


#Then find the differences between the two orderedDicts

至于粗体,由于CSV文件包含数据而不包含任何格式信息,因此无法在CSV中做到这一点。

答案 1 :(得分:0)

第二种方式:

# get indices of differences
difference_locations = np.where(df1 != df2)
#define reference
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]

df_differences = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

答案 2 :(得分:0)

CSV通常不支持不同的字体,但这是一种使用粗体和颜色输出到控制台的解决方案(注意:我仅在Mac上进行了测试)。如果您使用的是Python 3.7+(按插入顺序对字典进行排序),则不需要字典顺序和列列表。

from collections import OrderedDict
from csv import DictReader

class Color(object):
    GREEN = '\033[92m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    END = '\033[0m'

def load_csv(file):
    # Index by ID in order, and keep track of the original column order
    with open(file, 'r') as fp:
        reader = DictReader(fp, delimiter=',')
        rows = OrderedDict((r['ID'], r) for r in reader)
        return rows, reader.fieldnames

def print_row(row, cols, color, prefix):
    print(Color.BOLD + color + prefix + ','.join(row[c] for c in cols) + Color.END)

def print_diff(row1, row2, cols):
    row = []
    for col in cols:
        value1 = row1[col]

        if row2[col] != value1:
            row.append(Color.BOLD + Color.GREEN + value1 + Color.END)
        else:
            row.append(value1)

    print(','.join(row))

def diff_csv(file1, file2):

    rows1, cols = load_csv(file1)
    rows2, _ = load_csv(file2)

    for row_id, row1 in rows1.items():

        # Pop the matching ID row
        row2 = rows2.pop(row_id, None)

        # If not in file2, then it was added
        if not row2:
            print_row(row1, cols, Color.GREEN, '+')

        # In both files, print the diff
        else:
            print_diff(row1, row2, cols)

    # Anything remaining from file2 was removed in file1
    for row in rows2.values():
        print_row(row, cols, Color.RED, '-')