我在编程方面相当新,我正在尝试编写一个python程序,它将比较特定列的2 .csv文件,并检查添加,删除和修改。 .csv文件的格式如下,包含相同数量的列,并使用BillingNumber作为键:
BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State
"2","CHARLIE RYAN","Yes","No","Yes","Reading","PA"
"3","INSURANCE BILLS","","","","",""
"4","AAA","","","","",""
我只需要比较0,1,2和0列。我已经尝试了很多不同的方法来实现这一点,但我没有运气。我知道我可以使用csv.DictReader
或csv.reader
将它们加载到字典中,但之后我就会卡住。在将它们加载到内存后,我不确定从何处或如何开始。
我之前试过这个:
import time
old_lines = set((line.strip() for line in open(r'Old/file1.csv', 'r+')))
file_new = open(r'New/file2.csv', 'r+')
choice = 0
choice = int( input('\nPlease choose your result format.\nEnter 1 for .txt, 2 for .csv or 3 for .json\n') )
time.sleep(1)
print(".")
time.sleep(1)
print("..")
time.sleep(1)
print("...")
time.sleep(1)
print("....")
time.sleep(1)
print('Done! Check "Different" folder for results.\n')
if choice == 1:
file_diff = open(r'Different/diff.txt', 'w')
elif choice == 2:
file_diff = open(r'Different/diff.csv', 'w')
elif choice == 3:
file_diff = open(r'Different/diff.json', "w")
else:
print ("You MUST enter 1, 2 or 3")
exit()
for line in file_new:
if line.strip() not in old_lines:
file_diff.write("** ERROR! Entry "+ line + "** Does not match previous file\n\n")
file_new.close()
file_diff.close()
它没有正常工作,因为如果有一个额外的行,或者缺少一行,它会将该行之后的所有内容记录为不同。它也比较了整行,这不是我想做的。这基本上只是一个起点,虽然它有点奏效,但它并不具备我所需要的特性。我真的只是想找一个好的起点。谢谢!
答案 0 :(得分:1)
我认为你使用csv模块走在正确的轨道上。由于' BillingNumber'是一个独特的键,我会为#34; old"创建一个词典。结算文件,另一个用于" new"帐单文件:
import csv
def make_billing_dict(csv_dict_reader):
bdict = {}
for entry in csv_dict_reader:
key = entry['BillingNumber']
bdict[key] = entry
return bdict
with open('old.csv') as csv_file:
old = csv.DictReader(csv_file)
old_bills = make_billing_dict(old)
这会产生old_bills
的数据结构:
{'2': {'BillingNumber': '2',
'City': 'Reading',
'CustomerName': 'CHARLIE RYAN',
'IsActive': 'Yes',
'IsCreditHold': 'No',
'IsPayScan': 'Yes',
'State': 'PA'},
'3': {'BillingNumber': '3',
'City': '',
'CustomerName': 'INSURANCE BILLS',
'IsActive': '',
'IsCreditHold': '',
'IsPayScan': '',
'State': ''},
'4': {'BillingNumber': '4',
'City': '',
'CustomerName': 'AAA',
'IsActive': '',
'IsCreditHold': '',
'IsPayScan': '',
'State': ''}}
为" new"创建相同的数据结构后结算文件,您可以轻松找到差异:
# Keys that are in old_bills, but not new_bills
print set(old_bills.keys()) - set(new_bills.keys())
# Keys that are in new_bills, but not old_bills
print set(new_bills.keys()) - set(old_bills.keys())
# Compare columns for same billing records
# Will print True or False
print old_bills['2']['CustomerName'] == new_bills['2']['CustomerName']
print old_bills['2']['IsActive'] == new_bills['2']['IsActive']
显然,你不会为每次比较写一个单独的印刷语句。我只是演示如何使用数据结构来查找差异。接下来,你应该编写一个函数来循环遍历所有可能的BillingNumbers并检查新旧之间的差异......但我会留下那部分给你。
答案 1 :(得分:0)
你自己要写这个吗?如果这是一个编程练习,所有权力都给你。否则,请查找名为“diff”的工具,该工具可能以您已有权访问的某种形式存在。它内置于许多其他工具中,例如文本编辑器,如vim,emacs和notepad ++,以及版本控制系统,如subversion mercurial和git。
我建议你使用既定的主力而不是重新发明轮子。 git diff
是一只野兽。
答案 2 :(得分:0)
阅读你的评论:
这只是我想弄清楚的事情。他们为工作中的新技术人员提供了工作清单,他们雇用的人员必须解决这个问题。
他们很可能正在寻找一些命令行fu。类似于
的东西diff <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv1.csv) <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv2.csv)
will work in bash使用diff工具比较某些列selected using awk的命令。
这显然不是基于python的解决方案。但是,该解决方案确实展示了基于unix的简单工具的强大功能。
答案 3 :(得分:0)
由于这些东西的要求有螺旋式上升的趋势,我认为将数据放入SQLite数据库是值得的。
由于检测行是否被删除或只是新的逻辑可能很难实现。
在下面我假设BillingNumber是id而不是改变。
import sqlite3
con = sqlite3.connect(":memory:")
cursor = con.cursor()
columns = "BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State"
cursor.execute("CREATE TABLE left (%s);" % columns)
cursor.execute("CREATE TABLE right (%s);" % columns)
placeholders = ",".join("?" * len(columns.split(',')))
import csv
def reader(filename):
for (lineno, line) in enumerate(open(filename)):
if lineno > 0: # skip header
yield line
def load_table(tablebname, filename):
for row in csv.reader(reader(filename)):
cursor.execute("INSERT INTO %s VALUES(%s);" % (tablebname, placeholders), row)
load_table("left", "left.csv")
load_table("right", "right.csv")
if False:
print "LEFT"
for row in cursor.execute("SELECT * from left;"):
print row[0]
print "RIGHT"
for row in cursor.execute("SELECT * from right;"):
print row
def dataset(tablename, columns):
for row in cursor.execute("SELECT * from %s;" % tablename):
yield tuple(row[x] for x in columns)
# To use if raw data required.
#left_dataset = dataset("left", [0,1,2,4])
#right_dataset = dataset("right", [0,1,2,4])
# COMPARE functions.
def different_rows():
q = """SELECT left.*, right.*
FROM left, right
WHERE left.BillingNumber = right.BillingNumber
AND ( left.CustomerName != right.CustomerName OR
left.IsActive != right.IsActive OR
left.IsPayScan != right.IsPayScan )
;
"""
for row in cursor.execute(q):
print "DIFFERENCE", row
def new_rows():
q = """SELECT right.*
FROM right
WHERE right.BillingNumber NOT IN ( SELECT BillingNumber FROM left)
;
"""
for row in cursor.execute(q):
print "NEW", row
different_rows()
new_rows()
OP必须编写不同的函数来比较数据,但我总体上使用SQL可能更容易。