我有两个CSV文件,其中包含数据库中的所有产品,目前正在使用Excel公式比较这些文件,这是一个漫长的过程。 (每个文件中约130,000行)
我已经用Python编写了一个脚本,该脚本可以很好地处理少量样本数据,但是在现实世界中并不实用
CSV布局为:
ID,产品标题,成本,价格1,价格2,价格3,状态
import csv
data_old = []
data_new = []
with open(file_path_old) as f1:
data = csv.reader(f1, delimiter=",")
next(data)
for row in data:
data_old.append(row)
f1.close()
with open(file_path_new) as f2:
data = csv.reader(f2, delimiter=",")
for row in data:
data_new.append(row)
f2.close()
for d1 in data_new:
for d2 in data_old:
if d2[0] == d1[0]:
# If match check rest of data in the same row
if d2[1] != d1[1]:
...
if d2[2] != d1[2]:
...
上述问题是因为它是嵌套的for循环,它遍历第二数据的每一行130,000次(慢速是一种轻描淡写)
我要实现的目的是获得所有产品的列表,这些产品的名称,成本,三种价格和状态中的任何一种都发生了变化,并且有一个布尔标志来显示哪些数据发生了变化根据前几周的数据。
所需的CSV输出格式:
ID,旧标题,新标题,已更改,旧费用,新费用,已更改...。
123,ABC,ABC,False,£12,£13,True ....
解决方案:
import pandas as pd
# Read CSVs
old = pd.read_csv(old_file, sep=",")
new = pd.read_csv(new_file, sep=",")
# Join data together in single data table
df_join = pd.concat([old.set_index('PARTNO'), new.set_index('PARTNO'], axis='columns', key=['Old', 'New'])
# Displays data side by side
df_swap = pd.swaplevel(axis='columns')[old.columns[1:]]
# Output to CSV
out = df_swap.to_csv(output_file)
答案 0 :(得分:1)
只需使用pandas
import pandas as pd
old = pd.read_csv(file_path_old, sep=',')
new = pd.read_csv(file_path_new, sep=',')
然后您可以执行任何操作(只需阅读文档)。例如,要比较标题:
old['Title'] == new['Title']
为文件中的每一行提供一个布尔数组。
答案 1 :(得分:0)
您是否关心新产品和新产品?如果没有,那么您可以使用字典来获得O(n)
的性能。
选择一个CSV文件,然后将其推入以id
键的字典中。使用查找字典的字典来查找已更改的产品。
请注意,为简便起见,我将您的数据简化为一列。
data_old = [
(1, 'alpha'),
(2, 'bravo'),
(3, 'delta'),
(5, 'echo')
]
data_new = [
(1, 'alpha'),
(2, 'zulu'),
(4, 'foxtrot'),
(6, 'mike'),
(7, 'lima'),
]
changed_products = []
new_product_map = {id: product for (id, product) in data_new}
for id, old_product in data_old:
if id in new_product_map and new_product_map[id] != old_product:
changed_products.append(id)
print('Changed products: ', changed_products)
您可以使用列表理解功能进一步缩短此操作
new_product_map = {id: product for (id, product) in data_new}
changed_products = [id for (id, old_product) in data_old if id in new_product_map and new_product_map[id] != old_product]
print('Changed products: ', changed_products)
下面的diff算法还可以跟踪插入和删除。如果您的CSV文件按id
排序,则可以使用它。
如果CSV文件没有合理的顺序,则可以在加载数据后O(n*Lg(n))
时间内对数据进行排序。排序后继续进行比较。
无论哪种方式,这都将比原始帖子中的O(n^2)
循环快:
data_old = # same setup as before
data_new = # ditto
old_index = 0
new_index = 0
new_products = []
deleted_products = []
changed_products = []
while old_index < len(data_old) and new_index < len(data_new):
(old_id, old_product) = data_old[old_index]
(new_id, new_product) = data_new[new_index]
if old_id < new_id:
print('Product removed : %d' % old_id)
deleted_products.append(old_id)
old_index += 1
elif new_id < old_id:
print('Product added : %d' % new_id)
new_products.append(new_id)
new_index += 1
else:
if old_product != new_product:
print ('Product %d changed from %s to %s' %(old_id, old_product, new_product))
changed_products.append(old_id)
else:
print ('Product %d did not change' % old_id)
old_index += 1
new_index += 1
if old_index != len(data_old):
num_deleted = len(data_old) - old_index
print('The last %d old items were deleted' % num_deleted)
deleted_products += [id for (id, _) in data_old[old_index:]]
elif new_index != len(data_new):
num_added = len(data_new) - new_index
print('The last %d ne items were completely new' % num_added)
new_products += [id for (id, _) in data_new[new_index:]]
print('New products: ', new_products)
print('Changed products: ', changed_products)
print('Deleted products: ', deleted_products)
PS:关于使用熊猫的建议是一个很好的建议。尽可能使用它。