我有2个文件,
orders.csv:
OrderNo,OrderDate,LineNo,ShipToAddressNo,ItemCode,QtyOrdered,QtyShipped
528758,1/3/2017,1,1358538,111931,70,70
528791,1/3/2017,10,1254798,110441,300,300
528791,1/3/2017,1,1254798,1029071,10,10
528791,1/3/2017,2,1254798,1033341,10,10
canceled.csv:
OrderNo,OrderDate,LineNo,ShipToAddressNo,ItemCode,QtyOrdered,QtyShipped
529027,1/4/2017,6,43823775,1029070,1,1
529027,1/4/2017,5,43823775,1029071,1,1
529027,1/4/2017,12,43823775,1038324,1,1
529027,1/4/2017,13,43823775,1039306,1,1
已取消工作表上的某些OrderNo
不会出现在订单表上,此外,有些行包含OrderNo
,它确实会出现在订单表上,而其中的ItemCode
却不会出现在订单表上。
我已经导入了熊猫DF。我正在尝试找出一种相对于 ,OrderNo
和ItemCode
上的orders.csv来检查cancels.csv的好方法。
然后,我想将匹配的行(包括所有其他字段)写入新的csv,checked.csv。
或者,如果我可以用新列将所有行写到新文件中,则指示它是否匹配也可以。
任何能提供建议或技巧以引导我朝正确方向发展的人,将不胜感激!
更新:
正如@Matt L.指出的那样,将iterrows
和df.loc()
与双重条件一起使用应该可以得到我想要的。我已经用一个小的测试文件成功地进行了测试,但是在实际文件上运行它(3,600行,预期匹配1,900个)时,结果只有34行。下面是输出:
,OrderNo,OrderDate,LineNo,ShipToAddressNo,ItemCode,QtyOrdered,QtyShipped
11,528980,1/4/2017,1,1912593,1039823,1,1
29,529222,1/4/2017,2,1254693,1038323,1,1
30,529285,1/4/2017,3,1254692,1041108,1,1
516,532202,1/18/2017,9,2203715,10135131,8,8
651,532699,1/19/2017,1,2060310,10098739,1,1
652,532699,1/19/2017,2,2060310,110441,1,1
726,533083,1/19/2017,7,43824548,10098739,10,10
762,533207,1/19/2017,1,43824564,10098739,234,234
767,533228,1/19/2017,2,1254707,10098739,11,11
779,533248,1/19/2017,1,1642075,10098739,1,1
780,533250,1/19/2017,1,1254733,10098739,9,9
781,533252,1/19/2017,1,1254706,10098739,1,1
782,533254,1/19/2017,1,1751514,10098739,10,10
783,533258,1/19/2017,3,1254711,10098739,7,7
784,533260,1/19/2017,1,1254723,10098739,12,12
786,533320,1/20/2017,4,1254612,10098739,35,35
899,534785,1/26/2017,6,2203715,10135358,19,19
1005,535540,1/30/2017,7,1254612,1040774,5,5
1011,535549,1/30/2017,5,1254612,10135131,3,3
1016,535563,1/30/2017,12,43823870,1040765,4,4
1020,535591,1/30/2017,13,43824564,10135132,30,30
1375,536840,2/3/2017,6,43823585,1041105,5,5
1376,536840,2/3/2017,7,43823585,1041107,3,3
1444,537013,2/3/2017,6,1255628,10137993,1,1
1455,537075,2/3/2017,9,1255617,10135364,2,2
1657,537570,2/6/2017,1,1254612,10135139,1,1
1658,537570,2/6/2017,2,1254612,10135138,3,3
1659,537570,2/6/2017,3,1254612,10135140,1,1
1660,537570,2/6/2017,4,1254612,10135131,1,1
1808,537667,2/6/2017,12,43823870,10137992,2,2
1847,537771,2/7/2017,5,1276705,1041106,4,4
2760,539524,2/13/2017,6,1254798,1038323,10,10
3575,542362,2/23/2017,11,1254612,1041108,2,2
3579,542835,2/23/2017,13,1255235,10137993,5,5
结果的索引表明它正在遍历整个cancels.csv,但仅找到具有匹配的OrderNo
和ItemCode
的34行。这是不正确的。
答案 0 :(得分:0)
您可以使用iterrows
,df.loc()
和条件字符串来执行此操作。请注意,我的.loc()
查询具有两个条件,因此结果必须匹配两列。如果您使用的是较大的文件,则可以通过将条件语句替换为某些其他类型的逻辑(例如try / except)来获得更高的效率。
import pandas as pd
orders_df = pd.read_csv('orders.csv')
canceled_df = pd.read_csv('canceled.csv')
matched = []
for row in canceled_df.iterrows():
match1 = orders_df.loc[(orders_df['OrderNo'] == row[1]['Order Number']) & (orders_df['ItemCode'] == row[1]["Item Code"])]
if len(match1) > 0:
matched.append(row[1])
result = pd.DataFrame(matched)
result.to_csv('result.csv')