我对python相对较新,但过去几周一直在使用Wes Kinney的“Python for Data Analysis”。我花了好几个小时试图找出解决当前问题的方法,但我需要一些帮助。我有一个数据集,其中包含在此日历年发送的货件的详细信息;因为我每个月都会收到新数据,所以有些细节可能已经改变。我已经找到了如何识别已更改的货件,以及如何识别这些更改可能是什么。
所以,假设我已经确定这些货件(数据框中的原始货物)发生了变化:
ID Code Mode Amount From To Weight Cube Service_Date
MNO123 BBB Air 50 M1234 M9876 60 6 1-1-2013
GHI123 AAA Air 50 M1234 M9876 80 8 1-1-2013
JKL123 AAA Ship 50 M1234 M9876 70 7 1-1-2013
我已经确定了潜在的变化(在数据框中,变化):
ID Code Mode Amount From To Weight Cube Service_Date
MNO123 BBB Air 50 M1234 M9876 60 6 2-2-2013
MNO123 BBB Air 60 M1234 M9876 60 6 2-2-2013
MNO123 BBB Air 70 M1234 M1111 60 6 2-2-2013
GHI123 AAA Air 65 M1234 M9876 80 8 1-1-2013
JKL123 AAA Ship 65 M1234 M9876 70 7 1-1-2013
JKL123 AAA Ship 65 M1234 M9876 70 8 1-1-2013
我要做的就是在更改数据框中添加一个计数列,它总计与原始数据框中相应值匹配的值的数量。因此,由于代码,模式,数量,from,to,weight和cube匹配,count列将为第一次观察获得值7。同样地,但是一个匹配值较少,第二个观察值的计数值为6,第三个值的计数值为5.
我要找的结果如下:
ID Code Mode Amount From To Weight Cube Service_Date Count
MNO123 BBB Air 50 M1234 M9876 60 6 2-2-2013 7
MNO123 BBB Air 60 M1234 M9876 60 6 2-2-2013 6
MNO123 BBB Air 70 M1234 M1111 60 6 2-2-2013 5
GHI123 AAA Air 65 M1234 M9876 80 8 1-1-2013 7
JKL123 AAA Ship 65 M1234 M9876 70 7 1-1-2013 7
JKL123 AAA Ship 65 M1234 M9876 70 8 1-1-2013 6
通过Wes的书和本网站上的许多有点类似的帖子,我相信我需要使用df.iterrows()
,但我正在努力迭代两个数据帧,同时检查和计算匹配值。
这是我最近的尝试:
for i in changes.iterrows():
for i in original.iterrows():
changes['count'] = 0
if changes(i) == original(i):
changes['count'] +=1
提前感谢您的时间和精力!
答案 0 :(得分:1)
这是一种方式:
确保将原始索引和更改的DataFrames的索引设置为ID:
In [11]: orignal.set_index('ID', inplace=True)
In [12]: original
Out[12]:
Code Mode Amount From To Weight Cube Service_Date
ID
MNO123 BBB Air 50 M1234 M9876 60 6 1-1-2013
GHI123 AAA Air 50 M1234 M9876 80 8 1-1-2013
JKL123 AAA Ship 50 M1234 M9876 70 7 1-1-2013
你还需要在这里做一点点破解以允许我们使用eq DataFrame方法,不幸的是,它会进行排序(或者你可以跟踪原始的唯一索引)。
In [13]: changes = changes.set_index('ID').sort_index()
选择您感兴趣的列(或者您可以删除Service_Date列):
In [14]: count_columns = ['Code', 'Mode', 'Amount', 'From', 'To', 'Weight', 'Cube']
然后你可以使用eq DataFrame方法:
In [15]: changes.eq(original)[count_columns]
Out[15]:
Code Mode Amount From To Weight Cube
ID
GHI123 True True False True True True True
JKL123 True True False True True True True
JKL123 True True False True True True False
MNO123 True True True True True True True
MNO123 True True False True True True True
MNO123 True True False True False True True
并对每一行求和:
In [16]: changes.eq(original)[count_columns].sum(1)
Out[16]:
ID
GHI123 6
JKL123 6
JKL123 5
MNO123 7
MNO123 6
MNO123 5
dtype: int64
In [17]: changes['match'] = changes.eq(original)[count_columns].sum(1).values
In [18]: changes
Out[18]:
Code Mode Amount From To Weight Cube Service_Date match
ID
GHI123 AAA Air 65 M1234 M9876 80 8 1-1-2013 6
JKL123 AAA Ship 65 M1234 M9876 70 7 1-1-2013 6
JKL123 AAA Ship 65 M1234 M9876 70 8 1-1-2013 5
MNO123 BBB Air 50 M1234 M9876 60 6 2-2-2013 7
MNO123 BBB Air 60 M1234 M9876 60 6 2-2-2013 6
MNO123 BBB Air 70 M1234 M1111 60 6 2-2-2013 5
注意:计数与您的计数略有不同......
答案 1 :(得分:0)
您不需要迭代行:
def count_equal(row, original, ID):
"""Counts the number of equal elements between row and original.ix[ID]"""
equal_values = (row == original[original.ID == ID]).values
return equal_values.sum() - 1 # substract 1 because ID doesn't count
changes['count'] = changes.apply(count_equal, args=(original, 'MNO123'), axis=1)