我为随后的代码墙以及格式不正确表示歉意。我尝试了尽可能多的方法来查找导致在应用DataFrame.equals()或更高版本的df1 == df2时导致这些数据帧返回False的原因。我找不到它们之间的任何差异。
我通过将groupby应用于除ORDER_QTY以外的所有列的第一个(bdf),获得了第二个数据帧(dftest)。由于这两个数据帧的行数相同,因此我认为没有任何变化(这并不令我感到惊讶。)但是,请确保我使用bdf.equals(dftest)进行了比较,并返回false。这是在我确保列的顺序正确之后。我注意到的唯一另一件事是数据帧的大小不同。否则我会迷路...
In:
dftest = bdf.groupby(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'FW_END_DT', 'BPS_INCLUDE']).sum().reset_index()
dftest = dftest[['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE']]
print(bdf.equals(dftest))
print(bdf.columns)
print(dftest.columns)
Out:
False
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
dtype='object')
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
dtype='object')
^列似乎相同,但是bdf.equals(dftest)
得出False
In:
bdf.info()
dftest.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Index: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE 53025 non-null object
CUST 53025 non-null object
ORDER_NUMBER 53025 non-null object
ORDER_DATE 53025 non-null datetime64[ns]
PURCHASE_ORDER 53025 non-null object
CHANNEL 53025 non-null object
SHIP_TO 53025 non-null object
PROD_LINE 53025 non-null object
GROUP_NUMBER 53025 non-null object
DESCRIPTION 53025 non-null object
ITEM 53025 non-null object
ORDER_QTY 53025 non-null int64
FW_END_DT 53025 non-null datetime64[ns]
BPS_INCLUDE 53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 6.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE 53025 non-null object
CUST 53025 non-null object
ORDER_NUMBER 53025 non-null object
ORDER_DATE 53025 non-null datetime64[ns]
PURCHASE_ORDER 53025 non-null object
CHANNEL 53025 non-null object
SHIP_TO 53025 non-null object
PROD_LINE 53025 non-null object
GROUP_NUMBER 53025 non-null object
DESCRIPTION 53025 non-null object
ITEM 53025 non-null object
ORDER_QTY 53025 non-null int64
FW_END_DT 53025 non-null datetime64[ns]
BPS_INCLUDE 53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 5.7+ MB
^我说过,除了大小,其他所有内容都一样。
In:
common = bdf.merge(dftest,on=['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'], how='outer', indicator=True)
print(common[common['_merge'] != 'both'])
Out:
Empty DataFrame
Columns: [SITE, CUST, ORDER_NUMBER, ORDER_DATE, PURCHASE_ORDER, CHANNEL, SHIP_TO, PROD_LINE, GROUP_NUMBER, DESCRIPTION, ITEM, ORDER_QTY, FW_END_DT, BPS_INCLUDE, _merge]
Index: []
试图合并和选择不在两个df中的行
In:
bdf[(~bdf.SITE.isin(common.SITE))&(~bdf.CUST.isin(common.CUST))&(~bdf.ORDER_NUMBER.isin(common.ORDER_NUMBER))&(~bdf.ORDER_DATE.isin(common.ORDER_DATE))&(~bdf.PURCHASE_ORDER.isin(common.PURCHASE_ORDER))&(~bdf.CHANNEL.isin(common.CHANNEL))&(~bdf.SHIP_TO.isin(common.SHIP_TO))&(~bdf.PROD_LINE.isin(common.PROD_LINE))&(~bdf.GROUP_NUMBER.isin(common.GROUP_NUMBER))&(~bdf.DESCRIPTION.isin(common.DESCRIPTION))&(~bdf.ITEM.isin(common.ITEM))&(~bdf.ORDER_QTY.isin(common.ORDER_QTY))&(~bdf.FW_END_DT.isin(common.FW_END_DT))&(~bdf.BPS_INCLUDE.isin(common.BPS_INCLUDE))]
Out:
SITE CUST ORDER_NUMBER ORDER_DATE PURCHASE_ORDER CHANNEL SHIP_TO PROD_LINE GROUP_NUMBER DESCRIPTION ITEM ORDER_QTY FW_END_DT BPS_INCLUDE
无所事事
In:
(bdf == dftest).all().all()
Out:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-6c2f52f55e60> in <module>()
----> 1 (bdf == dftest).all().all()
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py in f(self, other)
1611 # Another DataFrame
1612 if not self._indexed_same(other):
-> 1613 raise ValueError('Can only compare identically-labeled '
1614 'DataFrame objects')
1615 return self._compare_frame(other, func, str_rep)
ValueError: Can only compare identically-labeled DataFrame objects
它们的标签不一样吗?
当我尝试搜索以下内容时,建议尝试:
In:
bdf.eq(dftest)
Out:
SITE CUST ORDER_NUMBER ORDER_DATE PURCHASE_ORDER CHANNEL SHIP_TO PROD_LINE GROUP_NUMBER DESCRIPTION ITEM ORDER_QTY FW_END_DT BPS_INCLUDE
0 False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False
5 False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52995 False False False False False False False False False False False False False False
106050 rows × 14 columns
在这种情况下,看起来每对单元格都不匹配...:(
我错过了完全显而易见的东西吗?
答案 0 :(得分:0)
您的数据中是否有nan / null /缺失值?
如果这样的话,groupby.sum()可以用例如0.如果是数字dtypes
如果是上述原因,groupby.first()结果将与原始输入相同