应用groupby后,DataFrames不相等,但我不知道为什么

时间:2019-02-06 21:05:30

标签: python pandas dataframe

我为随后的代码墙以及格式不正确表示歉意。我尝试了尽可能多的方法来查找导致在应用DataFrame.equals()或更高版本的df1 == df2时导致这些数据帧返回False的原因。我找不到它们之间的任何差异。

我通过将groupby应用于除ORDER_QTY以外的所有列的第一个(bdf),获得了第二个数据帧(dftest)。由于这两个数据帧的行数相同,因此我认为没有任何变化(这并不令我感到惊讶。)但是,请确保我使用bdf.equals(dftest)进行了比较,并返回false。这是在我确保列的顺序正确之后。我注意到的唯一另一件事是数据帧的大小不同。否则我会迷路...

In:
dftest = bdf.groupby(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'FW_END_DT', 'BPS_INCLUDE']).sum().reset_index()
dftest = dftest[['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE']]

print(bdf.equals(dftest))
print(bdf.columns)
print(dftest.columns)


Out:
False
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
       'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
       'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
      dtype='object')
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
       'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
       'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
      dtype='object')

^列似乎相同,但是bdf.equals(dftest)得出False

In:
bdf.info()
dftest.info()


Out:
<class 'pandas.core.frame.DataFrame'>
Index: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE              53025 non-null object
CUST              53025 non-null object
ORDER_NUMBER      53025 non-null object
ORDER_DATE        53025 non-null datetime64[ns]
PURCHASE_ORDER    53025 non-null object
CHANNEL           53025 non-null object
SHIP_TO           53025 non-null object
PROD_LINE         53025 non-null object
GROUP_NUMBER      53025 non-null object
DESCRIPTION       53025 non-null object
ITEM              53025 non-null object
ORDER_QTY         53025 non-null int64
FW_END_DT     53025 non-null datetime64[ns]
BPS_INCLUDE       53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 6.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE              53025 non-null object
CUST              53025 non-null object
ORDER_NUMBER      53025 non-null object
ORDER_DATE        53025 non-null datetime64[ns]
PURCHASE_ORDER    53025 non-null object
CHANNEL           53025 non-null object
SHIP_TO           53025 non-null object
PROD_LINE         53025 non-null object
GROUP_NUMBER      53025 non-null object
DESCRIPTION       53025 non-null object
ITEM              53025 non-null object
ORDER_QTY         53025 non-null int64
FW_END_DT     53025 non-null datetime64[ns]
BPS_INCLUDE       53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 5.7+ MB

^我说过,除了大小,其他所有内容都一样。

In:
common = bdf.merge(dftest,on=['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'], how='outer', indicator=True)
print(common[common['_merge'] != 'both'])



Out:
Empty DataFrame
Columns: [SITE, CUST, ORDER_NUMBER, ORDER_DATE, PURCHASE_ORDER, CHANNEL, SHIP_TO, PROD_LINE, GROUP_NUMBER, DESCRIPTION, ITEM, ORDER_QTY, FW_END_DT, BPS_INCLUDE, _merge]
Index: []

试图合并和选择不在两个df中的行

In:
bdf[(~bdf.SITE.isin(common.SITE))&(~bdf.CUST.isin(common.CUST))&(~bdf.ORDER_NUMBER.isin(common.ORDER_NUMBER))&(~bdf.ORDER_DATE.isin(common.ORDER_DATE))&(~bdf.PURCHASE_ORDER.isin(common.PURCHASE_ORDER))&(~bdf.CHANNEL.isin(common.CHANNEL))&(~bdf.SHIP_TO.isin(common.SHIP_TO))&(~bdf.PROD_LINE.isin(common.PROD_LINE))&(~bdf.GROUP_NUMBER.isin(common.GROUP_NUMBER))&(~bdf.DESCRIPTION.isin(common.DESCRIPTION))&(~bdf.ITEM.isin(common.ITEM))&(~bdf.ORDER_QTY.isin(common.ORDER_QTY))&(~bdf.FW_END_DT.isin(common.FW_END_DT))&(~bdf.BPS_INCLUDE.isin(common.BPS_INCLUDE))]



Out:
SITE    CUST    ORDER_NUMBER    ORDER_DATE  PURCHASE_ORDER  CHANNEL SHIP_TO PROD_LINE   GROUP_NUMBER    DESCRIPTION ITEM    ORDER_QTY   FW_END_DT   BPS_INCLUDE

无所事事

In:
(bdf == dftest).all().all()




Out:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-6c2f52f55e60> in <module>()
----> 1 (bdf == dftest).all().all()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py in f(self, other)
   1611             # Another DataFrame
   1612             if not self._indexed_same(other):
-> 1613                 raise ValueError('Can only compare identically-labeled '
   1614                                  'DataFrame objects')
   1615             return self._compare_frame(other, func, str_rep)

ValueError: Can only compare identically-labeled DataFrame objects

它们的标签不一样吗?

当我尝试搜索以下内容时,建议尝试:

In:
bdf.eq(dftest)


Out:
    SITE    CUST    ORDER_NUMBER    ORDER_DATE  PURCHASE_ORDER  CHANNEL SHIP_TO PROD_LINE   GROUP_NUMBER    DESCRIPTION ITEM    ORDER_QTY   FW_END_DT   BPS_INCLUDE
0   False   False   False   False   False   False   False   False   False   False   False   False   False   False
1   False   False   False   False   False   False   False   False   False   False   False   False   False   False
2   False   False   False   False   False   False   False   False   False   False   False   False   False   False
3   False   False   False   False   False   False   False   False   False   False   False   False   False   False
4   False   False   False   False   False   False   False   False   False   False   False   False   False   False
5   False   False   False   False   False   False   False   False   False   False   False   False   False   False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52995   False   False   False   False   False   False   False   False   False   False   False   False   False   False
106050 rows × 14 columns

在这种情况下,看起来每对单元格都不匹配...:(

我错过了完全显而易见的东西吗?

1 个答案:

答案 0 :(得分:0)

您的数据中是否有nan / null /缺失值?

如果这样的话,groupby.sum()可以用例如0.如果是数字dtypes

如果是上述原因,groupby.first()结果将与原始输入相同