外观相等的Python数据框不相等

时间:2019-08-20 19:20:54

标签: python dataframe

我正在学习在线Python课程,该课程正在研究数据框架。

我下载了this CSV file并将其导入到数据框中:

import os
import pandas as pd
os.chdir('C:/cygwin64/home/User.Name/path/to/brics.csv')
pd.read_csv( os.getcwd() + '/brics.csv' )
myBrics = pd.read_csv( 'brics.csv' )
myBrics

      Unnamed: 0       country    capital    area  population
    0         BR        Brazil   Brasilia   8.516      200.40
    1         RU        Russia     Moscow  17.100      143.50
    2         IN         India  New Delhi   3.286     1252.00
    3         CH         China    Beijing   9.597     1357.00
    4         SA  South Africa   Pretoria   1.221       52.98

然后我使用课程演示中提供的代码创建相同的数据框

dict = {
   "country":["Brazil", "Russia", "India", "China", "South Africa"],
   "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
   "area":[8.516, 17.10, 3.286, 9.597, 1.221],
   "population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
brics

            country    capital    area  population
    0        Brazil   Brasilia   8.516      200.40
    1        Russia     Moscow  17.100      143.50
    2         India  New Delhi   3.286     1252.00
    3         China    Beijing   9.597     1357.00
    4  South Africa   Pretoria   1.221       52.98

除了myBrics中的第一列外,它们似乎相同。某些网络搜索显示我可以摆脱第1列:

myBrics.drop( myBrics.columns[[0]] , axis=1 )

            country    capital    area  population
    0        Brazil   Brasilia   8.516      200.40
    1        Russia     Moscow  17.100      143.50
    2         India  New Delhi   3.286     1252.00
    3         China    Beijing   9.597     1357.00
    4  South Africa   Pretoria   1.221       52.98

但是,看起来相同的数据帧仍然不相等:

myBrics.drop( myBrics.columns[[0]] , axis=1 ).equals( brics )

    False

任何人都可以解释发生了什么吗?谢谢。

我正在使用Spyder的Python 3.7,它是由Anaconda安装(由具有管理员权限的人安装)。该操作系统是Windows 7 64位。

4 个答案:

答案 0 :(得分:2)

您依靠返回true的浮点值相等;有很多资源可以解释为什么这不能按预期进行。

我建议导入numpy并在浮点数列上使用isclose函数

将此添加到您的导入

import numpy as np

,然后使用以下内容:

eq = np.isclose(myBrics['area'], brics['area'])

如果您想进一步了解浮动信息的详情,请参见this answer

答案 1 :(得分:1)

我怀疑这是您的列的dtype。正如文档所述:

  

列标题不必具有相同的类型,但是列中的元素必须具有相同的dtype。

您可以使用:

dataframe.dtypes

查看每一列是什么数据类型

答案 2 :(得分:1)

艾伦·埃尔德(Allan Elder)的答案是正确的。我运行了这段代码:

import os
import pandas as pd
myBrics = pd.read_csv( 'brics.csv' )
dict = {
     "country":["Brazil", "Russia", "India", "China", "South Africa"],
     "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
     "area":[8.516, 17.10, 3.286, 9.597, 1.221],
     "population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
myBrics = myBrics.drop( myBrics.columns[[0]] , axis=1 )
print (myBrics['area'].equals(brics['area']))

结果是

False

答案 3 :(得分:1)

量化误差是造成差异的原因。这是受访者建议的一系列疑难解答步骤:

import os
import pandas as pd
os.chdir('C:/cygwin64/home/User.Name/path/to/brics.csv')
pd.read_csv( os.getcwd() + '/brics.csv' )
myBrics = pd.read_csv( 'brics.csv' )
myBrics

     Unnamed: 0       country    capital    area  population
   0         BR        Brazil   Brasilia   8.516      200.40
   1         RU        Russia     Moscow  17.100      143.50
   2         IN         India  New Delhi   3.286     1252.00
   3         CH         China    Beijing   9.597     1357.00
   4         SA  South Africa   Pretoria   1.221       52.98

dict = {
 "country":["Brazil", "Russia", "India", "China", "South Africa"],
 "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
 "area":[8.516, 17.10, 3.286, 9.597, 1.221],
 "population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
brics
           country    capital    area  population
   0        Brazil   Brasilia   8.516      200.40
   1        Russia     Moscow  17.100      143.50
   2         India  New Delhi   3.286     1252.00
   3         China    Beijing   9.597     1357.00
   4  South Africa   Pretoria   1.221       52.98

alex067 建议修改数据类型,这表明它们是相同的:

brics.dtypes

   Out[14]:
   country        object
   capital        object
   area          float64
   population    float64
   dtype: object

myBrics.dtypes

   Out[15]:
   Unnamed: 0     object
   country        object
   capital        object
   area          float64
   population    float64
   dtype: object

HS星云建议使用assert_frame_equal来查看差异所在:

from pandas.util.testing import assert_frame_equal
assert_frame_equal(myBrics.drop( myBrics.columns[[0]] , axis=1 ), brics)
    # Reports no differences

Josh Allan Elder 说,差异是由于量化误差引起的:

import numpy as np
np.isclose(myBrics['area'], brics['area'])

   array([ True,  True,  True,  True,  True])

brics['area'] - myBrics['area']

   0    0.000000e+00
   1    0.000000e+00
   2    0.000000e+00
   3   -1.776357e-15
   4    2.220446e-16
   Name: area, dtype: float64

这意味着pd.read_csv对数字数据的文本表示进行量化的方式与dictpd.DataFrame的组合不同。 dict可能负责量化。我发现这种不一致在一定程度上令人不安,但请放心。

谢谢大家!