我正在学习在线Python课程,该课程正在研究数据框架。
我下载了this CSV file并将其导入到数据框中:
import os
import pandas as pd
os.chdir('C:/cygwin64/home/User.Name/path/to/brics.csv')
pd.read_csv( os.getcwd() + '/brics.csv' )
myBrics = pd.read_csv( 'brics.csv' )
myBrics
Unnamed: 0 country capital area population
0 BR Brazil Brasilia 8.516 200.40
1 RU Russia Moscow 17.100 143.50
2 IN India New Delhi 3.286 1252.00
3 CH China Beijing 9.597 1357.00
4 SA South Africa Pretoria 1.221 52.98
然后我使用课程演示中提供的代码创建相同的数据框
dict = {
"country":["Brazil", "Russia", "India", "China", "South Africa"],
"capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"area":[8.516, 17.10, 3.286, 9.597, 1.221],
"population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
brics
country capital area population
0 Brazil Brasilia 8.516 200.40
1 Russia Moscow 17.100 143.50
2 India New Delhi 3.286 1252.00
3 China Beijing 9.597 1357.00
4 South Africa Pretoria 1.221 52.98
除了myBrics
中的第一列外,它们似乎相同。某些网络搜索显示我可以摆脱第1列:
myBrics.drop( myBrics.columns[[0]] , axis=1 )
country capital area population
0 Brazil Brasilia 8.516 200.40
1 Russia Moscow 17.100 143.50
2 India New Delhi 3.286 1252.00
3 China Beijing 9.597 1357.00
4 South Africa Pretoria 1.221 52.98
但是,看起来相同的数据帧仍然不相等:
myBrics.drop( myBrics.columns[[0]] , axis=1 ).equals( brics )
False
任何人都可以解释发生了什么吗?谢谢。
我正在使用Spyder的Python 3.7,它是由Anaconda安装(由具有管理员权限的人安装)。该操作系统是Windows 7 64位。
答案 0 :(得分:2)
您依靠返回true的浮点值相等;有很多资源可以解释为什么这不能按预期进行。
我建议导入numpy并在浮点数列上使用isclose函数
将此添加到您的导入
import numpy as np
,然后使用以下内容:
eq = np.isclose(myBrics['area'], brics['area'])
如果您想进一步了解浮动信息的详情,请参见this answer
答案 1 :(得分:1)
我怀疑这是您的列的dtype。正如文档所述:
列标题不必具有相同的类型,但是列中的元素必须具有相同的dtype。
您可以使用:
dataframe.dtypes
查看每一列是什么数据类型
答案 2 :(得分:1)
艾伦·埃尔德(Allan Elder)的答案是正确的。我运行了这段代码:
import os
import pandas as pd
myBrics = pd.read_csv( 'brics.csv' )
dict = {
"country":["Brazil", "Russia", "India", "China", "South Africa"],
"capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"area":[8.516, 17.10, 3.286, 9.597, 1.221],
"population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
myBrics = myBrics.drop( myBrics.columns[[0]] , axis=1 )
print (myBrics['area'].equals(brics['area']))
结果是
False
答案 3 :(得分:1)
量化误差是造成差异的原因。这是受访者建议的一系列疑难解答步骤:
import os
import pandas as pd
os.chdir('C:/cygwin64/home/User.Name/path/to/brics.csv')
pd.read_csv( os.getcwd() + '/brics.csv' )
myBrics = pd.read_csv( 'brics.csv' )
myBrics
Unnamed: 0 country capital area population
0 BR Brazil Brasilia 8.516 200.40
1 RU Russia Moscow 17.100 143.50
2 IN India New Delhi 3.286 1252.00
3 CH China Beijing 9.597 1357.00
4 SA South Africa Pretoria 1.221 52.98
dict = {
"country":["Brazil", "Russia", "India", "China", "South Africa"],
"capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"area":[8.516, 17.10, 3.286, 9.597, 1.221],
"population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
brics
country capital area population
0 Brazil Brasilia 8.516 200.40
1 Russia Moscow 17.100 143.50
2 India New Delhi 3.286 1252.00
3 China Beijing 9.597 1357.00
4 South Africa Pretoria 1.221 52.98
alex067 建议修改数据类型,这表明它们是相同的:
brics.dtypes
Out[14]:
country object
capital object
area float64
population float64
dtype: object
myBrics.dtypes
Out[15]:
Unnamed: 0 object
country object
capital object
area float64
population float64
dtype: object
HS星云建议使用assert_frame_equal
来查看差异所在:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(myBrics.drop( myBrics.columns[[0]] , axis=1 ), brics)
# Reports no differences
Josh 和 Allan Elder 说,差异是由于量化误差引起的:
import numpy as np
np.isclose(myBrics['area'], brics['area'])
array([ True, True, True, True, True])
brics['area'] - myBrics['area']
0 0.000000e+00
1 0.000000e+00
2 0.000000e+00
3 -1.776357e-15
4 2.220446e-16
Name: area, dtype: float64
这意味着pd.read_csv
对数字数据的文本表示进行量化的方式与dict
和pd.DataFrame
的组合不同。 dict
可能负责量化。我发现这种不一致在一定程度上令人不安,但请放心。
谢谢大家!