pd.DataFrame.update()不会携带类型。为什么?

时间:2018-01-17 19:35:26

标签: python pandas types

我有一个数据框df。其中两列('neighborhood''price')包含字符串。这两列中的每个字符串都包含一个数字。

我的目标是创建两个仅包含字符串中数字的列表,然后使用新列表覆盖df中的旧列,这样pandas的.corr()将能够认识并操作它们。

这是我目前的代码:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Import data
df = pd.read_csv('data.csv',sep='|')

# Clean data (throw away room_id, remove html from neighborhood, make price float )
df = df.drop('room_id', 1)
neighborhood = np.array([float(n[3:-4]) for n in df['neighborhood']]).astype(np.float64)
price = np.array([float(p[7:-1]) for p in df['price']]).astype(np.float64)
df_updates = pd.DataFrame({'neighborhood' : neighborhood, 'price' : price})
df.update(df_updates)

# Print first row of dataframe and the output of df.corr()
print(df.iloc[0])
print(df.corr())

# Print types
print(type(neighborhood[0]))
print(type(price[0]))
print(type(df['neighborhood'][0]))
print(type(df['price'][0]))

如下所示,.corr()无法将新的'neighborhood''price'列识别为可以操作的内容。

Out []: 

room_type               Entire home/apt
neighborhood                          5
reviews                               0
satisfaction                          0
acc.                                  6
bedrooms                              3
price                                80
Name: 0, dtype: object

                      reviews               satisfaction  acc.      bedrooms
reviews               1.000000              0.520951     -0.037194 -0.064366
overall_satisfaction  0.520951              1.000000     -0.019771 -0.052900
accommodates         -0.037194             -0.019771      1.000000  0.720229
bedrooms             -0.064366             -0.052900      0.720229  1.000000

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'float'>
<class 'float'>
<class 'numpy.int64'>
<class 'numpy.float64'>

我怀疑上述情况发生是因为'neighborhood''price'的条目是数据框中的简单浮点数(而不是np.float64),即使相应的ndarrays在传递时包含np.float64 .update。

问:为什么会发生这种情况,我该如何解决?

1 个答案:

答案 0 :(得分:0)

感谢@MattR提供了正确的提示!正如df.dtypes所示,objects和.corr()无法处理的两个有问题的列。解决方案(在this answer中找到)是在定义包含有问题列的名称的列表prob_cols之后执行以下操作:

df[prob_cols] = df[prob_cols].apply(pd.to_numeric)