Question

我有一个从一些网络数据（用于网球游戏）编译的pandas DataFrame，它在对所选行进行求和时表现出奇怪的行为。

10^18 * 4 /( 1024 * 1024 * 1024) Gb

然后尝试使用DataFrame: In [178]: tdf.shape Out[178]: (47028, 57) In [201]: cols Out[201]: ['L1', 'L2', 'L3', 'L4', 'L5', 'W1', 'W2', 'W3', 'W4', 'W5'] In [177]: tdf[cols].head() Out[177]: L1 L2 L3 L4 L5 W1 W2 W3 W4 W5 0 4.0 2 NaN NaN NaN 6.0 6 NaN NaN NaN 1 3.0 3 NaN NaN NaN 6.0 6 NaN NaN NaN 2 7.0 5 3 NaN NaN 6.0 7 6 NaN NaN 3 1.0 4 NaN NaN NaN 6.0 6 NaN NaN NaN 4 6.0 7 4 NaN NaN 7.0 5 6 NaN NaN计算行数之和。从上表中，第1行的总和应为18.0，但报告为10，如下所示：

tdf[cols].sum(axis=1)

问题似乎是由特定记录（第13771行）引起的，因为当我排除此行时，总和会正确计算：

In [180]: tdf[cols].sum(axis=1).head()
Out[180]:
0    10.0
1     9.0
2    13.0
3     7.0
4    13.0
dtype: float64

然而，包括它：

In [182]: tdf.iloc[:13771][cols].sum(axis=1).head()
Out[182]:
0    18.0
1    18.0
2    34.0
3    17.0
4    35.0
dtype: float64

为整列提供错误的结果。

违规记录如下：

In [183]: tdf.iloc[:13772][cols].sum(axis=1).head()
Out[183]:
0    10.0
1     9.0
2    13.0
3     7.0
4    13.0
dtype: float64

我正在运行以下版本：

In [196]: tdf[cols].iloc[13771]
Out[196]:
L1      1
L2      1
L3    NaN
L4    NaN
L5    NaN
W1      6
W2      0
W3
W4    NaN
W5    NaN
Name: 13771, dtype: object

In [197]: tdf[cols].iloc[13771].W3
Out[197]: ' '

In [198]: type(tdf[cols].iloc[13771].W3)
Out[198]: str

当然，单个格式不佳的记录不应影响其他记录的总和？这是一个错误还是我做错了什么？

非常感谢！

Answer 1

问题在于empty string - 然后dtype列的W3为object（显然string），sum省略它。

解决方案：

尝试将有问题的empty string值替换为NaN，然后转换为float

tdf.loc[13771, 'W3'] = np.nan

tdf.W3 = tdf.W3.astype(float)

或者尝试将所有空字符串替换为子集NaN中的cols：

tdf[cols] = tdf[cols].replace({'':np.nan})
#if necessary
tdf[cols] = tdf[cols].astype(float)

另一个解决方案是在有问题的列中使用to_numeric - 将所有非数字替换为NaN：

tdf.W3 = pd.to_numerice(tdf.W3, erors='coerce')

或者通常适用于cols列：

tdf[cols] = tdf[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))

奇怪的pandas.DataFrame.sum（axis = 1）行为

1 个答案: