列包含字符串值时Pandas DataFrame.sum的奇怪行为

时间:2018-10-26 09:06:21

标签: python pandas dataframe types coercion


import pandas as pd

df1 = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])

df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df2.loc[1,2] = 'hey'

df3 = pd.DataFrame(index=range(3), columns=range(3))
for i in range(3):
    for j in range(3):
        if (i,j) != (1,2):
            df3.loc[i,j] = i*3 + j + 1
            df3.loc[i,j] = 'hey'

# df1, df2, df3 look the same as below
   0  1    2
0  1  2    3
1  4  5  hey
2  7  8    9


sumcol1 = df1.sum()
sumcol2 = df2.sum()
sumcol3 = df3.sum()

# sumcol1, sumcol2, sumcol3 look the same as below
0    12
1    15
dtype: int64


此外,似乎当axis = 0时,将不计算包含字符串的列之和,而当axis = 1时,将计算所有行总和,而属于列的元素将被跳过。

sumrow1 = df1.sum(axis=1)
sumrow2 = df2.sum(axis=1)
sumrow3 = df3.sum(axis=1)

0     3
1     9
2    15
dtype: int64

0     3
1     9
2    15
dtype: int64

0    0.0
1    0.0
2    0.0
dtype: float64


  1. 是什么原因导致sumcol1sumrow1之间的不同行为?

  2. 是什么原因导致sumrow1sumrow3之间的不同行为?

  3. 是否有正确的方法来获得与sumrow1df3相同的结果?


  1. 是否有一种聪明的方法在保留字符串的同时仅添加数字值?

    • 我当前的解决方法(由于jpp的回答):

      df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
      df_c = df.copy()
      for col in df.select_dtypes(['object']).columns:
          df_c[col] = pd.to_numeric(df_c[col], errors='coerce')
      df['sum'] = df_c.sum(axis=1)
         0  1    2   sum
      0  1  2    3   6.0
      1  4  5  hey   9.0
      2  7  8    9  24.0

我正在使用Python 3.6.6,pandas 0.23.4。

2 个答案:

答案 0 :(得分:3)


  • 主要问题是您构建的df3具有全部 dtype为object的三个系列,而df1df2具有 dtype=int用于前两个系列。
  • Pandas数据框中的数据是按序列组织和存储的。 [列]。因此,类型转换是按系列 进行的。因此,在“行和列”上求和的逻辑必然是不同的,并且在混合类型方面不一定是一致的。



print({'df1': df1.dtypes, 'df2': df2.dtypes, 'df3': df3.dtypes})

{'df1': 0     int64
        1     int64
        2    object
      dtype: object,

 'df2': 0     int64
        1     int64
        2    object
      dtype: object,

 'df3': 0    object
        1    object
        2    object
      dtype: object}


for col in df3.select_dtypes(['object']).columns:
    col_num = pd.to_numeric(df3[col], errors='coerce')
    if not col_num.isnull().any():  # check if any null values
        df3[col] = col_num          # assign numeric series


0     int64
1     int64
2    object
dtype: object

然后您将看到一致的治疗。在这一点上,有必要丢弃原始的df3:在任何操作后都不能进行连续系列类型检查 的地方,没有记录。


df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])

col_sum = df.apply(pd.to_numeric, errors='coerce').sum()
row_sum = df.apply(pd.to_numeric, errors='coerce').sum(1)


0    12.0
1    15.0
2    12.0
dtype: float64


0     6.0
1     9.0
2    24.0
dtype: float64

答案 1 :(得分:2)



>>> df1.equals(df3)
False # not so useful, doesn't tell you why they differ


>>> import pandas.testing
>>> pandas.testing.assert_frame_equal(df1, df3)

AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  int64
[right]: object

pandas.testing.assert_frame_equal() 具有以下有用的args,您可以自定义所需的任何内容:

check_dtype : bool, default True    
Whether to check the DataFrame dtype is identical.

check_index_type : bool / string {‘equiv’}, default False    
Whether to check the Index class, dtype and inferred_type are identical.

check_column_type : bool / string {‘equiv’}, default False    
Whether to check the columns class, dtype and inferred_type are identical.

check_frame_type : bool, default False    
Whether to check the DataFrame class is identical.

check_less_precise : bool or int, default False    
Specify comparison precision. Only used when check_exact is False. 5 digits (False) or 3 digits (True) after decimal points are compared. If int, then specify the digits to compare

check_names : bool, default True    
Whether to check the Index names attribute.

by_blocks : bool, default False    
Specify how to compare internal data. If False, compare by columns. If True, compare by blocks.

check_exact : bool, default False    
Whether to compare number exactly.

check_datetimelike_compat : bool, default False    
Compare datetime-like which is comparable ignoring dtype.

check_categorical : bool, default True    
Whether to compare internal Categorical exactly.

check_like : bool, default False    
If true, ignore the order of rows & columns