熊猫的改善

时间:2017-02-14 09:00:09

标签: python pandas numpy

我目前有一个Pandas Dataframe,我在其中执行列之间的比较。我发现了一个在进行比较时存在空列的情况,由于某种原因进行比较会返回 else 值。我添加了一个额外的声明来清理它为空。想看看我是否可以简化这个并且只有一个声明。

df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''

代码

    df = pd.DataFrame({
        'a_id': ['A', 'B', 'C', 'D', '', 'F', ''],
        'a_score': [1, 2, 3, 4, '', 6, ''],
        'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
        'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, ''],

    })
    print df
    # Replace empty string with NaN
    df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

    # Calculate higher score
    df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)

    # Select type based on higher score
    df['doc_type'] = df.apply(lambda df: 'a' if df['a_score'] >= df['b_score'] else 'b', axis=1)
    print df
    # Update type when is empty        
    df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
    print df

1 个答案:

答案 0 :(得分:2)

您可以使用numpy.where代替apply,也可以使用boolean indexing选择列(s)更好地使用此解决方案:

df.loc[mask, 'colname'] = val
 # Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

# Calculate higher score
df['doc_id'] = np.where(df['a_score'] >= df['b_score'], df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(df['a_score'] >= df['b_score'], 'a', 'b')
print (df)
# Update type when is empty        
df.loc[(df['a_id'].isnull() & df['b_id'].isnull()), 'doc_type'] = ''
print (df)
  a_id  a_score b_id  b_score doc_id doc_type
0    A      1.0    a     0.10      A        a
1    B      2.0    b     0.20      B        a
2    C      3.0    c     3.10      c        b
3    D      4.0    d     4.10      d        b
4  NaN      NaN    e     5.00      e        b
5    F      6.0    f     5.99      F        a
6  NaN      NaN  NaN      NaN    NaN   

使用DataFrame.all替代mask,以检查行中的所有True是否为axis=1

print (df[['a_id', 'b_id']].isnull())
    a_id   b_id
0  False  False
1  False  False
2  False  False
3  False  False
4   True  False
5  False  False
6   True   True

print (df[['a_id', 'b_id']].isnull().all(axis=1))
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

df.loc[df[['a_id', 'b_id']].isnull().all(axis=1), 'doc_type'] = ''
print (df)
  a_id  a_score b_id  b_score doc_id doc_type
0    A      1.0    a     0.10      A        a
1    B      2.0    b     0.20      B        a
2    C      3.0    c     3.10      c        b
3    D      4.0    d     4.10      d        b
4  NaN      NaN    e     5.00      e        b
5    F      6.0    f     5.99      F        a
6  NaN      NaN  NaN      NaN    NaN           

Bur更好的是使用双numpy.where

 # Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

#create masks to series - not compare twice
mask = df['a_score'] >= df['b_score']
mask1 = (df['a_id'].isnull() & df['b_id'].isnull())
#altrnative solution for mask1
#mask1 = df[['a_id', 'b_id']].isnull().all(axis=1)
# Calculate higher score
df['doc_id'] = np.where(mask, df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(mask, 'a', np.where(mask1, '', 'b'))
print (df)
  a_id  a_score b_id  b_score doc_id doc_type
0    A      1.0    a     0.10      A        a
1    B      2.0    b     0.20      B        a
2    C      3.0    c     3.10      c        b
3    D      4.0    d     4.10      d        b
4  NaN      NaN    e     5.00      e        b
5    F      6.0    f     5.99      F        a
6  NaN      NaN  NaN      NaN    NaN