使用多级列在Pandas DF中有条件地更改值

时间:2016-04-18 17:09:18

标签: python pandas dataframe

给出以下具有多级列的DF:

arrays = [['foo', 'foo', 'bar', 'bar'],
          ['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))          
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(6,4), columns = columnValues)
df['txt'] = 'aaa'
print(df)

的产率:

        foo                 bar            txt
          A         B         C         D
0  0.080029  0.710943  0.157265  0.774827  aaa
1  0.276949  0.923369  0.550799  0.758707  aaa
2  0.416714  0.440659  0.835736  0.130818  aaa
3  0.935763  0.908967  0.502363  0.677957  aaa
4  0.191245  0.291017  0.014355  0.762976  aaa
5  0.365464  0.286350  0.450263  0.509556  aaa

问题:如果foo子列中的值100的值< 0.5,我如何有效将值更改为In [41]: df.foo < 0.5 Out[41]: A B 0 True False 1 True False 2 True True 3 False False 4 True True 5 True True In [42]: df.foo[df.foo < 0.5] Out[42]: A B 0 0.080029 NaN 1 0.276949 NaN 2 0.416714 0.440659 3 NaN NaN 4 0.191245 0.291017 5 0.365464 0.286350 巨大的DF?

以下作品:

In [45]: df.foo[df.foo < 0.5] = 100
C:\Users\USER\AppData\Local\Programs\Python35\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

但如果我试图更改它会引发我的价值:

In [46]: df.foo.loc[df.foo < 0.5] = 100
...
ValueError: cannot copy sequence with size 2 to array axis with dimension 6

如果我尝试使用定位器:

df.foo.loc[df.foo < 0.5, 'foo'] = 100

df.loc[df.foo < 0.5, 'foo']

的错误相同

如果我尝试:

KeyError: 'None of [       A      B\n0   True  False\n1   True  False\n2   True   True\n3  False  False\n4   True   True\n5   True   True] are in the [index]' 

我得到:

In [19]: %timeit df.foo.applymap(lambda x: x if x >= 0.5 else 100)
1 loop, best of 3: 29.4 s per loop

In [20]: %timeit df.foo[df.foo >= 0.5].fillna(100)
1 loop, best of 3: 1.55 s per loop

解决方案 - 与10M行的DF进行时间比较:

In [21]: %timeit df.foo.where(df.foo < 0.5, 100)
1 loop, best of 3: 1.12 s per loop

John Galt:

In [5]: %timeit u=df['foo'].values;u[u<.5]=100
1 loop, best of 3: 628 ms per loop

B中。 M:

<html>

1 个答案:

答案 0 :(得分:3)

以下使用where - df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)

的方式
In [96]: df
Out[96]:
        foo                 bar            txt
          A         B         C         D
0  0.255309  0.237892  0.491065  0.930555  aaa
1  0.859998  0.008269  0.376213  0.984806  aaa
2  0.479928  0.761266  0.993970  0.266486  aaa
3  0.078284  0.009748  0.461687  0.653085  aaa
4  0.923293  0.642398  0.629140  0.561777  aaa
5  0.936824  0.526626  0.413250  0.732074  aaa

In [97]: df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)

In [98]: df
Out[98]:
          foo                   bar            txt
            A           B         C         D
0    0.255309    0.237892  0.491065  0.930555  aaa
1  100.000000    0.008269  0.376213  0.984806  aaa
2    0.479928  100.000000  0.993970  0.266486  aaa
3    0.078284    0.009748  0.461687  0.653085  aaa
4  100.000000  100.000000  0.629140  0.561777  aaa
5  100.000000  100.000000  0.413250  0.732074  aaa