Question

假设我有一个如下所示的数据框：

import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'A':np.random.randn(5), 'B': np.zeros(5), 'C': np.zeros(5)})
df
>>>
          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  0.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  0.0

当A为负数时，我有一个要填充值为1的列的列表。

idx = df.A < 0
cols = ['B', 'C']

因此，在这种情况下，我希望将索引[1，'B']和[4，'C']设置为1。

我尝试过的事情：

但是，进行df.loc[idx, cols] = 1会将整个行设置为1，而不仅仅是单个列。我还尝试做df.loc[idx, cols] = pd.get_dummies(cols)，结果如下：

          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  0.0  1.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  NaN  NaN

我认为这是因为get_dummies的索引和数据框未对齐。

预期输出：

          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  1.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  1.0

那么最好的方法（最快的阅读方法）是什么？就我而言，一千行五列。

结果计时：

TLDR：直接编辑值更快。

%%timeit
df.values[idx, df.columns.get_indexer(cols)] = 1

每个循环123 µs±2.5 µs（平均±标准偏差，共运行7次，每个10000个循环）

%%timeit
df.iloc[idx.array,df.columns.get_indexer(cols)]=1

每个循环266 µs±7 µs（平均±标准偏差，共运行7次，每个循环1000个）

Answer 1

使用numpy索引来提高性能：

using System;
using System.Text.Json;
using System.Text.Json.Serialization;

class Program
{
    static void Main(string[] args)
    {
        // escaped version, just for demo
        var json =
            "{\r\n    \"properties\": {\r\n        \"subscriptionId\": \"sub1\",\r\n        \"usageStartTime\": \"2015-03-03T00:00:00+00:00\",\r\n        \"usageEndTime\": \"2015-03-04T00:00:00+00:00\",\r\n        \"instanceData\": {\"Microsoft.Resources\":{\"resourceUri\":\"resourceUri1\",\"location\":\"Alaska\",\"tags\":null,\"additionalInfo\":null}},\r\n        \"quantity\": 2.4000000000,\r\n        \"meterId\": \"meterID1\"\r\n    }\r\n}";
        var props = JsonSerializer.Deserialize<Properties>(json);

    }
}

idx = df.A < 0
res = ['B', 'C']
arr = df.values
arr[idx, df.columns.get_indexer(res)] = 1
print (arr)
[[ 0.49671415  0.          0.        ]
 [-0.1382643   1.          0.        ]
 [ 0.64768854  0.          0.        ]
 [ 1.52302986  0.          0.        ]
 [-0.23415337  0.          1.        ]]

替代：

df = pd.DataFrame(arr, columns=df.columns, index=df.index)
print (df)
          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  1.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  1.0

Answer 2

ind = df.index[idx]
for idx,col in zip(ind,res):
   ...:     df.at[idx,col] = 1

In [7]: df
Out[7]:
          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  1.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  1.0

大熊猫从列列表分配结果

我尝试过的事情：

预期输出：

结果计时：

2 个答案: