使用groupby对象--pandas编辑数据框条目

时间:2015-05-17 18:14:44

标签: python pandas group-by dataframe

考虑以下数据框:

     index      count     signal
       1          1         1
       2          1        NAN
       3          1        NAN
       4          1        -1
       5          1        NAN
       6          2        NAN
       7          2        -1
       8          2        NAN
       9          3        NAN
       10         3        NAN
       11         3        NAN
       12         4        1
       13         4        NAN
       14         4        NAN

我需要在“信号”中“填充”NAN,并且具有不同“计数”值的值不应相互影响。这样我就可以得到以下数据帧:

     index      count     signal
       1          1         1
       2          1         1
       3          1         1
       4          1        -1
       5          1        -1
       6          2        NAN
       7          2        -1
       8          2        -1
       9          3        NAN
       10         3        NAN
       11         3        NAN
       12         4        1
       13         4        1
       14         4        1

现在我逐个遍历每个数据框并填充NAN值,然后复制到新数据框:

new_table = np.array([]); 
for key, group in df.groupby('count'):
    group['signal'] = group['signal'].fillna(method='ffill')
    group1 = group.copy()
    if new_table.shape[0]==0:
        new_table = group1
    else:
        new_table = pd.concat([new_table,group1])

哪种方式有效,但考虑到数据框很大,确实很慢。我想知道是否有任何其他方法可以使用或不使用groupby方法。谢谢!

编辑:

感谢Alexander和jwilner提供替代方法。但是,对于拥有800,000行数据的大数据帧,这两种方法都非常慢。

4 个答案:

答案 0 :(得分:2)

使用apply方法。

In [56]: df = pd.DataFrame({"count": [1] * 4 + [2] * 5 + [3] * 2 , "signal": [1] + [None] * 4 + [-1] + [None] * 5})

In [57]: df
Out[57]:
    count  signal
0       1       1
1       1     NaN
2       1     NaN
3       1     NaN
4       2     NaN
5       2      -1
6       2     NaN
7       2     NaN
8       2     NaN
9       3     NaN
10      3     NaN

[11 rows x 2 columns]

In [58]: def ffill_signal(df):
   ....:     df["signal"] = df["signal"].ffill()
   ....:     return df
   ....:

In [59]: df.groupby("count").apply(ffill_signal)
Out[59]:
    count  signal
0       1       1
1       1       1
2       1       1
3       1       1
4       2     NaN
5       2      -1
6       2      -1
7       2      -1
8       2      -1
9       3     NaN
10      3     NaN

[11 rows x 2 columns]

但是,请注意groupby重新排序内容。如果计数列不总是保持不变或增加,而是可以在其中重复值,则groupby可能会出现问题。也就是说,如果count系列为[1, 1, 2, 2, 1],则groupby会像这样分组:[1, 1, 1], [2, 2],这可能会对您的前向填充产生不良影响。如果这是不受欢迎的,您必须创建一个与groupby一起使用的新系列,该系列始终保持不变或根据计数系列中的更改而增加 - 可能使用pd.Series.diff和{{1} }

答案 1 :(得分:1)

另一种解决方案是创建数据透视表,转发填充值,然后将它们映射回原始DataFrame。

final BroadcastReceiver broadcastReceiver = new BroadcastReceiver() {
    @Override
    public void onReceive(Context context, Intent intent) {
        String action = intent.getAction();
        System.out.println("ACTION: "+ action);
        // When discovery finds a device

    if (BluetoothDevice.ACTION_FOUND.equals(action)) {
        list();

        // Get the BluetoothDevice object from the intent
        BluetoothDevice device = intent.getParcelableExtra(BluetoothDevice.EXTRA_DEVICE);
        System.out.println("FOUND: " + device.getName() + " STATE: " + device.getBondState());
        mapInput(device);

        // Add the name and the mac address of the object to the array adapter
        BTSimpleAdapter.notifyDataSetChanged();
    } else if (BluetoothDevice.ACTION_BOND_STATE_CHANGED.equals(action)) {
        BluetoothDevice device = intent.getParcelableExtra(BluetoothDevice.EXTRA_DEVICE);
        System.out.println("UPDATE Name " + device.getName() + " Value " + device.getAddress() + " Bond state " + device.getBondState());
        for (Map<String, String> entry : data) {
            if (entry.get("Name").equals(device.getName()) && entry.get("Value").equals(device.getAddress())) {
                if (device.getBondState() == BluetoothDevice.BOND_NONE) {
                    entry.put("Paired", "Unpaired");
                } else if (device.getBondState() == BluetoothDevice.BOND_BONDING) {
                    entry.put("Paired", "Pairing");
                } else {
                    entry.put("Paired", "Paired");
                }
                BTSimpleAdapter.notifyDataSetChanged();
            }
        }
    }
}
};

有800k行数据,这种方法的功效取决于计数中有多少独特值。

与我之前的回答相比:

df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c] 
                for i, c in zip(df2.index, df['count'].tolist())]
>>> df
    count  index  signal
0       1      1       1
1       1      2       1
2       1      3       1
3       1      4      -1
4       1      5      -1
5       2      6     NaN
6       2      7      -1
7       2      8      -1
8       3      9     NaN
9       3     10     NaN
10      3     11     NaN
11      4     12       1
12      4     13       1
13      4     14       1

最后,您可以简单地使用%%timeit for c in df['count'].unique(): df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill() 100 loops, best of 3: 4.1 ms per loop %%timeit df2 = df.pivot(columns='count', values='signal', index='index').ffill() df['signal'] = [df2.at[i, c] for i, c in zip(df2.index, df['count'].tolist())] 1000 loops, best of 3: 1.32 ms per loop ,尽管它比上一个方法慢:

groupby

答案 2 :(得分:1)

我知道已经很晚了,但是我找到了一个比所提出的解决方案快得多的解决方案,即将更新的数据帧收集在列表中并仅在最后进行连接。举个例子:

new_table = [] 
for key, group in df.groupby('count'):
    group['signal'] = group['signal'].fillna(method='ffill')
    group1 = group.copy()
    if new_table.shape[0]==0:
        new_table = [group1]
    else:
        new_table.append(group1)

new_table = pd.concat(new_table).reset_index(drop=True)

答案 3 :(得分:0)

假设数据已在df ['index']上预先排序,请尝试改为使用loc

for c in df['count'].unique():
    df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()

>>> df
    index  count signal
0       1      1      1
1       2      1      1
2       3      1      1
3       4      1     -1
4       5      1     -1
5       6      2    NaN
6       7      2     -1
7       8      2     -1
8       9      3    NaN
9      10      3    NaN
10     11      3    NaN
11     12      4      1
12     13      4      1
13     14      4      1