考虑以下数据框:
index count signal
1 1 1
2 1 NAN
3 1 NAN
4 1 -1
5 1 NAN
6 2 NAN
7 2 -1
8 2 NAN
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 NAN
14 4 NAN
我需要在“信号”中“填充”NAN,并且具有不同“计数”值的值不应相互影响。这样我就可以得到以下数据帧:
index count signal
1 1 1
2 1 1
3 1 1
4 1 -1
5 1 -1
6 2 NAN
7 2 -1
8 2 -1
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 1
14 4 1
现在我逐个遍历每个数据框并填充NAN值,然后复制到新数据框:
new_table = np.array([]);
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = group1
else:
new_table = pd.concat([new_table,group1])
哪种方式有效,但考虑到数据框很大,确实很慢。我想知道是否有任何其他方法可以使用或不使用groupby方法。谢谢!
编辑:
感谢Alexander和jwilner提供替代方法。但是,对于拥有800,000行数据的大数据帧,这两种方法都非常慢。
答案 0 :(得分:2)
使用apply
方法。
In [56]: df = pd.DataFrame({"count": [1] * 4 + [2] * 5 + [3] * 2 , "signal": [1] + [None] * 4 + [-1] + [None] * 5})
In [57]: df
Out[57]:
count signal
0 1 1
1 1 NaN
2 1 NaN
3 1 NaN
4 2 NaN
5 2 -1
6 2 NaN
7 2 NaN
8 2 NaN
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
In [58]: def ffill_signal(df):
....: df["signal"] = df["signal"].ffill()
....: return df
....:
In [59]: df.groupby("count").apply(ffill_signal)
Out[59]:
count signal
0 1 1
1 1 1
2 1 1
3 1 1
4 2 NaN
5 2 -1
6 2 -1
7 2 -1
8 2 -1
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
但是,请注意groupby
重新排序内容。如果计数列不总是保持不变或增加,而是可以在其中重复值,则groupby
可能会出现问题。也就是说,如果count
系列为[1, 1, 2, 2, 1]
,则groupby
会像这样分组:[1, 1, 1], [2, 2]
,这可能会对您的前向填充产生不良影响。如果这是不受欢迎的,您必须创建一个与groupby
一起使用的新系列,该系列始终保持不变或根据计数系列中的更改而增加 - 可能使用pd.Series.diff
和{{1} }
答案 1 :(得分:1)
另一种解决方案是创建数据透视表,转发填充值,然后将它们映射回原始DataFrame。
final BroadcastReceiver broadcastReceiver = new BroadcastReceiver() {
@Override
public void onReceive(Context context, Intent intent) {
String action = intent.getAction();
System.out.println("ACTION: "+ action);
// When discovery finds a device
if (BluetoothDevice.ACTION_FOUND.equals(action)) {
list();
// Get the BluetoothDevice object from the intent
BluetoothDevice device = intent.getParcelableExtra(BluetoothDevice.EXTRA_DEVICE);
System.out.println("FOUND: " + device.getName() + " STATE: " + device.getBondState());
mapInput(device);
// Add the name and the mac address of the object to the array adapter
BTSimpleAdapter.notifyDataSetChanged();
} else if (BluetoothDevice.ACTION_BOND_STATE_CHANGED.equals(action)) {
BluetoothDevice device = intent.getParcelableExtra(BluetoothDevice.EXTRA_DEVICE);
System.out.println("UPDATE Name " + device.getName() + " Value " + device.getAddress() + " Bond state " + device.getBondState());
for (Map<String, String> entry : data) {
if (entry.get("Name").equals(device.getName()) && entry.get("Value").equals(device.getAddress())) {
if (device.getBondState() == BluetoothDevice.BOND_NONE) {
entry.put("Paired", "Unpaired");
} else if (device.getBondState() == BluetoothDevice.BOND_BONDING) {
entry.put("Paired", "Pairing");
} else {
entry.put("Paired", "Paired");
}
BTSimpleAdapter.notifyDataSetChanged();
}
}
}
}
};
有800k行数据,这种方法的功效取决于计数中有多少独特值。
与我之前的回答相比:
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c]
for i, c in zip(df2.index, df['count'].tolist())]
>>> df
count index signal
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 -1
4 1 5 -1
5 2 6 NaN
6 2 7 -1
7 2 8 -1
8 3 9 NaN
9 3 10 NaN
10 3 11 NaN
11 4 12 1
12 4 13 1
13 4 14 1
最后,您可以简单地使用%%timeit
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
100 loops, best of 3: 4.1 ms per loop
%%timeit
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c] for i, c in zip(df2.index, df['count'].tolist())]
1000 loops, best of 3: 1.32 ms per loop
,尽管它比上一个方法慢:
groupby
答案 2 :(得分:1)
我知道已经很晚了,但是我找到了一个比所提出的解决方案快得多的解决方案,即将更新的数据帧收集在列表中并仅在最后进行连接。举个例子:
new_table = []
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = [group1]
else:
new_table.append(group1)
new_table = pd.concat(new_table).reset_index(drop=True)
答案 3 :(得分:0)
假设数据已在df ['index']上预先排序,请尝试改为使用loc
:
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
>>> df
index count signal
0 1 1 1
1 2 1 1
2 3 1 1
3 4 1 -1
4 5 1 -1
5 6 2 NaN
6 7 2 -1
7 8 2 -1
8 9 3 NaN
9 10 3 NaN
10 11 3 NaN
11 12 4 1
12 13 4 1
13 14 4 1