我有一个pandas DataFrame df1,内容如下:
Serial N year current
B 10 14
B 10 16
B 11 10
B 11
B 11 15
C 12 11
C 9
C 12 13
C 12
D 3 4
我想计算每个串行唯一序列的出现次数。如果序列号小于2,我想将该行的年份和当前值替换为nan。我想有这样的事情:
Serial N year current
B 10 14
B 10 16
B 11 10
B 11
B 11 15
C 12 11
C 9
C 12 13
C 12
D nan nan
答案 0 :(得分:1)
您可以合并value_counts
,lt
和reindex
以获取布局数组,其中将值更改为nan
,然后使用loc
进行制作变化。
serial_filter = df1['Serial N'].value_counts().lt(2).reindex(df1['Serial N'])
df1.loc[serial_filter.values, ['year', 'current']] = np.nan
结果输出:
Serial N year current
0 B 10.0 14.0
1 B 10.0 16.0
2 B 11.0 10.0
3 B 11.0 NaN
4 B 11.0 15.0
5 C 12.0 11.0
6 C NaN 9.0
7 C 12.0 13.0
8 C 12.0 NaN
9 D NaN NaN
答案 1 :(得分:0)
import pandas as pd
from StringIO import StringIO
text = """Serial_N year current
B 10 14
B 10 16
B 11 10
B 11 nan
B 11 15
C 12 11
C nan 9
C 12 13
C 12 nan
D 3 4"""
df1 = pd.read_csv(StringIO(text), delim_whitespace=True)
df1.columns = ['Serial N', 'year', 'current']
现在,我上面显示的是df1
。
serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))
serial_filter = serial_filter[serial_filter > 1]
mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)
df1 = df1[mask]
serial_filter = df1.groupby('Serial N').apply(lambda x: len(x))
print serial_filter
Serial N
B 5
C 4
D 1
dtype: int64
生成每个唯一Serial N
serial_filter = serial_filter[serial_filter > 1]
print serial_filter
Serial N
B 5
C 4
dtype: int64
重新定义它,使其仅包含大于1的Serial N
mask = df1.apply(lambda x: x.ix['Serial N'] in serial_filter, axis=1)
print mask
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
dtype: bool
创建要在df1
df1 = df1[mask]
print df1
Serial N year current
0 B 10.0 14.0
1 B 10.0 16.0
2 B 11.0 10.0
3 B 11.0 NaN
4 B 11.0 15.0
5 C 12.0 11.0
6 C NaN 9.0
7 C 12.0 13.0
8 C 12.0 NaN
更新df1