我有一个DataFrame,我有以下数据。每行代表一个出现在电视剧每集中的单词。如果一集中出现3次单词,则pandas数据框有3行。现在我需要过滤一个单词列表,这样我只能得到大于或等于2次的单词。我可以通过groupby
执行此操作,但如果一个单词出现2(或说3,4或5)次,我需要两行(3,4或5)行。
通过groupby,我只会获得唯一的条目和计数,但是我需要重复该条目,使其重复出现在对话框中。有没有单行做这个?
dialogue episode
0 music 1
1 corrections 1
2 somnath 1
3 yadav 5
4 join 2
5 instagram 1
6 wind 2
7 music 1
8 whimpering 2
9 music 1
10 wind 3
所以我应该理想地得到,
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
因为这些是仅出现2次以上或2次以上的单词。
答案 0 :(得分:5)
您可以使用groupby' filter
:
In [11]: df.groupby("dialogue").filter(lambda x: len(x) > 1)
Out[11]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
答案 1 :(得分:4)
回答更新的问题:
In [208]: df.groupby('dialogue')['episode'].transform('size') >= 3
Out[208]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 False
dtype: bool
In [209]: df[df.groupby('dialogue')['episode'].transform('size') >= 3]
Out[209]:
dialogue episode
0 music 1
7 music 1
9 music 1
回答原始问题:
您可以使用duplicated()方法:
In [202]: df[df.duplicated(subset=['dialogue'], keep=False)]
Out[202]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
如果要对结果进行排序:
In [203]: df[df.duplicated(subset=['dialogue'], keep=False)].sort_values('dialogue')
Out[203]:
dialogue episode
0 music 1
7 music 1
9 music 1
6 wind 2
10 wind 3
答案 2 :(得分:1)
我使用value_counts
vc = df.dialogue.value_counts() >= 2
vc = vc[vc]
df[df.dialogue.isin(vc.index)]
请记住,这完全超过了顶部。但是,我正在提高我的计时技能。
<强>码强>
from timeit import timeit
def pirsquared(df):
vc = df.dialogue.value_counts() > 1
vc = vc[vc]
return df[df.dialogue.isin(vc.index)]
def maxu(df):
return df[df.groupby('dialogue')['episode'].transform('size') > 1]
def andyhayden(df):
return df.groupby("dialogue").filter(lambda x: len(x) > 1)
rows = ['pirsquared', 'maxu', 'andyhayden']
cols = ['OP_Given', '10000_3_letters']
summary = pd.DataFrame([], rows, cols)
iterations = 10
df = pd.DataFrame({'dialogue': {0: 'music', 1: 'corrections', 2: 'somnath', 3: 'yadav', 4: 'join', 5: 'instagram', 6: 'wind', 7: 'music', 8: 'whimpering', 9: 'music', 10: 'wind'}, 'episode': {0: 1, 1: 1, 2: 1, 3: 5, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 3}})
summary.loc['pirsquared', 'OP_Given'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', 'OP_Given'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', 'OP_Given'] = timeit(lambda: andyhayden(df), number=iterations)
df = pd.DataFrame(
pd.DataFrame(np.random.choice(list(lowercase), (10000, 3))).sum(1),
columns=['dialogue'])
df['episode'] = 1
summary.loc['pirsquared', '10000_3_letters'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', '10000_3_letters'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', '10000_3_letters'] = timeit(lambda: andyhayden(df), number=iterations)
summary