使用pandas中的groupby过滤数据

时间:2016-07-23 16:59:14

标签: python pandas dataframe

我有一个DataFrame,我有以下数据。每行代表一个出现在电视剧每集中的单词。如果一集中出现3次单词,则pandas数据框有3行。现在我需要过滤一个单词列表,这样我只能得到大于或等于2次的单词。我可以通过groupby执行此操作,但如果一个单词出现2(或说3,4或5)次,我需要两行(3,4或5)行。

通过groupby,我只会获得唯一的条目和计数,但是我需要重复该条目,使其重复出现在对话框中。有没有单行做这个?

       dialogue  episode
0         music        1
1   corrections        1
2       somnath        1
3         yadav        5
4          join        2
5     instagram        1
6          wind        2
7         music        1
8    whimpering        2
9         music        1
10         wind        3

所以我应该理想地得到,

   dialogue  episode
0     music        1
6      wind        2
7     music        1
9     music        1
10     wind        3

因为这些是仅出现2次以上或2次以上的单词。

3 个答案:

答案 0 :(得分:5)

您可以使用groupby' filter

In [11]: df.groupby("dialogue").filter(lambda x: len(x) > 1)
Out[11]:
   dialogue  episode
0     music        1
6      wind        2
7     music        1
9     music        1
10     wind        3

答案 1 :(得分:4)

回答更新的问题:

In [208]: df.groupby('dialogue')['episode'].transform('size') >= 3
Out[208]:
0      True
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9      True
10    False
dtype: bool

In [209]: df[df.groupby('dialogue')['episode'].transform('size') >= 3]
Out[209]:
  dialogue  episode
0    music        1
7    music        1
9    music        1

回答原始问题:

您可以使用duplicated()方法:

In [202]: df[df.duplicated(subset=['dialogue'], keep=False)]
Out[202]:
   dialogue  episode
0     music        1
6      wind        2
7     music        1
9     music        1
10     wind        3

如果要对结果进行排序:

In [203]: df[df.duplicated(subset=['dialogue'], keep=False)].sort_values('dialogue')
Out[203]:
   dialogue  episode
0     music        1
7     music        1
9     music        1
6      wind        2
10     wind        3

答案 2 :(得分:1)

我使用value_counts

vc = df.dialogue.value_counts() >= 2
vc = vc[vc]
df[df.dialogue.isin(vc.index)]

enter image description here

时序

请记住,这完全超过了顶部。但是,我正在提高我的计时技能。

<强>码

from timeit import timeit

def pirsquared(df):
    vc = df.dialogue.value_counts() > 1
    vc = vc[vc]
    return df[df.dialogue.isin(vc.index)]

def maxu(df):
    return df[df.groupby('dialogue')['episode'].transform('size') > 1]

def andyhayden(df):
    return df.groupby("dialogue").filter(lambda x: len(x) > 1)

rows = ['pirsquared', 'maxu', 'andyhayden']
cols = ['OP_Given', '10000_3_letters']

summary = pd.DataFrame([], rows, cols)
iterations = 10

df = pd.DataFrame({'dialogue': {0: 'music', 1: 'corrections', 2: 'somnath', 3: 'yadav', 4: 'join', 5: 'instagram', 6: 'wind', 7: 'music', 8: 'whimpering', 9: 'music', 10: 'wind'}, 'episode': {0: 1, 1: 1, 2: 1, 3: 5, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 3}})

summary.loc['pirsquared', 'OP_Given'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', 'OP_Given'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', 'OP_Given'] = timeit(lambda: andyhayden(df), number=iterations)


df = pd.DataFrame(
    pd.DataFrame(np.random.choice(list(lowercase), (10000, 3))).sum(1),
    columns=['dialogue'])
df['episode'] = 1

summary.loc['pirsquared', '10000_3_letters'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', '10000_3_letters'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', '10000_3_letters'] = timeit(lambda: andyhayden(df), number=iterations)


summary

enter image description here