如果值在pandas系列中重复,则返回一个布尔值?

时间:2015-09-03 23:08:39

标签: python python-2.7 pandas

鉴于以下pandas DataFrame:

mydf = pd.DataFrame([{'Campaign': 'Campaign X', 'Date': '24-09-2014', 'Spend': 1.34, 'Clicks': 241}, {'Campaign': 'Campaign Y', 'Date': '24-08-2014', 'Spend': 2.89, 'Clicks': 12}, {'Campaign': 'Campaign X', 'Date': '24-08-2014', 'Spend': 1.20, 'Clicks': 1}, {'Campaign': 'Campaign Z2', 'Date': '24-08-2014', 'Spend': 4.56, 'Clicks': 13}] )

enter image description here

如果给定的广告系列出现多次,我只想检查(并返回单个布尔值)。

我能做到:

True in mydf['Campaign'].duplicated().get_values()

或:

True if len(mydf.drop_duplicates('Campaign')) < len(mydf['Campaign']) else False

有更好/更有效的方式吗?如果没有,上述哪一项更可取?

1 个答案:

答案 0 :(得分:1)

看起来你的第一个提出的方法在小型数据帧上是最快的。

%timeit mydf.Campaign.duplicated().any()
The slowest run took 4.08 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 39.9 µs per loop

%timeit True in mydf['Campaign'].duplicated().get_values()
The slowest run took 4.23 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 34 µs per loop

%timeit True if len(mydf.drop_duplicates('Campaign')) < len(mydf['Campaign']) else False
1000 loops, best of 3: 311 µs per loop

然而,在更大的数据帧上,我的方法(下面的第一个)稍快一点。

mydf = pd.DataFrame({'Campaign': np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 1e6, replace=True), 'Date': pd.date_range('2015-1-1', periods=1e6), 'Spend': np.random.randn(1e6), 'Clicks': np.random.rand(1e6)})

%timeit mydf.Campaign.duplicated().any()
100 loops, best of 3: 11.2 ms per loop

%timeit True in mydf['Campaign'].duplicated().get_values()
100 loops, best of 3: 12.3 ms per loop

%timeit True if len(mydf.drop_duplicates('Campaign')) < len(mydf['Campaign']) else False
10 loops, best of 3: 138 ms per loop