鉴于以下pandas DataFrame:
mydf = pd.DataFrame([{'Campaign': 'Campaign X', 'Date': '24-09-2014', 'Spend': 1.34, 'Clicks': 241}, {'Campaign': 'Campaign Y', 'Date': '24-08-2014', 'Spend': 2.89, 'Clicks': 12}, {'Campaign': 'Campaign X', 'Date': '24-08-2014', 'Spend': 1.20, 'Clicks': 1}, {'Campaign': 'Campaign Z2', 'Date': '24-08-2014', 'Spend': 4.56, 'Clicks': 13}] )
如果给定的广告系列出现多次,我只想检查(并返回单个布尔值)。
我能做到:
True in mydf['Campaign'].duplicated().get_values()
或:
True if len(mydf.drop_duplicates('Campaign')) < len(mydf['Campaign']) else False
有更好/更有效的方式吗?如果没有,上述哪一项更可取?
答案 0 :(得分:1)
看起来你的第一个提出的方法在小型数据帧上是最快的。
%timeit mydf.Campaign.duplicated().any()
The slowest run took 4.08 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 39.9 µs per loop
%timeit True in mydf['Campaign'].duplicated().get_values()
The slowest run took 4.23 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 34 µs per loop
%timeit True if len(mydf.drop_duplicates('Campaign')) < len(mydf['Campaign']) else False
1000 loops, best of 3: 311 µs per loop
然而,在更大的数据帧上,我的方法(下面的第一个)稍快一点。
mydf = pd.DataFrame({'Campaign': np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 1e6, replace=True), 'Date': pd.date_range('2015-1-1', periods=1e6), 'Spend': np.random.randn(1e6), 'Clicks': np.random.rand(1e6)})
%timeit mydf.Campaign.duplicated().any()
100 loops, best of 3: 11.2 ms per loop
%timeit True in mydf['Campaign'].duplicated().get_values()
100 loops, best of 3: 12.3 ms per loop
%timeit True if len(mydf.drop_duplicates('Campaign')) < len(mydf['Campaign']) else False
10 loops, best of 3: 138 ms per loop