我有以下内容:
import pandas as pd
import numpy as np
documents = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user'],
['System', 'and', 'human'],
['Relation', 'of', 'user'],
['The', 'generation'],
['The', 'intersection'],
['Graph', 'minors'],
['Graph', 'minors', 'a']]
df = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-10', '2014-05-15', '2014-05-15', '2014-05-20', '2014-05-20', '2014-05-20'], dtype=np.datetime64), 'text': documents})
只有5个独特的日子。我想按天分组以得出以下结果:
documents2 = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user', 'System', 'and', 'human'],
['Relation', 'of', 'user', 'The', 'generation'],
['The', 'intersection', 'Graph', 'minors', 'Graph', 'minors', 'a']]
df2 = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-15', '2014-05-20'], dtype=np.datetime64), 'text': documents2})
答案 0 :(得分:5)
IIUC,您可以aggregate
sum
df.groupby('date').text.sum() # or .agg(sum)
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
或使用列表理解来平化列表,这会产生与chain.from_iterable
相同的时间复杂度,但不依赖于另一个外部库
df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
答案 1 :(得分:5)
sum
已经在另一个答案中显示,所以让我提出一个使用chain.from_iterable
更快(更有效)的解决方案:
from itertools import chain
df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
sum
的问题在于,对于相加的每两个列表,都会创建一个新的中间结果。因此,运算为O(N ^ 2)。您可以使用链将其缩减为线性时间。
即使使用相对较小的DataFrame,性能差异也很明显。
df = pd.concat([df] * 1000)
%timeit df.groupby('date').text.sum()
%timeit df.groupby('date').text.agg('sum')
%timeit df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
%timeit df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
71.8 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
68.9 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.25 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
当组更大时,问题将更加明显。特别是因为sum
不是针对对象进行矢量化的。