Python Pandas在文本字段中聚合以空格分隔的值

时间:2016-11-28 10:17:11

标签: python pandas whitespace text-processing

我有一个这样的数据框:

0      A\nA\nA
1      na\nB|D|E|F|G|H\nB|D|E|F|G|H
2      B\nB|C\nB
3      na\nna\nna

我想按最高计数汇总值:

0      A
1      B|D|E|F|G|H
2      B
3      na

我假设我应该首先用'\ n'分隔列,所以我正在使用

df = pd.DataFrame([ x.split('\n') for x in df.tolist()])

所以我得到了:

       0            1               2
0      A            A               A
1      na           B|D|E|F|G|H     B|D|E|F|G|H
2      B            B|C             B
3      na           na              na

如何合并下一列以获得所需的输出?

感谢。

2 个答案:

答案 0 :(得分:1)

pd.DataFrame.modeaxis=1上应用时提供预期输出:

import pandas as pd
df = pd.read_clipboard()
df.mode(1)

返回:

0
0   A
1   B|D|E|F|G|H
2   B
3   na

答案 1 :(得分:1)

您可以Counter使用most_common

from collections import Counter

df = pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()])
print (df)
             0
0            A
1  B|D|E|F|G|H
2            B
3           na

使用str.split并应用value_counts的另一种解决方案:

df = df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1)
print (df)
0              A
1    B|D|E|F|G|H
2              B
3             na
dtype: object

<强>计时

In [238]: %timeit (pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()]))
1000 loops, best of 3: 197 µs per loop

In [239]: %timeit (df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1))
100 loops, best of 3: 2.33 ms per loop


In [241]: %timeit (pd.DataFrame([ x.split('\n') for x in df.tolist()]).mode(1))
100 loops, best of 3: 2.32 ms per loop

较大的DataFrame

#len (df) = 40k

from collections import Counter
df = pd.Series(['A\nA\nA','na\nB|D|E|F|G|H\nB|D|E|F|G|H','B\nB|c\nB','na\nna\nna'])
#print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
In [331]: %timeit (pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()]))
1 loop, best of 3: 257 ms per loop

In [332]: %timeit (df.apply(lambda x: Counter(x.split('\n')).most_common()[0][:][0]))
1 loop, best of 3: 282 ms per loop

In [333]: %timeit (pd.DataFrame([ x.split('\n') for x in df.tolist()]).mode(1))
1 loop, best of 3: 9.18 s per loop

In [334]: %timeit (df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1))
1 loop, best of 3: 15.7 s per loop