我有一个像下面这样的单词列表。
wordlist = ['p1','p2','p3','p4','p5','p6','p7']
数据框如下所示。
df = pd.DataFrame({'id' : [1,2,3,4],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]})
输出:
id path
1 p1,p2,p3,p4
2 p1,p2,p1
3 p1,p5,p5,p7
4 p1,p2,p3,p3
我想计算路径数据以获得以下输出。是否有可能实现这种转变?
id p1 p2 p3 p4 p5 p6 p7
1 1 1 1 1 0 0 0
2 2 1 0 0 0 0 0
3 1 0 0 0 2 0 1
4 1 1 2 0 0 0 0
答案 0 :(得分:5)
您可以使用向量化字符串方法str.count()
(请参阅docs和reference),并将wordlist中的每个元素用于新数据框:
In [4]: pd.DataFrame({name : df["path"].str.count(name) for name in wordlist})
Out[4]:
p1 p2 p3 p4 p5 p6 p7
id
1 1 1 1 1 0 0 0
2 2 1 0 0 0 0 0
3 1 0 0 0 2 0 1
4 1 1 2 0 0 0 0
更新:评论的一些答案。实际上,如果字符串可以是彼此的子串,这将不起作用(但OP应该澄清它)。如果是这种情况,这将起作用(并且也更快):
splitted = df["path"].str.split(",")
pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})
还有一些测试来支持我更快的主张:-)
当然,我不知道现实用例是什么,但是我把数据框架放大了(只重复了1000次,差异就大了):
In [37]: %%timeit
....: splitted = df["path"].str.split(",")
....: pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name i
n wordlist})
....:
100 loops, best of 3: 17.9 ms per loop
In [38]: %%timeit
....: pd.DataFrame({name:df["path"].str.count(name) for name in wordlist})
....:
10 loops, best of 3: 23.6 ms per loop
In [39]: %%timeit
....: c = df["path"].str.split(',').apply(Counter)
....: pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
....:
10 loops, best of 3: 42.3 ms per loop
In [40]: %%timeit
....: dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
....: pd.DataFrame(dfN, columns=wordlist).fillna(0)
....:
1 loops, best of 3: 715 ms per loop
我还在wordlist
中使用了更多元素进行了测试,结论是:如果你有一个更大的数据框,wordlist
中的元素数量相对较少,我的方法会更快,如果你有一个大的wordlist
来自@RomanPekar的Counter
方法可以更快(但只有最后一个)。
答案 1 :(得分:5)
我认为这会很有效率
# create Series with dictionaries
>>> from collections import Counter
>>> c = df["path"].str.split(',').apply(Counter)
>>> c
0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1}
1 {u'p2': 1, u'p1': 2}
2 {u'p1': 1, u'p7': 1, u'p5': 2}
3 {u'p2': 1, u'p3': 2, u'p1': 1}
# create DataFrame
>>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
另一种方法:
>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
>>> pd.DataFrame(dfN, columns=wordlist).fillna(0)
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
对性能的一些粗略测试:
>>> dfL = pd.concat([df]*100)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
0.7363274283027295
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
0.5305424618886718
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
1.765344003293876
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
2.33328927599905
阅读this topic后,我发现Counter
真的很慢。您可以使用defaultdict
:
>>> def create_dict(x):
... d = defaultdict(int)
... for c in x:
... d[c] += 1
... return d
>>> c = df["path"].str.split(",").apply(create_dict)
>>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
和一些测试:
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
0.45942801555111146
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
1.5798653213942089
答案 2 :(得分:0)
类似于此:
df1 = pd.DataFrame([[path.count(p) for p in wordlist] for path in df['path']],columns=['p1','p2','p3','p4','p5','p6','p7'])