我有一个来自Pandas DataFrame的系列
19607 uhmm i guess i start wit my name.. trung<br />...
6205 you could say my interests revolve around tech...
57858 i always find it difficult to sum myself up wi...
29471 loyal, witty, silly, understanding, dedicated,...
47277 so basically, i hate these "fill in your own w...
25535 i am ending a relationship with a woman right ...
51731 i work and live in san francisco. i enjoy what...
19106 i love being outside when the sun is out. i <a...
18594 i've met someone and am in a long-term relatio...
7326 humanitarian, teamplayer, great work ethic, re...
我想计算每行的平均单词长度。我该如何实现它?
答案 0 :(得分:1)
让我们使用str.split
将句子分成单词。然后explode
和str.len
:
s.str.split().explode().str.len().mean(level=0)
您会得到类似这样的信息:
0
19607 4.000000
6205 5.250000
57858 4.000000
29471 9.000000
47277 4.000000
25535 4.000000
51731 4.000000
19106 3.545455
18594 4.555556
7326 7.333333
Name: 1, dtype: float64
答案 1 :(得分:1)
我的回答是:
import re
import numpy as np
# s = pd.Series(d[1]) # I have called you pandas series "s" from your StackOverFlow question. If it is called something else change from s.apply to your_series.apply
s1 = (s.apply(lambda x: re.sub(r'[^a-z|\s]', '', x))
.str.split('\s+')
.apply(lambda x: np.mean([len(y) for y in x])))
df = pd.concat([s,s1], axis=1)
df
Out[1]:
1 1
0
19607 uhmm i guess i start wit my name.. trung<br />... 3.200000
6205 you could say my interests revolve around tech... 4.875000
57858 i always find it difficult to sum myself up wi... 3.700000
29471 loyal, witty, silly, understanding, dedicated,... 7.400000
47277 so basically, i hate these "fill in your own w... 3.500000
25535 i am ending a relationship with a woman right ... 3.700000
51731 i work and live in san francisco. i enjoy what... 3.600000
19106 i love being outside when the sun is out. i <a... 3.090909
18594 i've met someone and am in a long-term relatio... 4.000000
7326 humanitarian, teamplayer, great work ethic, re... 6.333333