如何计算熊猫长弦系列的平均工作长度

时间:2020-10-18 04:14:46

标签: python pandas lambda

我有一个来自Pandas DataFrame的系列

19607    uhmm i guess i start wit my name.. trung<br />...
6205     you could say my interests revolve around tech...
57858    i always find it difficult to sum myself up wi...
29471    loyal, witty, silly, understanding, dedicated,...
47277    so basically, i hate these "fill in your own w...
25535    i am ending a relationship with a woman right ...
51731    i work and live in san francisco. i enjoy what...
19106    i love being outside when the sun is out. i <a...
18594    i've met someone and am in a long-term relatio...
7326     humanitarian, teamplayer, great work ethic, re...

我想计算每行的平均单词长度。我该如何实现它?

2 个答案:

答案 0 :(得分:1)

让我们使用str.split将句子分成单词。然后explodestr.len

s.str.split().explode().str.len().mean(level=0)

您会得到类似这样的信息:

0
19607    4.000000
6205     5.250000
57858    4.000000
29471    9.000000
47277    4.000000
25535    4.000000
51731    4.000000
19106    3.545455
18594    4.555556
7326     7.333333
Name: 1, dtype: float64

答案 1 :(得分:1)

我的回答是:

  1. 删除了标点符号(但保留了空格),因为这不应该算作计数
  2. 在空格上分割
  3. 通过列表理解计算平均值
  4. 已加入原始系列,因此您可以并排查看结果

import re
import numpy as np
# s = pd.Series(d[1]) # I have called you pandas series "s" from your StackOverFlow question. If it is called something else change from s.apply to your_series.apply
s1 = (s.apply(lambda x: re.sub(r'[^a-z|\s]', '', x))
      .str.split('\s+')
      .apply(lambda x: np.mean([len(y) for y in x])))
df = pd.concat([s,s1], axis=1)
df
Out[1]: 
                                                       1         1
0                                                                 
19607  uhmm i guess i start wit my name.. trung<br />...  3.200000
6205   you could say my interests revolve around tech...  4.875000
57858  i always find it difficult to sum myself up wi...  3.700000
29471  loyal, witty, silly, understanding, dedicated,...  7.400000
47277  so basically, i hate these "fill in your own w...  3.500000
25535  i am ending a relationship with a woman right ...  3.700000
51731  i work and live in san francisco. i enjoy what...  3.600000
19106  i love being outside when the sun is out. i <a...  3.090909
18594  i've met someone and am in a long-term relatio...  4.000000
7326   humanitarian, teamplayer, great work ethic, re...  6.333333