Question

基于（简化的）DataFrame

import pandas as pd
texts = pd.DataFrame({"description":["This is one text","and this is another one"]})
print(texts)
               description
0         This is one text
1  and this is another on

我想使用描述列中一组单词的词频来创建系列。

预期结果应如下所示：

           counts
this       2
is         2    
one        2
text       1
and        1
another    1

我尝试了

print(pd.Series('  '.join(str(texts.description)).split(' ')).value_counts())

但是得到了

      139
e       8
t       7
i       6
n       5
o       5
s       5
d       3
a       3
h       3
p       2
:       2
c       2
r       2
\n      2
T       1
0       1
j       1
x       1
1       1
N       1
m       1
,       1
y       1
b       1
dtype: int64

Answer 1

您的代码失败，因为str(texts.description)给出了

'0           This is one text\n1    and this is another one\nName: description, dtype: object'

，即系列的字符串表达式，几乎等同于print(texts.description)。然后，当您执行join(str(texts.description)时，上面的字符串将转换为字符列表，其余部分您便会知道。

尝试：

(texts.description
      .str.lower()
      .str.split(expand=True)
      .stack().value_counts()
)

输出：

this       2
one        2
is         2
another    1
and        1
text       1
dtype: int64

Answer 2

l = texts['description'].apply(lambda x: x.lower().split())
Counter([item for sublist in l for item in sublist])

Answer 3

删除str中的print(pd.Series(' '.join(str(texts.description)).split(' ')).value_counts())

这是因为str(texts.description)返回 '0 This is one text\n1 and this is another one\nName: description, dtype: object'，那不是您想要的。

它是这样的：

print(pd.Series('  '.join(texts.description).split(' ')).value_counts())

并给您：

is         2
one        2
This       1
and        1
this       1
another    1
text       1
           1
dtype: int64

Answer 4

如果要将列的值转换为字符串，请使用Series.astype函数：

print(pd.Series(' '.join(texts.description.astype(str)).split(' ')).value_counts())

但是如果列中的所有字符串，您也可以忽略它并正常工作：

print(pd.Series(' '.join(texts.description).split(' ')).value_counts())
one        2
is         2
This       1
text       1
this       1
and        1
another    1
dtype: int64

从包含文本的列中获取所有行的词频

4 个答案: