Question

python的新手，开始学习处理数据，并遇到一些麻烦。

我有一个数据集（pandas），每行我都有一个句子。我想创建一个新列，用于计算句子中的单词（每行）。

如果句子是：“Hello World Hello dogs”，单词counter将是 -

{'Hello' - 2, 'World' - 1, 'dogs' -1}

我通常使用graphlab，它是通过以下方式完成的：

dataset['new_column'] = graphlab.text_analytics.count_words(..)

我看到了很多类似的解决方案，但是在添加新列时没有在数据集上，我从未在python中编程。

会喜欢一些指导。

Answer 1

我建议不要在您的数据框中的单元格中存储字典，但如果无法绕过它，您可以使用Counter

dataset = pd.DataFrame([['Hello world dogs'], ['this is another sentence']], columns=['column_of_interest'] )

from collections import Counter
dataset['new_column'] = dataset.column_of_interest.apply(lambda x: Counter(x.split(' ')))
dataset

    column_of_interest  new_column
0   Hello world dogs    {'dogs': 1, 'world': 1, 'Hello': 1}
1   this is another sentence    {'is': 1, 'sentence': 1, 'this': 1, 'another': 1}

编辑：根据以下评论，如果有不包含字符串的单元格，您可能需要在分割str之前转换为lambda x: Counter(str(x).split(' ')))

Answer 2

接受的答案就行了。

如果有人想要，没有熊猫的答案：

def word_count(text):
    word_count = {}
    for word in text.split():
        if word not in word_count:
            word_count[word] = 1
        else:
            word_count[word] += 1
    return word_count

data['word_count'] = data['sentences'].apply(word_count)

测试：

print word_count("Hello Hello world")

输出：

{'world': 1, 'Hello': 2}

python计算每行中的单词，并保存在新列

2 个答案: