用Python计算数据框中的单词数

时间:2019-05-14 16:23:50

标签: python python-3.x pandas nltk stop-words

我已使用熊猫将CSV文件导入到Python。该文件包括3列和498行。我只需要为名为“说明”的1列计算字数。我已经通过将“说明”列转换为小写字母,删除了英语停用词和拆分来清理了文件。

IN

    import pandas as pd

    df = pd.read_csv("capex_motscles.csv")

    from nltk.corpus import stopwords
    stop = stopwords.words('english') 

    Description3 = df['Description'].str.lower().apply(lambda x: 
    ''.join([word for word in str(x).split() if word not in (stop)]))

    print(Description3)

OUT

    0      crazy mind california medical service data base...
    1      california licensed producer recreational & medic...
    2      silicon valley data clients live beyond status...
    3      mycrazynotes inc. announces $144.6 million expans...
    4      leading provider sustainable energy company prod ...
    5      livefreecompany founded 2005, listed new york stock...

我从“ print(Description3)”提供了5行。我总共有498行,并且如上所述,我需要计算单词频率。 任何帮助将不胜感激,谢谢您的时间!

1 个答案:

答案 0 :(得分:1)

你的意思是这样吗?

df['Description3'] = df['Description'].str.lower().apply(lambda x: 
                             ''.join([word for word in str(x).split() if word not in (stop)]))

df['Description3'].str.split(expand=True).stack().value_counts()