文本分析:使用python查找列中最常见的单词

时间:2019-09-26 16:20:27

标签: python pandas

我创建了一个仅包含主题行列的数据框。

df = activities.filter(['Subject'],axis=1)
df.shape

这返回了此数据框:

    Subject
0   Call Out: Quadria Capital - May Lo, VP
1   Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2   Columbia Partners: WW Worked (Not Sure Will Ev...
3   Meeting, Sophie, CFO, CDC Investment
4   Prospecting

然后我尝试使用以下代码分析文本:

import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)

我收到的错误消息是:“系列”对象没有属性“主题”

2 个答案:

答案 0 :(得分:1)

由于在此行将df转换为Series,所以引发了错误:

df = activities.filter(['Subject'],axis=1)

所以当你说:

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

df是Series,没有属性Series。尝试替换为:

txt = df.str.lower().str.replace(r'\|', ' ')

或者也可以不要先将DataFrame过滤到单个Series上,然后

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

应该工作。

[更新]

我上面所说的是错误的,因为指出过滤器不会返回Series,而是返回具有单个列的DataFrame。

答案 1 :(得分:1)

数据:

(seq_len(nrow(iris))-1) %/% 50 + 1

enter image description here

更新代码:

  • 将所有单词转换为小写并删除所有非字母数字字符
    • Subject "Call Out: Quadria Capital - May Lo, VP" Call Out: Revelstoke - Anthony Hayes (Sr Assoc... Columbia Partners: WW Worked (Not Sure Will Ev... "Meeting, Sophie, CFO, CDC Investment" Prospecting # read in the data df = pd.read_clipboard(sep=',') 创建一个txt = df.Subject.str.lower().str.replace(r'\|', ' ')并将被替换
  • pandas.core.series.Series抛出words = nltk.tokenize.word_tokenize(txt),因为TypeErrortxt
    • 以下代码标记数据框的每一行
  • 标记单词,将每个字符串分成Series。在此示例中,查看list将显示一个df列,其中每一行都是一个列表
tok

enter image description here

  • 要分析该列中的所有单词,请将各个行列表组合为一个名为import nltk import pandas as pd top_N = 50 # replace all non-alphanumeric characters df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ') # tokenize df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize) 的列表。
words

输出# all tokenized words to a list words = df.tok.tolist() # this is a list of lists words = [word for list_ in words for word in list_] # frequency distribution word_dist = nltk.FreqDist(words) # remove stopwords stopwords = nltk.corpus.stopwords.words('english') words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) # output the results rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])

rslt
相关问题