我创建了一个仅包含主题行列的数据框。
df = activities.filter(['Subject'],axis=1)
df.shape
这返回了此数据框:
Subject
0 Call Out: Quadria Capital - May Lo, VP
1 Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2 Columbia Partners: WW Worked (Not Sure Will Ev...
3 Meeting, Sophie, CFO, CDC Investment
4 Prospecting
然后我尝试使用以下代码分析文本:
import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)
我收到的错误消息是:“系列”对象没有属性“主题”
答案 0 :(得分:1)
由于在此行将df
转换为Series,所以引发了错误:
df = activities.filter(['Subject'],axis=1)
所以当你说:
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
df是Series,没有属性Series。尝试替换为:
txt = df.str.lower().str.replace(r'\|', ' ')
或者也可以不要先将DataFrame过滤到单个Series上,然后
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
应该工作。
[更新]
我上面所说的是错误的,因为指出过滤器不会返回Series,而是返回具有单个列的DataFrame。
答案 1 :(得分:1)
(seq_len(nrow(iris))-1) %/% 50 + 1
Subject
"Call Out: Quadria Capital - May Lo, VP"
Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
Columbia Partners: WW Worked (Not Sure Will Ev...
"Meeting, Sophie, CFO, CDC Investment"
Prospecting
# read in the data
df = pd.read_clipboard(sep=',')
创建一个txt = df.Subject.str.lower().str.replace(r'\|', ' ')
并将被替换pandas.core.series.Series
抛出words = nltk.tokenize.word_tokenize(txt)
,因为TypeError
是txt
。
Series
。在此示例中,查看list
将显示一个df
列,其中每一行都是一个列表tok
import nltk
import pandas as pd
top_N = 50
# replace all non-alphanumeric characters
df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ')
# tokenize
df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize)
的列表。words
# all tokenized words to a list
words = df.tok.tolist() # this is a list of lists
words = [word for list_ in words for word in list_]
# frequency distribution
word_dist = nltk.FreqDist(words)
# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
# output the results
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
:rslt