我很难从情节中删除一些停用词(默认停用词加上其他手动添加的词)。 该问题与其他两个问题有关:
原始数据:
Date Sentences
0 02/06/2020 That's the word some researcher us...
1 02/06/2020 A top official with the World Wide...
2 02/06/2020 Asymptomatic spread is the trans...
3 02/07/2020 "I don't want anyone to get con...
4 02/07/2020 And, separately, how many of th...
... ... ...
65 02/09/2020 its 'very rare' comment on asymp...
66 02/09/2020 The rapid spread of the virus t...
这是有关文本挖掘和分析的练习。我一直试图做的是收集每个日期更频繁的单词。为此,我将句子标记了符号,然后将其保存在名为“ Clean”的新列中。我曾经使用过函数,一种用于删除停用词,另一种用于清理文本。
代码:
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
stop_words = (stopwords.words('English') + extra_stops) # extra stops are words that may not be useful for the analysis so they could be removed, e.g. spread in the example above)
c_text = []
for i in text.lower().split():
if i not in stop_words:
c_text.append(i)
return(' '.join(c_text))
def clean_text(file):
#remove punctuation
punct = string.punctuation.replace("'", '')
punc = r'[{}]'.format(punct)
remove_words =list(stopwords.words('english'))+list(my_stop)+list(extra_stops)
#clean text
file.Clean = file..str.replace('\d+', '') # remove all numbers
file.Clean = file.Clean.str.replace(punc, ' ')
file.Clean = file.Clean.str.strip()
file.Clean = file.Clean.str.lower().str.split()
file.dropna(inplace=True)
file.Clean = file.Clean.apply(lambda x: list(word for word in x if word not in remove_words))
return(file.Clean)
其中“清洁”定义为:
df4['Sentences'] = df4['Sentences'].astype(str)
df4['Clean'] = df4['Sentences']
清除文本后,我尝试按日期将单词分组以选择排名靠前的单词(数据集很大,因此我仅选择了排名前4的单词)。
df4_ex = df4.explode('Clean')
df4_ex.dropna(inplace=True)
df4_ex = df4_ex.groupby(['Date', 'Clean']).agg({'Clean': 'count'}).groupby('Date').head(4)
然后,我将代码应用于绘制报告频率最高的单词的堆叠条形图,如下所示(我在Stackoverflow中找到了代码;由于它不是我从头开始构建的,因此有可能在绘制之前错过了某些部分):>
# create list of words of appropriate length; all words repeat for each date
cols = [x[1] for x in df_gb.columns for _ in range(len(df_gb))]
# plot df_gb
ax = df_gb.plot.bar(stacked=True)
# annotate the bars
for i, rect in enumerate(ax.patches):
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# The height of the bar is the count value and can used as the label
label_text = f'{height:.0f}: {cols[i]}'
label_x = x + width / 2
label_y = y + height / 2
# don't include label if it's equivalently 0
if height > 0.001:
ax.text(label_x, label_y, label_text, ha='center', va='center', fontsize=8)
# rename xtick labels; remove time
labels = [label.get_text()[:10] for label in labels]
plt.xticks(ticks=ticks, labels=labels)
ax.get_legend().remove()
plt.show()
但是,即使添加了一些新单词以从结果中排除,我仍然在情节上得到相同的变量,这意味着该变量未正确删除。
由于我不了解并无法找出错误所在,因此希望您能为我提供帮助。 预先感谢您为我提供的所有帮助和时间。
答案 0 :(得分:1)
这可能有帮助;
import pandas, string, collections
from nltk.corpus import stopwords
extra = ['der', 'die', 'das']
STOPWORDS = {token.lower() for token in stopwords.words('english') + extra}
PUNCTUATION = string.punctuation
df = pandas.DataFrame({
'Date': ['02/06/2020', '02/06/2020', '03/06/2020', '03/06/2020'],
'Sentences': ["That's the word some tor researcher", 'A top official with the World Wide', 'The rapid spread of the virus', 'Asymptomatic spread is the transmition']
})
#### ----------- Preprocessing --------------
def remove_punctuation(input_string):
for char in PUNCTUATION:
input_string = input_string.replace(char, ' ')
return input_string
def remove_stopwords(input_string):
return ' '.join([word for word in input_string.lower().split() if word not in STOPWORDS])
def preprocess(input_string):
no_punctuation = remove_punctuation(input_string)
no_stopwords = remove_stopwords(no_punctuation)
return no_stopwords
df['clean'] = df['Sentences'].apply(preprocess)
### ------------- Token Count -----------------
group_counters = dict()
for date, group in df.groupby('Date'):
group_counters[date] = group['clean'].apply(lambda x: pandas.value_counts(x.split())).sum(axis = 0)
counter_df = pandas.concat(group_counters)
输出;
02/06/2020 researcher 1.0
word 1.0
tor 1.0
world 1.0
wide 1.0
official 1.0
top 1.0
03/06/2020 spread 2.0
rapid 1.0
virus 1.0
transmition 1.0
asymptomatic 1.0
dtype: float64