我有一个3列的CSV文件,如下所示:
Comment Comment Author Location
As for the requirement David ON
The sky is blue Martin SK
As for the assignment David ON
As for the request Eric QC
As for the request Eric QC
基于此CSV,我创建了一个代码,该代码使我可以将注释列拆分为双字,并计算它们出现的频率。但是,它不会基于“注释作者”和“位置”列将其分组。
我当前的代码正在生成如下所示的csv输出
Word Frequency Comment Author Location
As for 4 David ON
the request 2 Martin SK
the assignment 1 David ON
the sky 1 Eric QC
is blue 1 Eric QC
我想要的输出CSV应该看起来像这样
Word Frequency Comment Author Location
As for 2 David ON
As for 2 Eric QC
the request 2 Eric QC
the requirement 1 David ON
the sky 1 Martin SK
is blue 1 Martin SK
我尝试使用df.groupby,但是没有给我想要的输出。我已经在代码中导入了停用词,但是例如在上面的缘故中,我保留了停用词。我的代码如下:
import nltk
import csv
import string
import re
from nltk.util import everygrams
import pandas as pd
from collections import Counter
from itertools import combinations
df = pd.read_csv('modified.csv', 'r', encoding="utf8", index_col=False, header=None, delimiter=",",
names=['comment','Comment Author', 'Location'])
top_N = 100000
stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
txt = df.comment.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
words = [w for w in words if not w in stopwords]
bigrm = list(nltk.bigrams(words))
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
rslt['Comment Author'] = df['Comment Author']
rslt['Location'] = df['Location']
print(rslt)
rslt.to_csv('bigram3.csv',index=False)
谢谢!
答案 0 :(得分:0)
import pandas as pd
from flashtext import KeywordProcessor
import nltk
from collections import Counter
# creating dataframe :
df = pd.DataFrame([['As per the requirement','ON','David'],['The sky is blue','SK','Martin'],['As per the assignment','ON','David'],['As per the request','QC','Eric'],['As per the request','QC','Eric']],columns = ['comments', 'location','Author'])
#creating a bigram token
txt = df.comments.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigram = list(nltk.bigrams(words))
bigram_token = [' '.join(x) for x in bigram]
#now use flashtext for extracting bigram token from comments
kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)
# groupby on author and location
groupby_element = list(df.groupby(['Author', 'location']))
data =[]
for i in range(len(groupby_element)):
author = groupby_element[i][0][0]
location = groupby_element[i][0][1]
text = groupby_element[i][1]['comments'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
data.append((author,location,text))
#groupby dataframe
groupby_df = pd.DataFrame(data, columns = ['Author','location','text'])
groupby_df['bigram_token_count'] = groupby_df['comment'].apply(lambda x: Counter(kp.extract_keywords(x)))
#o/p
Author location text bigram_token_count
0 David ON as per the requirement as per the assignment {'as per': 2, 'the requirement': 1, 'the assig...
1 Eric QC as per the request as per the request {'as per': 2, 'the request': 2}
2 Martin SK the sky is blue {'the sky': 1, 'is blue': 1}
您还可以使用Countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range = (2,2))
bigram_df = pd.DataFrame(vect.fit_transform(groupby_df['text']).todense(), columns = vect.get_feature_names())
final_df = pd.concat([groupby_df[['Author', 'location']],bigram_df], axis=1)