计算Bigram频率

时间:2019-10-16 02:38:23

标签: python python-3.x csv nltk export-to-csv

我有一个3列的CSV文件,如下所示:

Comment                   Comment Author       Location
As for the requirement    David                ON
The sky is blue           Martin               SK
As for the assignment     David                ON
As for the request        Eric                 QC 
As for the request        Eric                 QC

基于此CSV,我创建了一个代码,该代码使我可以将注释列拆分为双字,并计算它们出现的频率。但是,它不会基于“注释作者”和“位置”列将其分组。

我当前的代码正在生成如下所示的csv输出

Word         Frequency           Comment Author       Location
As for            4                 David                ON
the request       2                 Martin               SK
the assignment   1                  David                ON
the sky         1                   Eric                 QC
is blue        1                    Eric                 QC

我想要的输出CSV应该看起来像这样

Word         Frequency           Comment Author       Location
As for            2                 David                ON
As for            2                 Eric                 QC
the request       2                 Eric                 QC
the requirement  1                  David                ON
the sky         1                   Martin               SK
is blue        1                    Martin               SK

我尝试使用df.groupby,但是没有给我想要的输出。我已经在代码中导入了停用词,但是例如在上面的缘故中,我保留了停用词。我的代码如下:

import nltk
import csv
import string
import re
from nltk.util import everygrams
import pandas as pd


from collections import Counter

from itertools import combinations

df = pd.read_csv('modified.csv', 'r', encoding="utf8", index_col=False, header=None, delimiter=",",
                 names=['comment','Comment Author', 'Location'])

top_N = 100000
stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))

txt = df.comment.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')

words = nltk.tokenize.word_tokenize(txt)
words = [w for w in words if not w in stopwords]

bigrm = list(nltk.bigrams(words))



word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
                columns=['Word', 'Frequency'])
rslt['Comment Author'] = df['Comment Author']
rslt['Location'] = df['Location']
print(rslt)
rslt.to_csv('bigram3.csv',index=False)


谢谢!

1 个答案:

答案 0 :(得分:0)

import pandas as pd
from flashtext import KeywordProcessor
import nltk
from collections import Counter

# creating dataframe :
df = pd.DataFrame([['As per the requirement','ON','David'],['The sky is blue','SK','Martin'],['As per the assignment','ON','David'],['As per the request','QC','Eric'],['As per the request','QC','Eric']],columns = ['comments', 'location','Author'])



#creating a bigram token
txt = df.comments.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigram = list(nltk.bigrams(words))
bigram_token = [' '.join(x) for x in bigram]

#now use flashtext for extracting bigram token from comments
kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)

# groupby on author and location 
groupby_element =  list(df.groupby(['Author', 'location']))

data =[]
for i in range(len(groupby_element)):
    author = groupby_element[i][0][0]
    location = groupby_element[i][0][1]
    text = groupby_element[i][1]['comments'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
    data.append((author,location,text))

#groupby dataframe 
groupby_df = pd.DataFrame(data, columns = ['Author','location','text'])
groupby_df['bigram_token_count'] = groupby_df['comment'].apply(lambda x: Counter(kp.extract_keywords(x)))

 #o/p
 Author location                                          text                                 bigram_token_count
 0   David       ON  as per the requirement as per the assignment  {'as per': 2, 'the requirement': 1, 'the assig...
 1    Eric       QC         as per the request as per the request                    {'as per': 2, 'the request': 2}
 2  Martin       SK                               the sky is blue                       {'the sky': 1, 'is blue': 1}

您还可以使用Countvectorizer

from sklearn.feature_extraction.text import  CountVectorizer
vect =  CountVectorizer(ngram_range = (2,2))
bigram_df = pd.DataFrame(vect.fit_transform(groupby_df['text']).todense(), columns = vect.get_feature_names())

final_df = pd.concat([groupby_df[['Author', 'location']],bigram_df], axis=1)

enter image description here