我有2个数据框。一个数据帧包含列1,即具有唯一编号:550的群集,而另一列具有与每个群集相对应的令牌。现在,我又有一个包含1000个文档的数据框。每行包含一个段落即文档。现在,如何与其他数据帧中的文档进行比较来获得令牌频率w.r.t簇
import nltk`enter code here`
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
def token_text(text):
sent =[]
for each in tokenizer.tokenize(text):
#if not each.isdigit():
sent.append(each)
return sent
def count_word(word,sent):
count =0
for item in sent:
if item ==word:
count =count+1
return count
def frequency_match(word,text):
sent = token_text(text)
count = count_word(word, sent)
return count