我有一个亚马逊评论数据集,如下有3个变量[user_id,product_id,review_text]
评论中有多少单词有词“rec”(说推荐,接收等包括他们的时态)以及有多少评论有“推荐+产品”(如推荐产品,收到产品等)
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
import csv
import numpy as np
import pandas as pd
df = pd.read_csv('Reviews.csv', sep=',')
print(df.head, ",")
Reviews = df[['user_id', 'product_id', 'review_text']] #user_id is unique, while there are 5 products of each type
with open ('Reviews.csv') as fin:
for line in fin:
tokens = word_tokenize(line)
print(tokens)
我为所有文字做了标记化。如何从这里开始?
答案 0 :(得分:0)
如果您只想计算" rec"和"推荐+产品"你可以使用正则表达式:
import re
found_0 = 'say recommend, receive etc including their tenses'
found_1 = 'recommend product, received product etc'
print(re.findall('.*rec.*', found_0))
# this will print the array:
# ['say recommend, receive etc including their tenses']
print(re.findall('rec[^ ]* product', found_1))
# this will print the array:
# ['recommend product', 'received product']
然后你只需用len计算数组的大小( insert_array_here )