Question

我有一个亚马逊评论数据集，如下有3个变量[user_id，product_id，review_text]

评论中有多少单词有词“rec”（说推荐，接收等包括他们的时态）以及有多少评论有“推荐+产品”（如推荐产品，收到产品等）

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
import csv
import numpy as np
import pandas as pd

df = pd.read_csv('Reviews.csv', sep=',')
print(df.head, ",")
Reviews = df[['user_id', 'product_id', 'review_text']] #user_id is unique, while there are 5 products of each type

with open ('Reviews.csv') as fin:
    for line in fin:
        tokens = word_tokenize(line)
        print(tokens)

我为所有文字做了标记化。如何从这里开始？

Answer 1

如果您只想计算＆＃34; rec＆＃34;和＆＃34;推荐+产品＆＃34;你可以使用正则表达式：

import re

found_0 = 'say recommend, receive etc including their tenses' 
found_1 = 'recommend product, received product etc'

print(re.findall('.*rec.*', found_0))
# this will print the array: 
# ['say recommend, receive etc including their tenses']

print(re.findall('rec[^ ]* product', found_1))
# this will print the array:
# ['recommend product', 'received product']

然后你只需用len计算数组的大小（ insert_array_here ）

如何计算python中单词的文本实例数？

1 个答案: