我正在尝试获取一系列产品评论中使用的每个单词,双字和3字短语的列表(200K +评论)。评论作为json对象提供给我。我试图通过使用生成器尽可能多地从内存中删除数据,但我仍然没有内存,并且不知道接下来要去哪里。我在这里回顾了生成器/迭代器的使用和一个非常类似的问题: repeated phrases in the text Python 但我仍然无法让它适用于大型数据集(如果我采用一部分评论,我的代码效果很好。)
我的代码的格式(或至少是预期的格式)如下: -Read在包含json对象的文本文件中逐行 -parse当前行到json对象并拉出评论文本(dict中有其他数据我不需要) - 将评论分解为组成单词,清理单词然后将其添加到我的主列表中,或者增加该单词/短语的计数器(如果已存在)
非常感谢任何帮助!
import json
import nltk
import collections
#define set of "stopwords", those that are removed
s_words=set(nltk.corpus.stopwords.words('english')).union(set(["it's", "us", " "]))
#load tokenizer, which will split text into words, and stemmer - which stems words
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
stemmer = nltk.SnowballStemmer('english')
master_wordlist = collections.defaultdict(int)
#open the raw data and read it in by line
allReviews = open('sample_reviews.json')
lines = allReviews.readlines()
allReviews.close()
#Get all of the words, 2 and 3 word phrases, in one review
def getAllWords(jsonObject):
all_words = []
phrase2 = []
phrase3 = []
sentences=tokenizer.tokenize(jsonObject['text'])
for sentence in sentences:
#split up the words and clean each word
words = sentence.split()
for word in words:
adj_word = str(word).translate(None, '"""#$&*@.,!()- +?/[]1234567890\'').lower()
#filter out stop words
if adj_word not in s_words:
all_words.append(str(stemmer.stem(adj_word)))
#add all 2 word combos to list
phrase2.append(str(word))
if len(phrase2) > 2:
phrase2.remove(phrase2[0])
if len(phrase2) == 2:
all_words.append(tuple(phrase2))
#add all 3 word combos to list
phrase3.append(str(word))
if len(phrase3) > 3:
phrase3.remove(phrase3[0])
if len(phrase3) == 3:
all_words.append(tuple(phrase3))
return all_words
#end of getAllWords
#parse each line from the txt file to a json object
for c in lines:
review = (json.loads(c))
#counter instances of each unique word in wordlist
for phrase in getAllWords(review):
master_wordlist[phrase] += 1
答案 0 :(得分:1)
我相信调用readlines
会将整个文件加载到内存中,只需逐行迭代文件对象就可以减少开销
#parse each line from the txt file to a json object
with open('sample_reviews.json') as f:
for line in f:
review = (json.loads(line))
#counter instances of each unique word in wordlist
for phrase in getAllWords(review):
master_wordlist[phrase] += 1