我有一个非常大的文本文件,我正在阅读。当我运行我的代码时,我得到一个超出范围的列表索引'错误。我注意到我的数据中需要忽略某些行。每组9行应该看起来像下面的第一个例子。有些集合具有随机线(参见第二组)。如何忽略或删除某些行,以便我的计数不被丢弃?我需要所有数据都是9行的集合。我是否可能要求第1行以产品开头,第2-8行与审核相关,第9行是空白的?
product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have
found them all to be of good quality. The product looks more like a stew than a
processed meat and it smells better. My Labrador is finicky and she appreciates this
product better than most.
product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
error error error
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have
found them all to be of good quality. The product looks more like a stew than a
processed meat and it smells better. My Labrador is finicky and she appreciates this
product better than most.
代码
import pandas as pd
import numpy as np
import collections
%time
with open('foods.txt',encoding='ISO-8859-1') as food_file:
dict_list = []
column_names = ('Product ID', 'Number of people who voted this review helpful', 'Total number of people who rated this review', 'Rating of product', 'Text of the review')
line_num = 1
while line_num <20000000:
#Read Lines
line1 = food_file.readline()
line2 = food_file.readline()
line3 = food_file.readline()
line4 = food_file.readline()
line5 = food_file.readline()
line6 = food_file.readline()
line7 = food_file.readline()
line8 = food_file.readline()
line9 = food_file.readline()
#Break out of the loop if we hit the end of the file
if not line1:
break
#This code when in use tells me the last successful line. I then searched the text file to make corrections.
#Manual process - not desirable
#if len(line9) > 1:
#print(line9)
#break
#Split Lines for Dataframe
prod = line1.split(':')[1].strip()
helpful = line4.split(':')[1].strip()
helpful = helpful.split('/')[0] #More efficient approach?
review_total = "/".join(line4.split("/",2)[2:]).strip()
rating = line5.split(':')[1].strip()
review_text = line8.split(':')[1].strip()
dict_list.append(collections.OrderedDict(zip(column_names, [prod, helpful, review_total, rating, review_text])))
line_num += 9
amazon_df = pd.DataFrame(dict_list)
amazon_df