我有来自亚马逊的文本文件,其中包含以下信息:
# user item time rating review text (the header is added by me for explanation, not in the text file
disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use
hjf2329ccc TGjsk123 14423321 3 Suck restaurant
如您所见,数据按空格分隔,每行中的列数不同。但是,它是文本内容。 这是我尝试过的代码:
pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part
发生了这样的错误:
ValueError: Passed header names mismatches usecols
当我尝试阅读所有列时:
pd.read_csv(filename, sep = " ", header = None)
这次的错误是:
Error tokenizing data. C error: Expected 229 fields in line 3, saw 320
鉴于审核文本在很多行中都很长,因此为此question中的每列添加标题名称的方法无效。
我想知道如果我想保留评论文本并分别跳过它们,如何阅读csv文件。先感谢您!
编辑:
Martin Evans完全解决了这个问题。但是现在我正在玩另一个具有相似但不同格式的数据集。现在数据的顺序是相反的:
# review text user item time rating (the header is added by me for explanation, not in the text file
I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
你有任何想法正确阅读吗?如有任何帮助,我们将不胜感激!
答案 0 :(得分:13)
根据建议,DictReader
也可以按如下方式用于创建行列表。然后可以将其作为框架导入pandas:
import pandas as pd
import csv
rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']
with open('input.csv', 'rb') as f_input:
for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
try:
rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
except KeyError, e:
rows.append([row['user'], row['item'], row['rating'], ' '])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
这将显示以下内容:
user item rating review
0 disjiad123 TYh23hs9 5 I love this phone as it is easy to use
1 hjf2329ccc TGjsk123 3 Suck restaurant
如果审核出现在行的开头,那么一种方法是反向解析该行,如下所示:
import pandas as pd
import csv
rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']
with open('input.csv', 'rb') as f_input:
for row in f_input:
cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
rows.append(cols[:4] + [' '.join(cols[4:][::-1])])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
这会显示:
rating time item user \
0 5 13160032 TYh23hs9 isjiad123
1 3 14423321 TGjsk123 hjf2329ccc
review
0 I love this phone as it is easy to used
1 Suck restaurant
row[::-1]
用于反转整行的文本,[2:]
跳过现在位于行开头的行结束。然后在空格上分割每一行。然后,列表推导重新反转每个拆分条目。最后,通过获取固定的5列条目(现在位于开头),首先附加rows
。然后将剩余的条目与空格连接在一起并添加为最后一列。
这种方法的好处是它不依赖于您的输入数据采用精确固定的宽度格式,并且您不必担心使用的列宽是否随时间而变化。
答案 1 :(得分:6)
看起来这是一个固定宽度的文件。为实现这一目的,熊猫提供read_fwf
。以下代码为我正确读取文件。如果它不能很好地工作,你可能想要稍微宽松一下。
pandas.read_fwf('test.fwf',
widths=[13, 12, 13, 5, 100],
names=['user', 'item', 'time', 'rating', 'review'])
如果列仍与编辑后的版本(评级首先出现)对齐,则只需添加正确的规范即可。如下的指南有助于快速完成此任务:
0 1 2 3 4 5 6 7 8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
所以新命令变为:
pandas.read_fwf('test.fwf',
colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]],
names=['review', 'user', 'item', 'time', 'rating'])
答案 2 :(得分:3)
由于前四个(现在最后四个)字段永远不会包含空格或需要用引号括起来,所以让我们忘记csv库并直接使用python' s awesome string handling 。这是一个单行,将每行分成五列,由maxsplit
的{{1}}参数提供:
rsplit()
以上内容应解决您的问题,但我更喜欢将其解压缩为更容易理解的生成器函数,并且可以在必要时进行扩展:
with open("myfile.dat") as data:
frame = pd.DataFrame(line.strip().rsplit(maxsplit=4) for line in data)
两个版本都避免在内存中构建一个大的普通数组,只是将它交给def splitfields(data):
"""Generator that parses the data correctly into fields"""
for line in data:
fields = line.rsplit(maxsplit=4)
fields[0] = fields[0].strip() # trim line-initial spaces
yield fields
with open("myfile.dat") as data:
frame = pd.DataFrame(splitfields(data))
构造函数。当从文件中读取每行输入时,它将被解析并立即添加到数据帧中。
以上是针对更新问题中的格式,左侧是自由文本。 (对于原始格式,请使用DataFrame
代替line.split
并删除最后一个字段,而不是第一个字段。)
line.rsplit
根据数据的实际情况,您还可以做更多事情:如果字段间隔恰好是四个空格(如您的示例所示),则可以在 I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
上拆分而不是分裂所有空白。如果其他一些字段可以包含空格,那也可以正常工作。通常,像这样的预解析是灵活和可扩展的;我保持代码简单,因为你的问题没有证据证明需要更多代码。
答案 3 :(得分:2)
Usecols
是指输入文件中列的名称。如果您的文件没有那些名为(user, item, rating
)的列,则它不知道您指的是哪些列。相反,您应该传递像usecols=[0,1,2]
这样的索引。
此外,names
指的是您调用导入列的内容。所以,我认为在导入3列时你不能有四个名字。这有用吗?
pd.read_csv(filename, sep = " ",
header = None,
names = ["user","item","rating"],
usecols = [0,1,2])
令牌化错误看起来像分隔符有问题。它可能会尝试将您的review text
列解析为多列,因为"我" "爱" "这" ......都用空格分隔。希望如果你只是阅读前三列,你可以避免抛出错误,但如果没有,你可以考虑逐行解析(例如,这里:http://cmdlinetips.com/2011/08/three-ways-to-read-a-text-file-line-by-line-in-python/)并写入一个DataFrame那里。
答案 4 :(得分:2)
我认为最好的方法是使用pandas
read_csv
:
import pandas as pd
import io
temp=u""" disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use
hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """
#estimated max length of columns
N = 20
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
sep = "\s+", #separator is arbitrary whitespace
header = None, #first row is not header, read all data to df
names=range(N))
print df
0 1 2 3 4 5 6 7 8 \
0 disjiad123 TYh23hs9 13160032 5 I love this phone as
1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love
9 10 11 12 13 14 15 16 17 18 19
0 it is easy to use NaN NaN NaN NaN NaN NaN
1 cooking pizza with onion ham garlic tomatoes NaN NaN NaN NaN
#get order of wanted columns
df = df.iloc[:, [0,1,2]]
#rename columns
df.columns = ['user','item','time']
print df
user item time
0 disjiad123 TYh23hs9 13160032
1 hjf2329ccc TGjsk123 14423321
如果您需要所有列,则需要预处理以创建参数usecols
的最大列长度,然后将最后一列的后加工后处理为一个:
import pandas as pd
import csv
#preprocessing
def get_max_len():
with open('file1.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
num = []
for i, row in enumerate(reader):
num.append(len(''.join(row).split()))
m = max(num)
#print m
return m
df = pd.read_csv('file1.csv',
sep = "\s+", #separator is arbitrary whitespace
header = None, #first row is not header, read all data to df
usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0)
print df
0 1 2 3 4 5 6 7 8 \
0 disjiad123 TYh23hs9 13160032 5 I love this phone as
1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant NaN NaN NaN
9 10 11 12 13
0 it is easy to use
1 NaN NaN NaN NaN NaN
#df from 4 col to last
print df.ix[:, 4:]
4 5 6 7 8 9 10 11 12 13
0 I love this phone as it is easy to use
1 Suck restaurant NaN NaN NaN NaN NaN NaN NaN NaN
#concanecate columns to one review text
df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1)
df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'})
#get string columns
cols = [x for x in df.columns if isinstance(x, basestring)]
#filter only string columns
print df[cols]
user item time rating \
0 disjiad123 TYh23hs9 13160032 5
1 hjf2329ccc TGjsk123 14423321 3
review text
0 I love this phone as it is easy to use
1 Suck restaurant
答案 5 :(得分:1)
我将迭代每一行并用分号替换连续的空格。然后调用str.split()并选择分号作为分隔符。它可能如下所示:
data = [["user","item","rating", "review"]]
with open("your.csv") as f:
for line in f.readlines():
for i in range(10, 1, -1):
line = line.replace(' '*i, ';')
data += [line.split(';')]
答案 6 :(得分:0)
我认为OP正在使用Amazon's review data,如果这样,我也发现此输入文件很难阅读。我不确定100%,但是我认为pandas.read_csv遇到困难的原因是review_body列具有用于替换换行符的选项卡(无论出于何种原因)。
我尝试了一些解决方案,最终基于@alexis提出的解决方案构建了一个新的解决方案。这里的解决方案不起作用,因为我提供的链接中的文件具有以下列名(请注意,“ review_body”既不在列表的开头也不在列表的开头):
['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 'review_headline', 'review_body', 'review_date']
我对变量名的相似性表示歉意。例如,有一个stopCol
和stopCols
。我知道...形式很糟糕。
# declare dictionary to contain columns from left-to-right search
forwCols = {}
# declare dictionary to contain "review_body" column
stopCols = {}
# declare dictionary to contain columns from right-to-left search
revrCols = {}
with open(filstr,'r') as TSVfile:
lines = TSVfile.readlines()
# The header should have the maximum num of cols
numCols = len(lines[0].split())
# Find which column index corresponds to 'review body' col
stopCol = lines[0].split().index('review_body')
colNames = lines[0].split()
for lineInt in range(1,len(lines)):
# populate dict with cols until the column with tabs
forwCols[lineInt] = lines[lineInt].\
split('\t',maxsplit=14)[:stopCol]
# reverse list
revrCols[lineInt] = lines[lineInt].rsplit('\t',maxsplit=2)[2:]
forwLine = '\t'.join(forwCols[lineInt])
revrLine = '\t'.join(revrCols[lineInt])
# this next line removes the contents of the line that exists in
# the dicts that are created already
stopCols[lineInt] = \
lines[lineInt].replace(forwLine,'').replace(revrLine,'')
# Create three DFs using the three dicts just created
revDF = pd.DataFrame.from_dict(forwCols,orient='index',\
columns=colNames[:stopCol])
dateDF = pd.DataFrame.from_dict(revrCols,orient='index',columns=['review_date'])
revbodyDF = pd.DataFrame.from_dict(stopCols,orient='index',\
columns=['review_body'])
# join the three DFs together on indices
combineDF1 = revbodyDF.merge(right=dateDF,how='outer',left_index=True,\
right_index=True)
combineDF = revDF.merge(right=combineDF1,how='outer',\
left_index=True,right_index=True)
上面的解决方案是一种蛮力方法,但这是我可以看到的唯一方法,它可以在包含制表符的列不是第一列或最后一列的情况下使用。