如果每行包含不同数量的字段(数字相当大),如何正确读取csv文件?

时间:2016-02-11 16:07:59

标签: python csv pandas

我有来自亚马逊的文本文件,其中包含以下信息:

 #      user        item     time   rating     review text (the header is added by me for explanation, not in the text file
  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant

如您所见,数据按空格分隔,每行中的列数不同。但是,它是文本内容。 这是我尝试过的代码:

pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part

发生了这样的错误:

ValueError: Passed header names mismatches usecols

当我尝试阅读所有列时:

pd.read_csv(filename, sep = " ", header = None)

这次的错误是:

Error tokenizing data. C error: Expected 229 fields in line 3, saw 320

鉴于审核文本在很多行中都很长,因此为此question中的每列添加标题名称的方法无效。

我想知道如果我想保留评论文本并分别跳过它们,如何阅读csv文件。先感谢您!

编辑:

Martin Evans完全解决了这个问题。但是现在我正在玩另一个具有相似但不同格式的数据集。现在数据的顺序是相反的:

     # review text                          user        item     time   rating      (the header is added by me for explanation, not in the text file
   I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3     

你有任何想法正确阅读吗?如有任何帮助,我们将不胜感激!

7 个答案:

答案 0 :(得分:13)

根据建议,DictReader也可以按如下方式用于创建行列表。然后可以将其作为框架导入pandas:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
    for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
        try:
            rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
        except KeyError, e:
            rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

这将显示以下内容:

         user      item rating                                  review
0  disjiad123  TYh23hs9      5  I love this phone as it is easy to use
1  hjf2329ccc  TGjsk123      3                         Suck restaurant

如果审核出现在行的开头,那么一种方法是反向解析该行,如下所示:

import pandas as pd
import csv


rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
    for row in f_input:
        cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
        rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

这会显示:

  rating      time      item        user  \
0      5  13160032  TYh23hs9   isjiad123   
1      3  14423321  TGjsk123  hjf2329ccc   

                                    review  
0  I love this phone as it is easy to used  
1                          Suck restaurant  

row[::-1]用于反转整行的文本,[2:]跳过现在位于行开头的行结束。然后在空格上分割每一行。然后,列表推导重新反转每个拆分条目。最后,通过获取固定的5列条目(现在位于开头),首先附加rows。然后将剩余的条目与空格连接在一起并添加为最后一列。

这种方法的好处是它不依赖于您的输入数据采用精确固定的宽度格式,并且您不必担心使用的列宽是否随时间而变化。

答案 1 :(得分:6)

看起来这是一个固定宽度的文件。为实现这一目的,熊猫提供read_fwf。以下代码为我正确读取文件。如果它不能很好地工作,你可能想要稍微宽松一下。

pandas.read_fwf('test.fwf', 
                 widths=[13, 12, 13, 5, 100], 
                 names=['user', 'item', 'time', 'rating', 'review'])

如果列仍与编辑后的版本(评级首先出现)对齐,则只需添加正确的规范即可。如下的指南有助于快速完成此任务:

0        1         2         3         4         5         6         7         8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
  I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                          hjf2329ccc   TGjsk123     14423321    3     

所以新命令变为:

pandas.read_fwf('test.fwf', 
                colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]], 
                names=['review', 'user', 'item', 'time', 'rating'])

答案 2 :(得分:3)

由于前四个(现在最后四个)字段永远不会包含空格或需要用引号括起来,所以让我们忘记csv库并直接使用python' s awesome string handling 。这是一个单行,将每行分成五列,由maxsplit的{​​{1}}参数提供:

rsplit()

以上内容应解决您的问题,但我更喜欢将其解压缩为更容易理解的生成器函数,并且可以在必要时进行扩展:

with open("myfile.dat") as data:
    frame = pd.DataFrame(line.strip().rsplit(maxsplit=4) for line in data)

两个版本都避免在内存中构建一个大的普通数组,只是将它交给def splitfields(data): """Generator that parses the data correctly into fields""" for line in data: fields = line.rsplit(maxsplit=4) fields[0] = fields[0].strip() # trim line-initial spaces yield fields with open("myfile.dat") as data: frame = pd.DataFrame(splitfields(data)) 构造函数。当从文件中读取每行输入时,它将被解析并立即添加到数据帧中。

以上是针对更新问题中的格式,左侧是自由文本。 (对于原始格式,请使用DataFrame代替line.split并删除最后一个字段,而不是第一个字段。)

line.rsplit

根据数据的实际情况,您还可以做更多事情:如果字段间隔恰好是四个空格(如您的示例所示),则可以在 I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5 Suck restaurant hjf2329ccc TGjsk123 14423321 3 上拆分而不是分裂所有空白。如果其他一些字段可以包含空格,那也可以正常工作。通常,像这样的预解析是灵活和可扩展的;我保持代码简单,因为你的问题没有证据证明需要更多代码。

答案 3 :(得分:2)

Usecols是指输入文件中列的名称。如果您的文件没有那些名为(user, item, rating)的列,则它不知道您指的是哪些列。相反,您应该传递像usecols=[0,1,2]这样的索引。

此外,names指的是您调用导入列的内容。所以,我认为在导入3列时你不能有四个名字。这有用吗?

pd.read_csv(filename, sep = " ", 
                      header = None, 
                      names = ["user","item","rating"], 
                      usecols = [0,1,2])

令牌化错误看起来像分隔符有问题。它可能会尝试将您的review text列解析为多列,因为"我" "爱" "这" ......都用空格分隔。希望如果你只是阅读前三列,你可以避免抛出错误,但如果没有,你可以考虑逐行解析(例如,这里:http://cmdlinetips.com/2011/08/three-ways-to-read-a-text-file-line-by-line-in-python/)并写入一个DataFrame那里。

答案 4 :(得分:2)

我认为最好的方法是使用pandas read_csv

 import pandas as pd
import io

temp=u"""  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """


#estimated max length of columns 
N = 20

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep = "\s+", #separator is arbitrary whitespace 
                 header = None, #first row is not header, read all data to df
                 names=range(N)) 
print df
           0         1         2   3     4           5     6      7     8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone    as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant    so      I  love   

        9      10    11     12   13      14        15  16  17  18  19  
0       it     is  easy     to  use     NaN       NaN NaN NaN NaN NaN  
1  cooking  pizza  with  onion  ham  garlic  tomatoes NaN NaN NaN NaN

#get order of wanted columns
df = df.iloc[:, [0,1,2]]
#rename columns
df.columns = ['user','item','time']
print df
         user      item      time
0  disjiad123  TYh23hs9  13160032
1  hjf2329ccc  TGjsk123  14423321

如果您需要所有列,则需要预处理以创建参数usecols的最大列长度,然后将最后一列的后加工后处理为一个:

import pandas as pd
import csv

#preprocessing
def get_max_len():
    with open('file1.csv', 'r') as csvfile:
        reader = csv.reader(csvfile)
        num = []
        for i, row in enumerate(reader):
            num.append(len(''.join(row).split()))
        m = max(num)
        #print m
        return m


df = pd.read_csv('file1.csv', 
                         sep = "\s+", #separator is arbitrary whitespace 
                         header = None, #first row is not header, read all data to df
                         usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0)
print df
           0         1         2   3     4           5     6      7    8   \
0  disjiad123  TYh23hs9  13160032   5     I        love  this  phone   as   
1  hjf2329ccc  TGjsk123  14423321   3  Suck  restaurant   NaN    NaN  NaN   

    9    10    11   12   13  
0   it   is  easy   to  use  
1  NaN  NaN   NaN  NaN  NaN 
#df from 4 col to last
print df.ix[:, 4:]
     4           5     6      7    8    9    10    11   12   13
0     I        love  this  phone   as   it   is  easy   to  use
1  Suck  restaurant   NaN    NaN  NaN  NaN  NaN   NaN  NaN  NaN

#concanecate columns to one review text
df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1)
df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'})

#get string columns
cols = [x for x in df.columns if isinstance(x, basestring)]

#filter only string columns
print df[cols]
         user      item      time  rating  \
0  disjiad123  TYh23hs9  13160032       5   
1  hjf2329ccc  TGjsk123  14423321       3   

                              review text  
0  I love this phone as it is easy to use  
1                         Suck restaurant  

答案 5 :(得分:1)

我将迭代每一行并用分号替换连续的空格。然后调用str.split()并选择分号作为分隔符。它可能如下所示:

data = [["user","item","rating", "review"]]
with open("your.csv") as f:
    for line in f.readlines():
        for i in range(10, 1, -1):
            line = line.replace(' '*i, ';')
        data += [line.split(';')]

答案 6 :(得分:0)

我认为OP正在使用Amazon's review data,如果这样,我也发现此输入文件很难阅读。我不确定100%,但是我认为pandas.read_csv遇到困难的原因是review_body列具有用于替换换行符的选项卡(无论出于何种原因)。

我尝试了一些解决方案,最终基于@alexis提出的解决方案构建了一个新的解决方案。这里的解决方案不起作用,因为我提供的链接中的文件具有以下列名(请注意,“ review_body”既不在列表的开头也不在列表的开头):

['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 'review_headline', 'review_body', 'review_date']

我对变量名的相似性表示歉意。例如,有一个stopColstopCols。我知道...形式很糟糕。

    # declare dictionary to contain columns from left-to-right search
    forwCols = {}
    # declare dictionary to contain "review_body" column
    stopCols = {}
    # declare dictionary to contain columns from right-to-left search
    revrCols = {}

    with open(filstr,'r') as TSVfile:
        lines    = TSVfile.readlines()
        # The header should have the maximum num of cols
        numCols  = len(lines[0].split())
        # Find which column index corresponds to 'review body' col
        stopCol  = lines[0].split().index('review_body')
        colNames = lines[0].split()

    for lineInt in range(1,len(lines)):
        # populate dict with cols until the column with tabs
        forwCols[lineInt] = lines[lineInt].\
                            split('\t',maxsplit=14)[:stopCol]
        # reverse list
        revrCols[lineInt] = lines[lineInt].rsplit('\t',maxsplit=2)[2:]
        forwLine = '\t'.join(forwCols[lineInt])
        revrLine = '\t'.join(revrCols[lineInt])
        # this next line removes the contents of the line that exists in
        # the dicts that are created already
        stopCols[lineInt] = \
                lines[lineInt].replace(forwLine,'').replace(revrLine,'')

    # Create three DFs using the three dicts just created
    revDF  = pd.DataFrame.from_dict(forwCols,orient='index',\
                            columns=colNames[:stopCol])
    dateDF = pd.DataFrame.from_dict(revrCols,orient='index',columns=['review_date'])
    revbodyDF = pd.DataFrame.from_dict(stopCols,orient='index',\
                                       columns=['review_body'])

    # join the three DFs together on indices
    combineDF1 = revbodyDF.merge(right=dateDF,how='outer',left_index=True,\
                                 right_index=True)
    combineDF = revDF.merge(right=combineDF1,how='outer',\
                                 left_index=True,right_index=True)

上面的解决方案是一种蛮力方法,但这是我可以看到的唯一方法,它可以在包含制表符的列不是第一列或最后一列的情况下使用。