将csv读入pandas df,其行可能会被分割成多行

时间:2017-06-11 15:03:18

标签: python csv pandas

我想将此csv文件读入pandas.DataFrame

Id,Name,Shape Library,Page Name,Line Connection Start,Line Connection End,Text Area 1,Text Area 2,Text Area 3,Text Area 4
1,Page,,0:Page 1,,,,,,
2,Table,Tables,0:Page 1,,,Openingsuren gemeentehuis,Action,"Is het gemeentehuis open?
Wat zijn de openingsuren van het gemeentehuis
Wanneer is het gemeentehuis open","webhook
De webserver staat niet op denk ik, gelieve ... te contacteren"
3,easy,Tables,0:Page 1,,,Openignsuren andere dag,Action,"En morgen?",
4,easy,Tables,0:Page 1,,,Openingsuren,,,

但有些行可以多行显示(参见Id 2)

有没有办法,把它正确地读成熊猫df?

1 个答案:

答案 0 :(得分:1)

您可以使用csv模块编写自己的解析器,然后为pandas构建一个生成器,如:

代码:

import csv
import pandas as pd

def read_my_csv(file_handle):
    # build csv reader
    reader = csv.reader(file_handle)

    # get and yield the header
    header = next(reader)
    yield header

    # for each row, get enough data and then yield the row
    for row in reader:
        while len(row) < len(header):
            row += next(reader)
        yield row

with open('file1', 'rU') as f:
    generator = read_my_csv(f)
    columns = next(generator)
    df = pd.DataFrame(generator, columns=columns)

print(df)

结果:

  Id   Name Shape Library Page Name Line Connection Start Line Connection End  \
0  1   Page                0:Page 1                                             
1  2  Table        Tables  0:Page 1                                             
2  3   easy        Tables  0:Page 1                                             
3  4   easy        Tables  0:Page 1                                             

                 Text Area 1 Text Area 2  \
0                                          
1  Openingsuren gemeentehuis      Action   
2    Openignsuren andere dag      Action   
3               Openingsuren               

                                         Text Area 3  \
0                                                      
1  Is het gemeentehuis open?\nWat zijn de opening...   
2                                         En morgen?   
3                                                      

                                         Text Area 4  
0                                                     
1  webhook\nDe webserver staat niet op denk ik, g...  
2                                                     
3