如何解析格式奇怪的数据文件?

时间:2017-06-05 23:56:31

标签: csv

如何阅读格式怪异的数据文件?

例如,如果有不同类型的分隔符(,:|)全部一起使用?

查看数据框示例,其中包含以下内容: enter image description here

2 个答案:

答案 0 :(得分:1)

对怪异数据的巨大反应。首先,拆分包含k:v对的每列并将它们转换为pandas Series。结合所有三个"其他"列成一个数据帧:

others = pd.concat(data[x].str.split(':').apply(pd.Series) 
                   for x in ('Other1', 'Other2', 'Other3')).dropna(how='all')

#                  0                  1
#0          Hospital   Awesome Hospital
#1           Hobbies            Cooking
#2          Hospital   Awesome Hospital
#0       Maiden Name              Rubin
#1  Hobby Experience           10 years
#2       Maiden Name            Simpson
#0               DOB         2015/04/09
#2               DOB         2015/04/16

进行一些索引操作(我们希望键成为列名):

others = others.reset_index().set_index(['index',0]).unstack()
#                 1                                                          
#0              DOB   Hobbies Hobby Experience           Hospital Maiden Name
#index                                                                       
#0       2015/04/09      None             None   Awesome Hospital       Rubin
#1             None   Cooking         10 years               None        None
#2       2015/04/16      None             None   Awesome Hospital     Simpson

删除unstack()生成的分层列索引:

others.columns = others.columns.get_level_values(0)

再次拼凑:

pd.concat([data[["Full Name","Town"]], others], axis=1)

答案 1 :(得分:1)

parse有一个很好的界面,可能是拉出这样的数据的好选择:

>>> import parse
>>> format_spec='{}: {}' 
>>> string='Hobbies: Cooking'
>>> parse.parse(format_spec, string).fixed
('Hobbies', 'Cooking')

如果要反复解析相同的规范,请使用compile

>>> other_parser = parse.compile(format_spec)
>>> other_parser.parse(string).fixed
('Hobbies', 'Cooking')
>>> other_parser.parse('Maiden Name: Rubin').fixed
('Maiden Name', 'Rubin')

fixed属性将解析的参数作为元组返回。使用这些元组,我们可以创建一堆字典,将它们提供给pd.DataFrame,并与第一个df合并:

import parse
import pandas as pd

# slice first two columns from original dataframe
first_df = pd.read_csv(filepath, sep='t').ix[:,0:2]

# make the parser
other_parser = parse.compile('{}: {}')

# parse remaining columns to a new dataframe
with open(filepath) as f:
    # a generator of dict objects is fed into DataFrame
    # the dict keys are column names
    others_df = pd.DataFrame(dict(other_parser.parse(substr).fixed for substr in line.split('\t')[2:]) for line in f)

# merge on the indexes
df = pd.merge(first_df, others_df, left_index=True, right_index=True)