答案 0 :(得分:1)
对怪异数据的巨大反应。首先,拆分包含k:v对的每列并将它们转换为pandas Series。结合所有三个"其他"列成一个数据帧:
others = pd.concat(data[x].str.split(':').apply(pd.Series)
for x in ('Other1', 'Other2', 'Other3')).dropna(how='all')
# 0 1
#0 Hospital Awesome Hospital
#1 Hobbies Cooking
#2 Hospital Awesome Hospital
#0 Maiden Name Rubin
#1 Hobby Experience 10 years
#2 Maiden Name Simpson
#0 DOB 2015/04/09
#2 DOB 2015/04/16
进行一些索引操作(我们希望键成为列名):
others = others.reset_index().set_index(['index',0]).unstack()
# 1
#0 DOB Hobbies Hobby Experience Hospital Maiden Name
#index
#0 2015/04/09 None None Awesome Hospital Rubin
#1 None Cooking 10 years None None
#2 2015/04/16 None None Awesome Hospital Simpson
删除unstack()
生成的分层列索引:
others.columns = others.columns.get_level_values(0)
再次拼凑:
pd.concat([data[["Full Name","Town"]], others], axis=1)
答案 1 :(得分:1)
parse
有一个很好的界面,可能是拉出这样的数据的好选择:
>>> import parse
>>> format_spec='{}: {}'
>>> string='Hobbies: Cooking'
>>> parse.parse(format_spec, string).fixed
('Hobbies', 'Cooking')
如果要反复解析相同的规范,请使用compile
:
>>> other_parser = parse.compile(format_spec)
>>> other_parser.parse(string).fixed
('Hobbies', 'Cooking')
>>> other_parser.parse('Maiden Name: Rubin').fixed
('Maiden Name', 'Rubin')
fixed
属性将解析的参数作为元组返回。使用这些元组,我们可以创建一堆字典,将它们提供给pd.DataFrame
,并与第一个df合并:
import parse
import pandas as pd
# slice first two columns from original dataframe
first_df = pd.read_csv(filepath, sep='t').ix[:,0:2]
# make the parser
other_parser = parse.compile('{}: {}')
# parse remaining columns to a new dataframe
with open(filepath) as f:
# a generator of dict objects is fed into DataFrame
# the dict keys are column names
others_df = pd.DataFrame(dict(other_parser.parse(substr).fixed for substr in line.split('\t')[2:]) for line in f)
# merge on the indexes
df = pd.merge(first_df, others_df, left_index=True, right_index=True)