我只是在看Pandas库,以查看它是否适合我的任务。
我有一个固定宽度的文件。我这样定义的
col_names = ['RecType','Region','SecCode','Data','FRF','Date']
col_def =[
[0,1],[1,4],[4,5],[5,123],[123,128],[128,132]
]
正在这样阅读:
df = pd.read_fwf(datafile, colspecs=col_def, names=col_names)
上面的模式对于文件中的每一行都是正确的。但是,“数据”列引用的数据的结构会根据“ SecCode”的值而改变
例如,如果SecCode的值为“ P”,则数据需要按以下方式拆分:
col_names = ['RecType','Region','SecCode','SubCode','Name', 'Data','FRF','Date']
col_def =[
[0,1],[1,4],[4,5],[5,6],[6,16],[16,122],[122,127],[127,131]
]
但是,如果SecCode的值为'W',则需要像这样拆分数据:
col_names = ['RecType','Region','SecCode','Name','SubCode', 'Data','FRF','Date']
col_def =[
[0,1],[1,4],[4,5],[5,15],[15,16],[16,122],[122,127],[127,131]
]
示例数据
SAFRPAWIDGETA-1 DAAEDAFD26 D 02172DMEDAPC1E S TF BJA DA 08120071 D + 02297 -300 S 378651811
SAFRWWIDGETB-1 X DAAEDAFD26 D 02172DMEDAPC2P 378661811
SAFRPAWIDGETA-2 DAAEDAFD26 D 03152DMEDAPC1E S TF BJA DA 08120051 D + 01657 -300 S 378671811
SAFRWWIDGETB-2 X DAAEDAFD26 D 03152DMEDAPC2P 378681811
SAFRWWIDGETB-3 X DAAEDAFD26 D 041MD26 DAPC1EY M TF BJA DA 08120041 D 01329 -300 S 378691811
SAFRPAWIDGETA-3 DAAEDAFD26 D 041MD26 DAPC2P 378701811
SAFRPAWIDGETA-4 DAAEDAFD26 D 042BJA DAD 1V M TF 2610 + 00420 06600 A 378711811
SAFRWWIDGETB-4 X DAAEDAFD26 D 042BJA DAD 2P 378721811
SAFRPAWIDGETA-5 DAAEDAFD26 D 052BJA DAD 1VE FM BJA DA 359200103230 D + 06200 160 - A 378731811
SAFRWWIDGETB-5 X DAAEDAFD26 D 052BJA DAD 2P 378741811
在此示例数据中,当SecCode ='P'时,两个子代码为A;当SecCode ='W'时,两个子代码为X
这可能吗?如果可以,我将如何处理?
答案 0 :(得分:1)
我建议遍历文件的各行,确定文件的类型是“ P”还是“ W”,然后使用特定的col_def
(请参见代码中的注释)。
注意:在使用pd.read_fwf()
时,我没有找到一种方法col_names
将一行导入到DataFrame中;因为在读一行时总是将行值放到列标题中。因此,我正在使用read_fwf(…).columns.values
的外观怪异的构造来获取值。另外,您也可以使用vals= [ line.rstrip()[a:b] for (a,b) in col_def_P ]
直接将值作为列表获取。
import pandas as pd
## create an empty DataFrame with column headers:
col_names=['RecType','Region','SecCode','SubCode', 'Name', 'Value','FRF','Date']
df=pd.DataFrame(columns=col_names)
## create the column definitions for .read_fwf()
col_def_P=[[0,1],[1,4],[4,5],[5,6],[6,16],[16,122],[122,127],[127,131]]
## note that for type "W", we use a "non-continuous" order, i.e. we read [15,16] first,
# and [5,14] next; this way, we have the values in the anticipated ordering
col_def_W=[[0,1],[1,4],[4,5],[15,16],[5,15],[16,122],[122,127],[127,131]]
with open('untitled.txt', 'r') as f:
for line in f:
if(line[4:5]=="P"):
vals=( pd.read_fwf(pd.compat.StringIO(line),colspecs=col_def_P).columns.values )
## or:
# vals= [ line.rstrip()[a:b] for (a,b) in col_def_P ]
df.loc[len(df)]=vals
elif(line[4:5]=="W"):
vals=( pd.read_fwf(pd.compat.StringIO(line),colspecs=col_def_W).columns.values )
df.loc[len(df)]=vals
print(df)
产量:
RecType Region SecCode SubCode Name Value FRF Date
0 S AFR P A WIDGETA-1 DAAEDAFD26 D 02172DMEDAPC1... 37865 1811
1 S AFR W X WIDGETB-1 DAAEDAFD26 D 02172DMEDAPC2... 37866 1811
2 S AFR P A WIDGETA-2 DAAEDAFD26 D 03152DMEDAPC1... 37867 1811
3 S AFR W X WIDGETB-2 DAAEDAFD26 D 03152DMEDAPC2... 37868 1811
4 S AFR W X WIDGETB-3 DAAEDAFD26 D 041MD26 DAPC1... 37869 1811
5 S AFR P A WIDGETA-3 DAAEDAFD26 D 041MD26 DAPC2... 37870 1811
6 S AFR P A WIDGETA-4 DAAEDAFD26 D 042BJA DAD 1... 37871 1811
7 S AFR W X WIDGETB-4 DAAEDAFD26 D 042BJA DAD 2... 37872 1811
8 S AFR P A WIDGETA-5 DAAEDAFD26 D 052BJA DAD 1... 37873 1811
9 S AFR W X WIDGETB-5 DAAEDAFD26 D 052BJA DAD 2... 37874 1811