我有一个UTF-8编码的文本文件,其中包含一个报告输出,我想进入一个数据框。我的问题是数据不是表格格式,而是由父行和子行,页面标题等组成。
这是文件布局的示例,完整文件中大约有2000条记录
ACME LTD (SP) Report for Mexico Time 14:18:11 Date 04082019
Mexico *********/JOEOD Page 1
Cnno Acct no Tax number Address
1 ABC3415 899111752 Kellys Hair ONE ST JOHNS CHURCHYARD ED45 8LP LONDON
PstDte Docno DocDte Reference no ClgDte WT code Invoice amnt Base amount tax Net amount T x-exempt amt
tax type: W1 tax code: WA
80519 5100002076 70519 20006874 50719 WA 1156961002 1156961003 76311439 1156961002 -1
10619 5100002673 70519 20007095 50719 WA 2147567637 2147567637 144956394 2147567637 0
****** WA 3304528639 330452864 221267833 3304528639 -1
** **** 3304528639 330452864 221267833 3304528639 -1
2 BFG4919 7880487069 SPA LTD OHNSON HOUSE GREENBY SQHH1 3DF READING
tax type: W1 tax code: WA
30619 5100002672 30619 90331014 20719 WA 2260302 1883585 1260708 1883585 376717
30619 5100002681 30619 90331015 20719 WA 73519295 61266079 4100618 61266079 12253216
10719 5100002679 30619 90331016 20719 WA 105593207 87994339 5719633 87994339 17598868
10719 5100002680 30619 90331017 20719 WA 82808594 69007162 4485466 69007162 13801432
10719 5100003245 10719 90332783 300719 WA 80358636 6696553 4447229 6696553 13393106
10719 5100003246 10719 90332782 300719 WA 102408262 85340218 5667505 85340218 17068044
10719 5100003247 10719 90332781 300719 WA 73498752 6124896 4067587 6124896 12249792
10719 5100003248 10719 90332780 300719 WA 22784614 18987178 1260952 18987178 3797436
****** WA 56357438 469645316 31009698 469645316 93929064
** **** 56357438 469645316 31009698 469645316 93929064
3 KLU5437 6781754415 BIRDS SERVICES LIMITED GREEN HOUSE REDCAR INDUSTEC4L 4HJ LONDON
tax type: CS tax code: CS
110619 5100002956 120619 1975674 90719 CS 1839932 17523288 91166 17523288 876032
10719 5100003373 120619 1975677 120719 CS 78940756 705990901 35886346 754108083 83416659
10719 5100003391 120619 1975675 120719 CS 643442103 61280197 31149443 61280197 30640133
****** CS 1451248983 1336316159 67947449 1384433341 114932824
tax type: W1 tax code: WA
110619 5100002956 120619 1975674 90719 WA 1839932 17523288 1185159 17523288 876032
10719 5100003373 120619 1975677 120719 WA 78940756 754108084 49831859 754108083 35299476
10719 5100003389 60619 1975671 120719 WA 368898403 368898403 24377001 368898403 0
10719 5100003391 120619 1975675 120719 WA 643442103 61280197 40494277 61280197 30640133
10719 5100003394 110619 1975678 120719 WA 1421290282 1421290283 93919609 1421290282 -1
10719 5100003513 120619 1975676 190719 WA 172718664 172718664 11434027 172718664 0
10719 5100003626 210619 1975693 260719 WA 276901444 25751819 17101966 276901444 19383254
****** WA 3691057776 3604858882 238343898 3624242134 86198894
tax type: X1 tax code: XA
110619 5100002956 120619 1975674 90719 XA 1839932 17523288 91167 17523288 876032
10719 5100003373 120619 1975677 120719 XA 78940756 754108084 383322 754108083 35299476
10719 5100003389 60619 1975671 120719 XA 368898403 368898403 1875154 368898403 0
10719 5100003391 120619 1975675 120719 XA 643442103 61280197 3114945 61280197 30640133
10719 5100003394 110619 1975678 120719 XA 1421290282 1421290283 7224586 1421290282 -1
10719 5100003513 120619 1975676 190719 XA 172718664 172718664 879541 172718664 0
10719 5100003626 210619 1975693 260719 XA 276901444 25751819 1315536 276901444 19383254
****** XA 3691057776 3604858882 18334149 3624242134 86198894
ACME LTD (SP) Report for Mexico Time 14:18:11 Date 04082019
Mexico *********/JOEOD Page 2
Cnno Acct no Tax number Address
3 KLU5437 6781754415 BIRDS SERVICES LIMITED GREEN HOUSE REDCAR INDUSTEC4L 4HJ LONDON
PstDte Docno DocDte Reference no ClgDte WT code Invoice amnt Base amount Withholdtax Net amount T x-exempt amt
** **** 3691057776 8546033923 324625496 3624242134 -4854976147
4 KLD15935 837960557 BOJACK GROUP LTD HORSEMAN HOUSE SHADWELLGH12 3BB ABERDEEN
tax type: W1 tax code: WA
10719 5100003296 290519 82620012754 90719 WA 6863606446 6863606446 443122606 6863606446 0
10719 5100003654 210619 82620013425 260719 WA 5854587092 585458709 381911219 5854587092 2
****** WA 12718193538 12718193536 825033825 12718193538 2
** **** 12718193538 12718193536 825033825 12718193538 2
5 HDH943859 Rover Energy Schweiz AG SWIZSTRASSE 345 1005 ZURICH
tax type: W1 tax code: WA
10719 5100003613 20419 2963427 260719 WA 2893481234 2893481234 190177614 2893481234 0
10719 5100003614 20419 2963426 260719 WA 2893481234 2893481234 190177614 2893481234 0
****** WA 5786962468 5786962468 380355228 5786962468 0
** **** 5786962468 5786962468 380355228 5786962468 0
我想将数据格式化为以下平面结构
Cnno, Acct no, Tax number, Address, PstDte, Docno, DocDte, Reference no, clg date,tax type, WT code, Invoice amnt,Base amount,tax,Net amount,T x-exempt amt
坦率地说,除了将数据加载到数据框中并删除空白行之外,我还没有做完。我已经看过了,但似乎找不到任何类似的例子,因此,如果有人有任何链接教程来解决类似的数据提取问题,那将是一个很好的选择,或者如果您对如何解决它有一些想法,那将是一个开始。 >
答案 0 :(得分:0)
因此,在查看了更多我采用的清理方法后,如下所述
加载到df中,没有标题,因此列仅是0,1,2的NaN等
删除所有均为NaN的列
df2 = df.dropna(axis = 0, how ='all').copy()
我想保留公司名称,而不保留其他任何数据,例如报告标题或县,所以将字符串拆分以删除我不需要的文本,然后为包含墨西哥的行创建一个掩码,然后过滤了df以将其删除
df2[0] = df2[0].str.split(' ').str[0]
mask = (df2[0] == 'Mexico')
df3 = df3[~mask].copy()
然后使用填充将公司名称复制到df的每一行(有多个公司名称,报告为一个公司然后是下一个公司进行所有记录,依此类推)
df3[0]=df3[0].fillna(method='ffill')
列[1]包含父记录Cnno和子记录Pstdte的数据,这些是作为文本存储的数字,因此我使用to_numeric过滤了此列,这删除了所有重复显示的标题和页码行数据仅保留父行和子行。
df4 = df3[new_WHT2[[1]].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)].copy()
然后我创建了一个新列'Cnno',并使用
进行了填充df4.loc[new_WHT3[1]<9999, 'Cnno'] = df4[1]
Cnno和Pstdte都是数字,但是由于Pstdte是一个“日期”,最小长度为5,并且Cnno从不大于长度4,所以我可以用它来分隔父行和子行
在数据框中,每个父行后面都有其子级,因此我可以在'Cnno'上使用ffill将父级Cnno复制到其子级以关联记录
df4['Cnno'] = df4['Cnno'].fillna(method='ffill')
然后我创建了一个父列以标识父记录(并非绝对必要)
df4['Parent'] = (df4[1]<9999).astype(int)
然后,我在父列上进行过滤,并将数据复制到新的df,删除所有空数据,在列[1]中删除cnno的旧数据,并为其余部分添加新的列标题。当原始文件中有新页时,父行会重复,因此同一数据有多行,因此我删除了重复行,只保留了第一行
Parent = df4[df4['Parent'] == 1].copy()
Parent = Parent.dropna(axis=1, how='all')
Parent = Parent.drop(Parent.columns[1] , axis=1)
Parent.columns = ['Company','Account No','Tax Code','Vendor Address','Cnno','Parent']
Parent.drop_duplicates(keep='first', inplace=True)
这将给出仅父记录的干净df
Company, Account No, Tax Code, Vendor Address, Cnno, Parent
5 ACME Ltd, ABC3415, 899111752, Kellys Hair ONE ST JOHNS CHURCHYARD ED45 8LP LONDON, 1, 1
18 ACME Ltd, BFG4919, 7880487069, SPA LTD OHNSON HOUSE GREENBY SQHH1 3DF READING, 2, 1
然后我基本上对子记录进行了
Children = df4[df4['Parent'] != 1].copy()
Children = Children.dropna(axis=1, how='all')
Children.columns = ['Company','PstDte', 'DocNo','DocDte','Reference no','ClgDte','WT code','Invoice amnt','Base amount','tax','Net amount','T x-exempt amt','Cnno','Parent']
这为我提供了所有子记录的清晰df,然后我使用键company和cnno合并了父记录和子记录
Final = pd.merge(Parent, Children, how='left', left_on=['Company','Cnno'], right_on = ['Company','Cnno'])
然后是格式化每个日期列以及格式化,dtypes等任何其他位的情况。
Final['PstDte'] = Final['PstDte'].apply(lambda x: pd.to_datetime(str(x), format='%d%m%y'))