我有从pdf文件解析的csv文件,由于pdf文件中的表格有多行,因此无法正确解析。 将其导入pd DataFrame如下所示。
Record Operational Address BIC
2 2007-03-03 Omladinskih Brigada 90V 11070 BEOGRAD SERBIA, AAAARSBG
3 NaN REPUBLIC OF NaN
4 1994-03-07 SAFAT ALI AL SALEM STREET, MUBARAKIA AAACKWKW
5 NaN OPPOSITE PUBLIC LIBRARY 13022 KUWAIT NaN
6 NaN KUWAIT NaN
7 2006-06-03 CHEZ NSM 3, AVENUE HOCHE 75008 PARIS AAADFRP1
8 NaN FRANCE NaN
9 2006-06-03 10 RUE DU COLISEE 75008 PARIS FRANCE AAAGFRP1
10 NaN NaN NaN
11 2014-07-05 152, 6TH OF SEPTEMBER BLVD. BUSINESS AAAJBG21
12 NaN CENTER LEGIS 4000 PLOVDIV BULGARIA NaN
13 1989-03-29 DHABAB STREET HEAD OFFICE BUILDING 11431 AAALSARI
14 NaN RIYADH SAUDI ARABIA NaN
如果“记录”列中的下一个值为NaN,我想将下一行连接到当前行。
我是说我想得到
Record Operational Address BIC
2 2007-03-03 Omladinskih Brigada 90V 11070 BEOGRAD SERBIA, REPUBLIC OF AAAARSBG
4 1994-03-07 SAFAT ALI AL SALEM STREET, MUBARAKIA OPPOSITE PUBLIC LIBRARY 13022 KUWAIT KUWAIT AAACKWKW
7 2006-06-03 CHEZ NSM 3, AVENUE HOCHE 75008 PARIS FRANCE AAADFRP1
9 2006-06-03 10 RUE DU COLISEE 75008 PARIS FRANCE AAAGFRP1
11 2014-07-05 152, 6TH OF SEPTEMBER BLVD. BUSINESS CENTER LEGIS 4000 PLOVDIV BULGARIA AAAJBG21
13 1989-03-29 DHABAB STREET HEAD OFFICE BUILDING 11431 RIYADH SAUDI ARABIA AAALSARI
这是数据框
import numpy a np
data = {'Record': {2: '2007-03-03',
3: np.NaN,
4: '1994-03-07',
5: np.NaN,
6: np.NaN,
7: '2006-06-03',
8: np.NaN,
9: '2006-06-03',
10: np.NaN,
11: '2014-07-05',
12: np.NaN,
13: '1989-03-29',
14: np.NaN},
'Operational Address': {2: 'Omladinskih Brigada 90V 11070 BEOGRAD SERBIA,',
3: 'REPUBLIC OF',
4: 'SAFAT ALI AL SALEM STREET, MUBARAKIA',
5: 'OPPOSITE PUBLIC LIBRARY 13022 KUWAIT',
6: 'KUWAIT',
7: 'CHEZ NSM 3, AVENUE HOCHE 75008 PARIS',
8: 'FRANCE',
9: '10 RUE DU COLISEE 75008 PARIS FRANCE',
10: np.NaN,
11: '152, 6TH OF SEPTEMBER BLVD. BUSINESS',
12: 'CENTER LEGIS 4000 PLOVDIV BULGARIA',
13: 'DHABAB STREET HEAD OFFICE BUILDING 11431',
14: 'RIYADH SAUDI ARABIA'},
'BIC': {2: 'AAAARSBG',
3: np.NaN,
4: 'AAACKWKW',
5: np.NaN,
6: np.NaN,
7: 'AAADFRP1',
8: np.NaN,
9: 'AAAGFRP1',
10: np.NaN,
11: 'AAAJBG21',
12: np.NaN,
13: 'AAALSARI',
14: np.NaN}}
df = pd.DataFrame(data=data)
答案 0 :(得分:1)
使用cumsum
形成组并为每列指定一个汇总字典。
agg_d = {'Record': 'first',
'Operational Address': lambda x: ' '.join(x.dropna()),
'BIC': 'first'}
df.groupby(df.Record.notnull().cumsum().rename(None)).agg(agg_d)
Record Operational Address BIC
1 2007-03-03 Omladinskih Brigada 90V 11070 BEOGRAD SERBIA, REPUBLIC OF AAAARSBG
2 1994-03-07 SAFAT ALI AL SALEM STREET, MUBARAKIA OPPOSITE PUBLIC LIBRARY 13022 KUWAIT KUWAIT AAACKWKW
3 2006-06-03 CHEZ NSM 3, AVENUE HOCHE 75008 PARIS FRANCE AAADFRP1
4 2006-06-03 10 RUE DU COLISEE 75008 PARIS FRANCE AAAGFRP1
5 2014-07-05 152, 6TH OF SEPTEMBER BLVD. BUSINESS CENTER LEGIS 4000 PLOVDIV BULGARIA AAAJBG21
6 1989-03-29 DHABAB STREET HEAD OFFICE BUILDING 11431 RIYADH SAUDI ARABIA AAALSARI