我正在尝试清理pdf,将其转换为用于地址解析的文件。我一直在使用tabula-py分割pdf,直到删除完全为空的行为止,都取得了不错的效果。我什至不确定这是否是一种有效的方法。
我已经尝试了SO向我推荐的大多数解决方案,但我仍然不太明白。我已经设置inplace = True,axis = 0和1,how =“ all”。尝试索引NaN值,但也没有用。
import pandas as pd
import tabula
pd.set_option('display.width', 500)
df = tabula.read_pdf("C:\\Users\\Jack\\Documents\\Schoolwork\\Schoolwork\\WICResearch\\RefDocs\\wicclinicdirectory.pdf", pages='all', guess = False, pandas_options={'header': None})
df.columns = ["County_Name", "Clinic_Number", "Clinic_Name", "Address", "City", "Zip_Code", "Phone_Number", "Hours_of_Operation"]
df.drop(["Phone_Number", "Hours_of_Operation"], axis = 1, inplace = True)
此处的#大量代码可消除不需要的重复列标题,例如通过表格读取PDF的结果。
df.drop(["Clinic_Name"], axis = 1, inplace = True)
df[['ClinicNum','ClinicName']] = df.Clinic_Number.apply(lambda x: pd.Series(str(x).split(" ", maxsplit = 1)))
df.drop(["Clinic_Number"], axis = 1, inplace = True)
#df[~df.isin(['NaN', 'NaT']).any(axis=1)]
#df.dropna(axis= 0, how ='all', inplace = True)
NaNIndex = df.index[df.isnull().all(1)]
print(NaNIndex)
print(df)
The above code gives this output:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
0 NaN NaN Ohio WIC Clinic Locations NaN nan NaN
1 NaN NaN NaN NaN Clinic NaN
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
4 NaN NaN NaN NaN nan NaN
5 NaN NaN NaN NaN nan NaN
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
16 NaN NaN NaN NaN nan NaN
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
我想要的是:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
我能够使用正确的标题创建所需的数据框,但是它仍然不会删除NaN值。或删除整个内容。我还希望能够将不是全部NaN值的行移动到相关的行中,以便它们全部为一行。
我也不知道我能得到多少重现性,因为我在表格中摆弄了很多东西,试图将这个pdf转换。