我是Python的新手。
我一直在尝试从http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm抓取一张桌子。该目标表的标题为“按人体系统利用率”。
我能够使用BeautifulSoup捕获表格;但是,报废的数据框使我发疯,并且我找不到解决该问题的方法。
我的代码:
import re
import bs4 as bs4
import urllib.request
source=urllib.request.urlopen('http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm').read()
soup=bs4.BeautifulSoup(source,'lxml')
#find the county utilization table by MDC
#using the parental tag scrapling method, find the exact table index then save the parental table
table_mdc=soup.find(text=re.compile("Utilization by Body System")).findParent('table')
# print (table_mdc)
# #constuct the table
for row in table_mdc.find_all('tr'):
for cell in row.find_all('td'):
print(cell.text)
with open ('utilization.txt','w') as r:
for row in table_mdc.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text)
例如,抓取的数据框打印为:
Utilization by Body System
MDC Description
Total Cases
Number
Percent
Total Charges
% of Charges
Avg. Charge
Total Days
% of Total Days
Avg. LOS
Total
2,594
100.0%
$101,757,824
100.0%
$39,228
11,972
100.0%
4.6
输出以及txt文件中有很多换行符。理想的txt文件应如下所示:
(标题中没有“总计”)
我该怎么做才能克服这些问题?