使用BeautifulSoup刮取数据

时间:2019-10-03 16:19:20

标签: python-3.x web-scraping beautifulsoup

我正在尝试使用BeautifulSoup解析html,但无法获取数据

<tr>
<td align="left" colspan="3" style="font-size:10.0pt;font-weight:800;" valign="top">49-009-41057
<td align="left" colspan="4" style="font-size:10.0pt;font-weight:800;" valign="top">CHESAPEAKE OPERATING LLC 
<td align="left" colspan="1" style="font-size:10.0pt;font-weight:800;" valign="top"> 
<tr>
<td align="left" colspan="3" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;">Well Name</span></td>
<td align="left" colspan="4" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;">Field</span></td>
<td align="left" colspan="1" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;"> </span></td>
<tr>
<td align="left" colspan="3" style="font-size:10.0pt;font-weight:800;" valign="top">SFU 10-34-72 USAC TR 23H 
<td align="left" colspan="4" style="font-size:10.0pt;font-weight:800;" valign="top">WC 
<td align="left" colspan="1" style="font-size:10.0pt;font-weight:800;" valign="top">
<tr>
<td align="left" colspan="3" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;">Surface Location</span></td>
<td align="left" colspan="1" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;">Section</span></td>
<td align="left" colspan="2" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;">Township/Range</span></td>
<td align="left" colspan="1" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;">Latitude</span></td>
<td align="left" colspan="1" style="border-top:none; border-left:none;border-bottom:none; border-right:none;padding:.01in .01in .01in .01in;height:6.75pt" valign="top"><span style="font-size:5.75pt;font-family:Arial;color:Darkgray;">Longitude</span></td>
<tr>
<td align="left" colspan="3" style="font-size:10.0pt;font-weight:800;" valign="top">2188 FNL AND 984 FEL  ( SE NE )
<td align="left" colspan="1" style="font-size:10.0pt;font-weight:800;" valign="top">10 
<td align="left" colspan="2" style="font-size:10.0pt;font-weight:800;" valign="top">34 NORTH 72 WEST
<td align="left" colspan="1" style="font-size:10.0pt;font-weight:800;" valign="top">42.934003 
<td align="left" colspan="1" nowrap="" style="font-size:10.0pt;font-weight:800;" valign="top">-105.480115 

我能够使用BeautifulSoup来获取HTML

soup = BeautifulSoup(body, 'html.parser')
tr = soup.find_all('tr')

如何获取值API = 49-009-41057,公司= CHESAPEAKE OPERATING LL,井名称= SFU 10-34-72 USAC TR 23H等?

1 个答案:

答案 0 :(得分:1)

BS方法,

soup = BeautifulSoup(body, 'html.parser')
trs = soup.find_all('tr')

for tr in trs:
    tds = tr.find_all('td')
    for td in tds:
        print(td.text)

或者您可以尝试使用lxml和xpath,

from lxml import html


html = html.parse(body)

td1 = html.xpath("//tr/td[1]")
for td in td1:
    print(td.text)
# repeat the same for other tds in the html body.

输出: BS方法应按顺序打印。 lxml方法应该打印出这样的内容,

'49 -009-41057'

'SFU 10-34-72美国空军TR 23H'

'2188 FNL和984 FEL(SE NE)