如何使用beautifulsoup从html中提取文本?

时间:2018-04-12 14:34:53

标签: html web-scraping beautifulsoup webdriver

我想从这个html中提取一些单词,如

<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
  <td align="right" style="width:75px;" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
  </td>
   <td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>
   </td>
    <td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0007&amp;dlv=0243&amp;cnt=" target="DIBBS" title="Delivery Order Package View">Delivery Order Package View</a></span></span>
   </td>
    <td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
    </td>
    <td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
   </td>

</tr>

这是我的代码的一部分,它生成上面的html

  import requests
  from bs4 import BeautifulSoup
  from selenium.webdriver.common.keys import Keys
  from selenium import webdriver
  import urllib3
  import numpy as np
  import re

  from datetime import datetime, timedelta 


  containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})



                        for batch in containers:

                            for item in range(53)[2:]:



                                try:
                                    // batch is the html above
                                    print(batch)

                                    uid = "ctl00_cph1_grdAwardSearch_ctl"+str(item)+"_lblAwardBasicNumber"
                                    print("uid id ", uid)
                                    awardid = batch.find_all("span", text = re.compile("_lblAwardBasicNumber"))


                                    print("award id is")
                                    print(awardid)

                                except Exception as e:
                                    print(colorama.Fore.MAGENTA + "award error.."+ str(e) )
                                    # print(container1)

                                    continue


                    except Exception as e:
                        raise e

打印(批处理)是上面生成的html,我想从这个获得这个号码SP450017D0007

<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>

但是奖励输出无。我怎样才能提取SP450017D0007?

1 个答案:

答案 0 :(得分:0)

解决方案:

要获取此文SP450017D0007,我使用了pagesoup.find('a', text=True).text

注意:

您上面的代码中有以下额外的行应该被取出

except Exception as e:
    raise e

代码:

import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import numpy as np
import re
from datetime import datetime, timedelta

data = '''
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
  <td align="right" style="width:75px;" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
  </td>
   <td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>
   </td>
    <td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0007&amp;dlv=0243&amp;cnt=" target="DIBBS" title="Delivery Order Package View">Delivery Order Package View</a></span></span>
   </td>
    <td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
    </td>
    <td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
   </td>

</tr>
'''

pagesoup = BeautifulSoup(data, 'html.parser')
containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})

for batch in containers:
    for item in range(53)[2:]:
        try:
            print(batch)
            uid = "ctl00_cph1_grdAwardSearch_ctl" + str(item) + "_lblAwardBasicNumber"
            print("uid id ", uid)
            awardid = pagesoup.find('a', text=True).text
            print("award id is")
            print(awardid)
            dateid = pagesoup.find('span', id='ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate').text
            print("date id is")
            print(dateid)

        except Exception as e:
            print(colorama.Fore.MAGENTA + "award error.." + str(e))
            # print(container1)
            continue

输出:

<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0007&amp;dlv=0243&amp;cnt=" target="DIBBS" title="Delivery Order Package View">Delivery Order Package View</a></span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
uid id  ctl00_cph1_grdAwardSearch_ctl2_lblAwardBasicNumber
award id is
SP450017D0007
date id is
04-12-2018