soup.findAll不适用于桌子

时间:2017-09-07 12:41:04

标签: python web-scraping beautifulsoup

我正在尝试解析此网站https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017

使用以下代码

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import ssl
context = ssl._create_unverified_context()
dibbsurl = 'https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017'
uClient = uReq(dibbsurl, context=context)
dibbshtml = uClient.read()
uClient.close()

#html parser
dibbssoup = soup(dibbshtml, "html.parser")

#grabs each rfq
containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

为了研究目的,我想从表格中获取国家股票编号,命名法和数量。

containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

我试图抓住桌子的每一排,但容器似乎没有抓住它。当我输入              len(容器)它显示0 为什么桌子没有被抓住,我该如何解决呢?

更新 这是来自网站的示例html

<tr class="BgWhite">
    <td headers="th0" valign="top">
        1
    </td>
    <td headers="th1" style="width: 125px;" valign="top">
        <a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQNsn.aspx?value=8465015550093&amp;category=issue&amp;Scope=" title="go to NSN view">8465-01-555-0093</a>
    </td>
    <td headers="th2" valign="top">
        SNAP LINK, RAPPELLER
    </td>
    <td headers="th3" valign="top">
        None
    </td>
    <td headers="th4" style="width: 150px;" valign="top">
        <a href="https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE1C117T2608.PDF" title="RFQ document" target="DIBBSDocuments">SPE1C1-17-T-2608</a><br>&nbsp;&nbsp;<span style="font-size: 9px; color: #505050;">» <a href="https://www.dibbs.bsm.dla.mil/rfq/rfqrec.aspx?sn=SPE1C117T2608" title="Package View" class="SubMenuLink">Package View</a></span><a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQQHlp.aspx?ht=fi"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconFastPace.gif" alt="Fast Award Candidate.  Micro-purchase quotes may be awarded prior to the solicitation return date.  See Master Solicitation for Additional Info" width="14" height="11" hspace="0" border="0" align="middle"></a><br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconEproc.gif" width="36" height="16" hspace="1" border="0" alt="DLA E-Procurement" style="border-width:0px;  vertical-align: bottom;">
    </td>
    <td headers="th5" valign="top">
        <span style="color:#000099">Open</span><br><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/buttons/btnQ.gif" width="18" height="18" border="0" alt="Click to submit Quote" hspace="1" align="bottom"></a><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><span style="font-size: 9px;">uote</span></a>&nbsp;&nbsp;<img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconSpace1010.gif" alt=" " width="18" height="16" hspace="0" border="0">
    </td>
    <td headers="th6" valign="top">
        0070631319<br>QTY: 400
    </td>
    <td headers="th7" valign="top">
        09-07-2017
    </td>
    <td headers="th8" valign="top">
        09-18-2017
    </td>
</tr>

1 个答案:

答案 0 :(得分:2)

我分析了您想要抓取的网站,我发现该网站确实有一个类似条款和条件的页面,您需要在查看内容之前达成一致。能够同意&#34;因此需要提交表格。因此,创建一个具有3级提取或页面源检索的解决方案。

我在此示例中使用了requestshtml5lib,因为它易于使用。您可以使用pip

安装它们

最后一部分是对表的解析,类似于你所做的。

import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

request_headers = {'Accept': '*/*',
                   'Accept-Encoding': 'gzip, deflate, sdch',
                   'Accept-Language': 'en-US,en;q=0.8',
                   'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
                       }

req = requests.Session()
warning_url = 'https://www.dibbs.bsm.dla.mil/dodwarning.aspx'

# get initial warning page
get_warning_page = req.get(warning_url, headers=request_headers, verify=False)
warning_soup = BeautifulSoup(get_warning_page.content, 'html5lib')

# parse forms needed to be submitted later (T&C of the site that you need to agree before proceeding)
payload = {}
for inp in warning_soup.find('form').find_all('input'):
    payload[inp.get('name')] = inp.get('value')

# submit the warning form (means you already agreed on the T&C)
submit_warning_form = req.post(warning_url, headers=request_headers, data=payload, verify=False)

# lastly, navigate to the main page that contains the table
main_page = req.post('https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017', headers=request_headers, verify=False)

# parsing of table
dibbssoup = BeautifulSoup(main_page.content, 'html5lib')
#grabs each rfq
containers = dibbssoup.find_all("tr", {"class": "BgWhite"})

print(containers)

如果您有任何疑问或遇到错误,请告诉我们。如果这解决了您的问题,请将其标记为答案。谢谢!