bs4 python找不到文本

时间:2019-03-10 21:22:59

标签: python web-scraping beautifulsoup

我有一个通过漂亮的汤抓取的html文档。 html的摘录在此问题的底部。我在用美丽的汤和硒。

有人告诉我,我每小时只能提取这么多数据,而当我让此页面等待一会儿(一个好小时)时。

这就是我试图提取数据的方式:

def get_page_data(self):
    opts = Options()
    opts.headless = True
    assert opts.headless  # Operating in headless mode
    browser_detail = Firefox(options=opts)
    url = self.base_url.format(str(self.tracking_id))
    print(url)
    browser_detail.get(url)
    self.page_data = bs4(browser_detail.page_source, 'html.parser')
    Error_Check = 1 if len(self.page_data.findAll(text='Error Report Number')) > 0 else 0
    Error_Check = 2 if len(self.page_data.findAll(text='exceeded the maximum number of sessions per hour allowed')) > 0 else Error_Check
    print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.
    print(self.page_data)
    return Error_Check

问题是这一行:

print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.

代码无法在页面中找到该行。我想念什么?谢谢

<html><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/CMPL/styles/ogm_style.css;jsessionid=rw9pc8-bncrIy_4KSZmJ8BxN2Z2hnKVwcr79Vho4-99gxTPrxNbo!-68716939" rel="stylesheet" type="text/css"/>
<body>
<!-- Content Area -->
<table style="width:100%; margin:auto;">
<tbody><tr valign="top">
<td class="ContentArea" style="width:100%;">
<span id="messageArea">
<!-- /tiles/messages.jsp BEGIN -->
<ul>
</ul><b>
</b><table style="width:100%; margin:auto; white-space: pre-wrap; text-align: left;">
<tbody><tr><td align="left"><b><li><font color="red"></font></li></b></td>
<td align="left"><font color="red">You have exceeded the maximum number of sessions per hour allowed for the public queries. You may still access the public</font></td>
</tr>
<tr><td><font color="red"><li style="list-style: none;"></li></font></td>
<td align="left"><font color="red">queries by waiting an hour and trying your query again. The RRC public queries are provided to facilitate online research and are not intended to be accessed by automated tools or scripts. For questions or concerns please contact the RRC HelpDesk at helpdesk@rrc.state.tx.us or 512-463-7229</font></td>
</tr>
</tbody></table>
<p>....more html...</p>
</body></html>

2 个答案:

答案 0 :(得分:2)

您可以使用以下CSS选择器

tr:last-child:not([valign])

from bs4 import BeautifulSoup as bs
html = '''yourHTML'''    
soup = bs(html, 'lxml')   
item = soup.select_one('tr:last-child:not([valign])')
print(item.text)

如果这返回了多个项目,则可以循环搜索包含感兴趣字符串的项目的列表过滤。您可以限制为td的选择器,并执行类似的操作。

items = soup.select('tr:last-child:not([valign])')
for item in items:
    if 'queries by waiting an hour' in item.text:
        print(item.text)

BeautifulSoup 4.7.1

答案 1 :(得分:1)

我不确定这不是您要寻找的东西,但是如果您是:

WITH CTE_1 AS (SELECT *, (ROW_NUMBER() OVER(ORDER BY [Location],     
               [Site] ) - 1) / 4 as grpno FROM #xmpltbl), 
     CTE_2 AS (SELECT * , ROW_NUMBER() OVER(PARTITION BY grpno ORDER BY grpno) rn 
               FROM CTE_1),
     CTE_3 AS (SELECT *, case when rn = 2 Then 1 else 0 end as Product, case when rn = 3 Then 1 else 0 end as Supplies
               FROM CTE_2)
SELECT DISTINCT [Location], [Site], Year([Period]) as [Period], 
                FIRST_VALUE(StringValue) OVER(PARTITION BY grpno ORDER BY rn) as [Customer] ,
                FIRST_VALUE(StringValue) OVER(PARTITION BY grpno ORDER BY Product DESC) as [Product] ,
                FIRST_VALUE(StringValue) OVER(PARTITION BY grpno ORDER BY Supplies DESC) as [Supplies] ,
                MAX([NumericValue]) OVER(PARTITION BY grpno) as [Total] 
from CTE_3

输出:

FIRST_VALUE()