HTML网页抓取值

时间:2017-07-31 11:22:47

标签: python html web-scraping beautifulsoup

我用beautifulsoup创建了一个python程序,该程序应该从一个站点找到一个特定的值,但是程序似乎找不到该值。

import bs4
from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup
my_url = 'http://www.calcalist.co.il/stocks/home/0,7340,L-4135-22212222,00.html?quote=%D7%93%D7%95%D7%9C%D7%A8'
uclient = ureq(my_url)
page_html = uclient.read()
uclient.close()
page_soup = soup(page_html, "html.parser")
value = page_soup.find("td",{"class":"RightBlack"})
print(value)

我试图找到的价值是美元兑换成以色列货币但由于某种原因应该检索该值的代码行:

value = page_soup.find("td",{"class":"RightBlack"})

无法找到它。

1 个答案:

答案 0 :(得分:2)

1。第一个选项,你可以使用BeautifulSoup

做什么

请注意,您要获取的元素位于iframe内,这意味着这是另一个请求,与您所做的不同,您可以执行代码迭代所有iframes并打印价格如果找到iframe_soup.find("td",{"class":"RightBlack"})

我建议使用except语句,因为在执行此操作时很容易陷入网址陷阱:

from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup

my_url = 'http://www.calcalist.co.il/stocks/home/0,7340,L-4135-22212222,00.html?quote=%D7%93%D7%95%D7%9C%D7%A8'
uclient = ureq(my_url)
page_html = uclient.read()
page_soup = soup(page_html, "html.parser")

iframesList = page_soup.find_all('iframe')
i = 1
for iframe in iframesList:
    print(i, ' out of ', len(iframesList), '...')
    try:
        uclient = ureq("http://www.calcalist.co.il"+iframe.attrs['src'])
        iframe_soup = soup(uclient.read(), "html.parser")
        price = iframe_soup.find("td",{"class":"RightBlack"})
        if price:
            print(price)
            break
    except:
        print("something went wrong")
    i+=1

运行代码,输出:

1  out of  8 ...
2  out of  8 ...
3  out of  8 ...
4  out of  8 ...
5  out of  8 ...
<td class="RightBlack">3.5630</td>

所以现在我们有了我们想要的东西:

>>> price
<td class="RightBlack">3.5630</td>
>>> price.text
'3.5630'

2。第二个选项,使用Selenium

这是一个建议,要执行请求和JavaScript处理,您应该使用 Selenium 和JS解释器,我正在使用 ChromeDriver ,但您也可以使用 PhantomJS 进行无头浏览。检查框架元素,我们知道它的ID为"StockQuoteIFrame",我们使用.switch_to_frame,然后我们可以轻松找到price

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'http://www.calcalist.co.il/stocks/home/0,7340,L-4135-22212222,00.html?quote=%D7%93%D7%95%D7%9C%D7%A8'

browser = webdriver.Chrome()
browser.get(url)

browser.switch_to_frame(browser.find_element_by_id("StockQuoteIFrame"))
price = browser.find_element_by_class_name("RightBlack").text

当然,输出与第一个选项相同:

>>> price
'3.5630'