BeautifulSoup解析器无法访问html元素

时间:2017-08-04 06:20:50

标签: python python-3.x parsing web-scraping beautifulsoup

我正试图抓住所有列表的hrefs。我对beautifulsoup相当新,之前做过一些刮痧,但之前做过一些刮痧。但我不能为我的生活提取。见下面我的代码。运行此脚本时,容器的长度为零。

我也尝试选择价格(soup.findAll(" span",{" class":" amount"}),但它没有&#39 ;反思。欢迎任何建议:)

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

url = 'https://www.takealot.com/computers/laptops-10130'   
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)

respData = str(resp.read())

soup = BeautifulSoup(respData, 'html.parser')

container = soup.find_all("div", {"class": "p-data left"})

1 个答案:

答案 0 :(得分:0)

该页面使用JavaScript呈现。有几种方法可以渲染和刮擦它。

我可以用Selenium刮掉它。 首先安装Selenium:

sudo pip3 install selenium

然后得到一个司机https://sites.google.com/a/chromium.org/chromedriver/downloads你可以使用无头版Chrome和#34; Chrome Canary"如果您使用的是Windows或Mac。

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Chrome()
url = ('https://www.takealot.com/computers/laptops-10130')
browser.get(url)
respData = browser.page_source
browser.quit()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

或者使用PyQt5

from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
from bs4 import BeautifulSoup
import sys


class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'https://www.takealot.com/computers/laptops-10130'
r = Render(url)
respData = r.frame.toHtml()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print (container.text)
    print (container.find("span", {"class":"amount"}).text)

或者使用dryscrape

from bs4 import BeautifulSoup
import dryscrape

url = 'https://www.takealot.com/computers/laptops-10130'
session = dryscrape.Session()
session.visit(url)
respData = session.body()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

所有情况下的输出:

Dell Inspiron 3162 Intel Celeron 11.6" Wifi Notebook (Various Colours)11.6 Inch Display; Wifi Only (Red; White & Blue Available)R 3,999R 4,999i20% OffeB 39,990Discovery Miles 39,990On Credit: R 372 / monthi
3,999
HP 250 G5 Celeron N3060 Notebook - Dark ash silverNBHPW4M70EAR 4,499R 4,999ieB 44,990Discovery Miles 44,990On Credit: R 419 / monthiIn StockShippingThis item is in stock in our CPT warehouse and can be shipped from there. You can also collect it yourself from our warehouse during the week or over weekends.CPT | ShippingThis item is in stock in our JHB warehouse and can be shipped from there. No collection facilities available, sorry!JHBWhen do I get it?
4,499
Asus Vivobook ...

然而,当您使用您的网址进行测试时,我发现每次都无法重现结果,有时候我没有在#34;容器"页面呈现后。