Python(美丽汤)在爬网时对现有html返回“ none”

时间:2018-11-04 14:49:31

标签: python-3.x selenium-webdriver beautifulsoup web-crawler ssl-certificate

我只是想获取https://www.daraz.com.pk网站的搜索栏的html。我已经编写了一个代码,并在“ https://www.amazon.com”,“ https://www.alibaba.com”,“ https://www.goto.com.pk”等上进行了尝试,效果很好。但不适用于https://www.daraz.com.pk

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    from urllib import request
    import ssl
    import requests

    ssl._create_default_https_context = ssl._create_unverified_context

    html =  urlopen("https://www.daraz.com.pk")
    bsObj = BeautifulSoup(html, features="lxml")
    nameList = bsObj.find("input", {"type": "search"})
    print(nameList)

它返回,而是返回:

input type="search" id="q" name="q" placeholder="Search in Daraz" class="search-box__input--O34g" tabindex="1" value="" data-spm-anchor-id="a2a0e.home.search.i0.35e34937eWCmbI"

我还曾在亚马逊,阿里巴巴和其他一些网站上尝试过类似的代码,这些代码成功返回了它们的html:

     html =  urlopen("https://www.amazon.com")
    bsObj = BeautifulSoup(html, features="lxml")
    nameList = bsObj.find("input", {"type": "text"})
    print(nameList)

我也尝试过这种方式:

    bsObj=BeautifulSoup(requests.get("https://www.daraz.com.pk").content, 
    "html.parser")

    nameList = bsObj.find("input", {"type": "search"})
    print(nameList)

以这种方式使用硒:

    driver = webdriver.Firefox()
    driver.get("https://www.daraz.com.pk")

    time.sleep(2)
    content = driver.page_source.encode('utf-8').strip()
    soup = BeautifulSoup(content,"html.parser")
    time.sleep(2)
    officials = soup.find("input", {"type":"search"})
    print(str(officials))

但失败了。

0 个答案:

没有答案