无法抓取该网站。如何从该网站抓取数据?

时间:2019-04-05 10:38:28

标签: python web-scraping beautifulsoup screen-scraping

我无法从该网站抓取数据。

我在其他网站上尝试过,但在其他网站上也可以...

from bs4 import BeautifulSoup
from urllib.request import urlopen

response = urlopen("https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1")

html = response.read()

parsed_html = BeautifulSoup(html, "html.parser")

containers = parsed_html.find_all("div", {"class" : "c2prKC"})

print(len(containers))

3 个答案:

答案 0 :(得分:1)

加载后看起来像JS渲染到页面。可以使用Selenium渲染页面,漂亮的汤来获取元素。

from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1")
time.sleep(5)

html = driver.page_source

parsed_html = BeautifulSoup(html, "html.parser")

containers = parsed_html.find_all("div", {"class" : "c2prKC"})

print(len(containers))

答案 1 :(得分:1)

您想要的信息在脚本标签中。您可以使用正则表达式或循环脚本标签来获取正确的字符串,以将其解析为json(稍作修改)

import requests
import json
from bs4 import BeautifulSoup as bs
import pandas as pd

headers = {
    'User-Agent' : 'Mozilla/5.0'
}
res = requests.get('https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1', headers = headers)
soup = bs(res.content, 'lxml')
for script in soup.select('script'):
    if 'window.pageData=' in script.text:
        script = script.text.replace('window.pageData=','')
        break
items = json.loads(script)['mods']['listItems']
results = []

for item in items:
    #print(item)
    #extract other info you want
    row = [item['name'], item['priceShow'], item['productUrl'], item['ratingScore']]
    results.append(row)

df = pd.DataFrame(results, columns = ['Name', 'Price', 'ProductUrl', 'Rating'])

print(df.head())

正则表达式版本:

import requests
import json
from bs4 import BeautifulSoup as bs
import pandas as pd

headers = {
    'User-Agent' : 'Mozilla/5.0'
}
res = requests.get('https://www.daraz.com.np/catalog/?spm=a2a0e.searchlistcategory.search.2.3eac4b8amQJ0zd&q=samsung%20m20&_keyori=ss&from=suggest_normal&sugg=samsung%20m20_1_1', headers = headers)
soup = bs(res.content, 'lxml')
r = re.compile(r'window.pageData=(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0]
items = json.loads(script)['mods']['listItems']
results = []

for item in items:
    row = [item['name'], item['priceShow'], item['productUrl'], item['ratingScore']]
    results.append(row)

df = pd.DataFrame(results, columns = ['Name', 'Price', 'ProductUrl', 'Rating'])

print(df.head())

答案 2 :(得分:1)

  setAttribute(Qt::WA_TranslucentBackground);
    QtWin::enableBlurBehindWindow(this);