使用beautifulSoup和urllib进行Web抓取

时间:2017-08-30 09:46:16

标签: python web-scraping beautifulsoup

我正在使用python 3.6并且我能够使用BeautifulSoup来搜索文本。我正在使用沃尔玛网站进行练习。我正在尝试从沃尔玛中删除文本。这是我的代码。

from bs4 import BeautifulSoup
from urllib.request import urlopen
main_page=urlopen('http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159')
soup = BeautifulSoup(main_page,"lxml")
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text()
price=soup.select_one("span.Price-group").get_text()
highLights=soup.select_one("div.ProductPage-short-description-body").get_text()
description=soup.select_one("div.about-desc").get_text()
print(title,"\n",highLights,"\n",description,"\n",price)

在上面的代码我提取产品名称,价格,高亮度和描述,但我无法提取描述(关于此项目)。而不是描述我得到别的东西。

请帮我解决这个问题。

2 个答案:

答案 0 :(得分:0)

因为有2个div与class =“about-desc”,因为你使用select_one只返回了第一个div但你需要第二个div。这是调整:

description=soup.select("div.about-desc")[1].get_text()

更新:该网站实际上阻止了urllib的默认用户代理,因此您应该屏蔽它。

from bs4 import BeautifulSoup
from urllib.request
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
req = urllib.request.Request(url="http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159", headers=user_agent)
main_page = urllib.request.urlopen(req)
soup = BeautifulSoup(main_page,"lxml")
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text()
price=soup.select_one("span.Price-group").get_text()
highLights=soup.select_one("div.ProductPage-short-description-body").get_text()
description=soup.select("div.about-desc")[1].get_text()
print(title,"\n",highLights,"\n",description,"\n",price)

答案 1 :(得分:-1)

有两种选择:

  • 使用 JSON + requests 解析 beautifulsoup
  • 使用 requests-htmlselemium

如果您在 Chrome 控制台中运行此程序,您将得到以下响应:

test = JSON.parse(document.querySelector("#item").textContent).item.product.buyBox.products[0]
console.log(test)

enter image description here

import json, requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get('https://www.walmart.com/ip/Wilson-The-Duke-Official-NFL-Game-Football/5192758', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# https://stackoverflow.com/a/63151716/15164646
fus = soup.select_one('#item').string
ro = json.loads(fus)
dah = json.dumps(fus, indent=2, ensure_ascii=False)
print(dah)

部分输出:

{
  "item": {
    "ads": {
      "config": {
        "lazy-homepage-expose1": "800",
        "lazy-search-expose1": "800",
        "lazy-browse-expose1": "800",
        "lazy-category-expose1": "200",
        "no-category-marquee2": true,
        "no-deals-skyline1": true,
        "no-homepage-twocolumnhp": true,
        "lazy-item-expose1": "800",
        "lazy-item-marquee2": "1200",
        "lazy-item-rightrail2": "1200",
        "adblockImgSource": "//i5.walmartimages.com/dfw/63fd9f59-8bc2/8fe200ec-4c4d-4ab0-89e5-0662af6f506d/v1/ads.png",
        "displayAdsS2sScript": "//i5.wal.co/dfw/63fd9f59-a579/be6f8cae-248d-40e2-8cad-32d04468ea59/v29/usgm-s2s-midas.js",
        "displayAdsS2sScriptWithPoly": "//i5.wal.co/dfw/63fd9f59-5870/c8ceb4ee-1e68-40ec-a38e-ca0623f075a0/v29/usgm-s2s-midas-poly.js",
        "safeframeUrl": "https://i5.wal.co/dfw/63fd9f59-d6ba/07b8ea82-184c-4ea3-8ac0-5dc1981e40c8/v50/safeframe.html",
        "displayAds": true,
        "exts2s": true,
        "isTwoDayDeliveryTextEnabled": true,
        "ads2s": true,
        "bypassproxy": false,
        "adblockDetectionEnabled": false,
        "marqueeSafeframe": true,
        "exposeSafeframe": true,
        "skylineSafeframe": true,
        "leftrailSafeframe": true,
        "rightrailSafeframe": true,
        "cloud": "scus-prod-a29"
      }
}
# much more down below...

以下代码使用 requests-html。获取“关于此项目”描述的一种方法是使用 XPath

代码(在多个列表上测试过):

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://www.walmart.com/ip/Sceptre-32-Class-720P-HD-LED-TV-X322BV-SR/55427159')

# first=True means that it will grab the first occurrence and skip everything else
title = response.html.find('.prod-productTitle-buyBox', first=True).text
price = response.html.find('.prod-PriceHero', first=True).text.split('$')[1]
description = response.html.xpath('//*[@id="about-product-section"]/div/div[1]/div[1]/div[3]/p[1]/text()', first=True)
key_features = response.html.xpath('//*[@id="about-product-section"]/div/div[1]/div[1]/div[3]/ul[1]', first=True).text

print(title)
print(price)
print(description)
print(key_features)

输出:

Sceptre 32" Class 720P HD LED TV X322BV-SR
129.00
Escape into a world of splendid color and clarity with the X322BV-SR. 
Clear QAM tuner is included to make cable connection as easy as possible, without an antenna. 
HDMI input delivers the unbeatable combination of high-definition video and clear audio. 
A USB port comes in handy when you want to flip through all of your stored pictures and tune into your stored music. 
More possibilities: with HDMI, VGA, Component and Composite inputs, we offer a convenient balance between the old and new to suit your diverse preferences. 
With the ability to connect your computer, laptop, monitor, or TV to all your favorite variety of input options, VGA inputs deliver superb analog video.

Screen Size (Diag.) 31.5"
Backlight Type LED
Resolution 720p
Effective Refresh Rate 60Hz
Smart Functionality no
Aspect Ratio 16 9
Dynamic Contrast Ratio 5,000 1
Viewable Angle (H/V) 178 degrees/178 degrees
Number of Colors 16.7 M
OSD Language English, Spanish, French
Speakers/Power Output 10W x 2
Surround Sound Mode

或者,您可以使用来自 SerpApi 的第三方 Walmart Product API。这是一个付费 API,可免费试用 5,000 次搜索。目前正在开发完全免费的试用版。

要集成的代码:

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "walmart_product",
  "product_id": "55427159"
}

search = GoogleSearch(params)
results = search.get_dict()

title = results['product_result']['title']
price = results['product_result']['price_map']['price']
key_features = results['product_result']['detailed_description_html']
print(title)
print(price)
print(key_features)

输出:

Sceptre 32" Class 720P HD LED TV X322BV-SR
129
<b>Sceptre 32" Class 720p HD LED TV X322BV-SR</b><br /><b>Key Features </b></p><ul><li>Screen Size (Diag.) 31.5"</li><li>Backlight Type LED</li><li>Resolution 720p</li><li>Effective Refresh Rate 60Hz</li><li>Smart Functionality no</li><li>Aspect Ratio 16 9</li><li>Dynamic Contrast Ratio 5,000 1</li><li>Viewable Angle (H/V) 178 degrees/178 degrees</li><li>Number of Colors 16.7 M</li><li>OSD Language English, Spanish, French</li><li>Speakers/Power Output 10W x 2</li><li>Surround Sound Mode</li></ul><b>Connectivity </b><ul><li>Component/Composite Video 1</li><li>HDMI 2</li><li>Headphone 1</li><li>Optical Digital Audio 1</li><li>RCA Audio L+R 1</li><li>RF (Coaxial) 1</li><li>USB 2.0 1</li><li>Assembled Product Dims 28.78 x 18.39 x 7.95 Inches<br /></li></ul><b>What's In The Box </b><ul><li>Remote Control</li></ul><b>Wall-mountable </b><ul><li>Mount Pattern 100mm x 100mm</li><li>Screw Size M4</li><li>Screw Length 6mm</li></ul><b>Support and Warranty </b><ul><li>1-year limited labor and parts</li></ul><br /><br />Flat Screen TV stand sold separately. See all <b> TV stands.</b><br /><br />Flat Screen TV mount sold separately. See all <b> TV mounts. </b><br /><br />TV audio equipment sold separately. See all <b> Home Theater Systems. </b><br /><br />HDMI cables sold separately. See all <b> HDMI Cables.</b><br /><br />Accessories sold separately. See all <b> Accessories.<br /></b><br /><br /><b>ENERGY STAR<sup></sup></b><br />Products that are ENERGY STAR-qualified prevent greenhouse gas emissions by meeting strict energy efficiency guidelines set by the U.S. Environmental Protection Agency and the U.S. Department of Energy. The ENERGY STAR name and marks are registered marks owned by the U.S. government, as part of their energy efficiency and environmental activities.
<块引用>

免责声明,我为 SerpApi 工作。