美丽的汤不会返回所有标签

时间:2019-07-11 07:45:46

标签: web-scraping beautifulsoup

我是网络爬虫的新手,并且正在使用python3和beautifulsoup4从该网站att.com抓取有关某些手机的信息。

这是我的代码,用于从html中提取每个电话的外部div(这里总共有49个电话)。

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.att.com/buy/phones/').text
#soup = BeautifulSoup(source,'lxml')
#soup = BeautifulSoup(source,'html5lib')
soup  = BeautifulSoup(source,'html.parser')
phone_div=soup.findAll('div',class_='_1hOzu')
#phone_div=soup.findAll('div',class_='_2Ldwa')
#phone_div=soup.find('div',class_='_3kwdR')
#phone_div=soup.findAll('div',class_='_1BGB4')
print(phone_div[1].prettify())
print(phone_div[5].prettify())

以下是第一个电话div的输出(与前四个电话类似),其中包含有关电话名称,价格等的所有信息:

<div class="_1hOzu">
 <div class="_14rcf _1NPjc false false" data-index="1" tabindex="0">
  <a class="_3-Yg9 _13w_Y" data-qa="DeviceTile-PDPlink-iPhone XS Max" href="/buy/phones/apple-iphone-xs-max-64gb-silver.html" tabindex="-1">
   <div class="_27UM0 false">
    <div class="_3C82I">
     <div class="_bOwfD">
      <span class="_2VSUp">
       Buy one, give one.
      </span>
     </div>
    </div>
   </div>
   <div class="_2pI5U">
    <div class="_3AUSX">
     <i class="_3cKi3" style="height:50px;width:50px">
     </i>
    </div>
    <div class="_VzvqU">
    </div>
   </div>
   <div>
    <div class="_1bjup">
     <div>
      <div class="_2Ldwa">
       APPLE
      </div>
      <div>
       <div class="_1BGB4">
        iPhone XS Max
       </div>
       <div class="_izQNb">
        placeholder
       </div>
      </div>
     </div>
    </div>
    <div class="_1NK_S">
     <div class="_1O0IX">
      <div class="_3AUSX">
       <i class="_3cKi3" style="height:50px;width:50px">
       </i>
      </div>
     </div>
    </div>
    <div>
     <div class="_3JaQ9 ">
      <div class="_1dPLs _3yvoJ _38PTM">
       <label class="_1ih28">
        <i class="_9V5dD _10JvD">
        </i>
        <span class="_1C-NR">
         Star Ratings
        </span>
        <input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="1"/>
       </label>
       <label class="_1ih28">
        <i class="_9V5dD _10JvD">
        </i>
        <span class="_1C-NR">
         Star Ratings
        </span>
        <input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="2"/>
       </label>
       <label class="_1ih28">
        <i class="_9V5dD _10JvD">
        </i>
        <span class="_1C-NR">
         Star Ratings
        </span>
        <input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="3"/>
       </label>
       <label class="_1ih28">
        <i class="_9V5dD _10JvD">
        </i>
        <span class="_1C-NR">
         Star Ratings
        </span>
        <input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="4"/>
       </label>
       <label class="_1ih28">
        <i class="_18XCu _10JvD">
        </i>
        <i class="_fLbUs _9V5dD _10JvD" style="width:58.95%">
        </i>
        <span class="_1C-NR">
         Star Ratings
        </span>
        <input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="5"/>
       </label>
      </div>
      <span>
       4.6
       <span class="_VCKql">
        |
       </span>
       531
      </span>
     </div>
     <p class="_2bs9E ">
      $36.67
      <span class="_31cDG">
       /mo.
      </span>
     </p>
    </div>
    <div class="_1YUjH">
     <div>
     </div>
     <div class="_3gbuG">
      <div>
       Req.’s 0% APR 30-mo. installment agmt, qual. credit and service.
      </div>
      <div class="_3gbuG">
       <button class="_1oGNe" data-index="1" data-qa="DeviceTilePLP-SeePriceDetails" tabindex="0">
        See
        <!-- -->
        price details.
       </button>
      </div>
     </div>
    </div>
    <div class="_3_rcU">
    </div>
   </div>
   <div class="_37Icd ">
   </div>
  </a>
 </div>
</div>

剩余电话div的输出:

<div class="_1hOzu">
 <div class="_14rcf _1NPjc false false" data-index="5" tabindex="0">
  <div class="_3AUSX">
   <i class="_3cKi3" style="height:50px;width:50px">
   </i>
  </div>
 </div>
</div>

没有获取剩余div的所有嵌套内部标签,因此我无法从中提取任何内容。已经阅读了有关丢失内部标签的一些答案,并尝试根据这些答案使用不同的解析器,但没有帮助。知道我哪里错了吗?

2 个答案:

答案 0 :(得分:0)

由于请求是动态请求,因此request方法不会返回您在inspect元素中看到的所有标记。 (查看页面源代码,这是您得到的答复)

要获取这些数据,请尝试使用request请求,而不是简单的selenium。它将像检查元素一样返回动态响应。

示例:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.att.com/buy/phones/')
content = BeautifulSoup(driver.page_source, 'html.parser')
phone_div=content.findAll('div',class_='_1hOzu')
print(phone_div[1].prettify())
print(phone_div[5].prettify())

答案 1 :(得分:0)

数据是通过Javascript加载的,但是内容在页面内部。通过一些正则表达式,我们可以提取Json内容(全部内容存储在变量data中):

import re
import json
import requests

url = 'https://www.att.com/buy/phones/'
html_text = requests.get(url).text

data = json.loads(re.findall(r'__NEXT_DATA__ = (.*?});', html_text)[0])
print(json.dumps(data['props']['pageProps']['deviceList'], indent=4))

打印:

[
    {
        "color": "Black",
        "manufacturerShortName": "apple",
        "paymentType": "postpaid",
        "deviceSubType": "pda",
        "iotDevice": false,
        "starRatings": 4.6092,
        "newArrival": false,
        "imageUrl": "https://www.att.com/catalog/en/skus/images/apple-iphone%20xr-black-100x160.jpg",
        "model": "iPhone XR",
        "brand": "Apple",
        "skuId": "sku9240254",
        "displayContentItems": [
            {
                "displayType": "ribbon",
                "contentSource": "cms",
                "marketingPriority": 1,
                "flowTypes": [
                    "NEW",
                    "UP",
                    "AL"
                ],
                "enable": true,
                "description": "Buy one, give one.",
                "customerTypes": [
                    "CRU"
                ],
                "contentType": "image"
            },
            {
                "displayType": "ribbon",
                "contentSource": "cms",
                "marketingPriority": 1,
                "flowTypes": [
                    "NEW",
                    "UP",
                    "AL"
                ],
                "enable": true,
                "description": "Buy one, give one.",
                "customerTypes": [
                    "CONSUMER",
                    "IRU"
                ],
                "contentType": "image"
            },

...and so on.