我是网络爬虫的新手,并且正在使用python3和beautifulsoup4从该网站att.com抓取有关某些手机的信息。
这是我的代码,用于从html中提取每个电话的外部div(这里总共有49个电话)。
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.att.com/buy/phones/').text
#soup = BeautifulSoup(source,'lxml')
#soup = BeautifulSoup(source,'html5lib')
soup = BeautifulSoup(source,'html.parser')
phone_div=soup.findAll('div',class_='_1hOzu')
#phone_div=soup.findAll('div',class_='_2Ldwa')
#phone_div=soup.find('div',class_='_3kwdR')
#phone_div=soup.findAll('div',class_='_1BGB4')
print(phone_div[1].prettify())
print(phone_div[5].prettify())
以下是第一个电话div的输出(与前四个电话类似),其中包含有关电话名称,价格等的所有信息:
<div class="_1hOzu">
<div class="_14rcf _1NPjc false false" data-index="1" tabindex="0">
<a class="_3-Yg9 _13w_Y" data-qa="DeviceTile-PDPlink-iPhone XS Max" href="/buy/phones/apple-iphone-xs-max-64gb-silver.html" tabindex="-1">
<div class="_27UM0 false">
<div class="_3C82I">
<div class="_bOwfD">
<span class="_2VSUp">
Buy one, give one.
</span>
</div>
</div>
</div>
<div class="_2pI5U">
<div class="_3AUSX">
<i class="_3cKi3" style="height:50px;width:50px">
</i>
</div>
<div class="_VzvqU">
</div>
</div>
<div>
<div class="_1bjup">
<div>
<div class="_2Ldwa">
APPLE
</div>
<div>
<div class="_1BGB4">
iPhone XS Max
</div>
<div class="_izQNb">
placeholder
</div>
</div>
</div>
</div>
<div class="_1NK_S">
<div class="_1O0IX">
<div class="_3AUSX">
<i class="_3cKi3" style="height:50px;width:50px">
</i>
</div>
</div>
</div>
<div>
<div class="_3JaQ9 ">
<div class="_1dPLs _3yvoJ _38PTM">
<label class="_1ih28">
<i class="_9V5dD _10JvD">
</i>
<span class="_1C-NR">
Star Ratings
</span>
<input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="1"/>
</label>
<label class="_1ih28">
<i class="_9V5dD _10JvD">
</i>
<span class="_1C-NR">
Star Ratings
</span>
<input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="2"/>
</label>
<label class="_1ih28">
<i class="_9V5dD _10JvD">
</i>
<span class="_1C-NR">
Star Ratings
</span>
<input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="3"/>
</label>
<label class="_1ih28">
<i class="_9V5dD _10JvD">
</i>
<span class="_1C-NR">
Star Ratings
</span>
<input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="4"/>
</label>
<label class="_1ih28">
<i class="_18XCu _10JvD">
</i>
<i class="_fLbUs _9V5dD _10JvD" style="width:58.95%">
</i>
<span class="_1C-NR">
Star Ratings
</span>
<input class="_ZI8n9" name="Customer Reviews" readonly="" type="radio" value="5"/>
</label>
</div>
<span>
4.6
<span class="_VCKql">
|
</span>
531
</span>
</div>
<p class="_2bs9E ">
$36.67
<span class="_31cDG">
/mo.
</span>
</p>
</div>
<div class="_1YUjH">
<div>
</div>
<div class="_3gbuG">
<div>
Req.’s 0% APR 30-mo. installment agmt, qual. credit and service.
</div>
<div class="_3gbuG">
<button class="_1oGNe" data-index="1" data-qa="DeviceTilePLP-SeePriceDetails" tabindex="0">
See
<!-- -->
price details.
</button>
</div>
</div>
</div>
<div class="_3_rcU">
</div>
</div>
<div class="_37Icd ">
</div>
</a>
</div>
</div>
剩余电话div的输出:
<div class="_1hOzu">
<div class="_14rcf _1NPjc false false" data-index="5" tabindex="0">
<div class="_3AUSX">
<i class="_3cKi3" style="height:50px;width:50px">
</i>
</div>
</div>
</div>
没有获取剩余div的所有嵌套内部标签,因此我无法从中提取任何内容。已经阅读了有关丢失内部标签的一些答案,并尝试根据这些答案使用不同的解析器,但没有帮助。知道我哪里错了吗?
答案 0 :(得分:0)
由于请求是动态请求,因此request方法不会返回您在inspect元素中看到的所有标记。 (查看页面源代码,这是您得到的答复)
要获取这些数据,请尝试使用request
请求,而不是简单的selenium
。它将像检查元素一样返回动态响应。
示例:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.att.com/buy/phones/')
content = BeautifulSoup(driver.page_source, 'html.parser')
phone_div=content.findAll('div',class_='_1hOzu')
print(phone_div[1].prettify())
print(phone_div[5].prettify())
答案 1 :(得分:0)
数据是通过Javascript加载的,但是内容在页面内部。通过一些正则表达式,我们可以提取Json内容(全部内容存储在变量data
中):
import re
import json
import requests
url = 'https://www.att.com/buy/phones/'
html_text = requests.get(url).text
data = json.loads(re.findall(r'__NEXT_DATA__ = (.*?});', html_text)[0])
print(json.dumps(data['props']['pageProps']['deviceList'], indent=4))
打印:
[
{
"color": "Black",
"manufacturerShortName": "apple",
"paymentType": "postpaid",
"deviceSubType": "pda",
"iotDevice": false,
"starRatings": 4.6092,
"newArrival": false,
"imageUrl": "https://www.att.com/catalog/en/skus/images/apple-iphone%20xr-black-100x160.jpg",
"model": "iPhone XR",
"brand": "Apple",
"skuId": "sku9240254",
"displayContentItems": [
{
"displayType": "ribbon",
"contentSource": "cms",
"marketingPriority": 1,
"flowTypes": [
"NEW",
"UP",
"AL"
],
"enable": true,
"description": "Buy one, give one.",
"customerTypes": [
"CRU"
],
"contentType": "image"
},
{
"displayType": "ribbon",
"contentSource": "cms",
"marketingPriority": 1,
"flowTypes": [
"NEW",
"UP",
"AL"
],
"enable": true,
"description": "Buy one, give one.",
"customerTypes": [
"CONSUMER",
"IRU"
],
"contentType": "image"
},
...and so on.