到目前为止,我的代码是:http://pastebin.com/CdUiXpdf
import requests
from bs4 import BeautifulSoup
def web_crawler(max_pages):
page = 1
while page <= max_pages:
url = "https://www.kupindo.com/Knjige/artikli/1_strana_" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
print("PAGE: " + str(page))
for link in soup.find_all("a", class_="item_link"):
href = link.get("href")
# title = link.string
print(href)
# print(title)
extended_crawler(href)
page += 1
def extended_crawler(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for view_counter in soup.find_all("span", id="BrojPregleda"):
print("View Count: ", view_counter.text)
web_crawler(1)
输出例如是
PAGE: 1
https://www.kupindo.com/showcontent/2143/Beletristika/37875219_VUK-DRASKOVIC-Izabrana-dela-1-7-Srpska-rec
View Count:
所以View Count为空,即使有expand_crawler函数,它查找id为BrojPregleda的span,也没有任何显示。
答案 0 :(得分:1)
这是因为具有ID BrojPregleda的跨度正在通过ajax调用填充。使用Selenium获取值或按照以下步骤操作:
1)从URL
中获取产品的ID 2)使用单个FormData键发布到http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php
- IDPredmet
,值为1)
3)获取视图计数
示例:
def extended_crawler(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
ViewCount = requests.post('http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php', data = {'IDPredmet': item_url[item_url.rfind('/') + 1:item_url.rfind('_')]})
print (ViewCount.text)