无法在span标记之间显示内容

时间:2017-02-25 21:04:00

标签: python beautifulsoup

到目前为止,我的代码是:http://pastebin.com/CdUiXpdf

import requests
from bs4 import BeautifulSoup


def web_crawler(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.kupindo.com/Knjige/artikli/1_strana_" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        print("PAGE: " + str(page))
        for link in soup.find_all("a", class_="item_link"):
            href = link.get("href")
            # title = link.string
            print(href)
            # print(title)
            extended_crawler(href)
        page += 1


def extended_crawler(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for view_counter in soup.find_all("span", id="BrojPregleda"):
        print("View Count: ", view_counter.text)


web_crawler(1)

输出例如是

PAGE: 1
https://www.kupindo.com/showcontent/2143/Beletristika/37875219_VUK-DRASKOVIC-Izabrana-dela-1-7-Srpska-rec
View Count:  

所以View Count为空,即使有expand_crawler函数,它查找id为BrojPregleda的span,也没有任何显示。

1 个答案:

答案 0 :(得分:1)

这是因为具有ID BrojPregleda的跨度正在通过ajax调用填充。使用Selenium获取值或按照以下步骤操作:

1)从URL

中获取产品的ID

2)使用单个FormData键发布到http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php - IDPredmet,值为1)

3)获取视图计数

示例:

def extended_crawler(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    ViewCount = requests.post('http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php', data = {'IDPredmet': item_url[item_url.rfind('/') + 1:item_url.rfind('_')]})
    print (ViewCount.text)