抓取工具正在跳过第一页的内容

时间:2017-06-01 07:16:10

标签: python-3.x web-scraping web-crawler

我创建了一个从网站解析某些内容的抓取工具。

首先,它从左侧栏中删除指向该类别的链接。

其次,它通过连接到个人资料页面的分页收集整个链接

最后,转到每个个人资料页面,它会删除姓名,电话和网址。

到目前为止,情况良好。我看到这个爬虫的唯一问题是它总是从跳过第一页的第二页开始抓取。我想可能有任何办法可以解决这个问题。这是我正在尝试的完整代码:

import requests
from lxml import html

url="https://www.houzz.com/professionals/"

def category_links(mainurl):
    req=requests.Session()
    response = req.get(mainurl).text
    tree = html.fromstring(response)
    for titles in tree.xpath("//a[@class='sidebar-item-label']/@href"):
        next_pagelink(titles)   # links to the category from left-sided bar


def next_pagelink(process_links):
    req=requests.Session()
    response = req.get(process_links).text
    tree = html.fromstring(response)
    for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"):
        profile_pagelink(link)      # the whole links spread through pagination connected to the profile page


def profile_pagelink(procured_links):
    req=requests.Session()
    response = req.get(procured_links).text
    tree = html.fromstring(response)
    for titles in tree.xpath("//div[@class='name-info']"):
        links = titles.xpath(".//a[@class='pro-title']/@href")[0]
        target_pagelink(links)         # profile page of each link


def target_pagelink(main_links):
    req=requests.Session()
    response = req.get(main_links).text
    tree = html.fromstring(response)

    def if_exist(titles,xpath):
        info=titles.xpath(xpath)
        if info:
            return info[0]
        return ""

    for titles in tree.xpath("//div[@class='container']"):
        name = if_exist(titles,".//a[@class='profile-full-name']/text()")
        phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
        web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
        print(name,phone,web)

category_links(url)

1 个答案:

答案 0 :(得分:1)

第一页的问题在于它没有分页' class所以这个表达式:tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href")返回一个空列表,profile_pagelink函数永远不会被执行。

作为快速解决方案,您可以在category_links函数中单独处理此案例:

def category_links(mainurl):
    response = requests.get(mainurl).text
    tree = html.fromstring(response)
    if mainurl == "https://www.houzz.com/professionals/": 
        profile_pagelink("https://www.houzz.com/professionals/")
    for titles in tree.xpath("//a[@class='sidebar-item-label']/@href"):
        next_pagelink(titles)   

此外,我注意到target_pagelinkif_exist返回""而打印了大量空字符串。如果在for循环中添加条件,则可以跳过这些情况:

for titles in tree.xpath("//div[@class='container']"):    # use class='profile-cover' if you get douplicates #
    name = if_exist(titles,".//a[@class='profile-full-name']/text()")
    phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
    web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
    if name+phone+web : 
        print(name,phone,web)

最后requests.Session主要用于存储您的脚本不需要的Cookie和其他标头。您可以使用requests.get并获得相同的结果。