处理具有不同分页结构的链接时遇到麻烦

时间:2018-12-23 23:36:32

标签: python python-3.x function web-scraping

我用python编写了一个脚本,以抓取位于其着陆页地图旁边右侧区域的不同项目的标题。我在脚本中使用了两个链接:一个具有分页功能,另一个没有。

当我执行脚本时,它首先检查分页链接。如果找到一个,则将链接传递到get_paginated_info()函数以在此处打印结果。但是,如果找不到分页链接,则将汤对象传递给get_info()函数,并在此打印结果。此刻的脚本正好按照我描述的方式工作。

如何使我的脚本仅在链接具有分页或不符合我希望尝试删除的逻辑的情况下,才能在get_info()函数中打印结果get_paginated_info()是否可以通过我的脚本运行?

这是我到目前为止的尝试

import requests 
from bs4 import BeautifulSoup
from urllib.parse import urljoin

urls = (
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
)

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one(".pagination a.next_page")
    if items:
        npagelink = items.find_previous_sibling().get("href").split("/")[-1]
        return [get_paginated_info(link + "/page/{}".format(page)) for page  in range(1,int(npagelink)+1)]

    else:
        return [get_info(soup)]

def get_info(soup):
    print("================links without pagination==============")
    for items in soup.select("td[class='table-row-price']"):
        item = items.select_one("h2 a").text
        print(item)

def get_paginated_info(url):
    r = requests.get(url)
    sauce = BeautifulSoup(r.text,"lxml")
    print("================links with pagination==============")
    for content in sauce.select("td[class='table-row-price']"):
        title = content.select_one("h2 a").text
        print(title)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

任何能够应对不同风格的更好的设计都将受到高度赞赏。

1 个答案:

答案 0 :(得分:1)

我稍微改变了逻辑。因此,现在无论在有分页的情况下还是没有分页的情况下,脚本都将调用get_names。但是在for循环的第二种情况下,只会执行一次迭代

import requests 
from bs4 import BeautifulSoup
from urllib.parse import urljoin

urls = (
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
)

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one(".pagination a.next_page")
    try:
        npagelink = items.find_previous_sibling().get("href").split("/")[-1]
    except AttributeError:
        npagelink = 1
    return [get_info(link + "/page/{}".format(page)) for page in range(1, int(npagelink) + 1)]


def get_info(url):
    r = requests.get(url)
    sauce = BeautifulSoup(r.text,"lxml")
    for content in sauce.select("td[class='table-row-price']"):
        title = content.select_one("h2 a").text
        print(title)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

仔细检查输出,以确保一切正常