无法一次返回所有结果

时间:2019-01-20 10:35:07

标签: python python-3.x web-scraping return

我已经用python编写了一个脚本来从网页中获取一些链接。我的脚本中有两个功能。第一个功能从网页收集指向本地企业的链接,第二个功能遍历这些链接并收集指向各种事件的URL。

当我尝试使用脚本found here时,会得到理想的结果。

如何返回符合以下设计的所有结果?

以下脚本 返回 单个链接的结果,而我希望一次 返回 所有结果保持设计不变(逻辑可能会有所不同)。

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

linklist = []

def collect_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    items = [urljoin(url,item.get("href")) for item in soup.select(".business-listings-category-list .field-content a[hreflang]")]
    return items

def fetch_info(ilink):
    res = requests.get(ilink)
    soup = BeautifulSoup(res.text, "lxml")
    for item in soup.select(".business-teaser-title a[title]"):
        linklist.append(urljoin(url,item.get("href")))
    return linklist

if __name__ == '__main__':
    url = "https://www.parentmap.com/atlas"
    for itemlink in collect_links(url):
        print(fetch_info(itemlink))

2 个答案:

答案 0 :(得分:2)

首先,无论如何从函数返回的全局linklist我都删除了,保持全局会产生重叠的结果。接下来,我添加了一个函数,以您想要的方式“组装”链接。我使用了一套防止重复链接。

#!/usr/bin/python

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def collect_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    items = [urljoin(url,item.get("href")) for item in soup.select(".business-listings-category-list .field-content a[hreflang]")]
    return items

def fetch_info(ilink):
    linklist = []
    res = requests.get(ilink)
    soup = BeautifulSoup(res.text, "lxml")
    for item in soup.select(".business-teaser-title a[title]"):
        linklist.append(urljoin(url,item.get("href")))
    return linklist

def fetch_all_links(url):
    links = set()
    for itemlink in collect_links(url):
        links.update(fetch_info(itemlink))
    return list(links)

if __name__ == '__main__':
    url = "https://www.parentmap.com/atlas"
    print(fetch_all_links(url))

答案 1 :(得分:0)

一个接一个地获取结果的主要原因是您在一个循环中一次又一次地调用fetchinfo,该循环一次又一次地调用了函数,导致一次又一次地打印数据,而不是在fetchinfo函数中使用了循环。 >

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

linklist = []

def collect_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    items = [urljoin(url,item.get("href")) for item in soup.select(".business-listings-category-list .field-content a[hreflang]")]
    return items

def fetch_info(url):
    for itemlink in collect_links(url):
       res = requests.get(ilink)
       soup = BeautifulSoup(res.text, "lxml")
        for item in soup.select(".business-teaser-title a[title]"):
            linklist.append(urljoin(url,item.get("href")))
    return linklist

if __name__ == '__main__':
    url = "https://www.parentmap.com/atlas"
    for itemlink in collect_links(url):
        print(fetch_info(itemlink))