Question

我一直在修补美丽的汤作为一个爱好/项目，这个似乎按预期工作;抓取HTML，抓取标题标题+链接，将其全部添加到一个集合中，然后迭代并显示在人性化的输出中。我使用一个集合作为运行列表似乎复制每个标题+链接配对;不知道如何阻止它运行两次。

然而，我注意到，一旦我启动NYT_Spider()功能，程序会在短时间内停止响应，但确实会返回并显示正确的结果。

我的问题有两个：

1）究竟是什么原因导致程序在运行NYT_Spider函数时挂起，这是否可以合理地纠正？

2）蜘蛛也拉出并打印缺少标题的文章/广告的链接，看起来类似于：

Nonehttps://www.nytimes.com/2017/11/02/education/edlife/bilingual-mfa-writing.html

是否有一种简单的方法可以删除那些使用当前模块的方法，或者我应该简单地对结果进行正则表达式？

以下是完整代码：

import requests
from bs4 import BeautifulSoup

def NYT_Spider():

    result_list = set()
    url = "https://www.nytimes.com/section/world/americas"
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")

    for link in soup.findAll('a', {'class': "story-link"}):
        href = link.get('href')
        title = get_single_item_data(href)
        results = str(title) + str(href) + "\n" + "\n"
        result_list.add(results)

    for i in result_list:
        print(i)


def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for item_name in soup.findAll('h1', {"class": "headline"}):
        return item_name.string + "\n"

NYT_Spider()

Python 3.x美丽的汤落在蜘蛛上

0 个答案: