Question

我最近开始学习Python。在学习网络抓取的过程中，我遵循了一个从Google新闻抓取示例。运行我的代码后，我得到消息：“进程以退出代码0完成”，没有结果。如果将网址更改为“ https://yahoo.com”，则会得到结果。有人能指出我在做错什么吗？

代码：

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
def __init__(self, site):
    self.site = site

def scrape(self):
    r = urllib.request.urlopen(self.site)
    html = r.read()
    parser = "html.parser"
    sp = BeautifulSoup(html, parser)
    for tag in sp.find_all("a"):
        url = tag.get("href")
        if url is None:
             continue
        if "html" in url:
            print("\n" + url)

news = "https://news.google.com/"
Scraper(news).scrape()

Answer 1

尝试一下：

import urllib.request
from bs4 import BeautifulSoup


class Scraper:

    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        sp = BeautifulSoup(html, parser)
        for tag in sp.find_all("a"):
            url = tag.get("href")
            if url is None:
                continue
            else:
                print("\n" + url)


if __name__ == '__main__':
    news = "https://news.google.com/"
    Scraper(news).scrape()

最初，您正在检查每个链接，以查看其中是否包含“ html”。我假设您正在关注的示例正在检查链接是否以'.html;

结尾

漂亮的汤效果很好，但是您需要在您的抓取网站上检查源代码，以了解代码的布局方式。 chrome中的Devtools对此非常有效，F12使其快速入门。

我删除了：

if "html" in url:
            print("\n" + url)

并替换为：

else:
    print("\n" + url)

PyCharm中的Python网络抓取工具出现问题。（初学者）

1 个答案:

PyCharm中的Python网络抓取工具出现问题。 （初学者）

1 个答案:

PyCharm中的Python网络抓取工具出现问题。（初学者）