使用Beautiful Soup在Python中递归删除网站的所有子链接

时间:2018-05-10 09:57:18

标签: python for-loop recursion web-scraping beautifulsoup

最新更新: 我正在减少我的问题,如何从网站获取所有链接,包括每页的子链接等,递归。

我想我知道如何获取一个页面的所有子链接:

from bs4 import BeautifulSoup
import requests
import re

def get_links(site, filename):
    f=open(filename, 'w')
    url = requests.get(site)
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    for links in soup.find_all('a'):
        f.write(str(links.get('href'))+"\n")
    f.close()

r="https://en.wikipedia.org/wiki/Main_Page"
filename="wiki"
get_links(r,filename)

如何递归确保网站上的所有链接也被收集并写入同一文件?

所以我尝试了这个,它甚至没有编译。

def is_url(link):
    #checks using regex if 'link' is a valid url
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\\,() ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
    return (" ".join(url)==link)

def get_links(site, filename):
    f=open(filename, 'a')
    url = requests.get(site)
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    for links in soup.find_all('a'):
        if is_url(links):
            f.write(str(links.get('href'))+"\n")
            get_links(links, filename)
    f.close()

1 个答案:

答案 0 :(得分:2)

回答你的问题,这是我如何使用beautilfulsoup获取页面的所有链接并将其保存到文件中:

import tkinter as tk

class playGame():
    def userTurn(self,foo):
        pass

class GUI(playGame):
    def __init__(self):
        super().__init__()
        home=tk.Tk()
        home.title("Tic Tac Toe")
        home.geometry("160x180")
        w,h=6,3

        self.c1r1=tk.Button(text='',width=w, height=h, command=lambda: self.userTurn(self.c1r1))
        self.c1r1.grid(column=1,row=1)
        home.mainloop()

然而,这将不会阻止cicles(这会导致无限递归)。为此,您可以使用set来存储已访问过的链接,而不是再次访问它们。

你应该考虑使用Scrapy之类的东西来完成这类任务。我认为你应该研究CrawlSpider

为了从from bs4 import BeautifulSoup import requests def get_links(url): response = requests.get(url) data = response.text soup = BeautifulSoup(data, 'lxml') links = [] for link in soup.find_all('a'): link_url = link.get('href') if link_url is not None and link_url.startswith('http'): links.append(link_url + '\n') write_to_file(links) return links def write_to_file(links): with open('data.txt', 'a') as f: f.writelines(links) def get_all_links(url): for link in get_links(url): get_all_links(link) r = 'https://en.wikipedia.org/wiki/Main_Page' write_to_file([r]) get_all_links(r) 域中提取网址,您可以执行以下操作:

wikipedia.org

运行它
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

from scrapy import Item
from scrapy import Field


class UrlItem(Item):
    url = Field()


class WikiSpider(CrawlSpider):
    name = 'wiki'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Main_Page/']

    rules = (
        Rule(LinkExtractor(), callback='parse_url'),
    )

    def parse_url(self, response):
        item = UrlItem()
        item['url'] = response.url

        return item

您可以在scrapy crawl wiki -o wiki.csv -t csv 文件上获得csv格式的网址。