最新更新: 我正在减少我的问题,如何从网站获取所有链接,包括每页的子链接等,递归。
我想我知道如何获取一个页面的所有子链接:
from bs4 import BeautifulSoup
import requests
import re
def get_links(site, filename):
f=open(filename, 'w')
url = requests.get(site)
data = url.text
soup = BeautifulSoup(data, 'lxml')
for links in soup.find_all('a'):
f.write(str(links.get('href'))+"\n")
f.close()
r="https://en.wikipedia.org/wiki/Main_Page"
filename="wiki"
get_links(r,filename)
如何递归确保网站上的所有链接也被收集并写入同一文件?
所以我尝试了这个,它甚至没有编译。
def is_url(link):
#checks using regex if 'link' is a valid url
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\\,() ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
return (" ".join(url)==link)
def get_links(site, filename):
f=open(filename, 'a')
url = requests.get(site)
data = url.text
soup = BeautifulSoup(data, 'lxml')
for links in soup.find_all('a'):
if is_url(links):
f.write(str(links.get('href'))+"\n")
get_links(links, filename)
f.close()
答案 0 :(得分:2)
回答你的问题,这是我如何使用beautilfulsoup获取页面的所有链接并将其保存到文件中:
import tkinter as tk
class playGame():
def userTurn(self,foo):
pass
class GUI(playGame):
def __init__(self):
super().__init__()
home=tk.Tk()
home.title("Tic Tac Toe")
home.geometry("160x180")
w,h=6,3
self.c1r1=tk.Button(text='',width=w, height=h, command=lambda: self.userTurn(self.c1r1))
self.c1r1.grid(column=1,row=1)
home.mainloop()
然而,这将不会阻止cicles(这会导致无限递归)。为此,您可以使用set
来存储已访问过的链接,而不是再次访问它们。
你应该考虑使用Scrapy之类的东西来完成这类任务。我认为你应该研究CrawlSpider
。
为了从from bs4 import BeautifulSoup
import requests
def get_links(url):
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
links = []
for link in soup.find_all('a'):
link_url = link.get('href')
if link_url is not None and link_url.startswith('http'):
links.append(link_url + '\n')
write_to_file(links)
return links
def write_to_file(links):
with open('data.txt', 'a') as f:
f.writelines(links)
def get_all_links(url):
for link in get_links(url):
get_all_links(link)
r = 'https://en.wikipedia.org/wiki/Main_Page'
write_to_file([r])
get_all_links(r)
域中提取网址,您可以执行以下操作:
wikipedia.org
用
运行它from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Item
from scrapy import Field
class UrlItem(Item):
url = Field()
class WikiSpider(CrawlSpider):
name = 'wiki'
allowed_domains = ['wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Main_Page/']
rules = (
Rule(LinkExtractor(), callback='parse_url'),
)
def parse_url(self, response):
item = UrlItem()
item['url'] = response.url
return item
您可以在scrapy crawl wiki -o wiki.csv -t csv
文件上获得csv
格式的网址。