因此,我正在关注有关使用Python进行网络抓取的本教程。每当我运行代码时,我都会遇到此错误
FileNotFoundError: [Errno 2] No such file or directory: './data/nyct/turnstile/turnstile_200314.txt'
我有一个预感,这意味着网络爬虫无法访问该文件,但是当我检查HTML时,该文件存在。请帮忙。 这是我的代码供参考:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
#Set URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'
#Connect to URL
response = requests.get(url)
#Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text,'html.parser')
#Loop to download whole dataset
linecount = 1 #var to track current line
for onetag in soup.findAll('a'):
if linecount>=36:
link = onetag['href']
downloadurl = 'http://web.mta.info/developers/'+link
urllib.request.urlretrieve(downloadurl,'./'+link[link.find('/turnsttile_')+1:])
time.sleep(3)#pause code so as to not get flagged as spammer
#increment for next line
linecount+=1
答案 0 :(得分:0)
将以下脚本放入文件夹中并运行。确保将[:2]
这一部分调整为适合您的需求,因为我已将其定义为测试:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'http://web.mta.info/developers/turnstile.html'
base = 'http://web.mta.info/developers/'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
for tag in soup.select('a[href^="data/nyct/"]')[:2]:
filename = tag['href'].split("_")[1]
with open(filename,"wb") as f:
f.write(requests.get(urljoin(base,tag['href'])).content)
如果您想坚持使用.find_all()
,可以通过以下操作来达到相同目的:
for onetag in soup.find_all('a',href=True):
if not onetag['href'].startswith('data/nyct/'):continue
link = urljoin(base,onetag['href'])
print(link)
或者像这样:
for onetag in soup.find_all('a',href=lambda e: e and e.startswith("data/nyct/")):
link = urljoin(base,onetag['href'])
print(link)