Question

因此，我正在关注有关使用Python进行网络抓取的本教程。每当我运行代码时，我都会遇到此错误

FileNotFoundError: [Errno 2] No such file or directory: './data/nyct/turnstile/turnstile_200314.txt'

我有一个预感，这意味着网络爬虫无法访问该文件，但是当我检查HTML时，该文件存在。请帮忙。这是我的代码供参考：

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

#Set URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'

#Connect to URL
response = requests.get(url)

#Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text,'html.parser')

#Loop to download whole dataset
linecount = 1 #var to track current line

for onetag in soup.findAll('a'):
    if linecount>=36:
        link = onetag['href']
        downloadurl = 'http://web.mta.info/developers/'+link
        urllib.request.urlretrieve(downloadurl,'./'+link[link.find('/turnsttile_')+1:])
        time.sleep(3)#pause code so as to not get flagged as spammer

    #increment for next line
    linecount+=1

Answer 1

将以下脚本放入文件夹中并运行。确保将[:2]这一部分调整为适合您的需求，因为我已将其定义为测试：

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'http://web.mta.info/developers/turnstile.html'
base = 'http://web.mta.info/developers/'

response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
for tag in soup.select('a[href^="data/nyct/"]')[:2]:
    filename = tag['href'].split("_")[1]
    with open(filename,"wb") as f:
        f.write(requests.get(urljoin(base,tag['href'])).content)

如果您想坚持使用.find_all()，可以通过以下操作来达到相同目的：

for onetag in soup.find_all('a',href=True):
    if not onetag['href'].startswith('data/nyct/'):continue
    link = urljoin(base,onetag['href'])
    print(link)

或者像这样：

for onetag in soup.find_all('a',href=lambda e: e and e.startswith("data/nyct/")):
    link = urljoin(base,onetag['href'])
    print(link)

WebScraper无法访问文件

1 个答案: