我想在剑桥大学出版社网站上提取不同期刊的封面。我想保存它,因为它是在线ISSN。以下代码有效但在一两个期刊之后,它给了我这个错误:
Traceback (most recent call last):
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection
.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\conne
ction.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\socket.py", line 745, in getaddr
info
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11004] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection
pool.py", line 601, in urlopen
chunked=chunked)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection
pool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1239, in r
equest
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1285, in _
send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1234, in e
ndheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1026, in _
send_output
self.send(msg)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 964, in se
nd
self.connect()
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection
.py", line 166, in connect
conn = self._new_conn()
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection
.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x030DB770>: Fai
led to establish a new connection: [Errno 11004] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.
py", line 440, in send
timeout=timeout
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection
pool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry
.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='ore', port=80): Max retries exceeded with
url: /services/aop-file-manager/file/57f386d3efeebb2f18eac486 (Caused by NewConnectionError('<urlli
b3.connection.HTTPConnection object at 0x030DB770>: Failed to establish a new connection: [Errno 110
04] getaddrinfo failed',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Boys\Documents\Python\python_work\Kudos\CUPgetcovers.py", line 19, in <module>
f.write(requests.get("http://" + imagefound).content)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py",
line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py",
line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.
py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.
py", line 618, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Boys\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.
py", line 508, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='ore', port=80): Max retries exceeded w
ith url: /services/aop-file-manager/file/57f386d3efeebb2f18eac486 (Caused by NewConnectionError('<ur
llib3.connection.HTTPConnection object at 0x030DB770>: Failed to establish a new connection: [Errno
11004] getaddrinfo failed',))
Process returned 1 (0x1) execution time : 4.373 s
Press any key to continue . . .
我做错了什么?我在谷歌上找不到任何答案。之前工作正常。 提前谢谢。
编辑: launch.py:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
import csv
import requests
from time import sleep
with open('listoflinks.csv', encoding="utf8") as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
http = httplib2.Http()
status, response = http.request(("https://www.cambridge.org" + row[0]))
soup = BeautifulSoup(response, "html.parser")
txt = (t.text for t in soup.find_all("span", class_="value"))
issn = next(t[:9] for t in txt if t.endswith("(Online)"))
for a in soup.find_all('a', attrs={'class' : 'image'}):
if a.img:
imagefound = (a.img['src'])
imagefound = imagefound[2:]
f = open((issn + ".jpg"),'wb')
f.write(requests.get("http://" + imagefound).content)
f.close()
listoflinks.csv:
/core/journals/journal-of-materials-research
/core/journals/journal-of-mechanics
/core/journals/journal-of-modern-african-studies
/core/journals/journal-of-navigation
/core/journals/journal-of-nutritional-science
/core/journals/journal-of-pacific-rim-psychology
/core/journals/journal-of-paleontology
/core/journals/journal-of-pension-economics-and-finance
/core/journals/journal-of-plasma-physics
/core/journals/journal-of-policy-history
/core/journals/journal-of-psychologists-and-counsellors-in-schools
/core/journals/journal-of-public-policy
/core/journals/journal-of-race-ethnicity-and-politics
/core/journals/journal-of-radiotherapy-in-practice
/core/journals/journal-of-relationships-research
/core/journals/journal-of-roman-archaeology
/core/journals/journal-of-roman-studies
/core/journals/journal-of-smoking-cessation
/core/journals/journal-of-social-policy
/core/journals/journal-of-southeast-asian-studies
/core/journals/journal-of-symbolic-logic
/core/journals/journal-of-the-american-philosophical-association
/core/journals/journal-of-the-australian-mathematical-society
/core/journals/journal-of-the-gilded-age-and-progressive-era
/core/journals/journal-of-the-history-of-economic-thought
/core/journals/journal-of-the-institute-of-mathematics-of-jussieu
/core/journals/journal-of-the-international-neuropsychological-society
/core/journals/journal-of-the-international-phonetic-association
/core/journals/journal-of-the-marine-biological-association-of-the-united-kingdom
/core/journals/journal-of-the-royal-asiatic-society
/core/journals/journal-of-the-society-for-american-music
/core/journals/journal-of-tropical-ecology
/core/journals/journal-of-tropical-psychology
/core/journals/journal-of-wine-economics
/core/journals/kantian-review
/core/journals/knowledge-engineering-review
/core/journals/language-and-cognition
/core/journals/language-in-society
/core/journals/language-teaching
/core/journals/language-variation-and-change
/core/journals/laser-and-particle-beams
/core/journals/latin-american-antiquity
/core/journals/latin-american-politics-and-society
/core/journals/law-and-history-review
/core/journals/legal-information-management
/core/journals/legal-studies
/core/journals/legal-theory
/core/journals/leiden-journal-of-international-law
/core/journals/libyan-studies
/core/journals/lichenologist
/core/journals/lms-journal-of-computation-and-mathematics
/core/journals/macroeconomic-dynamics
/core/journals/management-and-organization-review
/core/journals/mathematical-gazette
/core/journals/mathematical-proceedings-of-the-cambridge-philosophical-society
/core/journals/mathematical-structures-in-computer-science
/core/journals/mathematika
/core/journals/medical-history
/core/journals/medical-history-supplements
/core/journals/melanges-d-histoire-sociale
/core/journals/microscopy-and-microanalysis
/core/journals/microscopy-today
/core/journals/mineralogical-magazine
/core/journals/modern-american-history
/core/journals/modern-asian-studies
/core/journals/modern-intellectual-history
/core/journals/modern-italy
/core/journals/mrs-advances
/core/journals/mrs-bulletin
/core/journals/mrs-communications
/core/journals/mrs-energy-and-sustainability
/core/journals/mrs-online-proceedings-library-archive
/core/journals/nagoya-mathematical-journal
/core/journals/natural-language-engineering
/core/journals/netherlands-journal-of-geosciences
/core/journals/network-science
/core/journals/new-perspectives-on-turkey
/core/journals/new-surveys-in-the-classics
/core/journals/new-testament-studies
/core/journals/new-theatre-quarterly
/core/journals/nineteenth-century-music-review
/core/journals/nordic-journal-of-linguistics
/core/journals/numerical-mathematics-theory-methods-and-applications
/core/journals/nutrition-research-reviews
/core/journals/organised-sound
/core/journals/oryx
/core/journals/paleobiology
/core/journals/the-paleontological-society-papers
/core/journals/palliative-and-supportive-care
/core/journals/papers-of-the-british-school-at-rome
/core/journals/parasitology
/core/journals/parasitology-open
/core/journals/personality-neuroscience
/core/journals/perspectives-on-politics
/core/journals/philosophy
/core/journals/phonology
/core/journals/plainsong-and-medieval-music
/core/journals/plant-genetic-resources
/core/journals/polar-record
/core/journals/political-analysis
/core/journals/political-science-research-and-methods
/core/journals/politics-and-gender
/core/journals/politics-and-religion
/core/journals/politics-and-the-life-sciences
/core/journals/popular-music
/core/journals/powder-diffraction
/core/journals/prehospital-and-disaster-medicine
/core/journals/primary-health-care-research-and-development
/core/journals/probability-in-the-engineering-and-informational-sciences
/core/journals/proceedings-of-the-asil-annual-meeting
/core/journals/proceedings-of-the-edinburgh-mathematical-society
/core/journals/proceedings-of-the-international-astronomical-union
/core/journals/proceedings-of-the-nutrition-society
/core/journals/proceedings-of-the-prehistoric-society
/core/journals/proceedings-of-the-royal-society-of-edinburgh-section-a-mathematics
/core/journals/ps-political-science-and-politics
/core/journals/psychological-medicine
/core/journals/public-health-nutrition
/core/journals/publications-of-the-astronomical-society-of-australia
/core/journals/quarterly-reviews-of-biophysics
/core/journals/quaternary-research
/core/journals/queensland-review
/core/journals/radiocarbon
/core/journals/ramus
/core/journals/recall
/core/journals/religious-studies
/core/journals/renewable-agriculture-and-food-systems
/core/journals/review-of-international-studies
/core/journals/review-of-middle-east-studies
/core/journals/review-of-politics
/core/journals/review-of-symbolic-logic
/core/journals/revista-de-historia-economica-journal-of-iberian-and-latin-american-economic-history
/core/journals/robotica
/core/journals/royal-historical-society-camden-fifth-series
/core/journals/royal-institute-of-philosophy-supplements
/core/journals/rural-history
/core/journals/science-in-context
/core/journals/scottish-journal-of-theology
/core/journals/seed-science-research
/core/journals/slavic-review
/core/journals/social-philosophy-and-policy
/core/journals/social-policy-and-society
/core/journals/social-science-history
/core/journals/spanish-journal-of-psychology
/core/journals/studies-in-american-political-development
/core/journals/studies-in-church-history
/core/journals/studies-in-second-language-acquisition
/core/journals/tempo
/core/journals/theatre-research-international
/core/journals/theatre-survey
/core/journals/theory-and-practice-of-logic-programming
/core/journals/think
/core/journals/traditio
/core/journals/trans-trans-regional-and-national-studies-of-southeast-asia
/core/journals/transactions-of-the-royal-historical-society
/core/journals/transnational-environmental-law
/core/journals/twentieth-century-music
/core/journals/twin-research-and-human-genetics
/core/journals/urban-history
/core/journals/utilitas
/core/journals/victorian-literature-and-culture
/core/journals/visual-neuroscience
/core/journals/weed-science
/core/journals/weed-technology
/core/journals/wireless-power-transfer
/core/journals/world-politics
/core/journals/world-s-poultry-science-journal
/core/journals/world-trade-review
/core/journals/zygote
答案 0 :(得分:0)
您应该简化代码和抓取策略,尽管我可以看到并非所有期刊页面都具有相同的结构。在大多数页面上,您可以通过表单值轻松获取ISSN。在其他人(我认为是免费访问)上,您需要应用某种启发式方法来获取ISSN。此外,我不知道你为什么使用httplib2和请求,因为它们提供或多或少相同的功能。无论如何,这里有一些代码可以做你想要的......有点(我也删除了CSV代码,因为它不需要那样):
import requests
from bs4 import BeautifulSoup, SoupStrainer
with open('listoflinks.csv', encoding="utf8") as f:
for line in f:
path = line.strip()
print("getting", path)
response = requests.get("https://www.cambridge.org" + path)
soup = BeautifulSoup(response.text, "html.parser")
try:
issn = soup.find("input", attrs={'name': 'productIssn'}).get('value')
except:
values = soup.find_all("span", class_="value")
for v in values:
if "(Online)" in v.string:
issn = v.string.split(" ")[0]
break
print("issn:", issn)
details_container = soup.find("div", class_="details-container")
image = details_container.find("img")
imgurl = image['src'][2:]
print("imgurl:", imgurl)
with open(issn + ".jpg", 'wb') as output:
output.write(requests.get("http://" + imgurl).content)