所以我有一个带有一堆维基百科链接的文本文件
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Alabama
http://en.wikipedia.org/wiki/List_of_cities_and_census-designated_places_in_Alaska
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arizona
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arkansas
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Colorado
... etc
以及以下python脚本旨在下拉每个页面的html
import urllib.request
for line in open("sites.txt", "r"):
print("Pulling: " + line)
urllib.request.urlretrieve(line, line.split('/'))
但是当我运行它时,我收到以下错误:
Traceback (most recent call last):
File "C:\Users\brandon\Desktop\site thing\miner.py", line 5, in <module>
urllib.request.urlretrieve(line, line.split('/'))
File "C:\Python3\lib\urllib\request.py", line 188, in urlretrieve
tfp = open(filename, 'wb')
TypeError: invalid file: ['http:', '', 'en.wikipedia.org', 'wiki', 'List_of_cities_and_towns_in_Alabama\n']
任何想法如何解决这个问题并做我想做的事情?
---编辑---
解决方案:
import urllib.request
for line in open("sites.txt", "r"):
article = line.replace('\n', '')
print("Pulling: " + article)
urllib.request.urlretrieve(article, article.split('/')[-1] + ".html")
答案 0 :(得分:1)
试试这个(我更喜欢requests
库):
import requests
with open('sites.txt', 'r') as url_list:
for url in url_list:
print("Getting: " + url)
r = requests.get(url)
# do whatever you want with text
# using r.text to access it
答案 1 :(得分:0)
网页链接提取代码:
import urllib, htmllib, formatter
website = urllib.urlopen("http://en.wikipedia.org")
data = website.read()
website.close()
format = formatter.AbstractFormatter(formatter.NullWriter())
ptext = htmllib.HTMLParser(format)
ptext.feed(data)
for link in ptext.anchorlist:
print(link)
//完整的网页内容&amp;回复提取器
import urllib
response = urllib.urlopen('http://en.wikipedia.org')
print 'RESPONSE:', response
print 'URL :', response.geturl()
headers = response.info()
print 'DATE :', headers['date']
print 'HEADERS :'
print '---------'
print headers
data = response.read()
print 'LENGTH :', len(data)
print 'DATA :'
print '---------'
print data