在Python中,下拉维基百科页面

时间:2014-11-15 02:17:31

标签: python file download wiki wikipedia

所以我有一个带有一堆维基百科链接的文本文件

http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Alabama
http://en.wikipedia.org/wiki/List_of_cities_and_census-designated_places_in_Alaska
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arizona
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arkansas
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Colorado
... etc

以及以下python脚本旨在下拉每个页面的html

import urllib.request
for line in open("sites.txt", "r"):
  print("Pulling: " + line)
  urllib.request.urlretrieve(line, line.split('/'))

但是当我运行它时,我收到以下错误:

Traceback (most recent call last):
File "C:\Users\brandon\Desktop\site thing\miner.py", line 5, in <module>
  urllib.request.urlretrieve(line, line.split('/'))
File "C:\Python3\lib\urllib\request.py", line 188, in urlretrieve
  tfp = open(filename, 'wb')
TypeError: invalid file: ['http:', '', 'en.wikipedia.org', 'wiki', 'List_of_cities_and_towns_in_Alabama\n']

任何想法如何解决这个问题并做我想做的事情?

---编辑---

解决方案:

import urllib.request
for line in open("sites.txt", "r"):
  article = line.replace('\n', '')
  print("Pulling: " + article)
  urllib.request.urlretrieve(article, article.split('/')[-1] + ".html")

2 个答案:

答案 0 :(得分:1)

试试这个(我更喜欢requests库):

import requests

with open('sites.txt', 'r') as url_list:
    for url in url_list:
        print("Getting: " + url)
        r = requests.get(url)
        # do whatever you want with text 
        # using r.text to access it

答案 1 :(得分:0)

网页链接提取代码:

import urllib, htmllib, formatter

website = urllib.urlopen("http://en.wikipedia.org")
data = website.read()
website.close()
format = formatter.AbstractFormatter(formatter.NullWriter())
ptext = htmllib.HTMLParser(format)
ptext.feed(data)
for link in ptext.anchorlist:
   print(link)

//完整的网页内容&amp;回复提取器

import urllib

response = urllib.urlopen('http://en.wikipedia.org')
print 'RESPONSE:', response
print 'URL     :', response.geturl()

headers = response.info()
print 'DATE    :', headers['date']
print 'HEADERS :'
print '---------'
print headers

data = response.read()
print 'LENGTH  :', len(data)
print 'DATA    :'
print '---------'
print data