Question

我已经完成了一个教程，我希望我的刮刀刮掉包含每个警察局信息的特定页面的所有链接，但它几乎返回整个网站。

from urllib import urlopen
import re

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

b = re.compile('<span class="listlink-police"><a href="(.*)">')
a = re.findall(b, f)

listiterator = []
listiterator[:] = range(0,16)

for i in listiterator:
    print a 
    print "\n"

f.close()

Answer 1

使用BeautifulSoup

from bs4 import BeautifulSoup
from urllib2 import urlopen

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

bs = BeautifulSoup(f)

for tag in bs.find_all('span', {'class': 'listlink-police'}):
    print tag.a['href']

Answer 2

您正在使用正则表达式来解析HTML。你不应该，因为你最终只有这种类型的问题。首先，.*通配符将匹配尽可能多的文本。但是一旦你解决了这个问题，你就会从挫败之树中榨取另一种水果。改为使用正确的HTML解析器。

Answer 3

上面有超过1.6k的链接。

我认为它的工作正常......是什么让你觉得它不起作用？

你绝对应该使用 Beautiful Soup ，这很简单，非常实用。

Webscraper不起作用

3 个答案: