Question

我正试图从以下网站收集德克萨斯州所有城市的生活费用指数数据：http://www.city-data.com/city/Texas.html

从网页抓取数据的最简单方法是什么？我尝试使用名为Web Scraper的Chrome扩展程序，但未成功。我认为使用XML包或尝试Scrapy可能会更好。我抬起两种方法，但有些迷失了，正在寻找一些方向。任何输入都会有所帮助。

Answer 1

您可以使用BeautifulSoup4（bs4）来解析和阅读HTML数据。请看一下这个例子：

In [4]: from urllib2 import urlopen

In [5]: citylinkpage = urlopen("http://www.city-data.com/city/Texas.html")

In [7]: from bs4 import BeautifulSoup as BS

In [8]: soup = BS(citylinkpage)

In [9]: allImportantLinks = soup.select('table.cityTAB td.ph a')

In [10]: print allImportantLinks[:10]
[<a href='javascript:l("Abbott");'>Abbott</a>, <a href='javascript:l("Abernathy");'>Abernathy</a>, <a href="Abilene-Texas.html">Abilene, TX</a>, <a href="Addison-Texas.html">Addison, TX</a>, <a href="Alamo-Heights-Texas.html">Alamo Heights</a>, <a href='javascript:l("Albany");'>Albany, TX</a>, <a href="Alice-Texas.html">Alice</a>, <a href="Allen-Texas.html">Allen, TX</a>, <a href='javascript:l("Alma");'>Alma, TX</a>, <a href="Alpine-Texas.html">Alpine, TX</a>]

In [14]: allCityUrls = ["http://www.city-data.com/city/"+a.get('href') for a in allImportantLinks if not a.get('href').startswith('javascript:')]

In [15]: allCityUrls
Out[15]: 
['http://www.city-data.com/city/Abilene-Texas.html',
 'http://www.city-data.com/city/Addison-Texas.html',
 'http://www.city-data.com/city/Alamo-Heights-Texas.html',
 'http://www.city-data.com/city/Alice-Texas.html',
 'http://www.city-data.com/city/Allen-Texas.html',
 'http://www.city-data.com/city/Alpine-Texas.html',
 'http://www.city-data.com/city/Amarillo-Texas.html',
...

因为每个城市的页面似乎都是糟糕的HTML（特别是在这个索引周围），所以通过正则表达式搜索页面似乎更好（使用内置的re - 模块）

cityPageAdress = "http://www.city-data.com/city/Abilene-Texas.html"
pageSourceCode = urlopen(cityPageAdress).read()
import re
expr = re.compile(r"cost of living index in .*?:</b>\s*(\d+(\.\d+)?)\s*<b>")
print expr.findall(pageSourceCode)[0][0]
Out: 83.5

Answer 2

尝试scrapy。 Check out my blog post on recursive scraping

从网页上的多个链接刮取数据

2 个答案: