我正在尝试从亚马逊的畅销商品列表中检索网址和类别名称。出于某种原因,当我遇到/ref=
并且我真的不明白为什么时,我正在使用RE停止?我在Windows 7的盒子上使用Python 2.7。
典型的记录是
<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>
我的RE是
Regex = "<li><a href='(http://www.amazon.ca/Best-Sellers.*?)'>(.*?)</a></li>"
Category = re.compile(Regex)
返回一个元组
[][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps
[][1] Appstore for Android
我确实获得了所有正确的记录,但正如您所看到的,网址缺少/ref=zg_bs_nav_0
。
类别层次结构中的其他级别表现出相同的问题; URL中的所有内容,以及包括/ ref =开头都没有。
在我采取Martijn的建议之后,这是我的代码片段
# First page of the list of Best Sellers categories
URL = "http://www.amazon.ca/gp/bestsellers"
# Retrieve the page source
HTMLFile = urllib.urlopen(URL)
HTMLText = HTMLFile.read()
soup = BeautifulSoup(HTMLText)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
print link['href']
print link.get_text()
答案 0 :(得分:4)
您正在使用正则表达式,但将XML与此类表达式匹配会变得太复杂,太快。别这么做。
使用HTML解析器,Python有几种可供选择:
后两者也非常优雅地处理格式错误的HTML,对许多拙劣的网站产生了不错的感觉。实际上,如果安装了BeautifulSoup 4,则使用lxml
作为首选解析器。
BeautifulSoup示例:
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlsource)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
print link['href'], link.get_text()
这使用CSS选择器查找<a>
元素中直接包含的所有<li>
元素,其中href
属性以文本http://www.amazon.ca/Best-Sellers
开头。
演示:
>>> from bs4 import BeautifulSoup
>>> htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>'
>>> soup = BeautifulSoup(htmlsource)
>>> for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
... print link['href'], link.get_text()
...
http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Appstore for Android
请注意,亚马逊会根据标题更改响应:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers')
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Appstore for Android</a>
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">Appstore for Android</a>