Question

我想从标签“article”获取以下元素：

链接
纬度和经度
每个房子的照片数量

但这不起作用。

这是Python代码：

import urllib
import urllib2
import re
import socket

def getPage(infoUrl):
    url = infoUrl
    try:
        request =  urllib2.Request(url)
        request.add_header("User-Agent","Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0")
        response = urllib2.urlopen(request)
    except urllib2.URLError, e:
        print "Bad Url or timeout"
        print type(e)
        print e
        return ''
    except socket.timeout,e:
        print "socket timeout"
        print type(e)
        print e
        return ''
    else:
        return response.read().decode('utf8')
        print "Done"

pattern = re.compile(r'<article.*?latitude="(.*?)".*?longtitude="(.*?)"><a href="(.*?)".*?<figcaption.*?>(.*?)</figcaption>.*?</a>',re.S)

infoUrl = 'http://www.zillow.com/homes/MA-02139_rb/'
page = getPage(infoUrl)

items = re.findall(pattern,page)
print items
for item in items:
    print item

顺便说一句，这个Python脚本运行得非常慢。

有任何优化建议吗？

Answer 1

我强烈建议您使用像Beautiful Soup这样的库来解析HTML。这是一个明确的用例，它将比你的正则表达式更好。

e.g：

soup = BeautifulSoup(your_html_text)
article = soup.article

会给你＆lt;文章＆gt;标签

编辑：由于问题刚刚改变，请查看上面链接中的BeautifulSoup文档。这将回答您的基本问题。

python无法通过re解析html

1 个答案: