我正在努力使用python-html。我知道某个标题的类。我需要来自通用<a href...="" in="" this="" h1=""

时间:2015-05-06 14:52:21

标签: python html urllib2

="" So, I have this:

<h1 class='entry-title'>
    <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
</h1>

How can I retrieve the URL (it is not always the same) and the title (also not always the same)?

2 个答案:

答案 0 :(得分:0)

使用 HTML解析器解析它,例如使用BeautifulSoup,它将是:

from bs4 import BeautifulSoup

data = "your HTML here"  # data can be the result of urllib2.urlopen(url)

soup = BeautifulSoup(data)
link = soup.select("h1.entry-title > a")[0]

print link.get("href")
print link.get_text()

其中h1.entry-title > a是与a元素直接匹配h1元素的CSS selectorclass="entry-title"

答案 1 :(得分:0)

好吧,只需使用字符串,就可以

>>> s = '''<h1 class='entry-title'>
...     <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
... </h1>'''
>>> s.split('>')[1].strip().split('=')[1].strip("'")
'http://theurlthatvariesinlengthbasedonwhenirequesthehtml'
>>> s.split('>')[2][:-3]
'theTitleIneedthatvariesinlength'

还有其他(和更好的)选项可用于解析HTML。