Question

="" So, I have this:

<h1 class='entry-title'>
    <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
</h1>

How can I retrieve the URL (it is not always the same) and the title (also not always the same)?

Answer 1

使用 HTML解析器解析它，例如使用BeautifulSoup，它将是：

from bs4 import BeautifulSoup

data = "your HTML here"  # data can be the result of urllib2.urlopen(url)

soup = BeautifulSoup(data)
link = soup.select("h1.entry-title > a")[0]

print link.get("href")
print link.get_text()

其中h1.entry-title > a是与a元素直接匹配h1元素的CSS selector与class="entry-title"。

Answer 2

好吧，只需使用字符串，就可以

>>> s = '''<h1 class='entry-title'>
...     <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
... </h1>'''
>>> s.split('>')[1].strip().split('=')[1].strip("'")
'http://theurlthatvariesinlengthbasedonwhenirequesthehtml'
>>> s.split('>')[2][:-3]
'theTitleIneedthatvariesinlength'

还有其他（和更好的）选项可用于解析HTML。

我正在努力使用python-html。我知道某个标题的类。我需要来自通用<a href...="" in="" this="" h1=""

2 个答案: