So, I have this:
<h1 class='entry-title'>
<a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
</h1>
How can I retrieve the URL (it is not always the same) and the title (also not always the same)?
答案 0 :(得分:0)
使用 HTML解析器解析它,例如使用BeautifulSoup
,它将是:
from bs4 import BeautifulSoup
data = "your HTML here" # data can be the result of urllib2.urlopen(url)
soup = BeautifulSoup(data)
link = soup.select("h1.entry-title > a")[0]
print link.get("href")
print link.get_text()
其中h1.entry-title > a
是与a
元素直接匹配h1
元素的CSS selector与class="entry-title"
。
答案 1 :(得分:0)
好吧,只需使用字符串,就可以
>>> s = '''<h1 class='entry-title'>
... <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
... </h1>'''
>>> s.split('>')[1].strip().split('=')[1].strip("'")
'http://theurlthatvariesinlengthbasedonwhenirequesthehtml'
>>> s.split('>')[2][:-3]
'theTitleIneedthatvariesinlength'
还有其他(和更好的)选项可用于解析HTML。