Question

我在html中具有以下标记，我只希望提取href内容，即Quatermass_2_Vintage_Movie_Poster-61-10782和Hard Day's Night

<span class="small">
                                Ref.No:10782<br/>
<a href="Quatermass_2_Vintage_Movie_Poster-61-10782" title="Click for more details and a larger picture of Quatermass 2">
                                Click for more details and a larger picture of <b>Quatermass 2</b>
</a>
</span>, <span class="small">
                                Ref.No:10781<br/>
<a href="Hard_Day__039_s_Night_Vintage_Movie_Poster-61-10781" title="Click for more details and a larger picture of Hard Day's Night">
                                Click for more details and a larger picture of <b>Hard Day's Night</b>
</a>
</span>

以下python代码使我只能查找整个标签

html = ['table2.html']

with open("table2.html", "r") as f:
    contents = f.read()


soup = BeautifulSoup(contents, "lxml")

for name in soup.find_all("span", {"class": "small"}):
    print(name)

但是无法仅选择href。我尝试过

for name in soup.find_all("span", {"class": "small"}.get(href)):
    print(name)

我也尝试过将href引用放入打印说明中

for name in soup.find_all("span", {"class": "small"}:
    print(name.get('href'))

任何人都可以帮忙吗？

Answer 1

获取span标记后，您需要找到a标记，然后获取href属性。

类似的事情会起作用：

for name in soup.find_all("span", {"class": "small"}):
    print(name.find("a").get("href"))

Answer 2

您可以使用正则表达式提取值，如下所示：

import re

input = "adde <a href=\"coedd.com\" > algo</a>";

patt= "href=\"[a-zA-Z0-9_\-\.]+\""

search = re.findall(patt, input, re.I)

print search

这将返回一个所有重合的数组。

我希望是有用的。

致谢。

使用漂亮的汤从标签中提取“ href”

2 个答案: