使用BeautifulSoup从属性中提取href

时间:2010-07-15 10:06:59

标签: beautifulsoup

我使用这种方法

allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")})

返回如下列表:

[<a onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" href="http://www.ylyd.com/showurl.asp?id=6182" target="_blank"><font size="3">掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫</font></a>, 
<a href="http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&amp;p=8b2a9403c0934eaf5abfc8385864&amp;user=baidu" target="_blank" class="m">掳脵露脠驴矛脮脮</a>]

如何提取此href?

http://www.ylyd.com/showurl.asp?id=6182

感谢。 :)

2 个答案:

答案 0 :(得分:0)

你可以使用

for a in dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")}, href=True):
   a['href']

答案 1 :(得分:0)

在这个例子中,没有真正需要使用 regex,它可以简单地调用 <a> 标签,然后像这样调用 ['href'] 属性:

get_me_url = soup.a['href'] # http://www.ylyd.com/showurl.asp?id=6182
# cached URL
get_me_cached_url = soup.find('a', class_='m')['href']

您始终可以使用 prettify() 方法来更好地查看 HTML 代码。

from bs4 import BeautifulSoup

string = '''
[
<a href="http://www.ylyd.com/showurl.asp?id=6182" onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" target="_blank">
 <font size="3">
  掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫
 </font>
</a>
,
<a class="m" href="http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&amp;p=8b2a9403c0934eaf5abfc8385864&amp;user=baidu" target="_blank">
 掳脵露脠驴矛脮脮
</a>
]
'''

soup = BeautifulSoup(string, 'html.parser')
href = soup.a['href']
cache_href = soup.find('a', class_='m')['href']
print(f'{href}\n{cache_href}')

# output:
'''
http://www.ylyd.com/showurl.asp?id=6182
http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu
'''

或者,您可以使用来自 SerpApi 的 Baidu Organic Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。

本质上,此示例的主要区别在于您不必弄清楚如何获取某些元素,因为它已经通过 JSON 输出为最终用户完成。

从第一页结果中获取 href/cached href 的代码:

from serpapi import BaiduSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "baidu",
  "q": "ylyd"
}

search = BaiduSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  # try/expect used since sometimes there's no link/cached link
  try:
    link = result['link']
  except:
    link = None
  try:
    cached_link = result['cached_page_link']
  except:
    cached_link = None
  print(f'{link}\n{cached_link}\n')

# Part of the output:
'''
http://www.baidu.com/link?url=7VlSB5iaA1_llQKA3-0eiE8O9sXe4IoZzn0RogiBMCnJHcgoDDYxz2KimQcSDoxK
http://cache.baiducontent.com/c?m=LU3QMzVa1VhvBXthaoh17aUpq4KUpU8MCL3t1k8LqlKPUU9qqZgQInMNxAPNWQDY6pkr-tWwNiQ2O8xfItH5gtqxpmjXRj0m2vEHkxLmsCu&p=882a9646d5891ffc57efc63e57519d&newp=926a8416d9c10ef208e2977d0e4dcd231610db2151d6d5106b82c825d7331b001c3bbfb423291505d3c77e6305a54d5ceaf13673330923a3dda5c91d9fb4c57479c77a&s=c81e728d9d4c2f63&user=baidu&fm=sc&query=ylyd&qid=e42a54720006d857&p1=1
'''
<块引用>

免责声明,我为 SerpApi 工作。