我有以下网页的HTML代码:
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
<img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
另外,我有以下链接名称列表:
links_dict = {}
links = ["Slack","Twitter","Telegram"]
我想为每个相应的链接提取href
值。如果没有href
(参见上面示例代码中的Slack),则表示没有链接。
预期输出如下:
"Slack" -> "None"
"Twitter" -> "https://twitter.com/abc"
"Telegram" -> "https://t.me/abc"
我无法仅通过a href
访问a
,因为还有许多其他div
元素与其他a
。
我想将BeautifulSoap
或Selenium
与PhantomJS
一起使用。这就是我试过的:
BeautifulSoap:
res = requests.get("https://myurl.com")
soup = BeautifulSoup(res.content,'html.parser')
tags = soup.find_all(class_="align-center")
for tag in tags:
print tag.text.strip()
硒:
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://mytest.com")
tags = driver.find_elements_by_class_name("align-center")
for tag in tags:
tag.find_element_by_tag_name("a").click()
url = driver.current_url
print(url)
driver.quit()
答案 0 :(得分:0)
继续使用BeatifulSoup,您可以找到每个标记中的所有img
个链接,然后检查链接是否包含正确的alt
模式。
如果模式正确,请获取父母的链接。
import re
...
links = []
tags = soup.find_all(class_="align-center")
for tag in tags:
# For each tag, get all the images
for img in tag.find_all('img'):
# Ensure the img has the correct `alt` pattern
if re.match('(Twitter|Slack|Telegram) link', img.attrs.get('alt')):
# Store the link found.
links.append(img.findParent().attrs.get('href'))
答案 1 :(得分:0)
尝试以下脚本。它会为您提供所需的结果。
from bs4 import BeautifulSoup
content="""
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
<img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
"""
soup = BeautifulSoup(content,"html5lib")
links = {item.get("alt").split(" ")[0]:link.get('href') for item,link in zip(soup.select(".align-center a img"),soup.select(".align-center a"))}
print(links)
输出:
{'Slack': None, 'Telegram': 'https://t.me/abc', 'Twitter': 'https://twitter.com/abc'}
或者你可以用稍微不同的方式做同样的事情:
soup = BeautifulSoup(content,"html5lib")
for item in soup.select(".align-center a img"):
title = item.get("alt").split(" ")[0]
link = item.findParent().get('href')
print(title,link)
输出:
Slack None
Twitter https://twitter.com/abc
Telegram https://t.me/abc
答案 2 :(得分:0)
如果要为子节点中的每个对应href
属性提取alt
值,可以按照以下代码块使用 Selenium
:
tags = driver.find_elements_by_xpath("//div[@class='align-center']/a/img")
my_alt = []
my_href= []
for tag in tags:
alt_text = tag.getAttribute("alt")
my_alt.append(alt_text)
my_href.append(driver.find_element_by_xpath("//div[@class='align-center']/a/img[.='" + alt_text + "']//preceding::a[1]").getAttribute("href"))
for alt, href in zip(my_alt, my_href):
print(alt, href)