Question

我有以下网页的HTML代码：

<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
  <img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>

另外，我有以下链接名称列表：

links_dict = {}
links = ["Slack","Twitter","Telegram"]

我想为每个相应的链接提取href值。如果没有href（参见上面示例代码中的Slack），则表示没有链接。

预期输出如下：

"Slack" -> "None"
"Twitter" -> "https://twitter.com/abc"
"Telegram" -> "https://t.me/abc"

我无法仅通过a href访问a，因为还有许多其他div元素与其他a。

我想将BeautifulSoap或Selenium与PhantomJS一起使用。这就是我试过的：

BeautifulSoap：

res = requests.get("https://myurl.com")
soup = BeautifulSoup(res.content,'html.parser')
tags = soup.find_all(class_="align-center")
for tag in tags:
    print tag.text.strip()

硒：

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://mytest.com")

tags = driver.find_elements_by_class_name("align-center")

for tag in tags:
    tag.find_element_by_tag_name("a").click()
    url = driver.current_url
    print(url)
driver.quit()

Answer 1

继续使用BeatifulSoup，您可以找到每个标记中的所有img个链接，然后检查链接是否包含正确的alt模式。

如果模式正确，请获取父母的链接。

import re

...

links = []
tags = soup.find_all(class_="align-center")
for tag in tags:
    # For each tag, get all the images
    for img in tag.find_all('img'):
        # Ensure the img has the correct `alt` pattern
        if re.match('(Twitter|Slack|Telegram) link', img.attrs.get('alt')):
            # Store the link found.
            links.append(img.findParent().attrs.get('href'))

Answer 2

尝试以下脚本。它会为您提供所需的结果。

from bs4 import BeautifulSoup

content="""
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
  <img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
"""
soup = BeautifulSoup(content,"html5lib")
links = {item.get("alt").split(" ")[0]:link.get('href') for item,link in zip(soup.select(".align-center a img"),soup.select(".align-center a"))}
print(links)

输出：

{'Slack': None, 'Telegram': 'https://t.me/abc', 'Twitter': 'https://twitter.com/abc'}

或者你可以用稍微不同的方式做同样的事情：

soup = BeautifulSoup(content,"html5lib")
for item in soup.select(".align-center a img"):
    title = item.get("alt").split(" ")[0]
    link = item.findParent().get('href')
    print(title,link)

输出：

Slack None
Twitter https://twitter.com/abc
Telegram https://t.me/abc

Answer 3

如果要为子节点中的每个对应href属性提取alt值，可以按照以下代码块使用 Selenium ：

tags = driver.find_elements_by_xpath("//div[@class='align-center']/a/img")
my_alt = []
my_href= []
for tag in tags:
    alt_text = tag.getAttribute("alt")
    my_alt.append(alt_text)
    my_href.append(driver.find_element_by_xpath("//div[@class='align-center']/a/img[.='" + alt_text + "']//preceding::a[1]").getAttribute("href"))
for alt, href in zip(my_alt, my_href):
    print(alt, href)

如何从给定的div中提取href？

3 个答案: