Question

我正在尝试提取页面上的所有链接，到目前为止我能够获取链接，但链接中的锚文本不提供任何相关的信息。该信息包含在另一个兄弟标签中。

这是Html布局：

<tbody>
<tr>
     <td>
        <h3>Driver with license E or F</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
    </td>
</tr>
    <tr>
    <td>
        <h3>Payroll Administrator</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
    </td>
</tr>
<tr>
    <td>
        <h3>Head of Sales and Marketing</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
   </td>
</tr>
</tbody>

当我提取链接时，我得到以下内容：

<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>

可是：

我有兴趣在每种情况下用标签中的文字替换文字转到详细信息。
这些链接将显示在外部网站上，因此我更喜欢绝对而不是相对

因此最后我希望得到类似的东西：

<a href="http://www.example.com/show_classifieds?..." class="bar">Driver with license E or F</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Payroll Administrator</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Head of Sales and Marketing</a>

任何帮助都将受到优雅的赞赏

Answer 1

为了给您一个稳定的解决方案，您确实需要确保所有页面都遵循与您的示例完全相同的模式。

基本假设：

假设您想要的文本始终位于h3标记中，该标记是div print的兄弟，谁是锚链接的父级。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
    # here is how you get the text from 'h3' tag
    header = a.parent.find_previous_sibling('h3').text
    # here is how you set the text of the anchor tag to be the text of 'h3' tag
    a.string = header
    print a

进一步阅读： tag.string

（如果需要，可以使用带域名的urljoin构建绝对URL） urljoin

的输出 ：

<a class="bar" href="show_classifieds?...">Driver with license E or F</a> <a class="bar" href="show_classifieds?...">Payroll Administrator</a> <a class="bar" href="show_classifieds?...">Head of Sales and Marketing</a>

BeautifulSoup：用另一个标签中的文本替换锚文本

1 个答案: