Question

我有一个文件，我正在使用该文件来解析Wikipedia的参考部分中的文章。目前，我将其设置为可返回参考部分中任何项目的网址。

我正试图让它导出包含链接（当前正在执行）和链接文本的单行：

https://this.is.the.url "And this is the article header"

或连续几行：

https://this.is.the.url
"And this is the article header"

链接样本

 <a 
   rel="nofollow" 
   class="external text" 
   href="https://www.mmajunkie.usatoday.com/2020/08/gerald-meerschaert-tests-positive-covid-19-ed-herman-fight-off-ufc-on-espn-plus-31/amp">
   "Gerald Meerschaert tests positive for COVID-19; Ed Herman fight off UFC on ESPN+ 31"
 </a>

抓取器

import requests
import sys
from bs4 import BeautifulSoup

session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"


if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    references = bsObj.find('ol', {'class': 'references'})
    href = BeautifulSoup(str(references), "html.parser")
    links = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
    title = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
    for link in links:
        print(link)

else:
    print("Error: Please enter a valid Wikipedia URL")

Answer 1

已解决：

import requests
import sys
from bs4 import BeautifulSoup

session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"

if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    references = bsObj.find('ol', {'class': 'references'})
    href = BeautifulSoup(str(references), "html.parser")

    for a in href.find_all("a", class_="external text", href=True):
        listitem = [a["href"],a.getText()]

        print(listitem)

else:
    print("Error: Please enter a valid Wikipedia URL")

Answer 2

您不仅可以获取锚标记的href属性，还可以获取链接的文本。

这可以简单地通过

完成

links = [(a["href"], a.text)
         for a in href.find_all("a", class_="external text", href=True)]
for link, title in links:
    print(link, title)

现在每个links元素将是一个带有链接和标题的tuple。现在，您可以根据需要显示它。

此外，a.text可以像a.getText()或a.get_text()这样写，因此请选择适合您的代码样式的

使用Beautiful Soup检索多个值

链接样本

抓取器

2 个答案: