使用Beautiful Soup检索多个值

时间:2020-09-24 12:46:47

标签: python python-3.x beautifulsoup

我有一个文件,我正在使用该文件来解析Wikipedia的参考部分中的文章。目前,我将其设置为可返回参考部分中任何项目的网址。

我正试图让它导出包含链接(当前正在执行)和链接文本的单行:

https://this.is.the.url "And this is the article header"

或连续几行:

https://this.is.the.url
"And this is the article header"

链接样本

 <a 
   rel="nofollow" 
   class="external text" 
   href="https://www.mmajunkie.usatoday.com/2020/08/gerald-meerschaert-tests-positive-covid-19-ed-herman-fight-off-ufc-on-espn-plus-31/amp">
   "Gerald Meerschaert tests positive for COVID-19; Ed Herman fight off UFC on ESPN+ 31"
 </a>

抓取器

import requests
import sys
from bs4 import BeautifulSoup

session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"


if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    references = bsObj.find('ol', {'class': 'references'})
    href = BeautifulSoup(str(references), "html.parser")
    links = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
    title = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
    for link in links:
        print(link)

else:
    print("Error: Please enter a valid Wikipedia URL")

2 个答案:

答案 0 :(得分:0)

已解决:

import requests
import sys
from bs4 import BeautifulSoup

session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"

if "wikipedia" in selectWikiPage:
    html = session.post(selectWikiPage)
    bsObj = BeautifulSoup(html.text, "html.parser")
    references = bsObj.find('ol', {'class': 'references'})
    href = BeautifulSoup(str(references), "html.parser")

    for a in href.find_all("a", class_="external text", href=True):
        listitem = [a["href"],a.getText()]

        print(listitem)

else:
    print("Error: Please enter a valid Wikipedia URL")

答案 1 :(得分:0)

您不仅可以获取锚标记的href属性,还可以获取链接的文本。

这可以简单地通过

完成
links = [(a["href"], a.text)
         for a in href.find_all("a", class_="external text", href=True)]
for link, title in links:
    print(link, title)

现在每个links元素将是一个带有链接和标题的tuple。 现在,您可以根据需要显示它。

此外,a.text可以像a.getText()a.get_text()这样写,因此请选择适合您的代码样式的