我有一个文件,我正在使用该文件来解析Wikipedia的参考部分中的文章。目前,我将其设置为可返回参考部分中任何项目的网址。
我正试图让它导出包含链接(当前正在执行)和链接文本的单行:
https://this.is.the.url "And this is the article header"
或连续几行:
https://this.is.the.url
"And this is the article header"
<a
rel="nofollow"
class="external text"
href="https://www.mmajunkie.usatoday.com/2020/08/gerald-meerschaert-tests-positive-covid-19-ed-herman-fight-off-ufc-on-espn-plus-31/amp">
"Gerald Meerschaert tests positive for COVID-19; Ed Herman fight off UFC on ESPN+ 31"
</a>
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
links = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
title = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
for link in links:
print(link)
else:
print("Error: Please enter a valid Wikipedia URL")
答案 0 :(得分:0)
已解决:
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
for a in href.find_all("a", class_="external text", href=True):
listitem = [a["href"],a.getText()]
print(listitem)
else:
print("Error: Please enter a valid Wikipedia URL")
答案 1 :(得分:0)
您不仅可以获取锚标记的href
属性,还可以获取链接的文本。
这可以简单地通过
完成links = [(a["href"], a.text)
for a in href.find_all("a", class_="external text", href=True)]
for link, title in links:
print(link, title)
现在每个links
元素将是一个带有链接和标题的tuple
。
现在,您可以根据需要显示它。
此外,a.text
可以像a.getText()
或a.get_text()
这样写,因此请选择适合您的代码样式的>