我正在尝试抓取某种类型的Wikipedia页面,并希望对其进行足够的概括,以便可以对多个页面进行重复抓取。您可以使用此页面作为示例页面:https://en.m.wikipedia.org/wiki/Template:POTD/2009-01-01
我想抓取图片下方的摘要;但是,当我使用Wikipedia模块进行操作时,我得到一个空字符串。当我使用Beautfiful Soup并尝试导航到摘要所在的位置时,我无法弄清楚该写些什么,而始终使用这些内容可以给我提供摘要文本:
soup.find(style ='display:inline-block; margin-left: 4px ; width: 314px ; vertical-align:middle;')
>
注意粗体值。每次更改的时间取决于页面。但是样式文本在其他页面上基本上保持不变。因此,无论如何,我想以该名称获取文本,或者也许有一种更简单的方法。我将不胜感激如何解决此问题的任何想法。
下面是一些代码,您可以用来查看我的目标,但希望可以在多个页面中使用它:
source = requests.get(f'https://en.m.wikipedia.org/wiki/Template:POTD/2009-01-01').text
soup = BeautifulSoup(source, 'lxml')
summary = soup.find(class_ = 'content').find(style = 'display: inline-block; margin-left: 4px; width: 314px; vertical-align: middle;').text
print(summary)
结果:
一组妇产科医生在一家现代化医院进行剖腹产手术(通常称为“剖腹产”)。该图显示了母亲瞥见新生婴儿的第一刻。这是一种外科手术,其中通过母亲的腹部(开腹手术)和子宫(子宫切开术)进行切口,以分娩一个或多个婴儿。通常在阴道分娩会危及婴儿或母亲的生命或健康的情况下执行该操作,尽管在最近一段时间,它也应应自然而然的分娩请求执行。图片来源:Salim Fadhley 存档–更多精选图片...
答案 0 :(得分:0)
我不会通过style属性来定位标签,这太脆弱了。更好的方法是使用CSS选择器。
#mw-content-text div:has(> a > img) + div
将在<div>
之后紧随{{1}的id=mw-content-text
下选择<div>
,其中<img>
包含import requests
from bs4 import BeautifulSoup
url = 'https://en.m.wikipedia.org/wiki/Template:POTD/2009-01-01'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for tag in soup.select('#mw-content-text div:has(> a > img) + div'):
print(tag.text.strip())
:
A team of obstetricians perform a Caesarean section (commonly called a "C-section") in a modern hospital. The image shows the very first moment the mother glimpses her new-born child. This is a surgical procedure in which incisions are made through a mother's abdomen (laparotomy) and uterus (hysterotomy) to deliver one or more babies. It is usually performed when a vaginal delivery would put the baby's or mother's life or health at risk, although in recent times it has been also performed upon request for childbirths that would otherwise have been natural.Photo credit: Salim Fadhley
Archive – More featured pictures...
打印:
function TransSearchControls(page, cs) {
var html = "";
var display = (cs == "all") ? "inline-block" : "none";
html += "<div class='col-xs-12 body'>";
html += "<div class='col-xs-1 text-left' style='margin-top:5px'>Filter/s:</div>";
html += "<div class='col-xs-6 col-sm-2'><input type='text' class='input-sm form-control trans-controls-txt-search' placeholder='Search Employee' /></div>";
html += "<div class='col-xs-6 col-sm-2' style='display:" + display + "'><select class='input-sm slctSoaFilter form-control' >";
html += "<option value='all' selected disabled>Select Transactions</option>";
html += "<option value='all'>All</option>";
html += "<option value='redeem'>Redeemed</option>";
html += "<option value='earn'>Earned</option>";
html += "</select></div>";
html += "<div class='col-xs-1 text-right' style='margin-top:5px'>Date/s:</div>";
html += "<div class='col-xs-5 col-sm-2'><input type='date' class='input-sm form-control dateFrom' /></div>";
html += "<div class='col-xs-5 col-sm-2'><input type='date' class='input-sm form-control dateTo' /></div>";
html += "<div class='col-xs-6 col-sm-2'><button class='btn btn-sm btn-primary btnSearchTrans form-control' type='button' page='" + page + "'><span class='glyphicon glyphicon-search'></span> Search</button></div>";
html += "</div>";
if (page == "Transactions") {
html += "<div class='col-xs-12' style='margin-top:10px'>";
html += "<button type='button' class='btn btn-sm btn-danger' id='btnExportExcel'><span class='glyphicon glyphicon glyphicon-import'></span> Export Excel</button>";
html += "<span class='filename-container'></span>";
html += "</div>";
}
if (page == "RewardPoints") {
html += "<div class='col-xs-12' style='margin-top:10px'>";
html += "<button type='button' class='btn btn-sm btn-primary' id='btnAddPoints'><span class='glyphicon glyphicon-plus-sign'></span> Add Points</button> ";
html += "<button type='button' class='btn btn-sm btn-danger' id='btnImportExcel'><span class='glyphicon glyphicon glyphicon-import'></span> Import Excel</button>";
html += "<span class='filename-container'></span>";
html += "</div>";
}
有关CSS选择器here的更多信息。