我试图使用看似如此的scrapy从网页上抓取数据:
<h3 id="greatCleave">Great Cleave [General]</h3>
<h5>Prerequisites</h5>
<p>
Str 13, <a href="#cleave">Cleave</a>, <a href="#powerAttack">Power Attack</a>, base attack bonus +4.
</p>
<h5>Benefit</h5>
<p>
This feat works like <a href="#cleave">Cleave</a>, except that there is no limit to the number of times you can use it per round.
</p>
<h5>Special</h5>
<p>
A <a href="/srd/classes/fighter.htm">fighter</a> may select Great Cleave as one of his fighter bonus feats.
</p>
<h3 id="greatFortitude">Great Fortitude [General]</h3>
<h5>Benefit</h5>
<p>
You get a +2 bonus on all <a href="/srd/combat/combatStatistics.htm#fortitude">Fortitude saving throws</a>.
</p>
<h3 id="greaterSpellFocus">Greater Spell Focus [General]</h3>
<p>
Choose a school of magic to which you already have applied the <a href="#spellFocus">Spell Focus</a> feat.
</p>
<h5>Benefit</h5>
<p>
Add +1 to the Difficulty Class for all <a href="/srd/combat/combatStatistics.htm#savingThrows">saving throws</a> against spells from the school of magic you select. This bonus stacks with the bonus from <a href="#spellFocus">Spell Focus</a>.
</p>
<h5>Special</h5>
<p>
You can gain this feat multiple times. Its effects do not stack. Each time you take the feat, it applies to a new school of magic to which you already have applied the <a href="#spellFocus">Spell Focus</a> feat.
</p>
etc...
一般模式是h3标题,然后是1到4个h5子标题和描述段落。我想将它们分成单独的项目,其中h3是名称,h5是各种属性。现在,我在scrapy解析器中使用的代码如下所示:
def parse(self,response):
for feat in response.xpath('//h3'):
item = FeatItem()
item['name'] = feat.xpath(".//text()").extract_first()
headerlist= feat.xpath("./following-sibling::h5/text()").extract()
paralist= feat.xpath("./following-sibling::p/text()").extract()
这给了我想要的东西。我最终得到了以下所有h5数据或p数据,而不仅仅是那些在我到达下一个h3之前跟随的数据。由于h3标签之间的数据量不同,我希望将网站拆分为h3标签的部分,以便我可以单独检查每个块。这可能吗?