Question

我试图使用看似如此的scrapy从网页上抓取数据：

<h3 id="greatCleave">Great Cleave [General]</h3>

<h5>Prerequisites</h5>
<p>
    Str 13, <a href="#cleave">Cleave</a>, <a href="#powerAttack">Power Attack</a>, base attack bonus +4.
</p>

<h5>Benefit</h5>
<p>
    This feat works like <a href="#cleave">Cleave</a>, except that there is no limit to the number of times you can use it per round.
</p>

<h5>Special</h5>
<p>
    A <a href="/srd/classes/fighter.htm">fighter</a> may select Great Cleave as one of his fighter bonus feats.
</p>



<h3 id="greatFortitude">Great Fortitude [General]</h3>

<h5>Benefit</h5>
<p>
    You get a +2 bonus on all <a href="/srd/combat/combatStatistics.htm#fortitude">Fortitude saving throws</a>.
</p>



<h3 id="greaterSpellFocus">Greater Spell Focus [General]</h3>
<p>
    Choose a school of magic to which you already have applied the <a href="#spellFocus">Spell Focus</a> feat.
</p>

<h5>Benefit</h5>
<p>
    Add +1 to the Difficulty Class for all <a href="/srd/combat/combatStatistics.htm#savingThrows">saving throws</a> against spells from the school of magic you select. This bonus stacks with the bonus from <a href="#spellFocus">Spell Focus</a>.
</p>

<h5>Special</h5>
<p>
    You can gain this feat multiple times. Its effects do not stack. Each time you take the feat, it applies to a new school of magic to which you already have applied the <a href="#spellFocus">Spell Focus</a> feat.
</p>


etc...

一般模式是h3标题，然后是1到4个h5子标题和描述段落。我想将它们分成单独的项目，其中h3是名称，h5是各种属性。现在，我在scrapy解析器中使用的代码如下所示：

def parse(self,response):
    for feat in response.xpath('//h3'):
        item = FeatItem()
        item['name'] = feat.xpath(".//text()").extract_first()

        headerlist= feat.xpath("./following-sibling::h5/text()").extract()
        paralist= feat.xpath("./following-sibling::p/text()").extract()

这给了我想要的东西。我最终得到了以下所有h5数据或p数据，而不仅仅是那些在我到达下一个h3之前跟随的数据。由于h3标签之间的数据量不同，我希望将网站拆分为h3标签的部分，以便我可以单独检查每个块。这可能吗？

当我点击某个标签时，我可以使用scrapy来分割网页的内容吗？

0 个答案: