Question

所以这是我试图从

获取数据的表

<table class="statBlock" cellspacing="0">
<tr>
    <th>
        <a href="/srd/magicOverview/spellDescriptions.htm#level">Level</a>:
    </th>
    <td>
        <a href="/srd/spellLists/clericSpells.htm#thirdLevelClericSpells">Clr 3</a>
    </td>
</tr>
<tr>
    <th>
        <a href="/srd/magicOverview/spellDescriptions.htm#components">Components</a>:
    </th>
    <td>
        V, S
    </td>
</tr>
<tr>
    <th>
        <a href="/srd/magicOverview/spellDescriptions.htm#castingTime">Casting Time</a>:
    </th>
    <td>
        1 <a href="/srd/combat/actionsInCombat.htm#standardActions">standard action</a>
    </td>
</tr>

ETC...

这是我到目前为止解析

的scrapy代码

        for sel in response.xpath('//tr'):
        string = " ".join(response.xpath('//th/a/text()').extract()) + ":" + " ".join(response.xpath('//td/text()').extract())
        print string

但这产生了这样的结果：

Level Components Casting Time Range Effect Duration Saving Throw Spell Resistance:V, S, M, XP 12 hours 0 ft. One duplicate creature Instantaneous None No

当输出看起来像

时

Level: CLR 1  Components:V, S, M etc...

基本上，由于某种原因，它没有循环遍历表格的每一行并为每个行找到一个和单元格并将它们粘在一起，它会从中找到所有数据并从中找到所有数据然后将这两个数据粘在一起集合在一起。我认为我的for语句需要修复 - 如何让它单独检查每一行？

Answer 1

当您查询类似于

的xpath时

response.xpath('//th/a/text()')

这将返回其中包含<th>元素的所有<a>元素（具有text()）。那不是你想要的。你应该做 -

for sel in response.xpath('//tr'):
    string = " ".join(sel.xpath('.//th/a/text()').extract()) + ":" + " ".join(sel.xpath('.//td/text()').extract())
    print string

循环内部xpath中的点，使xpath相对于当前节点运行，而不是从起始节点运行。

有关Working with Relative XPaths

的相对x路的更多详细信息

循环遍历所有行，而不是分别遍历每一行

1 个答案: