Question

我在Vista 64位上使用Python.org版本2.7 64位。我有现在的Scrapy代码，它现在可以很好地提取文本，但我对如何从网站上的表格获取数据感到困惑。我在网上看过答案，但我还是不确定。举个例子，我想得到韦恩·鲁尼的进球数据表中包含的数据：

http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney 我目前的代码是：

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import re


class MySpider(Spider):
    name = "Goals"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]

    def parse(self, response):
        titles = response.selector.xpath("normalize-space(//title)")
        for titles in titles:

            body = response.xpath("//p").extract()
            body2 = "".join(body)

            print remove_tags(body2).encode('utf-8')

execute(['scrapy','crawl','goals'])

需要在xpath（）语句中使用什么语法来获取表格数据？

由于

Answer 1

我刚刚看到了页面链接，我在整个Xpath表达式中获得了所需的所有锦标赛表：'//table[@id="player-fixture"]//tr[td[@class="tournament"]]'。

我将尝试解释此Xpath表达式的每个部分：

//table[@id="player-fixture"]：检索整个表格，其中包含id属性player-fixture，您可以在该页面中进行检查。
//tr[td[@class="tournament"]]：使用您想要的每个匹配项的信息检索所有行。

您也可以使用这个较短的//tr[td[@class="tournament"]] Xpath表达式。但我认为使用前面的表达式会更加一致，因为您要通过该表达式声明您希望tr唯一（id）的某个表下的所有行（player-fixture）。

获得所有行后，您可以遍历它们以从每个行条目中获取所需的所有信息。

Answer 2

要抓取数据，通常会识别表格，然后循环遍历行。像这样的html表通常具有以下格式：

<table id="thistable">
  <tr>
    <th>Header1</th>
    <th>Header2</th>
  </tr>
  <tr>
    <td>data1</td>
    <td>data2</td>
  </tr>
</table>

以下是解析此灯具表的示例：

from scrapy.spider import Spider
from scrapy.http import Request
from myproject.items import Fixture

class GoalSpider(Spider):
    name = "goal"
    allowed_domains = ["whoscored.com"]
    start_urls = (
        'http://www.whoscored.com/',
        )

    def parse(self, response):
        return Request(
            url="http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney",
            callback=self.parse_fixtures
        )

    def parse_fixtures(self,response):
        sel = response.selector
        for tr in sel.css("table#player-fixture>tbody>tr"):
             item = Fixture()
             item['tournament'] = tr.xpath('td[@class="tournament"]/span/a/text()').extract()
             item['date'] = tr.xpath('td[@class="date"]/text()').extract()
             item['team_home'] = tr.xpath('td[@class="team home "]/a/text()').extract()
             yield item

首先，我使用sel.css("table#player-fixture>tbody>tr")识别数据行并循环结果，然后提取数据。

修改：items.py（http://doc.scrapy.org/en/latest/topics/items.html）

class Fixture(Item):
    tournament = Field()
    date = Field()
    team_home = Field()

Answer 3

首先，对于您想要的每个符号，您必须知道与此符号关联的名称是什么。例如，对于目标，我看到<span>元素的标题属性等于＆＃34;目标＆＃34; 以及标题属性等于的<span>元素＆＃34;协助＆＃34; 进行符号辅助。

考虑到这些信息，您可以检查检索到的每一行，如果它包含一个带有所需标题名称的范围，该范围与您要检索的符号相关联。

要获取行的所有目标符号，您可以使用下面的表达式//span[@title="Goal"来评估此行：

for row in response.selector.xpath(
            '//table[@id="player-fixture"]//tr[td[@class="tournament"]]'):
    # Is this row contains goal symbols?
    list_of_goals = row.xpath('//span[@title="Goal"')
    if list_of_goals:
        # Output goals text.
    .
    .
    .

如果它检索到无空列表，则表示此行内有目标符号。因此，您可以输出多少目标文本与上面返回的跨度列表的长度一样多。

尝试使用Scrapy从表中提取数据

3 个答案: