Question

我刚开始学习python / Scrapy。我能够成功地学习教程，但是我正在为自己想要做的“测试”拼抢而苦苦挣扎。

我现在要做的是继续http://jobs.walmart.com/search/finance-jobs并抓住工作清单。

但是，我认为我可能在XPath中做错了什么，但我不确定是什么。

该表没有“id”，所以我正在使用它的类。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class MySpider(BaseSpider):
  name = "walmart"
  allowed_domains = ["jobs.walmart.com"]
  start_urls = ["http://jobs.walmart.com/search/finance-jobs"]

  def parse(self, response):
      hxs = HtmlXPathSelector(response)
      titles = hxs.select("//table[@class='tableSearchResults']")
      items = []
      for titles in titles:
          item = walmart()
          item ["title"] = titles.select("a/text()").extract()
          item ["link"] = titles.select("a/@href").extract()
          items.append(item)
      return items

这是页面源的样子：

Answer 1

你说的问题也就是你的XPATH。运行始终有用：

scrapy view http://jobs.walmart.com/search/finance-jobs

在运行蜘蛛之前，先从scrapy视图查看网站的外观。

现在应该可以了：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class MySpider(BaseSpider):
    name = "walmart"
    allowed_domains = ["jobs.walmart.com"]
    start_urls = ["http://jobs.walmart.com/search/finance-jobs"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        item = walmart()
        titles = hxs.select("//table[@class='tableSearchResults']/tr")
        items = []
        for title in titles:
            if title.select("td[@class='td1']/a").extract():
                item ["title"] = title.select("td[@class='td1']/a/text()").extract()
                item ["link"] = title.select("td[@class='td1']/a/@href").extract()
                items.append(item)
        return items

用于Scrappy的HTMLXPathSelector返回null结果

1 个答案: