Question

请注意：我不是一位经验丰富的程序员，不要生我的气...... 我正在探索scrapy的可能性（我有一些Python编程技巧）。

抓取网站：让我们想象一下，我们可以从 opengraph（og :) 中提取一些信息，例如'title'，'url'和'description'，以及来自 schema.org 的其他信息，例如'author'，最后我们要'title'，'url'，'description'和'date'可以从HTML中提取“正常”XPath，如果没有来自 opengraph（og :)和schema.org 的话。

我在单独的.py文件中创建3个项目类 OpengraphItem（Item），SchemaItem（Item）和MyItem（Item）。在每个类中都会有一个提取函数来提取字段，如下例所示：

class OpengraphItem(Item):
      title = Field()
      url = Field()
      description = Field()

      def extract(self, hxs):
            self.title = hxs.xpath('/html/head/meta[@property="og:title"]/@content').extract()
            self.url = hxs.xpath('/html/head/meta[@property="og:url"]/@content').extract()
            self.description = hxs.xpath('/html/head/meta[@property="og:description"]/@content').extract()

然后在蜘蛛代码中，将像下面这样调用提取函数：

def parse_item(self, response):
    hxs = HtmlXPathSelector(response)

    my_item = MyItem()
    item_opengraph = OpengraphItem()
    item_opengraph.extract(hxs)

     item_schema = SchemaItem()
     item_schema.extract(hxs)

      my_item['date']= hxs.xpath('/html/body//*/div[@class="reviewDate"]/span/time[@class="dtreviewed"]/@content').extract()

      my_item['title'] = item_opengraph.get('title', None)
      my_item['url'] = item_opengraph.get('url', None)
      my_item['description'] = item_opengraph.get('description', None)

      if my_item['url'] == None:
            my_item['url'] = response.url

      if my_item['title'] == None:
            my_item['title'] = hxs.xpath('/html/head/title/text()').extract()

      if my_item['description'] == None:
            my_item['description'] = hxs.xpath('/html/head/meta[@name="description"]/@content').extract()

      return my_item

这有什么意义吗？在项目类中创建提取方法是否很简单？

我看了看其他问题： scrapy crawler to pass multiple item classes to pipeline - 我不知道如果只有一个包含多个不同类的items.py是正确的。

Scrapy item extraction scope issue和scrapy single spider to pass multiple item classes to pipeline - 我应该有Itempipeline吗？我不熟悉那些，但在scrapy文档中说它的用途，我认为它不适合这个问题。和物品装载机？

我省略了代码的某些部分。

Answer 1

在项目类中创建提取方法是否很简单？

这很不寻常。我不能说它“不对”，因为代码仍然可以工作，但通常所有与页面结构相关的代码（例如选择器）都保留在Spider中。

项目加载器可能对您正在尝试的内容有用，您一定要尝试一下。

另一件事，属性分配到项目字段，如

  def extract(self, hxs):
        self.title = hxs [...]

不起作用。 Scrapy项目就像dicts，你应该分配给例如self['title']。

使用其中的extract方法scrapy多个项目类

1 个答案: