为什么我所有关于scrapy的项目都是一样的?

时间:2015-10-13 03:03:07

标签: python web-crawler scrapy

我是Scrapy编程的新手,我遇到了问题。有这个网站,我想为表的每个元素创建一个唯一的项目,但每个项目是相同的,我不知道为什么,这是我的代码:

from scrapy import Spider
from scrapy.selector import Selector

from petroleo.items import PetroleoItem


class PetroleoSpider(Spider):
  name = "petroleo"
  site = "http://www.glossary.oilfield.slb.com/"
  allowed_domains = [site]
  start_urls = [site + 'en/Terms.aspx?filter=sym&LookIn=term%20name&searchtype=starts%20with',]

  def parse(self, response):

  words = Selector(response).xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td")

    for word in words:
        item = PetroleoItem()

        if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/em").extract():

            item['title'] = word.xpath(
                    "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/em/text()").extract()[0]
            item['title'] += word.xpath(
                    "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/sub/text()").extract()[0]


        if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i").extract():
            item['title'] = {'en': word.xpath(
                "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i/text()").extract()}
            item['title']['en'][0] += word.xpath(
                "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i/sub/text()").extract()[0]

        if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/text()").extract():
            item['title'] = {'en': word.xpath(
                "//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/text()").extract()}

        yield item

1 个答案:

答案 0 :(得分:1)

通过添加一个点来使表达式特定于上下文,不要重复//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td部分:

words = response.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td")

for word in words:
    item = PetroleoItem()

    if word.xpath("./a/em").extract():
        item['title'] = word.xpath("./a/em/text()").extract()[0]
        item['title'] += word.xpath("./a/sub/text()").extract()[0]

    if word.xpath("./a/i").extract():
        item['title'] = {'en': word.xpath("./a/i/text()").extract()}
        item['title']['en'][0] += word.xpath("./a/i/sub/text()").extract()[0]

    if word.xpath("./a/text()").extract():
        item['title'] = {'en': word.xpath("./a/text()").extract()}

    yield item

我不是特别喜欢和理解你在循环中想要做什么,但这至少应该解决你在问题中描述的问题。