我是Scrapy编程的新手,我遇到了问题。有这个网站,我想为表的每个元素创建一个唯一的项目,但每个项目是相同的,我不知道为什么,这是我的代码:
from scrapy import Spider
from scrapy.selector import Selector
from petroleo.items import PetroleoItem
class PetroleoSpider(Spider):
name = "petroleo"
site = "http://www.glossary.oilfield.slb.com/"
allowed_domains = [site]
start_urls = [site + 'en/Terms.aspx?filter=sym&LookIn=term%20name&searchtype=starts%20with',]
def parse(self, response):
words = Selector(response).xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td")
for word in words:
item = PetroleoItem()
if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/em").extract():
item['title'] = word.xpath(
"//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/em/text()").extract()[0]
item['title'] += word.xpath(
"//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/sub/text()").extract()[0]
if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i").extract():
item['title'] = {'en': word.xpath(
"//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i/text()").extract()}
item['title']['en'][0] += word.xpath(
"//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/i/sub/text()").extract()[0]
if word.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/text()").extract():
item['title'] = {'en': word.xpath(
"//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td/a/text()").extract()}
yield item
答案 0 :(得分:1)
通过添加一个点来使表达式特定于上下文,不要重复//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td
部分:
words = response.xpath("//table[@id='pagecolumns_0_columncontent_0__rptLetter_ctl00__dlTerms']//td")
for word in words:
item = PetroleoItem()
if word.xpath("./a/em").extract():
item['title'] = word.xpath("./a/em/text()").extract()[0]
item['title'] += word.xpath("./a/sub/text()").extract()[0]
if word.xpath("./a/i").extract():
item['title'] = {'en': word.xpath("./a/i/text()").extract()}
item['title']['en'][0] += word.xpath("./a/i/sub/text()").extract()[0]
if word.xpath("./a/text()").extract():
item['title'] = {'en': word.xpath("./a/text()").extract()}
yield item
我不是特别喜欢和理解你在循环中想要做什么,但这至少应该解决你在问题中描述的问题。