Question

我需要从Website（内部类名）中提取表中的链接，但是我总是会抓取0页，但下载程序会得到一堆字节。

class geneDetails(scrapy.Spider):
name = "details"

def start_requests(self):
    urls = ['https://ecocyc.org/gene?orgid=ECOLI&id=G7688']
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):


    details = response.xpath('//*[contains(@class,"internal")]/tbody/tr').extract()

    for det in details:
        gene_det =  det.xpath('./text()').extract()

已经尝试了很多东西，但是没有用，上面的代码是我最后的尝试，对不起，对不起scrapy / xpath的专家。

Answer 1

在您的代码中，您没有打开“ GO”的特定部分

https://ecocyc.org/gene?orgid=ECOLI&id=G7688#tab=GO

要获取此数据，您需要加载

https://ecocyc.org/gene-tab?id=G7688&orgid=ECOLI&tab=GO

您可以在

中找到urlpart

tabIds[tabIds.length] = 'GO';Y.one('#GO').setData('uri', '/gene-tab?id=G7688&orgid=ECOLI&tab=GO');
Y.one('#GO').setData('clim-reqd-p', 'true');

下一步将是解析结果表。

Scrapy失败的表提取

1 个答案: