我想通过scrapy抓取页面http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B
。但似乎存在一个问题,我在抓取它时没有得到任何数据。
这是我的蜘蛛代码:
import scrapy
from scrapy.selector import Selector
from scrapy_Data.items import CharProt
class CPSpider(scrapy.Spider):
name = "CharProt"
allowed_domains = ["jcvi.org"]
start_urls = ["http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//*[@id="middle_content_template"]/table/tbody/tr')
for site in sites:
item = CharProt()
item['protein_name'] = site.xpath('td[1]/a/text()').extract()
item['pn_link'] = site.xpath('td[1]/a/@href').extract()
item['organism'] = site.xpath('td[2]/a/text()').extract()
item['organism_link'] = site.xpath('td[2]/a/@href').extract()
item['status'] = site.xpath('td[3]/a/text()').extract()
item['status_link'] = site.xpath('td[3]/a/@href').extract()
item['references'] = site.xpath('td[4]/a').extract()
item['source'] = "CharProt"
# collection.update({"protein_name": item['protein_name']}, dict(item), upsert=True)
yield item
这是日志:
2016-05-28 17:25:06 [scrapy] INFO: Spider opened
2016-05-28 17:25:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-28 17:25:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-28 17:25:07 [scrapy] DEBUG: Crawled (200) <GET http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B> (referer: None)
<200 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B>
2016-05-28 17:25:08 [scrapy] INFO: Closing spider (finished)
2016-05-28 17:25:08 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 337,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 26198,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 28, 9, 25, 8, 103577),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 5, 28, 9, 25, 6, 55848)}
当我运行其他蜘蛛时,它们都能正常工作。那么有人能告诉我我的代码有什么问题吗?或者这个网页有问题吗?
答案 0 :(得分:0)
您正在抓取它,但您的xpath是错误的
当您使用浏览器检查元素时,会显示<tbody>
标记,但它不在源代码中的任何位置,因此不会抓取任何内容!
sites = sel.xpath('//*[@id="middle_content_template"]/table/tr')
那应该有用
修改
旁注extract()
会返回list
而不是您想要的元素,因此您需要使用extract_first()
方法或extract()[0]
例如
item['protein_name'] = site.xpath('td[1]/a/text()').extract_first()
答案 1 :(得分:-1)
您的xpath错误
tbody
来访问表行table/tr
即可访问表格行正确的xpath将是:
sites = sel.xpath('//*[@id="middle_content_template"]//table//tr')
更好的xpath将是
sites = response.xpath('//table[@class="search_results"]/tr')
正如您在上面的示例中所看到的,您不需要创建选择器对象 按
Selector(response)
选择xpath在较新的scrapy版本中,选择器属性已经添加到响应类中,并且可以使用它,如下所述
response.selector.xpath(...)
或简短格式
response.xpath(...)