如何为没有任何网址的网址NaN
返回".//*[@id='object']//tbody//tr//td//span//a[2]"
?我试着:
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
if not links:
item = ToyItem()
item['link'] = 'NaN'
item['name'] = response.url
return item
for links in links:
item = ToyItem()
item['link'] = links.xpath('@href').extract_first()
item['name'] = response.url # <-- see here
yield item
list_of_dics = []
list_of_dics.append(item)
df = pd.DataFrame(list_of_dics)
print(df)
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
但是,而不是返回(*)
:
'link1.com' 'NaN'
'link2.com' 'NAN'
'link3.com' 'extracted3.link.com'
我得到了:
'link3.com' 'extracted3.link.com'
如何退回(*)
?
答案 0 :(得分:1)
你可以重做这个以使用scrapy管道:
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = ['link1','link2','link3']
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
if not links:
item = ToyItem()
item['link'] = 'NaN'
item['name'] = response.url
yield item
else:
for links in links:
item = ToyItem()
item['link'] = link.xpath('@href').extract_first()
item['name'] = response.url # <-- see here
yield item
现在在pipelines.py
class PandasPipeline:
def open_spider(self, spider):
self.data = []
def process_item(self, item, spider):
self.data.append(item)
return item
def close_spider(self, spider):
df = pd.DataFrame(self.data)
print('saving dataframe')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
和settings.py
:
ITEM_PIPELINES = {
'myproject.pipelines.PandasPipeline': 900
}