如何从链接列表中删除?

时间:2017-04-12 00:58:55

标签: python python-3.x beautifulsoup scrapy web-crawler

我有一个链接列表,其中也有一些var file = dataURLtoFile('data:image/png;base64,....', 'filename.png'); console.log(file);

interesting urls

使用scrapy,我该怎么办?:

start_urls = ['link1.com', 'link2.com', 'link3.com', ...,'linkN.com']

因为我是scrapy的新手,所以我只尝试了一个链接:

'link1.com'   'extracted1.link.com'
'link2.com'   'extracted2.link.com'
'link3.com'   'extracted3.link.com'
...
'linkN.com'    'extractedN.link.com'

然而,这让我回答:

class ToySpider(scrapy.Spider):
    name = "toy"
    allowed_domains = ["https://www.example.com/"]
    start_urls = ['link1.com']


    def parse(self, response):

        for link in response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]"):
            item = ToyItem()
            item['link'] = link.xpath('@href').extract_first()
            item['interesting_link'] = link
            yield item

如何对{'link': 'extracted1.link.com', 'name': <Selector xpath=".//*[@id='object']//tbody//tr//td//span//a[2]" data='<a href="extracted1.link.com'>} 的所有元素执行上述操作并返回以下列表:

start_urls

更新

在尝试@Granitosaurus回答后,回复[ {'link': 'extracted1.link.com', 'name': 'link1.com'}, {'link': 'extracted2.link.com', 'name': 'link2.com'}, {'link': 'extracted3.link.com', 'name': 'link3.com'}, .... {'link': 'extractedN.link.com', 'name': 'linkN.com'} ] 以查找没有NaN的链接:

response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")

但是,而不是返回def parse(self, response): links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]") if not links: item = ToyItem() item['link'] = 'NaN' item['name'] = response.url return item for links in links: item = ToyItem() item['link'] = links.xpath('@href').extract_first() item['name'] = response.url # <-- see here yield item list_of_dics = [] list_of_dics.append(item) df = pd.DataFrame(list_of_dics) print(df) df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

(*)

我得到了:

'link1.com'   'NaN'
'link2.com'   'NAN'
'link3.com'   'extracted3.link.com'

如何返回'link3.com' 'extracted3.link.com'

1 个答案:

答案 0 :(得分:1)

您可以从response.url属性

中检索您的蜘蛛正在抓取的当前网址
start_urls = ['link1.com', 'link2.com', 'link3.com', ...,'linkN.com']

def parse(self, response):
    links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
    if not links:
        item = ToyItem()
        item['link'] = None
        item['name'] = response.url
        return item
    for links in links:
        item = ToyItem()
        item['link'] = links.xpath('@href').extract_first()
        item['name'] = response.url  # <-- see here
        yield item