如何从打开star_url的每个链接到打开我的csv来更改我的scrapy

时间:2019-01-28 19:03:02

标签: python scrapy

在我抓狂之前,请执行以下操作: 在imdb排名前250位的每个链接中,它将打开并获取我需要的信息。

现在,我有一个包含500个链接的csv文件,我需要它一个一个地打开并获取所需的信息。 但是我有点迷茫,不知道该怎么做。 我当时正在考虑更改

def parse(self,response) 

但是我不确定如何

这是我以前的代码:

import scrapy
from imdb2.items import Imdb2Item


class ThirdSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = ["imdb.com"]
start_urls = (
'http://www.imdb.com/chart/top',
)

def parse(self, response):
links = response.xpath('//tbody[@class="lister-list"]/tr/td[@class="titleColumn"]/a/@href').extract()
i =1 
for link in links:
    abs_url = response.urljoin(link)
    #
    url_next = '//*[@id="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()'
    rating = response.xpath(url_next).extract()
    if (i <= len(links)):
        i=i+1
    yield scrapy.Request(abs_url, callback = self.parse_indetail, meta={'rating' : rating})



def parse_indetail(self,response):
    item = Imdb2Item()
    #
    item['title'] = response.xpath('//div[@class="title_wrapper"]/h1/text()').extract()[0][:-1]
    item['production'] = response.xpath('//h4[contains(text(), "Production Co")]/following-sibling::a/text()').extract()

    return item

我的代码现在是这样的:

import scrapy
from imdb2.items import Imdb2Item
import csv
import re
from scrapy.contrib.linkextractors import LinkExtractor

class ThirdSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = []

with open('links.csv') as f:
    start_urls = [url.strip() for url in f.readlines()]

def parse(self, response):
    #this should change i guess?

def parse_indetail(self,response):
    item = Imdb2Item()
    #
    item['title'] = response.xpath('//div[@class="title_wrapper"]/h1/text()').extract()[0][:-1]
    item['production'] = response.xpath('//h4[contains(text(), "Production Co")]/following-sibling::a/text()').extract()

    return item

我添加了从csv文件获取链接的功能,但我不知道在def解析时要进行哪些更改。

谢谢。

1 个答案:

答案 0 :(得分:1)

csv文件中是否有电影链接?在这种情况下,您的代码将如下所示:

import scrapy
from imdb2.items import Imdb2Item
import csv

class ThirdSpider(scrapy.Spider):
    name = "imdbtestspider"

    def start_requests(self):
        with open('links.csv', 'r') as f:
            for url in f.readlines():
                yield Request(url.strip())

    def parse(self, response):
        item = Imdb2Item()
        item['title'] = response.xpath('//div[@class="title_wrapper"]/h1/text()').extract()[0][:-1]
        item['production'] = response.xpath('//h4[contains(text(), "Production Co")]/following-sibling::a/text()').extract()
        yield item