我无法弄清楚如何在scrapinghub部署中使用csv文件进行列表理解

时间:2019-04-26 06:27:22

标签: python scrapy scrapinghub

我正在尝试将Spider部署到scrapinghub,但无法弄清楚如何解决数据输入问题。我需要从csv中读取ID并将其附加到我的起始网址中,作为蜘蛛抓取的列表信息:

class exampleSpider(scrapy.Spider):
    name = "exampleSpider"

    #local scrapy method to extract data
    #PID = pd.read_csv('resources/PID_list.csv')

    #scrapinghub method
    csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")


    start_urls = ['http://www.example.com/PID=' + str(x) for x in csvdata]

需求文件和pkgutil.get_data部分都可以工作,但是我一直坚持将数据IO转换为列表。将数据调用转换为列表推导的过程是什么?

编辑: 谢谢!这让我有90%的选择!

class exampleSpider(scrapy.Spider):
    name = "exampleSpider"

    #local scrapy method to extract data
    #PID = pd.read_csv('resources/PID_list.csv')

    #scrapinghub method
    csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
    csvio = StringIO(csvdata)
    raw = csv.reader(csvio)

    # TODO : update code to get exact value from raw 
    start_urls = ['http://www.example.com/PID=' + str(x[0]) for x in raw]

str(x)需要str(x [0])作为快速修复,因为循环正在以url编码读取方括号中的内容,从而破坏了链接: str(x)产生了“ http://www.example.com/PID=%5B'0001'%5D” 但str(x[0])将其从列表括号中删除:“ http://www.example.com/PID='0001'”

1 个答案:

答案 0 :(得分:1)

class exampleSpider(scrapy.Spider):
    name = "exampleSpider"

    #local scrapy method to extract data
    #PID = pd.read_csv('resources/PID_list.csv')

    #scrapinghub method
    csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
    csvio = StringIO(csvdata)
    raw = csv.reader(csvio)

    # TODO : update code to get exact value from raw 
    start_urls = ['http://www.example.com/PID=' + str(x) for x in raw]

您可以使用StringIO通过read()方法将字符串转换为csv.reader应该能够处理的内容。希望对您有所帮助:)