我正在尝试将Spider部署到scrapinghub,但无法弄清楚如何解决数据输入问题。我需要从csv中读取ID并将其附加到我的起始网址中,作为蜘蛛抓取的列表信息:
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
start_urls = ['http://www.example.com/PID=' + str(x) for x in csvdata]
需求文件和pkgutil.get_data部分都可以工作,但是我一直坚持将数据IO转换为列表。将数据调用转换为列表推导的过程是什么?
编辑: 谢谢!这让我有90%的选择!
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
csvio = StringIO(csvdata)
raw = csv.reader(csvio)
# TODO : update code to get exact value from raw
start_urls = ['http://www.example.com/PID=' + str(x[0]) for x in raw]
str(x)需要str(x [0])作为快速修复,因为循环正在以url编码读取方括号中的内容,从而破坏了链接:
str(x)
产生了“ http://www.example.com/PID=%5B'0001'%5D”
但str(x[0])
将其从列表括号中删除:“ http://www.example.com/PID='0001'”
答案 0 :(得分:1)
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
csvio = StringIO(csvdata)
raw = csv.reader(csvio)
# TODO : update code to get exact value from raw
start_urls = ['http://www.example.com/PID=' + str(x) for x in raw]
您可以使用StringIO通过read()方法将字符串转换为csv.reader应该能够处理的内容。希望对您有所帮助:)