我正在使用scrapinghub很长一段时间。我有一些蜘蛛每天都在做一份工作。每个周末我都会登录收集抓取的数据。所以我最终不得不一次打开一个蜘蛛一个七个工作,下载数据然后移动到下一个,然后是下一个蜘蛛,依此类推。 有没有办法立即获取蜘蛛完成作业的所有提取数据?
答案 0 :(得分:1)
我所做的是使用scrapinghub python API客户端界面,所以如果你熟悉python我会建议你使用它,否则你可以卷曲... https://doc.scrapinghub.com/api/items.html#item-object
我有一个宠物项目刮擦varios视频托管网站获取视频标题,流url +类别(取决于刮刀调用它)...部署到scrapinghub,然后使用shubs api(基于python),迭代项目像字典一样创建.m3u播放列表..
目的是在一个播放列表中聚合所有需要的视频 (在我的情况下使用vlc播放器)。如果
这是一个废料代码snipet(不是我的实际项目应用程序)
from __future__ import print_function
from scrapinghub import Connection
import os
conn = Connection('YOURAPIKEYGOESHERE')
#179923/1/1
list = conn.project_ids()
print("PROJECTS")
print("-#-" * 30)
for index, item in enumerate(list[1::]):
index = str(index)
item = str(item)
project = conn[item]
pspi = project.spiders()
jobs = project.jobs()
for x in pspi:
print("["+ index + "] | PROJECT ID " + item, x['id'], x['tags'])
print("-#-" * 30)
print(list[0:4])
print(list[4:8])
print(list[8:12])
print(list[12:16])
print(list[16:20])
print(list[20:24])
print("-#-" * 30)
project = conn['180064'] #Manually Inserted
print("CONNECTING 2 |" + project.id)
print(project)
print("-#-" * 30)
pspi = project.spiders()
for x in pspi:
print(x)
print("-#-" * 30)
jobs = project.jobs()
print(jobs)
for job in jobs:
print(job)
job = project.job(u'180064/3/1') #Manually Inserted
print(job)
print("ITEMS")
print("-#-" * 30)
itemCount = job.info['items_scraped']
print("Items Scraped: {}".format(itemCount))
print(job.info['items_scraped'])
print("-#-" * 30)
def printF():
ipr = input("Do you wish to print? [y/n] \n")
if ipr == "y":
name = input("what is the name of project?\n")
print("-#-" * 30)
print("Printing intems to m3u")
print("-#-" * 30)
for item in job.items():
with open(name +'.m3u', 'a') as f:
f.write('#EXTINF:0, ' + str(item['title']) + '\n' + str(item['vidsrc']) + '\n')
f.close()
infile = name + ".m3u"
outfile = name + "_clean.m3u"
delete_list = ["['", "']"]
fin = open(infile)
fout = open(outfile, "w+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
else:
print("Not printing")
答案 1 :(得分:0)
这是我的最终代码
{{1}}