Scrapy-将Feed导出程序用于项目中的特定蜘蛛(而不是其他蜘蛛)

时间:2018-11-15 11:52:18

标签: python json scrapy

环境: Windows7,Python 3.6.5,Scrapy 1.5.1

问题描述:

我有一个名为project_github的项目,其中包含3个蜘蛛:spider1spider2spider3。这些蜘蛛中的每一个都会将特定网站的数据抓取到该蜘蛛。

我正在尝试在执行特定的Spider时自动导出JSON文件,格式为NameOfSpider_TodaysDate.json,以便从命令行可以:

执行脚本scrapy crawl spider1,该脚本返回spider1_181115.json

当前,我在ITEM EXPORTERS中使用settings.py,其代码如下:

import datetime
FEED_URI = 'spider1_' + datetime.datetime.today().strftime('%y%m%d') + '.json'
FEED_FORMAT = 'json'
FEED_EXPORTERS = {'json': 'scrapy.exporters.JsonItemExporter'}
FEED_EXPORT_ENCODING = 'utf-8'

很明显,无论使用什么蜘蛛,这段代码总是写spider1_TodaysDate.json。有什么建议吗?

1 个答案:

答案 0 :(得分:0)

执行此操作的方法是将<!-- Input --> <bean id="lastModifiedFileComparator" class="org.apache.commons.io.comparator.LastModifiedFileComparator"/> <int-file:inbound-channel-adapter id="inputAdapter" channel="inputChannel" directory="file:${input.files.path}" comparator="lastModifiedFileComparator" scan-each-poll="true"> <int:poller max-messages-per-poll="1" fixed-rate="5000"> <int:transactional transaction-manager="transactionManager" isolation="READ_COMMITTED" propagation="REQUIRED" timeout="60000" synchronization-factory="syncFactory"/> </int:poller> </int-file:inbound-channel-adapter> <!-- Continue only if the concurrentmetadatastore doesn't contain the file. If if is not the case : insert it in the metadatastore --> <int:filter input-channel="inputChannel" output-channel="processChannel" discard-channel="nullChannel" throw-exception-on-rejection="false" expression="@jdbcMetadataStore.putIfAbsent(headers[file_name], headers[timestamp]) == null"/> <!-- Rollback by removing the file from the metadatastore --> <int:transaction-synchronization-factory id="syncFactory"> <int:after-rollback expression="@jdbcMetadataStore.remove(headers[file_name])" /> </int:transaction-synchronization-factory> <!-- Metadatastore configuration --> <bean id="jdbcDataSource" class="org.apache.commons.dbcp.BasicDataSource"> <property name="url" value="jdbc:h2:file:${database.path}/shared;AUTO_SERVER=TRUE;AUTO_RECONNECT=TRUE;MVCC=TRUE"/> <property name="driverClassName" value="org.h2.Driver"/> <property name="username" value="${database.username}"/> <property name="password" value="${database.password}"/> <property name="maxIdle" value="4"/> <property name="defaultAutoCommit" value="false"/> </bean> <bean id="jdbcMetadataStore" class="org.springframework.integration.jdbc.metadata.JdbcMetadataStore"> <constructor-arg ref="jdbcDataSource"/> </bean> <bean id="transactionManager" class="org.springframework.jdbc.datasource.DataSourceTransactionManager"> <property name="dataSource" ref="jdbcDataSource"/> </bean> <!-- Workflow --> <int:chain input-channel="processChannel" output-channel="outputChannel"> <int:service-activator ref="fileActivator" method="fileRead"/> <int:service-activator ref="fileActivator" method="fileProcess"/> <int:service-activator ref="fileActivator" method="fileAudit"/> </int:chain> <!-- Output --> <int-file:outbound-channel-adapter id="outputChannel" directory="file:${output.files.path}" filename-generator-expression ="payload.name"> <!-- Delete the source file --> <int-file:request-handler-advice-chain> <bean class="org.springframework.integration.handler.advice.ExpressionEvaluatingRequestHandlerAdvice"> <property name="onSuccessExpressionString" value="headers[file_originalFile].delete()"/> </bean> </int-file:request-handler-advice-chain> </int-file:outbound-channel-adapter> 定义为要为其编写项目导出程序的特定蜘蛛网下的custom_settings属性。蜘蛛设置会覆盖项目设置。

因此,对于class

spider1