我正在使用python脚本抓取一个页面(例如,facebook页面),并希望传递每个帖子以写入文件(类似于gettwitter进程)。 ExecuteScript是我的nifi数据流中的第一个处理器。我设法使用session.create()创建一个流文件,没有问题。
但是,我对如何将我从facebook上读取的数据放入 outputstreamCallback 感到困惑。我见过的大多数例子都使用了java覆盖,但是我必须使用Python,并且必须承认我对此并不陌生。
我已经找到了很多关于读取流文件的例子,但没什么。下面是Java中我想用Python做的事情。
FlowFile flowFile = session.create();
flowFile = session.write(flowFile, new OutputStreamCallback() {
@Override
public void process(final OutputStream out) throws IOException {
out.write(tweet.getBytes(StandardCharsets.UTF_8));
如果还有其他方法,请指导。感谢。
采用@James建议的更改后,我写了一个片段,下面给出但不传输流文件。但是没有编译错误。
import urllib2
import json
import datetime
import csv
import time
import sys
import traceback
from org.apache.nifi.processor.io import OutputStreamCallback
from org.python.core.util import StringUtil
class WriteContentCallback(OutputStreamCallback):
def __init__(self, content):
self.content_text = content
def process(self, outputStream):
try:
outputStream.write(StringUtil.toBytes(self.content_text))
except:
traceback.print_exc(file=sys.stdout)
raise
#app_id = "<FILL IN>"
#app_secret = "<FILL IN>" # DO NOT SHARE WITH ANYONE!
page_id = "dsssssss"
#page_id = raw_input("Please Paste Public Page Name:")
#access_token = app_id + "|" + app_secret
access_token = "sdfsdfsf%sdfsdf"
#access_token = raw_input("Please Paste Your Access Token:")
def scrapeFacebookPageFeedStatus(page_id, access_token):
flowFile = session.create()
flowFile = session.write(flowFile, WriteContentCallback("Hello there this is my data"))
flowFile = session.write()
session.transfer(flowFile, REL_SUCCESS)
has_next_page = False
num_processed = 0 # keep a count on how many we've processed
scrape_starttime = datetime.datetime.now()
while has_next_page:
print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
has_next_page = False
print "\nDone!\n%s Statuses Processed in %s" % \
(num_processed, datetime.datetime.now() - scrape_starttime)
if __name__ == '__main__':
scrapeFacebookPageFeedStatus(page_id, access_token)
flowFile = session.create()
flowFile = session.write(flowFile, WriteContentCallback("and your data"))
session.transfer(flowFile, REL_SUCCESS)
以下是nifi-app.log
的输出> [root@ambari logs]# tail -100 nifi-app.log 2017-04-03 14:08:07,989
> INFO [StandardProcessScheduler Thread-6]
> o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled
> ExecuteScript[id=a62f4b97-8fd7-15cd-95b9-505e1b960805] to run with 1
> threads 2017-04-03 14:08:08,938 INFO [Flow Service Tasks Thread-2]
> o.a.nifi.controller.StandardFlowService Saved flow controller
> org.apache.nifi.controller.FlowController@44ec5960 // Another save
> pending = false 2017-04-03 14:08:13,789 INFO [StandardProcessScheduler
> Thread-3] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled
> PutFile[id=a62f4b8e-8fd7-15cd-7517-56593deabf55] to run with 1 threads
> 2017-04-03 14:08:14,296 INFO [Flow Service Tasks Thread-2]
> o.a.nifi.controller.StandardFlowService Saved flow controller
> org.apache.nifi.controller.FlowController@44ec5960 // Another save
> pending = false
答案 0 :(得分:0)
以下是Python ExecuteScript中的NiFi OutputStreamCallback的简单实现:
memset