我想在spark(scala)中解析json文件。 接下来我要保存txt文件.. JSON文件保存在HDFS中。
如何使用scala解析json文件?
json文件示例) metadata.json
{"ID": "ABCDEFG", "product": "computer", "review": "good"}
{"ID": "ZXCVBND", "product": "computer", "review": "bad"}
我想解析ID和审核。 解析后==>
ABCDEFG :: good
ZXCVBND :: bad
答案 0 :(得分:1)
在Spark中读取JSON是一个使用SparkSession class MyTask(Task):
def on_success(self, retval, task_id, args, kwargs):
print("success")
def on_failure(self, exc, task_id, args, kwargs, einfo):
print("failed")
def bind(self, app):
return super(self.__class__, self).bind(app)
def run(self, *args, **kwargs):
x = kwargs.get('data', None)
x = x**2
if __name__=="__main__":
mytask = MyTask()
app = Celery('mytask', backend='redis', broker='redis://localhost')
mytask.bind(app)
job = mytask.apply_async(data = 1)
的问题,它将返回一个DataFrame(Dataset [Row]的别名)。然后,您可以在其上调用Received unregistered task of type None.
The message has been ignored and discarded.
Did you remember to import the module containing this task?
Or maybe you're using relative imports?
Please see
http://docs.celeryq.org/en/latest/internals/protocol.html
for more information.
The full contents of the message body was:
b'[[], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (77b)
Traceback (most recent call last):
File "/home/ayandeh/anaconda3/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 559, in on_task_received
strategy = strategies[type_]
KeyError: None
以将这两个值作为另一个DataFrame。
在您要写入HDFS的DF上(抱歉,我还没有从Spark中写入DynamoDB),您可以调用.read.json(path)
,其中json / csv / parquet / etc表示任何内容要写入HDFS目录的格式。
答案 1 :(得分:1)
看起来很简单 - 从json读取数据,使用Spark sql创建查询,并将数据保存到hdfs:
val df = spark.read.json("json/in/hdfs/data.json")
df.show()
val myDF = spark.read.json(path)
myDF.printSchema()//for debug purposes
myDF.createOrReplaceTempView("myData")
val selectedDF = spark.sql("SELECT id, parse FROM myData")
.map(attributes => attributes(0) + " :: " + attributes(1))
selectedDF.write().fomat("json").saveAsTextFile("hdfs://...")