我正在按照https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dealing_with_bad_data.html的建议来清理一些json数据。
然而,指南已过时,我想使用sparkSession
加载数据集并解析json。
spark.read.text('file.json').as[String].map(x => parse_json(x))
所以我最后得到的是Dataset[String]
而不是RDD[String]
,
如何读取数据集中的json行?
答案 0 :(得分:0)
使用预期结构定义一个case类(类似于java pojo)并将输入数据映射到它。列按名称自动排列,并保留类型。将person.json视为
<httpCompression directory="%SystemDrive%\inetpub\temp\IIS Temporary Compressed Files" minFileSizeForComp="2700" noCompressionForHttp10="false" noCompressionForProxies="false">
<scheme name="gzip" dll="%Windir%\system32\inetsrv\gzip.dll" />
<dynamicTypes>
<add mimeType="text/*" enabled="true" />
<add mimeType="message/*" enabled="true" />
<add mimeType="application/json" enabled="true" />
<add mimeType="application/json; charset=utf-8" enabled="true" />
<add mimeType="application/javascript" enabled="true" />
<add mimeType="application/x-javascript" enabled="true" />
<add mimeType="application/x-javascript; charset=utf-8" enabled="true" />
<add mimeType="*/*" enabled="false" />
</dynamicTypes>
<staticTypes>
<add mimeType="text/*" enabled="true" />
<add mimeType="message/*" enabled="true" />
<add mimeType="application/atom+xml" enabled="true" />
<add mimeType="application/xaml+xml" enabled="true" />
<add mimeType="application/javascript" enabled="true" />
<add mimeType="application/x-javascript" enabled="true" />
<add mimeType="application/x-javascript; charset=utf-8" enabled="true" />
<add mimeType="*/*" enabled="false" />
</staticTypes>
</httpCompression>
<urlCompression dynamicCompressionBeforeCache="true" doDynamicCompression="true" doStaticCompression="true" />
<serverRuntime frequentHitThreshold="1" frequentHitTimePeriod="10:00:00" />
定义case类就像 case class Person(name:String,age:Integer,technology:String)。读取json文件并映射到人员数据集
{"name": "Narsireddy", "age": 30, "technology": "hadoop"}