下面的我的JSON文件包含六行:
[
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:12 EST","n":"est"}]],
"apps":[],
"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},
"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"12","n":"cpu"},{"v":"154665","n":"seq"},{"v":"2016-08-24 14:23:17 EST","n":"est"}]
},
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:14 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"5","n":"cpu"},{"v":"154666","n":"seq"},{"v":"2016-08-24 14:23:23 EST","n":"est"}]},
{"events":[[{"v":"LOGOFF","n":"type"},{"v":"2016-08-24 14:24:04 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"0","n":"cpu"},{"v":"154667","n":"seq"},{"v":"2016-08-24 14:24:05 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"O","n":"state"},{"v":"5376","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"29","n":"cpu"},{"v":"154668","n":"seq"},{"v":"2016-09-25 16:57:24 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"16","n":"cpu"},{"v":"154669","n":"seq"},{"v":"2016-09-25 16:57:30 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"17","n":"cpu"},{"v":"154670","n":"seq"},{"v":"2016-09-25 16:57:36 EST","n":"est"}]}
]
JSON类似于以下记录:
JSON
0
1
2
3
4
5
必需的输出:
Count
6
答案 0 :(得分:1)
好,您在Spark中,需要将Json转换为数据集,并对其执行适当的操作。因此,在这里,我编写了从Json到数据集的工作流程,并编写了示例所需的步骤。我认为这种回答方式更有益,因为您可以查看步骤,然后可以决定如何处理信息。
输入数据:您拥有Json,这就是您应该开始处理的数据。然后,您需要确定哪些字段很重要。在大多数情况下,仅靠计数是很小的一部分,并且您不想加载所有不必要的字段。
创建案例类:您可以使用案例类,因为这样您就可以序列化输入数据。为简单起见,我有一个属于部门的医生,并且我在Json中获取数据。我可以使用以下案例类:
case class Department(name: String, address: String)
case class Doctor(name: String, department: Department)
因此,从上面的代码中可以看到,我自下而上创建了我要处理的数据。在您的Json中,有许多我无法理解的字段(例如v
)的含义。因此,请注意不要混合使用。
具有数据集:好的,下面的代码将Json序列化为我们定义的case类:
spark.read.json("doctorsData.json).as[Doctor]
两点。 spark
是一个Spark会话,您需要创建它。这里的实例是spark
,可以是任何实例。您还需要import spark.implicits._
。
在企业中!:好的,您现在在从事商业,并且在Spark世界中。只需使用count()
对数据集进行计数即可。以下方法显示了如何进行计数:
def recordsCount(myDataset: Dataset[Doctor]): Long = myDataset.count()
答案 1 :(得分:0)
我拥有的三个记录的文件-格式正确的Spark 2.x,读入数据框/数据集:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val df = spark.read
.option("multiLine", true)
.option("mode", "PERMISSIVE")
.option("inferSchema", true)
.json("/FileStore/tables/json_01.txt")
df.select("*").show(false)
df.printSchema()
df.count()
如果仅统计总数,那么就够了,最后一行。
res15: Long = 3