具有时间序列数据的PySpark XML到JSON

时间:2018-04-22 16:06:01

标签: python apache-spark pyspark spark-dataframe

我有近50万个包含时间序列数据的XML文件,每个文件约2-3MB,每个文件包含大约10k行时间序列数据。我们的想法是将XML文件转换为每个唯一ID的JSON。但是,每个ID的时间序列数据需要分解为行大小为10的批次并转换为JSON并写入NoSQL数据库。最初,代码编写为迭代在每个ID的一个单片数据帧上,并按行大小10递增,然后将文档写入db。

def resample_idx(X,resample_rate):
    for idx in range(0,len(X),resample_rate):
        yield X.iloc[idx:idx+resample_rate,:]

# Batch Documents 
    for idx, df_batch in enumerate(resample_idx(df,10))
        dict_ = {}
        dict_['id'] = soup.find('id').contents[0]
        dict_['data'] = [v for k,v in pd.DataFrame.to_dict(df_batch.T).items()]

JSON文档的示例如下所示:

{'id':123456A,
'data': [{'A': 251.23,
          'B': 130.56,
          'dtim': Timestamp('2011-03-24 11:18:13.350000')
         },
         {
          'A': 253.23,
          'B': 140.56,
          'dtim': Timestamp('2011-03-24 11:19:21.310000')
         },
         .........
        ]
},
{'id':123593X,
'data': [{'A': 641.13,
          'B': 220.51,
          'C': 10.45
          'dtim': Timestamp('2011-03-26 12:11:13.350000')
         },
         {
          'A': 153.25,
          'B': 810.16,
          'C': 12.5
          'dtim': Timestamp('2011-03-26 12:11:13.310000')
         },
         .........
        ]
}

这适用于小样本,但在创建批次时很快就会意识到这一点。因此,希望在Spark中复制这个。 Spark的经验有限,但这是我尝试过的:

首先获取所有ID的所有时间序列数据:

df = sqlContext.read.format("com.databricks.spark.xml").options(rowTag='log').load("dbfs:/mnt/timedata/")

XML Schema

 |-- _id: string (nullable = true)   
 |-- collect_list(TimeData): array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- data: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- ColNames: string (nullable = true)
 |    |    |-- Units: string (nullable = true)

获取Spark DataFrame的SQL查询     d = df.select(" _id"," TimeData.data",' TimeData.ColNames')

当前Spark DataFrame

+--------------------+--------------------+--------------------+
|                id  |                data|            ColNames|
+--------------------+--------------------+--------------------+
|123456A             |[2011-03-24 11:18...|dTim,A,B            |
|123456A             |[2011-03-24 11:19...|dTim,A,B            |
|123593X             |[2011-03-26 12:11...|dTim,A,B,C          |
|123593X             |[2011-03-26 12:11...|dTim,A,B,C          |
+--------------------+--------------------+--------------------+

预期的Spark DataFrame

+--------------------+--------------------+----------+----------+
|                id  |               dTime|         A|         B|
+--------------------+--------------------+----------+----------+
|123456A             |2011-03-24 11:18... |    251.23|    130.56|
|123456A             |2011-03-24 11:19... |    253.23|    140.56|
+--------------------+--------------------+----------+----------+

+--------------------+--------------------+----------+----------+----------+
|                id  |               dTime|         A|         B|         C|
+--------------------+--------------------+----------+----------+----------+
|123593X             |2011-03-26 12:11... |    641.13|    220.51|     10.45|
|123593X             |2011-03-26 12:11... |    153.25|    810.16|      12.5|
+--------------------+-------------------+---------- +----------+----------+

我这里只展示了两个时间戳的数据,但是我怎样才能将上面的DataFrame转换为每个第n行(对于每个id)的批处理JSON文件,类似于使用上面显示的Pandas完成的方式?最初的想法是执行groupBy并将UDF应用于每个ID?输出看起来像上面的JSON结构。

XML结构:

<log>
   <id>"ABC"</id>
   <TimeData>
      <colNames>dTim,colA,colB,colC,</colNames>
      <data>2011-03-24T11:18:13.350Z,0.139,38.988,0,110.307</data>
      <data>2011-03-24T11:18:43.897Z,0.138,39.017,0,110.307</data>
  </TimeData>
</log>

请注意,每个ID都没有固定数量的coName,范围在5到30之间,具体取决于为该ID收集的数据源。

2 个答案:

答案 0 :(得分:2)

根据这些信息,这可能是一个解决方案。不幸的是我的Python有点生疏,但是这里的所有scala函数应该有等效的

// Assume nth is based of dTim ordering
val windowSpec = Window
  .partitionBy($"_id")
  .orderBy($"dTim".desc)

val nthRow  = 2  // define the nthItem to be fetched

df.select(
  $"_id",
  $"TimeData.data".getItem(0).getItem(0).cast(TimestampType).alias("dTim"),
  $"TimeData.data".getItem(0).getItem(1).cast(DoubleType).alias("A"),
  $"TimeData.data".getItem(0).getItem(2).cast(DoubleType).alias("B"),
  $"TimeData.data".getItem(0).getItem(3).cast(DoubleType).alias("C")
).withColumn("n", row_number().over(windowSpec))
  .filter(col("n") === nthRow)
  .drop("n")
.show()

将输出类似

的内容
+-------+--------------------+------+------+-----+
|    _id|                dTim|     A|     B|    C|
+-------+--------------------+------+------+-----+
|123456A|2011-03-24 11:18:...|251.23|130.56| null|
|123593X|2011-03-26 12:11:...|641.13|220.51|10.45|
+-------+--------------------+------+------+-----+

如果我知道更多的话,我会改进答案

更新

我喜欢这个谜题,所以如果我能正确理解问题,这可能是一个解决方案:

我创建了3个xml文件,每个2个数据记录共有2个不同的ID

val df = spark
  .sqlContext
  .read
  .format("com.databricks.spark.xml")
  .option("rowTag", "log")
  .load("src/main/resources/xml")


// Could be computationally heavy, maybe cache df first if possible, otherwise run it on a sample, otherwise hardcode possible colums
val colNames = df
  .select(explode(split($"TimeData.colNames",",")).as("col"))
  .distinct()
  .filter($"col" =!= lit("dTim") && $"col" =!= "")
  .collect()
  .map(_.getString(0))
  .toList
  .sorted

// or list all possible columns
//val colNames = List("colA", "colB", "colC")


// Based on XML colNames and data are comma seprated strings that have to be split. Could be done using sql split function, but this UDF maps the columns to the correct field
def mapColsToData = udf((cols:String, data:Seq[String]) =>
  if(cols == null || data == null) Seq.empty[Map[String, String]]
  else {
    data.map(str => (cols.split(",") zip str.split(",")).toMap)
  }
)

//  The result of this action is 1 record for each datapoint for all XML's. Each data record is key->value map of colName->data
val denorm = df.select($"id", explode(mapColsToData($"TimeData.colNames", $"TimeData.data")).as("data"))

denorm.show(false)

输出:

+-------+-------------------------------------------------------------------------------+
|id     |data                                                                           |
+-------+-------------------------------------------------------------------------------+
|123456A|Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0)|
|123593X|Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)           |
|123593X|Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017)           |
|123456A|Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)|
|123456A|Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0)|
+-------+-------------------------------------------------------------------------------+
// now create column for each map value, based on predef / found columnNames
val columized = denorm.select(
  $"id",
  $"data.dTim".cast(TimestampType).alias("dTim"),
  $"data"
)

columized.show()

输出:

+-------+--------------------+--------------------+
|     id|                dTim|                data|
+-------+--------------------+--------------------+
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-24 12:18:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123593X|2011-03-26 12:20:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
|123456A|2011-03-27 13:18:...|Map(dTim -> 2011-...|
+-------+--------------------+--------------------+
// create window over which to resample
val windowSpec = Window
  .partitionBy($"id")
  .orderBy($"dTim".desc)

val resampleRate = 2

// add batchId based on resample rate. Group by batch and
val batched = columized
  .withColumn("batchId", floor((row_number().over(windowSpec) - lit(1)) / lit(resampleRate)))
  .groupBy($"id", $"batchId")
  .agg(collect_list($"data").as("data"))
  .drop("batchId")

batched.show(false)

输出:

+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id     |data                                                                                                                                                              |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|123593X|[Map(dTim -> 2011-03-26T11:20:43.897Z, colA -> 1.138, colB -> 29.017), Map(dTim -> 2011-03-26T11:20:13.350Z, colA -> 1.139, colB -> 28.988)]                      |
|123456A|[Map(dTim -> 2011-03-27T11:18:43.897Z, colA -> 0.128, colB -> 35.017, colC -> 0), Map(dTim -> 2011-03-27T11:18:13.350Z, colA -> 0.129, colB -> 35.988, colC -> 0)]|
|123456A|[Map(dTim -> 2011-03-24T11:18:43.897Z, colA -> 0.138, colB -> 39.017, colC -> 0), Map(dTim -> 2011-03-24T11:18:13.350Z, colA -> 0.139, colB -> 38.988, colC -> 0)]|
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
// Store as 1 huge json file (drop reapatrition if you can handle multiple json, better for master as well)
batched.repartition(1).write.mode(SaveMode.Overwrite).json("/tmp/xml")

输出json:

{"id":"123593X","data":[{"dTim":"2011-03-26T12:20:43.897+01:00","colA":"1.138","colB":"29.017"},{"dTim":"2011-03-26T12:20:13.350+01:00","colA":"1.139","colB":"28.988"}]}
{"id":"123456A","data":[{"dTim":"2011-03-27T13:18:43.897+02:00","colA":"0.128","colB":"35.017","colC":"0"},{"dTim":"2011-03-27T13:18:13.350+02:00","colA":"0.129","colB":"35.988","colC":"0"}]}
{"id":"123456A","data":[{"dTim":"2011-03-24T12:18:43.897+01:00","colA":"0.138","colB":"39.017","colC":"0"},{"dTim":"2011-03-24T12:18:13.350+01:00","colA":"0.139","colB":"38.988","colC":"0"}]}

答案 1 :(得分:1)

这是另一种不依赖于硬编码列名称的方法。基本上,我们的想法是爆炸dataColNames列以获得'融化'DF,然后我们可以转动以获得您想要的形式:

# define function that processes elements of rdd
# underlying the DF to get a melted RDD
def process(row, cols):
    """cols is list of target columns to explode"""
    row=row.asDict()
    exploded=[[row['id']]+list(elt) for elt in zip(*[row[col] for col in cols])]    
    return(exploded)


#Now split ColNames:
df=df.withColumn('col_split', f.split('ColNames',","))

# define target cols to explode, each element of each col 
# can be of different length
cols=['data', 'col_split']

# apply function and flatmap the results to get melted RDD/DF
df=df.select(['id']+cols).rdd\
    .flatMap(lambda row: process(row, cols))\
    .toDF(schema=['id', 'value', 'name'])

# Pivot to get the required form
df.groupby('id').pivot('name').agg(f.max('value')).show()