使用PySpark解码一列并将其提取为几列

时间:2018-08-16 21:21:26

标签: pyspark pyspark-sql

给出:我有以下pySpark数据框:

s_df.show(10)

+-------------+--------------------+-------+--------------------+
|    timestamp|               value|quality|          attributes|
+-------------+--------------------+-------+--------------------+
|1506846688201|eyJGbG9vcl9OdW1iZ...|      3|[WrappedArray(0.3...|
|1506846714421|eyJGbG9vcl9OdW1iZ...|      3|[WrappedArray(0.3...|
|1506853046041|eyJGbG9vcl9OdW1iZ...|      3|[WrappedArray(0.3...|
|1506853069411|eyJGbG9vcl9OdW1iZ...|      3|[WrappedArray(0.3...|
|1506853175701|eyJGbG9vcl9OdW1iZ...|      3|[WrappedArray(0.3...|
|1506853278721|eyJWYWx1ZSI6ICJOQ...|      3|[WrappedArray(0.3...|
|1506853285741|eyJWYWx1ZSI6ICJOQ...|      3|[WrappedArray(0.3...|
|1506853313701|eyJWYWx1ZSI6ICJOQ...|      3|[WrappedArray(0.3...|
|1506856544461|eyJJbnNlcnRUaW1lI...|      3|[WrappedArray(0.3...|
|1506856563751|eyJJbnNlcnRUaW1lI...|      3|[WrappedArray(0.3...|
+-------------+--------------------+-------+--------------------+
only showing top 10 rows

目标:我想解码value列并将数据提取到如下所示的数据帧中:

  Counter  Duration   EventEndTime  ...   Floor_Number     InsertTime Value
0      1.0      2790  1506846690991  ...             NA  1507645527691     0
0      1.0      2760  1506846717181  ...             NA  1507645530751     0
0      1.0      5790  1506853051831  ...             NA  1509003670478    NA
0      1.0      6060  1506853075471  ...             NA  1509003671231    NA
0      1.0      3480  1506853179181  ...             NA  1509003671935    NA
0      1.0      2760  1506853281481  ...             NA  1509004002809    NA
0      1.0      3030  1506853288771  ...             NA  1509004003249    NA
0      1.0      2790  1506853316491  ...             NA  1509004004038    NA
0      1.0      3510  1506856547971  ...             NA  1509003922437    NA
0      1.0      3810  1506856567561  ...             NA  1509003923116    NA 

难度::我可以解码,但是无法提取dict中的pySpark。我最终在pandas中做到了。我想完成pySpark中的所有操作。

我的尝试

我使用以下udf通过以下方式对value列进行解码:

def decode_values(x):
    return base64.b64decode(x)
udf_myFunction = udf(decode_values, StringType())
result = ts_df.withColumn('value', udf_myFunction('value'))

获得以下信息:

result.show(10)

+-------------+--------------------+-------+--------------------+
|    timestamp|               value|quality|          attributes|
+-------------+--------------------+-------+--------------------+
|1506846688201|{"Floor_Number": ...|      3|[WrappedArray(0.3...|
|1506846714421|{"Floor_Number": ...|      3|[WrappedArray(0.3...|
|1506853046041|{"Floor_Number": ...|      3|[WrappedArray(0.3...|
|1506853069411|{"Floor_Number": ...|      3|[WrappedArray(0.3...|
|1506853175701|{"Floor_Number": ...|      3|[WrappedArray(0.3...|
|1506853278721|{"Value": "NA", "...|      3|[WrappedArray(0.3...|
|1506853285741|{"Value": "NA", "...|      3|[WrappedArray(0.3...|
|1506853313701|{"Value": "NA", "...|      3|[WrappedArray(0.3...|
|1506856544461|{"InsertTime": 15...|      3|[WrappedArray(0.3...|
|1506856563751|{"InsertTime": 15...|      3|[WrappedArray(0.3...|
+-------------+--------------------+-------+--------------------+

value列显示为dict,因此我在熊猫中执行以下操作:

import pandas as pd
from pandas.io.json import json_normalize
result2 = result.toPandas()
final_df = result2.value.apply(json.loads).apply(json_normalize).pipe(lambda x: pd.concat(x.values))
final_df.head()

  Counter  Duration   EventEndTime  ...   Floor_Number     InsertTime Value
0      1.0      2790  1506846690991  ...             NA  1507645527691     0
0      1.0      2760  1506846717181  ...             NA  1507645530751     0
0      1.0      5790  1506853051831  ...             NA  1509003670478    NA
0      1.0      6060  1506853075471  ...             NA  1509003671231    NA
0      1.0      3480  1506853179181  ...             NA  1509003671935    NA
0      1.0      2760  1506853281481  ...             NA  1509004002809    NA
0      1.0      3030  1506853288771  ...             NA  1509004003249    NA
0      1.0      2790  1506853316491  ...             NA  1509004004038    NA
0      1.0      3510  1506856547971  ...             NA  1509003922437    NA
0      1.0      3810  1506856567561  ...             NA  1509003923116    NA

我试图创建另一个udf函数,但是它不起作用:

def convert_dict_to_columns(x):
     df_out = x.apply(json.loads).apply(json_normalize).pipe(lambda y: pd.concat(y.values))
     return df_out

# Convert the dict to columns
udf_convert_d2c = udf(convert_dict_to_columns, StringType())
final_result = result.withColumn('value', udf_convert_d2c('value'))

任何想法我都可以通过pySpark-y方法做到这一点。

这是我最小的可行代码:

import pandas as pd
df = pd.DataFrame(columns=['Timestamp', 'value'])
df.loc[:, 'Timestamp'] = [1501891200, 1501891200, 1501891200, 1501891200, 1501891200, 1501891200]
df.loc[:, 'value'] = [{"Floor_Number": "NA", "Value": 0.0,  "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
                      {"Floor_Number": "NA", "Value": 0.0,  "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
                      {"Floor_Number": "NA", "Value": 0.0,  "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
                      {"Floor_Number": "NA", "Value": 0.0,  "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
                      {"Floor_Number": "NA", "Value": 0.0,  "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
                      {"Floor_Number": "NA", "Value": 0.0,  "InsertTime": 1507645527691, "EventStartTime": 1506846688201}]

 s_df = spark.createDataFrame(df)

0 个答案:

没有答案