给出:我有以下pySpark数据框:
s_df.show(10)
+-------------+--------------------+-------+--------------------+
| timestamp| value|quality| attributes|
+-------------+--------------------+-------+--------------------+
|1506846688201|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506846714421|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853046041|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853069411|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853175701|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853278721|eyJWYWx1ZSI6ICJOQ...| 3|[WrappedArray(0.3...|
|1506853285741|eyJWYWx1ZSI6ICJOQ...| 3|[WrappedArray(0.3...|
|1506853313701|eyJWYWx1ZSI6ICJOQ...| 3|[WrappedArray(0.3...|
|1506856544461|eyJJbnNlcnRUaW1lI...| 3|[WrappedArray(0.3...|
|1506856563751|eyJJbnNlcnRUaW1lI...| 3|[WrappedArray(0.3...|
+-------------+--------------------+-------+--------------------+
only showing top 10 rows
目标:我想解码value
列并将数据提取到如下所示的数据帧中:
Counter Duration EventEndTime ... Floor_Number InsertTime Value
0 1.0 2790 1506846690991 ... NA 1507645527691 0
0 1.0 2760 1506846717181 ... NA 1507645530751 0
0 1.0 5790 1506853051831 ... NA 1509003670478 NA
0 1.0 6060 1506853075471 ... NA 1509003671231 NA
0 1.0 3480 1506853179181 ... NA 1509003671935 NA
0 1.0 2760 1506853281481 ... NA 1509004002809 NA
0 1.0 3030 1506853288771 ... NA 1509004003249 NA
0 1.0 2790 1506853316491 ... NA 1509004004038 NA
0 1.0 3510 1506856547971 ... NA 1509003922437 NA
0 1.0 3810 1506856567561 ... NA 1509003923116 NA
难度::我可以解码,但是无法提取dict
中的pySpark
。我最终在pandas
中做到了。我想完成pySpark
中的所有操作。
我的尝试
我使用以下udf
通过以下方式对value
列进行解码:
def decode_values(x):
return base64.b64decode(x)
udf_myFunction = udf(decode_values, StringType())
result = ts_df.withColumn('value', udf_myFunction('value'))
获得以下信息:
result.show(10)
+-------------+--------------------+-------+--------------------+
| timestamp| value|quality| attributes|
+-------------+--------------------+-------+--------------------+
|1506846688201|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506846714421|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853046041|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853069411|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853175701|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853278721|{"Value": "NA", "...| 3|[WrappedArray(0.3...|
|1506853285741|{"Value": "NA", "...| 3|[WrappedArray(0.3...|
|1506853313701|{"Value": "NA", "...| 3|[WrappedArray(0.3...|
|1506856544461|{"InsertTime": 15...| 3|[WrappedArray(0.3...|
|1506856563751|{"InsertTime": 15...| 3|[WrappedArray(0.3...|
+-------------+--------------------+-------+--------------------+
value
列显示为dict
,因此我在熊猫中执行以下操作:
import pandas as pd
from pandas.io.json import json_normalize
result2 = result.toPandas()
final_df = result2.value.apply(json.loads).apply(json_normalize).pipe(lambda x: pd.concat(x.values))
final_df.head()
Counter Duration EventEndTime ... Floor_Number InsertTime Value
0 1.0 2790 1506846690991 ... NA 1507645527691 0
0 1.0 2760 1506846717181 ... NA 1507645530751 0
0 1.0 5790 1506853051831 ... NA 1509003670478 NA
0 1.0 6060 1506853075471 ... NA 1509003671231 NA
0 1.0 3480 1506853179181 ... NA 1509003671935 NA
0 1.0 2760 1506853281481 ... NA 1509004002809 NA
0 1.0 3030 1506853288771 ... NA 1509004003249 NA
0 1.0 2790 1506853316491 ... NA 1509004004038 NA
0 1.0 3510 1506856547971 ... NA 1509003922437 NA
0 1.0 3810 1506856567561 ... NA 1509003923116 NA
我试图创建另一个udf
函数,但是它不起作用:
def convert_dict_to_columns(x):
df_out = x.apply(json.loads).apply(json_normalize).pipe(lambda y: pd.concat(y.values))
return df_out
# Convert the dict to columns
udf_convert_d2c = udf(convert_dict_to_columns, StringType())
final_result = result.withColumn('value', udf_convert_d2c('value'))
任何想法我都可以通过pySpark-y方法做到这一点。
这是我最小的可行代码:
import pandas as pd
df = pd.DataFrame(columns=['Timestamp', 'value'])
df.loc[:, 'Timestamp'] = [1501891200, 1501891200, 1501891200, 1501891200, 1501891200, 1501891200]
df.loc[:, 'value'] = [{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201}]
s_df = spark.createDataFrame(df)