数据帧的base64解码

时间:2019-05-31 20:53:57

标签: scala apache-spark dataframe pyspark base64

我有一个编码的数据帧,我设法使用PySpark中的以下代码对其进行了解码。有什么简单的方法可以通过Scala / PySpark在数据框本身中添加一个附加列?

import base64
import numpy as np
df = spark.read.parquet("file_path")
encodedColumn = base64.decodestring(df.take(1)[0].column2)
t1 = np.frombuffer(encodedColumn ,dtype='<f4')

我查找了多个类似的问题,但无法使它们起作用。

1 个答案:

答案 0 :(得分:1)

您具有base64和unbase64函数。

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streaming#pyspark.sql.functions.base64

您可以

    from pyspark.sql.functions import unbase64,base64
    got = spark.createDataFrame([(1, "Jon"), (2, "Danny"), (3, "Tyrion")], ("id", "name"))

+---+------+
| id|  name|
+---+------+
|  1|   Jon|
|  2| Danny|
|  3|Tyrion|
+---+------+

encoded_got = got.withColumn('encoded_base64_name', base64(got.name))

+---+------+-------------------+
| id|  name|encoded_base64_name|
+---+------+-------------------+
|  1|   Jon|               Sm9u|
|  2| Danny|           RGFubnk=|
|  3|Tyrion|           VHlyaW9u|
+---+------+-------------------+

decoded_got = encoded_got.withColumn('decoded_base64', unbase64(encoded_got.encoded_base64).cast("string"))
# Need to use cast("string") to convert from binary to string


+---+------+--------------+--------------+
| id|  name|encoded_base64|decoded_base64|
+---+------+--------------+--------------+
|  1|   Jon|          Sm9u|           Jon|
|  2| Danny|      RGFubnk=|         Danny|
|  3|Tyrion|      VHlyaW9u|        Tyrion|
+---+------+--------------+--------------+