我在PySpark中将十进制解码为二进制值时遇到问题。 这就是我在普通python中做的事情:
Int
这是我要转换的示例DataFrame:
a = 28
b = format(a, "09b")
print(b)
-> 000011100
我希望将“ b”列解码为:
from pyspark import Row
from pyspark.sql import SparkSession
df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
Row(a=2, b='28', c='44', d='bar'),
Row(a=3, b='28', c='22', d='foo')])
| a| b| c| d|
+---+---+---+---+
| 1| 28| 11|foo|
| 2| 28| 44|bar|
| 3| 28| 22|foo|
+---+---+---+---+
谢谢您的帮助!
答案 0 :(得分:1)
具有bin
和lpad
功能以达到相同的输出
import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark
df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
Row(a=2, b='28', c='44', d='bar'),
Row(a=3, b='28', c='22', d='foo')])
df = df.withColumn('b', f.lpad(f.bin(df['b']), 9, '0'))
df.show()
使用UDF
import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark
df = spark.createDataFrame([Row(a=1, b='28', c='11', d='foo'),
Row(a=2, b='28', c='44', d='bar'),
Row(a=3, b='28', c='22', d='foo')])
@f.udf()
def to_binary(value):
return format(int(value), "09b")
df = df.withColumn('b', to_binary(df['b']))
df.show()
输出:
+---+---------+---+---+
| a| b| c| d|
+---+---------+---+---+
| 1|000011100| 11|foo|
| 2|000011100| 44|bar|
| 3|000011100| 22|foo|
+---+---------+---+---+