我正在使用压缩列处理数据帧。我想通过使用zlib.decompress解压缩它。以下代码片段是我的尝试:
from zlib import decompress
from pyspark.sql.functions import udf
toByteStr = udf(bytes)
unzip = udf(decompress)
df = (spark.read.format("xx.xxx.xx.xx").
load())
df1 = df.withColumn("message", unzip(toByteStr("content"), 15+32))
以下消息是我收到的错误:
An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
我真的需要你的帮助来解决它。感谢。
更多信息:
我刚刚意识到真正的数据是以pkzip格式压缩的,zlib不支持。我试图使用以下代码解压缩它。
import StringIO
import zipfile
from pyspark.sql.functions import udf
def unZip(buf):
fio = StringIO.StringIO(buf)
z = zipfile.ZipFile(fio, 'r')
result = z.open(z.infolist()[0]).read()
return result
toByteStr = udf(bytes, StringType())
unzip = udf(unZip, StringType())
df = (spark.read.format("xxx.xxx.xxx.xx").
option("env", "xxx").
option("table", "xxxxx.xxxxxx.xxxx").
load())
df1 = df.withColumn("message", unzip(toByteStr("content")))
df1.show()
我尝试过" 解压缩"使用Zip字符串功能,效果很好。但是,当我想注册为一个udf并在spark集群上并行工作时,它向我显示该文件不是zip文件,但我确信它是。错误如下:
BadZipfile: File is not a zip file
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
答案 0 :(得分:1)
第二个参数也应该是Column
,因此您需要使用lit
函数:
from pyspark.sql.functions import lit
df.withColumn("message", unzip(toByteStr("content"), lit(15 + 32)))