美好的一天!
我试图在pyspark(Spark 2.4.5)中应用非常简单的Pandas UDF,但是它对我不起作用。示例:
pyspark --master local[4] --conf "spark.pyspark.python=/opt/anaconda/envs/bd9/bin/python3" --conf "spark.pyspark.driver.python=/opt/anaconda/envs/bd9/bin/python3"
>>> my_df = spark.createDataFrame(
... [
... (1, 0),
... (2, 1),
... (3, 1),
... ],
... ["uid", "partition_id"]
... )
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("uid", StringType())])
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas
>>> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
... def apply_model(sample_df):
... print(sample_df)
... return pandas.DataFrame({"uid": sample_df["uid"]})
...
>>> result = my_df.groupBy("partition_id").apply(apply_model)
>>> result.show()
uid partition_id
0 1 0
[Stage 13:==================================================> (92 + 4) / 100] uid partition_id
0 2 1
1 3 1
+---+
|uid|
+---+
| |
| |
| |
+---+
某种程度上,uid不会反映在结果中。
你能说我在这里想念的吗?
谢谢。
答案 0 :(得分:0)
对不起,我不好,我在模式中写了错误的类型,应该是LongType()而不是StringType()