我有一个Spark DataFrame。以下是生成数据框示例的代码。
arr = np.array([
['b5ad805c-f295-4852-82fc-961a88',12732936],
['0FD6955D-484C-4FC8-8C3F-DA7D28','Gklb38'],
['0E3D17EA-BEEF-4931-8104',12909841],
['CC2877D0-A15C-4C0A-AD65-762A35C1','12645715'],
['CC2877D0-A15C-4C0A-AD65-762A35C1',12909837],
['6AC9C45D-A891-4BEA-92B1-04224E9C65ED', '12894376'],
['CFF7BAB7-C5E1-490D-B257-AE58CA071362', 'Gklb38' ]])
df_purchases = pd.DataFrame(arr, columns = ['user_id','basket'])
df_spark = spark.createDataFrame(df_purchases)
df_spark.show()
为了为每个唯一的product_id(购物篮)创建索引,我使用了zipWithIndex()
products_only = spark_df[['basket']]
products_df = products_only.distinct()
indexed_products = products_df.rdd.zipWithIndex()
然后我转换回DataFrame类型:
# convert to spark data frame
products_ind_df = indexed_products.toDF(["product_id", "index"])
当我检查类型时,发现它是:
products_ind_df.dtypes
输出:
[('product_id', 'struct<basket:string>'), ('index', 'bigint')]
而:
products_df.dtypes
输出:
[('basket', 'string')]
我的问题是为什么类型不是:
[('product_id', 'string'), ('index', 'bigint')]
以及如何将其更改为字符串?
答案 0 :(得分:1)
由于products_df.rdd
是Row对象的RDD,因此您需要首先从每行中提取basket
作为String:
products_df.rdd.map(lambda r: r.basket).zipWithIndex().toDF(['product_id', 'index'])
# DataFrame[product_id: string, index: bigint]
除了只需要将每个产品ID映射为整数外,还可以使用StringIndexer
模块中的ml.feature
:
from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import col
stringIndexer = StringIndexer(inputCol="basket", outputCol="index")
model = stringIndexer.fit(df_spark)
df_spark_index = model.transform(df_spark).withColumn("index", col("index").cast("int"))
df_spark_index.show()
+--------------------+--------+-----+
| user_id| basket|index|
+--------------------+--------+-----+
|b5ad805c-f295-485...|12732936| 2|
|0FD6955D-484C-4FC...| Gklb38| 0|
|0E3D17EA-BEEF-493...|12909841| 1|
|CC2877D0-A15C-4C0...|12645715| 5|
|CC2877D0-A15C-4C0...|12909837| 3|
|6AC9C45D-A891-4BE...|12894376| 4|
|CFF7BAB7-C5E1-490...| Gklb38| 0|
+--------------------+--------+-----+