我有一个PySpark数据帧
+-------+--------------+----+----+
|address| date|name|food|
+-------+--------------+----+----+
|1111111|20151122045510| Yin|gre |
|1111111|20151122045501| Yin|gre |
|1111111|20151122045500| Yln|gra |
|1111112|20151122065832| Yun|ddd |
|1111113|20160101003221| Yan|fdf |
|1111111|20160703045231| Yin|gre |
|1111114|20150419134543| Yin|fdf |
|1111115|20151123174302| Yen|ddd |
|2111115| 20123192| Yen|gre |
+-------+--------------+----+----+
我想转换为与pyspark.ml一起使用。我可以使用StringIndexer将name列转换为数字类别:
indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df)
df_ind = indexer.transform(df)
df_ind.show()
+-------+--------------+----+----------+----+
|address| date|name|name_index|food|
+-------+--------------+----+----------+----+
|1111111|20151122045510| Yin| 0.0|gre |
|1111111|20151122045501| Yin| 0.0|gre |
|1111111|20151122045500| Yln| 2.0|gra |
|1111112|20151122065832| Yun| 4.0|ddd |
|1111113|20160101003221| Yan| 3.0|fdf |
|1111111|20160703045231| Yin| 0.0|gre |
|1111114|20150419134543| Yin| 0.0|fdf |
|1111115|20151123174302| Yen| 1.0|ddd |
|2111115| 20123192| Yen| 1.0|gre |
+-------+--------------+----+----------+----+
如何使用StringIndexer转换多个列(例如,name
和food
,每个列都有自己的StringIndexer
),然后使用VectorAssembler生成特征向量?或者我是否必须为每列创建StringIndexer
?
**编辑**:这不是一个骗局,因为我需要以编程方式为几个具有不同列名的数据帧。我无法使用VectorIndexer
或VectorAssembler
,因为列不是数字。
**编辑2 **:暂定解决方案
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]
我现在创建一个包含三个数据帧的列表,每个数据帧与原始数据帧和转换后的列相同。现在我需要加入以形成最终的数据帧,但效率非常低。
答案 0 :(得分:46)
我发现这样做的最好方法是在列表中合并多个StringIndex
并使用Pipeline
执行所有操作:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
+-------+--------------+----+----+----------+----------+-------------+
|address| date|food|name|food_index|name_index|address_index|
+-------+--------------+----+----+----------+----------+-------------+
|1111111|20151122045510| gre| Yin| 0.0| 0.0| 0.0|
|1111111|20151122045501| gra| Yin| 2.0| 0.0| 0.0|
|1111111|20151122045500| gre| Yln| 0.0| 2.0| 0.0|
|1111112|20151122065832| gre| Yun| 0.0| 4.0| 3.0|
|1111113|20160101003221| gre| Yan| 0.0| 3.0| 1.0|
|1111111|20160703045231| gre| Yin| 0.0| 0.0| 0.0|
|1111114|20150419134543| gre| Yin| 0.0| 0.0| 5.0|
|1111115|20151123174302| ddd| Yen| 1.0| 1.0| 2.0|
|2111115| 20123192| ddd| Yen| 1.0| 1.0| 4.0|
+-------+--------------+----+----+----------+----------+-------------+
答案 1 :(得分:1)
使用PySpark 3.0+,现在更容易了,您可以使用inputCols
和outputCols
选项:
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer
class pyspark.ml.feature.StringIndexer(inputCol=None, outputCol=None, inputCols=None, outputCols=None, handleInvalid='error', stringOrderType='frequencyDesc')
答案 2 :(得分:1)
将 StringIndexer 应用到 PySpark Dataframe 中的几列 对于火花 2.4.7
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
indexers = [StringIndexer(inputCol="F1", outputCol="F1Index") , StringIndexer(inputCol="F5", outputCol="F5Index")]
pipeline = Pipeline(stages=indexers)
DF6 = pipeline.fit(DF5).transform(DF5)
DF6.show()
答案 3 :(得分:0)
我可以为您提供以下解决方案。最好将管道用于较大数据集上的此类转换。它们还使您的代码更易于理解和理解。如果需要,可以向管道添加更多阶段。例如,添加编码器。
#create a list of the columns that are string typed
categoricalColumns = [item[0] for item in df.dtypes if item[1].startswith('string') ]
#define a list of stages in your pipeline. The string indexer will be one stage
stages = []
#iterate through all categorical values
for categoricalCol in categoricalColumns:
#create a string indexer for those categorical values and assign a new name including the word 'Index'
stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
#append the string Indexer to our list of stages
stages += [stringIndexer]
#Create the pipeline. Assign the satges list to the pipeline key word stages
pipeline = Pipeline(stages = stages)
#fit the pipeline to our dataframe
pipelineModel = pipeline.fit(df)
#transform the dataframe
df= pipelineModel.transform(df)
请看看我的reference