如何声明DataFrame
中的给定列包含分类信息?
我有一个从数据库加载的Spark SQL DataFrame
。此DataFrame
中的许多列都有分类信息,但它们编码为Longs(用于隐私)。
我希望能够告诉spark-ml即使这个列是数字,信息实际上是分类的。类别的索引可能有一些漏洞,这是可以接受的。 (例如,列可能具有值[1,0,0,4])
我知道存在StringIndexer
,但我宁愿避免编码和解码的麻烦,特别是因为我有很多列都有这种行为。
我会寻找类似于以下内容的内容
train = load_from_database()
categorical_cols = ["CategoricalColOfLongs1",
"CategoricalColOfLongs2"]
numeric_cols = ["NumericColOfLongs1"]
## This is what I am looking for
## this step detects the min and max value of both columns
## and adds metadata to indicate this as a categorical column
## with (1 + max - min) categories
categorizer = ColumnCategorizer(columns = categorical_cols,
autoDetectMinMax = True)
##
vectorizer = VectorAssembler(inputCols = categorical_cols +
numeric_cols,
outputCol = "features")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [categorizer, vectorizer, classifier])
model = pipeline.fit(train)
答案 0 :(得分:2)
我宁愿避免编码和解码的麻烦,
你无法完全避免这种情况。分类变量所需的元数据实际上是值和索引之间的映射。不过,无需手动或create a custom transformer。让我们假设你有这样的数据框:
import numpy as np
import pandas as pd
df = sqlContext.createDataFrame(pd.DataFrame({
"x1": np.random.random(1000),
"x2": np.random.choice(3, 1000),
"x4": np.random.choice(5, 1000)
}))
您只需要一个汇编程序和索引器:
from pyspark.ml.feature import VectorAssembler, VectorIndexer
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
VectorAssembler(inputCols=df.columns, outputCol="features_raw"),
VectorIndexer(
inputCol="features_raw", outputCol="features", maxCategories=10)])
transformed = pipeline.fit(df).transform(df)
transformed.schema.fields[-1].metadata
## {'ml_attr': {'attrs': {'nominal': [{'idx': 1,
## 'name': 'x2',
## 'ord': False,
## 'vals': ['0.0', '1.0', '2.0']},
## {'idx': 2,
## 'name': 'x4',
## 'ord': False,
## 'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']}],
## 'numeric': [{'idx': 0, 'name': 'x1'}]},
## 'num_attrs': 3}}
此示例还显示了您为将给定元素标记为分类变量
而提供的类型信息{
'idx': 2, # Index (position in vector)
'name': 'x4', # name
'ord': False, # is ordinal?
# Mapping between value and label
'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']
}
因此,如果你想从头开始构建这个,你所要做的就是正确的架构:
from pyspark.sql.types import *
from pyspark.mllib.linalg import VectorUDT
# Lets assume we have only a vector
raw = transformed.select("features_raw")
# Dictionary equivalent to transformed.schema.fields[-1].metadata shown abov
meta = ...
schema = StructType([StructField("features", VectorUDT(), metadata=meta)])
sqlContext.createDataFrame(raw.rdd, schema)
但由于需要序列化,反序列化,效率非常低。
自 Spark 2.2 以来,您还可以使用元数据参数:
df.withColumn("features", col("features").alias("features", metadata=meta))
答案 1 :(得分:0)
嘿zero323我使用相同的技术来查看元数据,我编写了这个Transformer
。
def _transform(self, data):
maxValues = self.getOrDefault(self.maxValues)
categoricalCols = self.getOrDefault(self.categoricalCols)
new_schema = types.StructType(data.schema.fields)
new_data = data
for (col, maxVal) in zip(categoricalCols, maxValues):
# I have not decided if I should make a new column or
# overwrite the original column
new_col_name = col + "_categorical"
new_data = new_data.withColumn(new_col_name,
data[col].astype(types.DoubleType()))
# metadata for a categorical column
meta = {u'ml_attr' : {u'vals' : [unicode(i) for i in range(maxVal + 1)],
u'type' : u'nominal',
u'name' : new_col_name}}
new_schema.add(new_col_name, types.DoubleType(), True, meta)
return data.sql_ctx.createDataFrame(new_data.rdd, new_schema)