只需要处理Spark DataFrame中的特定列

时间:2016-10-30 11:56:08

标签: scala apache-spark apache-spark-sql

我一直在处理这个答案link,但我有更具体的需求。

我只需要选择以“cat”开头的列。我无法确定如何根据模式选择列。我不需要过滤数据帧,只需选择名称以模式开头的列。

val transformers: Array[PipelineStage] = df.select("cat*").columns.map(
  cname =>
    new StringIndexer()
      .setInputCol(cname)
      .setOutputCol(s"${cname}_index")
  )

val stages: Array[PipelineStage] = transformers

val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)

此代码产生错误:

org.apache.spark.sql.AnalysisException: cannot resolve 'cat*' given input columns: [cat3, cat7, cat25,...

2 个答案:

答案 0 :(得分:1)

这很简单。您只需过滤以" cat"开头的列。如下:

PictureBox pictureBoxRain1 = new PictureBox();
pictureBoxRain1.Size = size;
//pictureBoxRain1.Image = (Image)Properties.Resources.kaplja;
pictureBoxRain1.Image = Image.FromFile(@"C:\images\kaplja.png");
//pictureBoxRain1.ImageLocation = pictureBoxRain.I;
//pictureBoxRain1.Image = Graphics.FromImage();
//pictureBoxRain1.InitialImage = Properties.Resources.kaplja;
//pictureBoxRain1.BackgroundImage = Properties.Resources.kaplja;
pictureBoxRain1.Location = new Point(pictureBoxRain.Location.X + pictureBoxGrass.Size.Width + 10, pictureBoxRain.Location.Y);
Controls.Add(pictureBoxRain1);

答案 1 :(得分:0)

为什么要从数据框中进行选择以获取列?为什么不过滤所有名称:

val transformers: Array[PipelineStage] = df.columns.filter(_.startsWith("cat")).map(
  cname =>
    new StringIndexer()
      .setInputCol(cname)
      .setOutputCol(s"${cname}_index")
  )