在Scala Spark中从向量列转换为Double [Array]列

时间:2019-04-08 05:23:40

标签: scala apache-spark

我有一个数据帧doubleSeq,其结构如下

private async Task DownloadFileFromHttpResponseMessage(HttpResponseMessage response)
{
    response.EnsureSuccessStatusCode();

    var totalBytes = response.Content.Headers.ContentLength;

    using (var contentStream = await response.Content.ReadAsStreamAsync())
    {
        await ProcessContentStream(totalBytes, contentStream);

        // Added code
        char[] buffer;
        using (StreamReader sr = new StreamReader(DestinationFilePath))
        {
            buffer = new char[(int)sr.BaseStream.Length];
            await sr.ReadAsync(buffer, 0, (int)sr.BaseStream.Length);
        }
        Debug.WriteLine(new string(buffer)); // Now it prints the response
    }
}

该列的第一条记录如下

ArrayList<String> words = new ArrayList<>();

for (Map.Entry<String, Integer> entry : RatingTable.reviews.entrySet()) {
    String symptom = entry.getKey();
    words.add(symptom);
}               
setDictionary(words);

return super.wordTyped(typedWord);

我要提取双精度数组

res274: org.apache.spark.sql.DataFrame = [finalFeatures: vector]

从这里-

res281: org.apache.spark.sql.Row = [[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]]

给予

[3.0,6.0,-0.7876947819954485,-0.21757635218517163,0.9731844373162398,-0.6641741696340383,-0.6860072219935377,-0.2990737363481845,-0.7075863760365155,0.8188108975549018,-0.8468559840943759,-0.04349947247406488,-0.45236764452589984,1.0333959313820456,0.6097566070878347,-0.7106619551471779,-0.7750330808435969,-0.08097610412658443,-0.45338437108038904,-0.2952869863393396,-0.30959772365257004,0.6988768123463287,0.17049117199049213,3.2674649019757385,-0.8333373234944124,1.8462942520757128,-0.49441222531240125,-0.44187299748074166,-0.300810826687287]

哪个不能解决我的问题

Scala Spark - split vector column into separate columns in a Spark DataFrame

不是解决我的问题,而是一个指标

1 个答案:

答案 0 :(得分:1)

因此,您想从行中提取一个Vector,并将其变成双精度数组。

您的代码存在问题,就是get方法(以及您使用的隐式apply方法)返回类型为Any的对象。确实,Row是一个通用的,未参数化的对象,目前尚无办法在编译时包含它所包含的类型。它有点像Java 1.4及更低版本中的Lists。要解决此问题,可以使用getAs方法,您可以根据自己的选择对它进行参数化。

根据您的情况,您似乎有一个包含矢量(org.apache.spark.ml.linalg.Vector)的数据框。

import org.apache.spark.ml.linalg._
val firstRow = df.head(1)(0) // or simply df.head
val vect : Vector = firstRow.getAs[Vector](0)
// or all in one: df.head.getAs[Vector](0)

// to transform into a regular array
val array : Array[Double] = vect.toArray

还请注意,您可以按如下名称访问列:

val vect : Vector = firstRow.getAs[Vector]("finalFeatures")