Question

我想从实木复合地板文件生成的df中子集数据帧

+----+-----+----------+-----+-----------------+-----+-----------+-----+
|year|state|count1    |rowId|count2           |rowId|count3     |rowId|
+----+-----+----------+-----+-----------------+-----+-----------+-----+
|2014|   CT|    343477|    0|           343477|    0|     343477|    0|
|2014|   DE|    123431|    1|           123431|    1|     123431|    1|
|2014|   MD|    558686|    2|           558686|    2|     558686|    2|
|2014|   NJ|    773321|    3|           773321|    3|     773321|    3|
|2015|   CT|    343477|    4|           343477|    4|     343477|    4|
|2015|   DE|    123431|    5|           123431|    5|     123431|    5|
|2015|   MD|    558686|    6|           558686|    6|     558686|    6|

我想保留一个“ rowId”列，并删除其他“ rowId”列，并且我还想使rowId列成为第一列：

    +----+-----+----------+-----+-----------------+
    rowId||year|state|count1    |count2 |count3   |
    +----+-----+----------+-----+-----------------+-
        0|2014|   CT|    343477|  343477|   343477|
        1|2015|   DE|    123431|  123431|   123431|
        2|2015|   MD|    558686|  558686|   558686|
        3|2015|   NJ|    773321|  773321|   773321|
        4|2015|   CT|    343477|  343477|   343477| 
        5|2015|   DE|    123431|  123431|   123431|
        6|2015|   MD|    558686|  558686|   558686|

我的尝试：

 df.createOrReplaceTempView("test")
 val sqlDF = spark.sql("SELECT rowId, year, state, count1, count2, count3 from test)

我收到错误：org.apache.spark.sql.AnalysisException：引用'rowId'不明确，可能是：rowId＃3356L，rowId＃3368L，rowId＃3378L，rowId＃3388L，rowId＃3398L，rowId＃3408L 。我怎么做？谢谢...

Answer 1

您可以按如下所示根据索引映射列

df.map(attributes => 
               (attributes.getInt(3),  
                attributes.getInt(0),
                attributes.getString(1),
                attributes.getInt(2),
                attributes.getInt(4), 
               attributes.getInt(6))).
toDF("rowId", "year", "state", "count1", "count2", "count3").show()

可以根据您的列数据类型随意修改以上语句。

如何从数据帧中子集数据帧

1 个答案: