Pyspark向数据帧添加顺序和确定性索引

时间:2018-09-13 16:28:16

标签: indexing pyspark

我需要向具有三个非常简单约束的数据框添加索引列:

  • 从0开始

  • 是连续的

  • 具有确定性

我确定我遗漏了一些明显的东西,因为对于这样一个简单的任务,或者使用非顺序,不确定性越来越单调的id,我发现的示例看起来非常复杂。我不想使用index压缩,然后不得不将以前分开的列现在分开放在单列中,因为我的数据帧在TB中,这似乎是不必要的。我不需要按任何内容进行分区,也不需要按任何顺序进行排序,而我所找到的示例可以做到这一点(使用窗口函数和row_number)。我需要的只是一个简单的0到df.count整数序列。我在这里想念什么?

12345

2 个答案:

答案 0 :(得分:2)

  

我的意思是:如何添加一列,该列的有序单调递增1序列0:df.count? (from comments)

您可以在此处使用row_number(),但是为此您需要指定一个orderBy()。由于您没有排序列,因此只需使用monotonically_increasing_id()

from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window

df = df.withColumn(
    "index",
    row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)

此外,row_number()从1开始,因此您必须减去1才能使其从0开始。最后一个值将是df.count - 1


  

我不想使用index进行压缩,然后不得不将以前分隔的列现在分隔为一个列

如果您在调用zipWithIndex之后使用map,则可以使用cols = df.columns df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols ,以避免将所有分开的列都变成一个列:

li.dropdown-submenu a#shoppingMenuLabel{
            transition-timing-function: ease-in-out 1s;
            -moz-transition-timing-function: ease-in-out 1s;
            -o-transition-timing-function: ease-in-out 1s;
            -webkit-transition-timing-function: ease-in-out 1s;
            transition-duration: 1s;
        }

        .fade {
                opacity: 1;
                transition: opacity .40s ease-in-out;
                -moz-transition: opacity .40s ease-in-out;
                -webkit-transition: opacity .40s ease-in-out;
        }

        #imageContainer img{
            width: 427px;            
            background-repeat: no-repeat;      
        }

        .dropdown-submenu {
            position: initial;
        }

        ul.dropdown-menu{
            transition-timing-function: ease-in 2s;
            -moz-transition-timing-function: ease-in 2s;
            -o-transition-timing-function: ease-in 2s;
            -webkit-transition-timing-function: ease-in s;
            transition-duration: 1s;
        }

        ul.dropdown-menu li a{
            transition-timing-function: ease-out .50s;
            -moz-transition-timing-function: ease-out .50s;
            -o-transition-timing-function: ease-out .50s;
            -webkit-transition-timing-function: ease-out .50s;
            transition-duration: .50s;
        }

        ul.dropdown-menu li.dropdown-submenu{
            transition-timing-function: ease-out 3s;
            -moz-transition-timing-function: ease-out 3s;
            -o-transition-timing-function: ease-out 3s;
            -webkit-transition-timing-function: ease-out 3s;
            transition-duration: 3s;
        }        

        ul.dropdown-menu li.dropdown-submenu a:hover{
            font-weight: bold;
        }

        li.dropdown-submenu a ul.dropdown-menu{
        transition: ease-out;
        }

        .dropdown-content a:hover{
            background: transparent;
            font-weight: bold;
        }

        li.dropdown-submenu a:hover{
            font-weight: bold !important; 
        }

        .dropdown-submenu>.dropdown-menu {
            top: 0;
            left: 95%;
            margin-top: -6px;
            margin-left: -1px;
            padding-left: 10px;
            border: 0;
            border-left: 2px solid #f1f1f1 !important;
        }

        .dropdown-submenu:hover>.dropdown-menu {
            display: block;
        }

        .dropdown-submenu>a:after {
            display: block;
            content: " ";
            float: right;
            width: 0;
            height: 0;
            border-color: transparent;
            border-style: solid;
            border-width: 5px 0 5px 5px;
            border-left-color: searchResults.htmlccc;
            margin-top: 5px;
            margin-right: 0px;
        }

        .dropdown-submenu:hover>a:after {
            border-left-color: searchResults.htmlfff;
        }

        .dropdown-submenu.pull-left {
            float: none;
        }

        .dropdown-submenu.pull-left>.dropdown-menu {
            left: -100%;
            margin-left: 10px;
            border: 0;
        }

        ul.mainDropDown {
            margin: 0;
            padding: 0;
        }

        ul.mainDropDown li {
            list-style: none;
        }

        ul.dropdown-menu {
            width: 285px;
        }

        li.dropdown-submenu a {
            display: block;
        }

        @media only screen and (max-width: 800px) {
            ul.dropdown-menu {
                width: 150px;
            }
        }

答案 1 :(得分:0)

不确定性能,但这里有一个技巧。

<块引用>

注意 - toPandas 会将所有数据收集到驱动程序

from pyspark.sql import SparkSession

# speed up toPandas using arrow
spark = SparkSession.builder.appName('seq-no') \
        .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
        .config("spark.sql.execution.arrow.enabled", "true") \
        .getOrCreate()

df = spark.createDataFrame([
    ('id1', "a"),
    ('id2', "b"),
    ('id2', "c"),
], ["ID", "Text"])

df1 = spark.createDataFrame(df.toPandas().reset_index()).withColumnRenamed("index","seq_no")

df1.show()

+------+---+----+
|seq_no| ID|Text|
+------+---+----+
|     0|id1|   a|
|     1|id2|   b|
|     2|id2|   c|
+------+---+----+