Question

我有一个DataFrame：

from django import template
register = template.Library()

@register.simple_tag
def get_message_print_tag(value):
    '''return a string for a message tag depending on the
        message tag that can be displayed bold on the flash message''' 
    if 'danger' in value.lower():
        return 'ERROR'
    elif 'success' in value.lower():
        return 'SUCCESS'
    else:
        return 'NOTE'

我想要明确+-----+--------+---------+ | usn|log_type|item_code| +-----+--------+---------+ | 0| 11| I0938| | 916| 19| I0009| | 916| 51| I1097| | 916| 19| C0723| | 916| 19| I0010| | 916| 19| I0010| |12331| 19| C0117| |12331| 19| C0117| |12331| 19| I0009| |12331| 19| I0009| |12331| 19| I0010| |12838| 19| I1067| |12838| 19| I1067| |12838| 19| C1083| |12838| 11| B0250| |12838| 19| C1346| +-----+--------+---------+并为每个item_code制作一个索引，如下所示：

item_code

我没有使用+---------+------+ |item_code| numId| +---------+------+ | I0938| 0 | | I0009| 1 | | I1097| 2 | | C0723| 3 | | I0010| 4 | | C0117| 5 | | I1067| 6 | | C1083| 7 | | B0250| 8 | | C1346| 9 | +---------+------+，因为它会返回一个bigint。

Answer 1

使用monotanicallly_increasing_id仅保证数字增加，不保证起始编号和连续编号。如果你想确保得到0,1,2,3，......你可以使用RDD函数zipWithIndex()。

由于我不太熟悉spark和python，下面的例子是使用scala，但它应该很容易转换。

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
    "C0117","C0117","I0009","I0009","I0010","I1067",
    "I1067","C1083","B0250","C1346")
  .toDF("item_code")

val df2 = df.distinct.rdd
  .map{case Row(item: String) => item}
  .zipWithIndex()
  .toDF("item_code", "numId")

这将为您提供所需的结果：

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+

如何制作整数索引行？

1 个答案: