我有一个DataFrame:
from django import template
register = template.Library()
@register.simple_tag
def get_message_print_tag(value):
'''return a string for a message tag depending on the
message tag that can be displayed bold on the flash message'''
if 'danger' in value.lower():
return 'ERROR'
elif 'success' in value.lower():
return 'SUCCESS'
else:
return 'NOTE'
我想要明确+-----+--------+---------+
| usn|log_type|item_code|
+-----+--------+---------+
| 0| 11| I0938|
| 916| 19| I0009|
| 916| 51| I1097|
| 916| 19| C0723|
| 916| 19| I0010|
| 916| 19| I0010|
|12331| 19| C0117|
|12331| 19| C0117|
|12331| 19| I0009|
|12331| 19| I0009|
|12331| 19| I0010|
|12838| 19| I1067|
|12838| 19| I1067|
|12838| 19| C1083|
|12838| 11| B0250|
|12838| 19| C1346|
+-----+--------+---------+
并为每个item_code
制作一个索引,如下所示:
item_code
我没有使用+---------+------+
|item_code| numId|
+---------+------+
| I0938| 0 |
| I0009| 1 |
| I1097| 2 |
| C0723| 3 |
| I0010| 4 |
| C0117| 5 |
| I1067| 6 |
| C1083| 7 |
| B0250| 8 |
| C1346| 9 |
+---------+------+
,因为它会返回一个bigint。
答案 0 :(得分:2)
使用monotanicallly_increasing_id
仅保证数字增加,不保证起始编号和连续编号。如果你想确保得到0,1,2,3,......你可以使用RDD函数zipWithIndex()
。
由于我不太熟悉spark和python,下面的例子是使用scala,但它应该很容易转换。
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
"C0117","C0117","I0009","I0009","I0010","I1067",
"I1067","C1083","B0250","C1346")
.toDF("item_code")
val df2 = df.distinct.rdd
.map{case Row(item: String) => item}
.zipWithIndex()
.toDF("item_code", "numId")
这将为您提供所需的结果:
+---------+-----+
|item_code|numId|
+---------+-----+
| I0010| 0|
| I1067| 1|
| C0117| 2|
| I0009| 3|
| I1097| 4|
| C1083| 5|
| I0938| 6|
| C0723| 7|
| B0250| 8|
| C1346| 9|
+---------+-----+