我需要向具有三个非常简单约束的数据框添加索引列:
从0开始
是连续的
具有确定性
我确定我遗漏了一些明显的东西,因为对于这样一个简单的任务,或者使用非顺序,不确定性越来越单调的id,我发现的示例看起来非常复杂。我不想使用index压缩,然后不得不将以前分开的列现在分开放在单列中,因为我的数据帧在TB中,这似乎是不必要的。我不需要按任何内容进行分区,也不需要按任何顺序进行排序,而我所找到的示例可以做到这一点(使用窗口函数和row_number)。我需要的只是一个简单的0到df.count整数序列。我在这里想念什么?
答案 0 :(得分:2)
我的意思是:如何添加一列,该列的有序单调递增1序列0:df.count? (from comments)
您可以在此处使用row_number()
,但是为此您需要指定一个orderBy()
。由于您没有排序列,因此只需使用monotonically_increasing_id()
。
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window
df = df.withColumn(
"index",
row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)
此外,row_number()
从1开始,因此您必须减去1才能使其从0开始。最后一个值将是df.count - 1
。
我不想使用index进行压缩,然后不得不将以前分隔的列现在分隔为一个列
如果您在调用zipWithIndex
之后使用map
,则可以使用cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols
,以避免将所有分开的列都变成一个列:
li.dropdown-submenu a#shoppingMenuLabel{
transition-timing-function: ease-in-out 1s;
-moz-transition-timing-function: ease-in-out 1s;
-o-transition-timing-function: ease-in-out 1s;
-webkit-transition-timing-function: ease-in-out 1s;
transition-duration: 1s;
}
.fade {
opacity: 1;
transition: opacity .40s ease-in-out;
-moz-transition: opacity .40s ease-in-out;
-webkit-transition: opacity .40s ease-in-out;
}
#imageContainer img{
width: 427px;
background-repeat: no-repeat;
}
.dropdown-submenu {
position: initial;
}
ul.dropdown-menu{
transition-timing-function: ease-in 2s;
-moz-transition-timing-function: ease-in 2s;
-o-transition-timing-function: ease-in 2s;
-webkit-transition-timing-function: ease-in s;
transition-duration: 1s;
}
ul.dropdown-menu li a{
transition-timing-function: ease-out .50s;
-moz-transition-timing-function: ease-out .50s;
-o-transition-timing-function: ease-out .50s;
-webkit-transition-timing-function: ease-out .50s;
transition-duration: .50s;
}
ul.dropdown-menu li.dropdown-submenu{
transition-timing-function: ease-out 3s;
-moz-transition-timing-function: ease-out 3s;
-o-transition-timing-function: ease-out 3s;
-webkit-transition-timing-function: ease-out 3s;
transition-duration: 3s;
}
ul.dropdown-menu li.dropdown-submenu a:hover{
font-weight: bold;
}
li.dropdown-submenu a ul.dropdown-menu{
transition: ease-out;
}
.dropdown-content a:hover{
background: transparent;
font-weight: bold;
}
li.dropdown-submenu a:hover{
font-weight: bold !important;
}
.dropdown-submenu>.dropdown-menu {
top: 0;
left: 95%;
margin-top: -6px;
margin-left: -1px;
padding-left: 10px;
border: 0;
border-left: 2px solid #f1f1f1 !important;
}
.dropdown-submenu:hover>.dropdown-menu {
display: block;
}
.dropdown-submenu>a:after {
display: block;
content: " ";
float: right;
width: 0;
height: 0;
border-color: transparent;
border-style: solid;
border-width: 5px 0 5px 5px;
border-left-color: searchResults.htmlccc;
margin-top: 5px;
margin-right: 0px;
}
.dropdown-submenu:hover>a:after {
border-left-color: searchResults.htmlfff;
}
.dropdown-submenu.pull-left {
float: none;
}
.dropdown-submenu.pull-left>.dropdown-menu {
left: -100%;
margin-left: 10px;
border: 0;
}
ul.mainDropDown {
margin: 0;
padding: 0;
}
ul.mainDropDown li {
list-style: none;
}
ul.dropdown-menu {
width: 285px;
}
li.dropdown-submenu a {
display: block;
}
@media only screen and (max-width: 800px) {
ul.dropdown-menu {
width: 150px;
}
}
答案 1 :(得分:0)
不确定性能,但这里有一个技巧。
<块引用>注意 - toPandas 会将所有数据收集到驱动程序
from pyspark.sql import SparkSession
# speed up toPandas using arrow
spark = SparkSession.builder.appName('seq-no') \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.config("spark.sql.execution.arrow.enabled", "true") \
.getOrCreate()
df = spark.createDataFrame([
('id1', "a"),
('id2', "b"),
('id2', "c"),
], ["ID", "Text"])
df1 = spark.createDataFrame(df.toPandas().reset_index()).withColumnRenamed("index","seq_no")
df1.show()
+------+---+----+
|seq_no| ID|Text|
+------+---+----+
| 0|id1| a|
| 1|id2| b|
| 2|id2| c|
+------+---+----+