向pyspark数据框添加行索引(以并排添加新的列/串联数据框)

时间:2019-03-26 22:42:41

标签: python apache-spark pyspark

我试图并排连接两个数据框。我看到了this。在monotonically_increasing_id()的说明中说:

“ monotonically_increasing_id()-返回单调递增的64位整数。生成的ID保证单调递增且唯一,但不连续。当前实现将分区ID放在高31位,并且将低33位代表每个分区中的记录数。假设数据帧的分区少于十亿,每个分区的记录少于八十亿。该函数是不确定的,因为其结果取决于分区ID。”

我试图了解我们如何假定monotonically_increasing_id()对于要加入的这两个数据帧产生相同的结果,因为它是不确定的。如果它为这些数据帧生成不同的row_numbers,则它们将不会加入。 “结果取决于分区ID”部分可能是答案,但是我不明白这一点。有人可以解释吗?

2 个答案:

答案 0 :(得分:0)

这是根据我的经验。 monotonically_increasing_id()有点粗糙。对于小型用例,您将始终获得一个通用的递增ID。但是,如果您遇到复杂的改组或数据使用问题,则每次滴答滴答声都可能并且不会增加相同的值。我的意思是DF1从1->〜100000000变了,但是在重新组合洗礼期间,DF2从Spark惰性实现再次重新计算,它从1->〜48000000到48000001.23-> 100000000.23了。这意味着我损失了很多行。

我解决问题的方法是通过唯一的Row_ID。为此,我在下面有一个名为Row_Hash的函数,它将通过该函数并在列的开头构建一个唯一的行ID。不管有多少次混洗或写入数据,我都保持了联接条件的唯一性。

编辑:我要做的是将数据帧的元数据的所有元素都转换为数组。这样做的原因是您可以指定要查询的数组元素。这与数据帧不同,因为混洗和重新分区,调用take(n)可能会得出不同的结果,但是调用array(n)总是会输出相同的结果。

考虑到这一点,让我们回到问题所在,我们需要在没有行标识符的地方创建一个本地行标识符。为此,我们将所有行完全串联在一起(这是针对没有行键的情况),在产品顶部调用MD5(是的,有可能相交,但相交的可能性极低)。这将为每行产生一个大的字符串字符,使其与系统的其余部分分开,从而允许用户将其用作唯一的行联接键。

#Call in the input data frame
val inputDF = ...

#Returns a array of string on the columns of input dataframe
val columnArray = inputDF.columns

#In Scala a variable allows us to dynamically augment and update the value
#This is the start of the command where we are concatenating all fields and running and MD5, we just need to add in the other fields. 
var commandString = "md5(concat("
#This will be a set of string of actions we want Spark to run on our columns. 
#The reason we are passing through the names is because we want to return the base columns. 
#Think of a select query
var commandArray = columnArray

#This is an iterator where we are going to move 1->n, n being the last element of the number of columns
var columnIterator = 1

#Run while there are still columns we have not acted upon.
while(columnIterator<=columnArray.length) {

  #We are going to take an N element from the columns and build a statement to cast it as a string 
  commandString = "cast(" + columnArray(columnIterator-1) + " as string)"

  #This loop checks if we are not the last element of the column array, if so we add 
  #in a comma this allows us to have N many element be concatenated (I add the space because it is aesthetically pleasing)
  if (columnIterator!=columnArray.length) {commandString = commandString + ", "}
  #Iterator
  columnIterator = columnIterator + 1
}

#I am appending the command we just build to the from of the command array with 
#a few extra characters to end the local command and name it something consistent. 
#So if we have a DF of Name, Addr, Date; this below statement will look like 
#Array("md5(concat(cast(Name as string), cast(Addr as string), cast(Date as string)) as Row_Hash","Name","Addr","Date")
val commandArray = Array(commandString + ")) as Row_Hash") ++ commandArray

#Select Expr runs independent strings through a standard SQL framework (kinda little bit of column A, column B)
#Each string is its own element so based on the above example DF 
#inputDF.selectExpr("Name", "length(Addr) as Addr_Length", "Addr", "Date) 
#Will output a DF with four elements Name, an integer of the length of column Addr, Addr, and Date. 
#In the previous lines of code we have build out those strings into the command array
#The usage of commandArray:_* means we want spark to run all elements of Array through the select statement.
val finalDF = inputDF.selectExpr(commandArray:_*)

答案 1 :(得分:0)

这是到目前为止我发现将索引添加到数据帧df的最佳方法:

new_columns = df.columns + ["row_idx"]

# Adding row index
df = df\
    .rdd\
    .zipWithIndex()\
    .map(lambda(row, rowindex): row + (rowindex,)).toDF()

# Renaming all the columns
df = df.toDF(*new_columns)

它确实有转换为rdd然后再返回到数据帧的开销。但是,monotonically_increasing_id()是不确定性的,row_number()要求使用Window,除非与PARTITION BY一起使用,否则它可能不理想,否则它将所有数据改组到一个分区{ {3}}是pyspark的目的。

因此,要将列表添加为数据框中的新列,只需将列表转换为数据框

new_df = spark.createDataFrame([(l,) for l in lst], ['new_col'])

并像上面一样添加row_number。然后加入,

joined_df = df.join(new_df, ['row_idx'], 'inner')