更新

Question

PySpark中是否有任何特定的方法来解决两个数据帧，因为我们在r中进行了cbind？

示例：

数据框1有10列
数据框2有1列

我需要同时处理数据帧并在PySpark中作为一个数据帧。

Answer 1

首先让我们创建我们的数据帧：

df = df.groupby('A').agg({'B' : 'sum'}).rename(columns={'B':'total'}).reset_index()

然后我们想要唯一地标识行，df1 = spark.createDataFrame(sc.parallelize([10*[c] for c in range(10)]), ["c"+ str(i) for i in range(10)]) df2 = spark.createDataFrame(sc.parallelize([[c] for c in range(10, 20, 1)]), ["c10"]) +---+---+---+---+---+---+---+---+---+---+ | c0| c1| c2| c3| c4| c5| c6| c7| c8| c9| +---+---+---+---+---+---+---+---+---+---+ | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| | 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| | 2| 2| 2| 2| 2| 2| 2| 2| 2| 2| | 3| 3| 3| 3| 3| 3| 3| 3| 3| 3| | 4| 4| 4| 4| 4| 4| 4| 4| 4| 4| | 5| 5| 5| 5| 5| 5| 5| 5| 5| 5| | 6| 6| 6| 6| 6| 6| 6| 6| 6| 6| | 7| 7| 7| 7| 7| 7| 7| 7| 7| 7| | 8| 8| 8| 8| 8| 8| 8| 8| 8| 8| | 9| 9| 9| 9| 9| 9| 9| 9| 9| 9| +---+---+---+---+---+---+---+---+---+---+ +---+ |c10| +---+ | 10| | 11| | 12| | 13| | 14| | 15| | 16| | 17| | 18| | 19| +---+有一个函数可以执行此操作RDD

zipWithIndex

最后，我们可以加入他们：

from pyspark.sql.types import LongType
from pyspark.sql import Row
def zipindexdf(df):
    schema_new = df.schema.add("index", LongType(), False)
    return df.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)

df1_index = zipindexdf(df1)
df1_index.show()
df2_index = zipindexdf(df2)
df2_index.show()

    +---+---+---+---+---+---+---+---+---+---+-----+
    | c0| c1| c2| c3| c4| c5| c6| c7| c8| c9|index|
    +---+---+---+---+---+---+---+---+---+---+-----+
    |  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|    0|
    |  1|  1|  1|  1|  1|  1|  1|  1|  1|  1|    1|
    |  2|  2|  2|  2|  2|  2|  2|  2|  2|  2|    2|
    |  3|  3|  3|  3|  3|  3|  3|  3|  3|  3|    3|
    |  4|  4|  4|  4|  4|  4|  4|  4|  4|  4|    4|
    |  5|  5|  5|  5|  5|  5|  5|  5|  5|  5|    5|
    |  6|  6|  6|  6|  6|  6|  6|  6|  6|  6|    6|
    |  7|  7|  7|  7|  7|  7|  7|  7|  7|  7|    7|
    |  8|  8|  8|  8|  8|  8|  8|  8|  8|  8|    8|
    |  9|  9|  9|  9|  9|  9|  9|  9|  9|  9|    9|
    +---+---+---+---+---+---+---+---+---+---+-----+

    +---+-----+
    |c10|index|
    +---+-----+
    | 10|    0|
    | 11|    1|
    | 12|    2|
    | 13|    3|
    | 14|    4|
    | 15|    5|
    | 16|    6|
    | 17|    7|
    | 18|    8|
    | 19|    9|
    +---+-----+

Answer 2

要获取具有单调增加的ID，连续唯一和的列，请在每个DataFrame上使用以下内容，其中colName是列要按名称对每个DataFrame排序的名称。

import pyspark.sql.functions as F
from pyspark.sql.window import Window as W

window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow)

df = df\
 .withColumn('int', F.lit(1))\
 .withColumn('consec_id', F.sum('int').over(window))\
 .drop('int')\

要检查所有内容是否正确排列，请使用以下代码查看数据框的尾部或最后rownums。

rownums = 10
df.where(F.col('consec_id')>df.count()-rownums).show()

使用以下代码查看DataFrame中start_row到end_row的行。

start_row = 20
end_row = 30
df.where((F.col('consec_id')>start_row) & (F.col('consec_id')<end_row)).show()

更新

另一种有效的方法是RDD方法zipWithIndex()。要使用此RDD方法简单地使用一列连续ID修改现有DataFrame，我：

将df转换为RDD，
应用了zipWithIndex()方法，
将返回的RDD转换为DataFrame，
将DataFrame转换为RDD，
映射RDD lambda函数以将原始DataFrame的RDD行对象与索引组合，
将最终的RDD转换为DataFrame，其中包含原始列名+ zipWithIndex()创建的整数中的ID列。

我还尝试了修改原始DataFrame的方法，其索引列包含类似于@MaFF所做的zipWithIndex()输出，但结果甚至更慢。窗口函数比这些中的任何一个都快一个数量级。大部分时间的增加似乎是将DataFrame转换为RDD并再次返回。

如果有更快的方法将zipWithIndex() RDD方法的输出添加为原始DataFrame中的列，请告诉我们。

在42,000行90列DataFrame上进行测试得到以下结果。

import time

def test_zip(df):
  startTime = time.time()
  df_1 = df \
  .rdd.zipWithIndex().toDF() \
  .rdd.map(lambda row: (row._1) + (row._2,)) \
  .toDF(df.columns + ['consec_id'])

  start_row = 20000
  end_row = 20010
  df_1.where((F.col('consec_id')>start_row) & (F.col('consec_id')<end_row)).show()
  endTime = time.time() - startTime
  return str(round(endTime,3)) + " seconds"

[test_zip(df) for _ in range(5)]

[＆＃39; 59.813秒＆＃39;，＆＃39; 39.574秒＆＃39;，＆＃39; 36.074秒＆＃39;，＆＃39; 35.436秒＆＃39;，＆＃39; 35.636秒＆＃39;]

import time
import pyspark.sql.functions as F
from pyspark.sql.window import Window as W

def test_win(df):
  startTime = time.time()
  window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow)
  df_2 = df \
  .withColumn('int', F.lit(1)) \
  .withColumn('IDcol', F.sum('int').over(window)) \
  .drop('int')

  start_row = 20000
  end_row = 20010
  df_2.where((F.col('consec_id')>start_row) & (F.col('consec_id')<end_row)).show()
  endTime = time.time() - startTime
  return str(round(endTime,3)) + " seconds"  

[test_win(df) for _ in range(5)]

[＆＃39; 4.19秒＆＃39;，＆＃39; 4.508秒＆＃39;，＆＃39; 4.099秒＆＃39;，＆＃39; 4.012秒＆＃39;，＆＃39; 4.045秒＆＃39;]

import time
from pyspark.sql.types import StructType, StructField
import pyspark.sql.types as T

def test_zip2(df):
  startTime = time.time()
  schema_new = StructType(list(df.schema) + [StructField("consec_id", T.LongType(), False)])
  df_3 = df.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)

  start_row = 20000
  end_row = 20010
  df_3.where((F.col('IDcol')>start_row) & (F.col('consec_id')<end_row)).show()
  endTime = time.time() - startTime
  return str(round(endTime,3)) + " seconds"

[test_zip2(testdf) for _ in range(5)]

[＆＃39; 82.795秒＆＃39;，＆＃39; 61.689秒＆＃39;，＆＃39; 58.181秒＆＃39;，＆＃39; 58.01秒＆＃39;，＆＃39; 57.765秒＆＃39;]

PySpark列明智绑定

2 个答案:

更新