按列值拆分火花数据帧,并在结果中获取每列值的x个行数

时间:2017-07-02 16:59:49

标签: python pyspark

我有以下火花数据框,我试图按列值拆分它,并返回一个包含每列值x行数的新数据框

假设这是我拥有的数据框:

from pyspark import *;
from pyspark.sql import *;
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructType, StructField, IntegerType, DoubleType
import math;

sc = SparkContext.getOrCreate();
spark = SparkSession.builder.master('local').getOrCreate();


schema = StructType([
    StructField("INDEX", IntegerType(), True),
    StructField("SYMBOL", StringType(), True),
    StructField("DATETIMETS", StringType(), True),
    StructField("PRICE", DoubleType(), True),
    StructField("SIZE", IntegerType(), True),
])

df = spark\
    .createDataFrame(
        data=[(0,'A','2002-12-01 9:30:20',19.75,30200),
             (1,'A','2002-12-02 9:31:20',29.75,30200),             
             (2,'A','2004-12-03 10:36:20',3.0,30200),
             (3,'A','2006-12-06 22:41:20',24.0,30200),
             (4,'A','2006-12-08 22:42:20',60.0,30200),
             (5,'B','2002-12-09 9:30:20',15.75,30200),
             (6,'B','2002-12-12 9:31:20',49.75,30200),             
             (7,'C','2004-11-02 10:36:20',6.0,30200),
             (8,'C','2007-12-02 22:41:20',50.0,30200),
             (9,'D','2008-12-02 22:42:20',60.0,30200),
             (10,'E','2052-12-02 9:30:20',14.75,30200),
             (11,'A','2062-12-02 9:31:20',12.75,30200),             
             (12,'A','2007-12-02 11:36:20',5.0,30200),
             (13,'A','2008-12-02 22:41:20',40.0,30200),
             (14,'A','2008-12-02 22:42:20',50.0,30200)],
        schema=schema);

假设我每个符号最多需要两行,即使用以下数据创建一个新的数据帧。

Resulting dataframe

有没有办法做到这一点,除了通过使用' where'来循环每个数据集。符号的条款?

1 个答案:

答案 0 :(得分:1)

以下是从每个 SYMBOL 中获取前两行的一个选项:

df.rdd.groupBy(lambda r: r['SYMBOL']).flatMap(lambda x: list(x[1])[:2]).toDF().show()

+-----+------+-------------------+-----+-----+
|INDEX|SYMBOL|         DATETIMETS|PRICE| SIZE|
+-----+------+-------------------+-----+-----+
|    0|     A| 2002-12-01 9:30:20|19.75|30200|
|    1|     A| 2002-12-02 9:31:20|29.75|30200|
|   10|     E| 2052-12-02 9:30:20|14.75|30200|
|    9|     D|2008-12-02 22:42:20| 60.0|30200|
|    7|     C|2004-11-02 10:36:20|  6.0|30200|
|    8|     C|2007-12-02 22:41:20| 50.0|30200|
|    5|     B| 2002-12-09 9:30:20|15.75|30200|
|    6|     B| 2002-12-12 9:31:20|49.75|30200|
+-----+------+-------------------+-----+-----+