我有以下火花数据框,我试图按列值拆分它,并返回一个包含每列值x行数的新数据框
假设这是我拥有的数据框:
from pyspark import *;
from pyspark.sql import *;
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructType, StructField, IntegerType, DoubleType
import math;
sc = SparkContext.getOrCreate();
spark = SparkSession.builder.master('local').getOrCreate();
schema = StructType([
StructField("INDEX", IntegerType(), True),
StructField("SYMBOL", StringType(), True),
StructField("DATETIMETS", StringType(), True),
StructField("PRICE", DoubleType(), True),
StructField("SIZE", IntegerType(), True),
])
df = spark\
.createDataFrame(
data=[(0,'A','2002-12-01 9:30:20',19.75,30200),
(1,'A','2002-12-02 9:31:20',29.75,30200),
(2,'A','2004-12-03 10:36:20',3.0,30200),
(3,'A','2006-12-06 22:41:20',24.0,30200),
(4,'A','2006-12-08 22:42:20',60.0,30200),
(5,'B','2002-12-09 9:30:20',15.75,30200),
(6,'B','2002-12-12 9:31:20',49.75,30200),
(7,'C','2004-11-02 10:36:20',6.0,30200),
(8,'C','2007-12-02 22:41:20',50.0,30200),
(9,'D','2008-12-02 22:42:20',60.0,30200),
(10,'E','2052-12-02 9:30:20',14.75,30200),
(11,'A','2062-12-02 9:31:20',12.75,30200),
(12,'A','2007-12-02 11:36:20',5.0,30200),
(13,'A','2008-12-02 22:41:20',40.0,30200),
(14,'A','2008-12-02 22:42:20',50.0,30200)],
schema=schema);
假设我每个符号最多需要两行,即使用以下数据创建一个新的数据帧。
有没有办法做到这一点,除了通过使用' where'来循环每个数据集。符号的条款?
答案 0 :(得分:1)
以下是从每个 SYMBOL 中获取前两行的一个选项:
df.rdd.groupBy(lambda r: r['SYMBOL']).flatMap(lambda x: list(x[1])[:2]).toDF().show()
+-----+------+-------------------+-----+-----+
|INDEX|SYMBOL| DATETIMETS|PRICE| SIZE|
+-----+------+-------------------+-----+-----+
| 0| A| 2002-12-01 9:30:20|19.75|30200|
| 1| A| 2002-12-02 9:31:20|29.75|30200|
| 10| E| 2052-12-02 9:30:20|14.75|30200|
| 9| D|2008-12-02 22:42:20| 60.0|30200|
| 7| C|2004-11-02 10:36:20| 6.0|30200|
| 8| C|2007-12-02 22:41:20| 50.0|30200|
| 5| B| 2002-12-09 9:30:20|15.75|30200|
| 6| B| 2002-12-12 9:31:20|49.75|30200|
+-----+------+-------------------+-----+-----+