我有一个像这样的PySpark数据框
+----------+--------+----------+----------+
|id_ | p |d1 | d2 |
+----------+--------+----------+----------+
| 1 | A |2018-09-26|2018-10-26|
| 2 | B |2018-06-21|2018-07-19|
| 2 | B |2018-08-13|2018-10-07|
| 2 | B |2018-12-31|2019-02-27|
| 2 | B |2019-05-28|2019-06-25|
| 3 |C |2018-06-15|2018-07-13|
| 3 |C |2018-08-15|2018-10-09|
| 3 |C |2018-12-03|2019-03-12|
| 3 |C |2019-05-10|2019-06-07|
| 4 | A |2019-01-30|2019-03-01|
| 4 | A |2019-05-30|2019-07-25|
| 5 |C |2018-09-19|2018-10-17|
-------------------------------------------
由此,我想创建并填充另一个Pyspark数据框,该数据框具有n
列,范围从min(d1)
到max(d2)
,每一列都是该范围内的日期。
我想用每行1和0填充此数据框。
对于第1行,我想用1
填充min(第1行的d1)到max(第1行的d1)的所有天,并用0
填充其余列。数据框中的所有行都类似。
我为此在熊猫上做这样的事情。
result = pd.DataFrame(data = 0, columns=pd.period_range(data['d1'].min(), data['d2'].max(), freq='D'), index=data.index)
for c in result.columns:
result[c] = np.where((c.d2>=data.d1)&(c.d1 <= data.d2), 1, 0)
如何在PySpark中执行相同的操作?
答案 0 :(得分:1)
这是一种方法(我只用了很少的行和较小的日期范围在此处打印输出)
from pyspark.sql import SparkSession,Row
import pyspark.sql.functions as F
import pyspark.sql.types as T
import datetime
def fill_dates(d1, d2, start_date, no_of_date_cols):
start_date = datetime.datetime.strptime(start_date, '%Y-%m-%d')
d1 = datetime.datetime.strptime(d1, '%Y-%m-%d')
d2 = datetime.datetime.strptime(d2, '%Y-%m-%d')
cols = {}
for x in range(0, no_of_date_cols):
col = (start_date + datetime.timedelta(days=x)).strftime('%Y-%m-%d')
if datetime.datetime.strptime(col, '%Y-%m-%d') >= d1 and datetime.datetime.strptime(col, '%Y-%m-%d') <= d2:
cols[col] = 1
else:
cols[col] = 0
return cols
spark = SparkSession \
.builder \
.appName("Filling_Dates_Cols") \
.config("master", "local") \
.getOrCreate()
df = spark.createDataFrame([
[1, 'A', '2018-09-26', '2018-09-28'],
[2, 'B', '2018-09-20', '2018-09-22'],
[2, 'B', '2018-09-23', '2018-09-26'],
[3, 'C', '2018-09-15', '2018-09-26']
], schema=['id', 'p', 'd1', 'd2'])
min_max_dates = df.select(
F.min('d1').alias('min'),
F.max('d2').alias('max')
).collect()[0]
min_date = min_max_dates[0]
max_date = min_max_dates[1]
d1 = datetime.datetime.strptime(min_date, '%Y-%m-%d')
d2 = datetime.datetime.strptime(max_date, '%Y-%m-%d')
no_of_date_cols = (d2 - d1).days + 1
schema = T.StructType()
for x in range(0, no_of_date_cols):
new_col = (d1 + datetime.timedelta(days=x)).strftime('%Y-%m-%d')
schema = schema.add(new_col, T.IntegerType())
fill_dates_udf = F.udf(fill_dates, schema)
df = df.withColumn(
'dates',
fill_dates_udf(F.col('d1'), F.col('d2'), F.lit(min_date), F.lit(no_of_date_cols))
)
df.select('id', 'p', 'd1', 'd2', 'dates.*').show()
这将导致
+---+---+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-09-15|2018-09-16|2018-09-17|2018-09-18|2018-09-19|2018-09-20|2018-09-21|2018-09-22|2018-09-23|2018-09-24|2018-09-25|2018-09-26|2018-09-27|2018-09-28|
+---+---+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-09-28| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 1| 1|
| 2| B|2018-09-20|2018-09-22| 0| 0| 0| 0| 0| 1| 1| 1| 0| 0| 0| 0| 0| 0|
| 2| B|2018-09-23|2018-09-26| 0| 0| 0| 0| 0| 0| 0| 0| 1| 1| 1| 1| 0| 0|
| 3| C|2018-09-15|2018-09-26| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 1| 0| 0|
+---+---+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
答案 1 :(得分:1)
列表理解的一种方法:
更新:每个请求,将d1
,d2
个字段从StringType调整为DateType。
import pandas as pd
from pyspark.sql import functions as F
#... skip the code to initialize SparkSession spark and df
# if d1 and d2 were read as String, convert them to Date using the following.
# Or if the data were already imported with explicit schema or inferSchema=True when running read.csv(), then skip the following:
df = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2'))
>>> df.show()
+---+---+----------+----------+
|id_| p| d1| d2|
+---+---+----------+----------+
| 1| A|2018-09-26|2018-10-26|
| 2| B|2018-06-21|2018-07-19|
| 2| B|2018-08-13|2018-10-07|
| 2| B|2018-12-31|2019-02-27|
| 2| B|2019-05-28|2019-06-25|
| 3| C|2018-06-15|2018-07-13|
| 3| C|2018-08-15|2018-10-09|
| 3| C|2018-12-03|2019-03-12|
| 3| C|2019-05-10|2019-06-07|
| 4| A|2019-01-30|2019-03-01|
| 4| A|2019-05-30|2019-07-25|
| 5| C|2018-09-19|2018-10-17|
+---+---+----------+----------+
>>> df.printSchema()
root
|-- id_: string (nullable = true)
|-- p: string (nullable = true)
|-- d1: date (nullable = true)
|-- d2: date (nullable = true)
start_date
的min(d1)和end_date
的max(d2):d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
>>> d
Row(start_date=datetime.date(2018, 6, 15), end_date=datetime.date(2019, 7, 25))
cols = [ c.to_timestamp().date() for c in pd.period_range(d.start_date, d.end_date, freq='D') ]
>>> cols
[datetime.date(2018, 6, 15),
datetime.date(2018, 6, 16),
...
datetime.date(2019, 7, 23),
datetime.date(2019, 7, 24),
datetime.date(2019, 7, 25)]
使用列表推导来迭代cols
,F.when(condition,1).otherwise(0)
中的所有日期以设置列值,并为列名(别名)设置str(c)
:
result = df.select('id_', *[ F.when((df.d1 <= c)&(df.d2 >= c),1).otherwise(0).alias(str(c)) for c in cols ])
# check data in some columns
result.select('id_', str(d.start_date), '2019-01-01', str(d.end_date)).show()
+---+----------+----------+----------+
|id_|2018-06-15|2019-01-01|2019-07-25|
+---+----------+----------+----------+
| 1| 0| 0| 0|
| 2| 0| 0| 0|
| 2| 0| 0| 0|
| 2| 0| 1| 0|
| 2| 0| 0| 0|
| 3| 1| 0| 0|
| 3| 0| 0| 0|
| 3| 0| 1| 0|
| 3| 0| 0| 0|
| 4| 0| 0| 0|
| 4| 0| 0| 1|
| 5| 0| 0| 0|
+---+----------+----------+----------+