PySpark Groupby和基于正则表达式的过滤器

时间:2020-03-12 02:32:46

标签: python dataframe apache-spark pyspark filtering

我有一个

的PySpark df
from pyspark.sql import functions as F
print(df.groupBy(['issue_month', 'loan_status']).count().show())

+-----------+------------------+-----+
|issue_month|       loan_status|count|
+-----------+------------------+-----+
|         06|        Fully Paid|12632|
|         03|        Fully Paid|16243|
|         07|           Default|    1|
|         02|        Fully Paid|16467|
|         06|           Default|    1|
|         07|   In Grace Period|  289|
|         01|       Charged Off| 5975|
|         05|       Charged Off| 5209|
|         02|Late (31-120 days)|  184|
|         11|           Current|17525|
|         12|   In Grace Period|  369|
|         10|        Fully Paid|19222|
|         04|        Fully Paid|16802|
|         07|       Charged Off| 7072|
|         06|       Charged Off| 4589|
|         04| Late (16-30 days)|   98|
|       null|              null|    2|
|         10|Late (31-120 days)|  621|
|         07| Late (16-30 days)|  125|
|         10|           Default|    2|
+-----------+------------------+-----+

我只想筛选出“ loat_status延迟”,可以是值“ Late(16-30 days)”或“ Late(31-120 days)”。所以我尝试了:

print(df.groupBy(['issue_month', 'loan_status']).count().filter((F.col('loan_status')=='Late (31-120 days)')|F.col('loan_status')=='Late (16-30 days)').show())

这失败了,但是不管它是脏的。我想在熊猫上做些简单的事情,在这里我可以过滤正则表达式。就我而言,这可能与以下内容类似:

F.col('loan_status').contains("Late")

1 个答案:

答案 0 :(得分:1)

Pyspark还具有我们可以在{ "status": "fail", "message": "private range", "query": "10.5.179.3" }

中使用的 contains() (或) like 功能

.filter()

Example:

我们可以使用 #sample data df.show() #+-----------+------------------+ #|issue_month| loan_status| #+-----------+------------------+ #| 10| Fully Paid| #| 10| Default| #| 10|Late (31-120 days)| #+-----------+------------------+ #in filter query convert loan_status to lower case and look for substring late. df.groupBy("issue_month","loan_status").\ count().\ filter(lower(col("loan_status")).contains("late")).\ show() #by using like function df.groupBy("issue_month","loan_status").\ count().\ filter(lower(col("loan_status")).like("late%")).\ show() #i would suggest filtering rows before groupby will significantly increases the performance in bigdata!! df.filter(lower(col("loan_status")).like("late%")).\ groupBy("issue_month","loan_status").\ count().\ show() #+-----------+------------------+-----+ #|issue_month| loan_status|count| #+-----------+------------------+-----+ #| 10|Late (31-120 days)| 1| #+-----------+------------------+-----+ 来获取计数总和,而不考虑issue_month。

.agg(sum("count"))

Example:

from pyspark.sql.functions import sum as _sum
df.show()
#+-----------+------------------+
#|issue_month|       loan_status|
#+-----------+------------------+
#|         10|        Fully Paid|
#|         10|           Default|
#|         11|Late (31-120 days)|
#|         11|Late (31-120 days)|
#|         10| Late (16-30 days)|
#+-----------+------------------+

df.filter(lower(col("loan_status")).contains("late")).\
groupBy("issue_month","loan_status").\
count().\
agg(_sum("count").alias("sum")).\
show()

#+---+
#|sum|
#+---+
#|  3|
#+---+

更新:

df.filter(lower(col("loan_status")).like("late%")).\
groupBy("issue_month","loan_status").\
count().\
groupBy("loan_status").\
agg(_sum("count").alias("sum_count")).\
show()

#same result will get by using one group too
df.filter(lower(col("loan_status")).contains("late")).\
groupBy("loan_status").\
agg(count("*").alias("sum_count")).\
show()

#+------------------+---------+
#|       loan_status|sum_count|
#+------------------+---------+
#|Late (31-120 days)|        2|
#| Late (16-30 days)|        1|
#+------------------+---------+