我有一个
的PySpark dffrom pyspark.sql import functions as F
print(df.groupBy(['issue_month', 'loan_status']).count().show())
+-----------+------------------+-----+
|issue_month| loan_status|count|
+-----------+------------------+-----+
| 06| Fully Paid|12632|
| 03| Fully Paid|16243|
| 07| Default| 1|
| 02| Fully Paid|16467|
| 06| Default| 1|
| 07| In Grace Period| 289|
| 01| Charged Off| 5975|
| 05| Charged Off| 5209|
| 02|Late (31-120 days)| 184|
| 11| Current|17525|
| 12| In Grace Period| 369|
| 10| Fully Paid|19222|
| 04| Fully Paid|16802|
| 07| Charged Off| 7072|
| 06| Charged Off| 4589|
| 04| Late (16-30 days)| 98|
| null| null| 2|
| 10|Late (31-120 days)| 621|
| 07| Late (16-30 days)| 125|
| 10| Default| 2|
+-----------+------------------+-----+
我只想筛选出“ loat_status延迟”,可以是值“ Late(16-30 days)”或“ Late(31-120 days)”。所以我尝试了:
print(df.groupBy(['issue_month', 'loan_status']).count().filter((F.col('loan_status')=='Late (31-120 days)')|F.col('loan_status')=='Late (16-30 days)').show())
这失败了,但是不管它是脏的。我想在熊猫上做些简单的事情,在这里我可以过滤正则表达式。就我而言,这可能与以下内容类似:
F.col('loan_status').contains("Late")
答案 0 :(得分:1)
Pyspark还具有我们可以在{
"status": "fail",
"message": "private range",
"query": "10.5.179.3"
}
contains()
(或) like
功能
.filter()
Example:
我们可以使用 #sample data
df.show()
#+-----------+------------------+
#|issue_month| loan_status|
#+-----------+------------------+
#| 10| Fully Paid|
#| 10| Default|
#| 10|Late (31-120 days)|
#+-----------+------------------+
#in filter query convert loan_status to lower case and look for substring late.
df.groupBy("issue_month","loan_status").\
count().\
filter(lower(col("loan_status")).contains("late")).\
show()
#by using like function
df.groupBy("issue_month","loan_status").\
count().\
filter(lower(col("loan_status")).like("late%")).\
show()
#i would suggest filtering rows before groupby will significantly increases the performance in bigdata!!
df.filter(lower(col("loan_status")).like("late%")).\
groupBy("issue_month","loan_status").\
count().\
show()
#+-----------+------------------+-----+
#|issue_month| loan_status|count|
#+-----------+------------------+-----+
#| 10|Late (31-120 days)| 1|
#+-----------+------------------+-----+
来获取计数总和,而不考虑issue_month。
.agg(sum("count"))
Example:
from pyspark.sql.functions import sum as _sum
df.show()
#+-----------+------------------+
#|issue_month| loan_status|
#+-----------+------------------+
#| 10| Fully Paid|
#| 10| Default|
#| 11|Late (31-120 days)|
#| 11|Late (31-120 days)|
#| 10| Late (16-30 days)|
#+-----------+------------------+
df.filter(lower(col("loan_status")).contains("late")).\
groupBy("issue_month","loan_status").\
count().\
agg(_sum("count").alias("sum")).\
show()
#+---+
#|sum|
#+---+
#| 3|
#+---+
更新:
df.filter(lower(col("loan_status")).like("late%")).\
groupBy("issue_month","loan_status").\
count().\
groupBy("loan_status").\
agg(_sum("count").alias("sum_count")).\
show()
#same result will get by using one group too
df.filter(lower(col("loan_status")).contains("late")).\
groupBy("loan_status").\
agg(count("*").alias("sum_count")).\
show()
#+------------------+---------+
#| loan_status|sum_count|
#+------------------+---------+
#|Late (31-120 days)| 2|
#| Late (16-30 days)| 1|
#+------------------+---------+