我有一个' user_name',' mac',' dayte'(day)的数据集。我想要GROUP BY [' user_name']。然后为该GROUP BY使用' dayte'创建滚动30天的WINDOW。在那30天的滚动期间,我想计算一下#mac;的不同数量。并将其添加到我的数据框中。数据样本。
user_name mac dayte
0 001j 7C:D1 2017-09-15
1 0039711 40:33 2017-07-25
2 0459 F0:79 2017-08-01
3 0459 F0:79 2017-08-06
4 0459 F0:79 2017-08-31
5 0459 78:D7 2017-09-08
6 0459 E0:C7 2017-09-16
7 133833 18:5E 2017-07-27
8 133833 F4:0F 2017-07-31
9 133833 A4:E4 2017-08-07
我尝试用PANDAs数据帧解决这个问题。
df['ct_macs'] = df.groupby(['user_name']).rolling('30d', on='dayte').mac.apply(lambda x:len(x.unique()))
但收到了错误
Exception: cannot handle a non-unique multi-index!
我在PySpark中尝试过,但也收到了错误。
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
#convert string timestamp to timestamp type
df= df.withColumn('dayte', df.dayte.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = Window.partitionBy("user_name").orderBy("dayte").rangeBetween(-days(30), 0)
df= df.select("user_name","mac","dayte",F.size(F.denseRank().over(w).alias("ct_mac")))
但收到了错误
Py4JJavaError: An error occurred while calling o464.select.
: org.apache.spark.sql.AnalysisException: Window function dense_rank does not take a frame specification.;
我也试过
df= df.select("user_name","dayte",F.countDistinct(col("mac")).over(w).alias("ct_mac"))
但是,它显然在Spark中不支持它(在Window中计算不同)。 我对纯粹的SQL方法持开放态度。在MySQL或SQL Server中,但更喜欢Python或Spark。
答案 0 :(得分:0)
<强> Pyspark 强>
窗口功能在以下方面受到限制:
countDistinct
不存在相反,你可以自己加入你的桌子。
首先让我们创建数据框:
df = sc.parallelize([["001j", "7C:D1", "2017-09-15"], ["0039711", "40:33", "2017-07-25"], ["0459", "F0:79", "2017-08-01"],
["0459", "F0:79", "2017-08-06"], ["0459", "F0:79", "2017-08-31"], ["0459", "78:D7", "2017-09-08"],
["0459", "E0:C7", "2017-09-16"], ["133833", "18:5E", "2017-07-27"], ["133833", "F4:0F", "2017-07-31"],
["133833", "A4:E4", "2017-08-07"]]).toDF(["user_name", "mac", "dayte"])
现在适用于join
和groupBy
:
import pyspark.sql.functions as psf
df.alias("left")\
.join(
df.alias("right"),
(psf.col("left.user_name") == psf.col("right.user_name"))
& (psf.col("right.dayte").between(psf.date_add("left.dayte", -30), psf.col("left.dayte"))),
"leftouter")\
.groupBy(["left." + c for c in df.columns])\
.agg(psf.countDistinct("right.mac").alias("ct_macs"))\
.sort("user_name", "dayte").show()
+---------+-----+----------+-------+
|user_name| mac| dayte|ct_macs|
+---------+-----+----------+-------+
| 001j|7C:D1|2017-09-15| 1|
| 0039711|40:33|2017-07-25| 1|
| 0459|F0:79|2017-08-01| 1|
| 0459|F0:79|2017-08-06| 1|
| 0459|F0:79|2017-08-31| 1|
| 0459|78:D7|2017-09-08| 2|
| 0459|E0:C7|2017-09-16| 3|
| 133833|18:5E|2017-07-27| 1|
| 133833|F4:0F|2017-07-31| 2|
| 133833|A4:E4|2017-08-07| 3|
+---------+-----+----------+-------+
<强>熊猫强>
这适用于python3
import pandas as pd
import numpy as np
df["mac"] = pd.factorize(df["mac"])[0]
df.groupby('user_name').rolling('30D', on="dayte").mac.apply(lambda x: len(np.unique(x)))