由于数据集太大,我正在分两个步骤来获取Spark数据帧。这是一个包含购买和诸如list_view,add_to_card等事件的电子商务数据。我的目标是获取已购买用户的事件(list_view)。
首先,我选择一个购买用户的样本。然后,我想获取已购买用户的list_view等。
在下面使用此方法无效:
def get_sparkdf(self, data_paths, filters, agg_by, agg_func, fields, sample_rate=None, seed=None):
spark_df = spark.read.parquet(*data_paths)
# filtering
if filters is not None:
for filter_ in filters:
spark_df = spark_df.filter(filter_)
# perform groupby if arguments is given
if (agg_by is not None) and (agg_func is not None):
spark_df = spark_df.groupby(agg_by).agg(*agg_func)
# select fields if given
if fields is not None:
spark_df = spark_df.select(*fields)
# sample to scale down the data
if sample_rate is not None:
spark_df = spark_df.sample(False, sample_rate, seed)
return spark_df
#filter_attr is a filter on purchase events.
df_events = spark.get_sparkdf(attr_path,filters = filters_attr, agg_by=None,
agg_func=None, fields=['user_id','event_data','ts'], sample_rate=None)
df_uid_sampled = df_events.select('user_id').distinct().sample(False,0.04,None)
为了执行步骤2,在我将其添加到该连接步骤之后太慢并且无法完成。(获取其他事件,例如该用户样本的list_view)。
因此,我想执行直接SQL查询。但是不知道该怎么做。
df_uid_sampled.createOrReplaceTempView("purchasers")
df_view_events = spark.sql("SELECT user_id, event_data, ts, country FROM parquet <parquet_file> WHERE user_id IN purchasers.user_id")
错误如下:
mismatched input ':' expecting {<EOF>, '(', ',', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 57)
== SQL ==
SELECT user_id, event_data, ts, country FROM parquet hdfs:///history_parquet/bids/2018/03/*/*.parquet WHERE user_id IN purchasers.user_id
---------------------------------------------------------^^^
答案 0 :(得分:0)
我从未见过这种语法(SELECT ... FROM parquet <parquet_file>
)。
您可以做的是加载该文件,将其注册为表并在查询中使用它:
df = spark.read.format('parquet').load('<parquet_file>')
df.registerTempTable('people')
spark.sql('SELECT user_id, event_data, ts, country FROM people WHERE user_id IN purchasers.user_id')