我正面临一种奇怪的行为,其中数据帧以及从其RDD等效项生成的下游列表和映射似乎返回不同的行。什么可能出错?任何帮助表示赞赏。
下面是代码片段以及输出:
samples
是一个包含10行和3列的数据帧(从另一个较大的数据帧subset_df
中抽取10个随机行得到)。后来,我连接了前两列。 .collect()
可能会产生不同的顺序),但返回的某些行完全不同。例如:第三个输出似乎产生了几个在生成此rdd的数据帧中从不存在的URL。这看起来很奇怪! 完整代码:
samples = subset_df.select("post_visid_low", "post_visid_high", "post_page_url").where(
subset_df["post_page_url"] != "").sample(False, 0.1, seed=0).limit(num_samples)
tmp = samples.select(func.concat(func.col("post_visid_low"), func.lit("-"), func.col("post_visid_high")).alias(
'user_id'), "post_page_url")
print("tmp show:")
tmp.show(10, False)
# term freq computation
vocab = tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap()
for k,v in vocab.items():
print(k,v)
# group by user_ids
user_id_urls = tmp.rdd.reduceByKey(
lambda x,y: x + "," + y)
num_users = user_id_urls.count()
print("user_id_urls:")
user_id_urls.collect()
输出:
tmp dataframe show():
+---------------------------------------+--------------------------------------------------------------------------------------------+
|user_id |post_page_url |
+---------------------------------------+--------------------------------------------------------------------------------------------+
|6917530152391623611-2707424459370863148|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp |
|6917530609264617841-2788188800375174579|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp |
|6917530818644021208-2821777435347267515|http://www.backcountry.com |
|6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets |
|6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets |
|6917530818644021208-2821777435347267515|http://www.backcountry.com/dakine-washburn-jacket-mens |
|1657310128-1262694438 |http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016|
|4611687717086954899-2907911088913069555|http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys |
|2023386797-562458996 |http://www.backcountry.com |
|6917530783747871522-2923626095076314968|http://www.backcountry.com/pikolinos-verona-boot-womens |
+---------------------------------------+--------------------------------------------------------------------------------------------+
词汇图:
http://www.backcountry.com/boys-jackets 2
http://www.backcountry.com/dakine-titan-mittens 1
https://www.backcountry.com/Store/account/account.jsp 1
http://www.backcountry.com/ski-clothing 1
http://www.backcountry.com/the-north-face-runners-1-etip-glove 1
http://www.backcountry.com/patagonia 1
http://www.backcountry.com/burton-boys-clothing 1
http://www.backcountry.com/mens-shorts 1
https://www.backcountry.com/Store/account/login.jsp 1
user_id_urls rdd:
[(u'4611687717086954899-2907911088913069555',
u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'),
(u'2023386797-562458996', u'http://www.backcountry.com'),
(u'6917530783747871522-2923626095076314968',
u'http://www.backcountry.com/pikolinos-verona-boot-womens'),
(u'6917530818644021208-2821777435347267515',
u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'),
(u'6917530152391623611-2707424459370863148',
u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
(u'6917530609264617841-2788188800375174579',
u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
(u'1657310128-1262694438',
u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')]