数据帧和相应的RDD返回不同的行(PySpark)

时间:2016-07-30 22:20:22

标签: apache-spark pyspark rdd spark-dataframe pyspark-sql

我正面临一种奇怪的行为,其中数据帧以及从其RDD等效项生成的下游列表和映射似乎返回不同的行。什么可能出错?任何帮助表示赞赏。

下面是代码片段以及输出:

  1. samples是一个包含10行和3列的数据帧(从另一个较大的数据帧subset_df中抽取10个随机行得到)。后来,我连接了前两列。
  2. 详细代码如下。我根据数据框转出数据框,生成的计数的键值映射,最后是RDD的处理版本。理想情况下,它们都应包含相同的网址集。但他们是不同的。我理解顺序是否不同(因为在rdd上执行.collect()可能会产生不同的顺序),但返回的某些行完全不同。例如:第三个输出似乎产生了几个在生成此rdd的数据帧中从不存在的URL。这看起来很奇怪!
  3. 完整代码:

    samples = subset_df.select("post_visid_low", "post_visid_high", "post_page_url").where( 
            subset_df["post_page_url"] != "").sample(False, 0.1, seed=0).limit(num_samples) 
    
    tmp = samples.select(func.concat(func.col("post_visid_low"), func.lit("-"), func.col("post_visid_high")).alias( 
            'user_id'), "post_page_url") 
    print("tmp show:") 
    tmp.show(10, False)
    
    # term freq computation 
    vocab = tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap() 
    for k,v in vocab.items(): 
        print(k,v)
    
    
    # group by user_ids 
    user_id_urls = tmp.rdd.reduceByKey( 
        lambda x,y: x + "," + y) 
    num_users = user_id_urls.count() 
    print("user_id_urls:") 
    user_id_urls.collect()
    

    输出:

    tmp dataframe show():

    +---------------------------------------+--------------------------------------------------------------------------------------------+ 
    |user_id                                |post_page_url                                                                               | 
    +---------------------------------------+--------------------------------------------------------------------------------------------+ 
    |6917530152391623611-2707424459370863148|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp                                  | 
    |6917530609264617841-2788188800375174579|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp                                  | 
    |6917530818644021208-2821777435347267515|http://www.backcountry.com                                                                  | 
    |6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets                              | 
    |6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets                              | 
    |6917530818644021208-2821777435347267515|http://www.backcountry.com/dakine-washburn-jacket-mens                                      | 
    |1657310128-1262694438                  |http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016| 
    |4611687717086954899-2907911088913069555|http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys                            | 
    |2023386797-562458996                   |http://www.backcountry.com                                                                  | 
    |6917530783747871522-2923626095076314968|http://www.backcountry.com/pikolinos-verona-boot-womens                                     | 
    +---------------------------------------+--------------------------------------------------------------------------------------------+ 
    

    词汇图:

    http://www.backcountry.com/boys-jackets 2 
    http://www.backcountry.com/dakine-titan-mittens 1 
    https://www.backcountry.com/Store/account/account.jsp 1 
    http://www.backcountry.com/ski-clothing 1 
    http://www.backcountry.com/the-north-face-runners-1-etip-glove 1 
    http://www.backcountry.com/patagonia 1 
    http://www.backcountry.com/burton-boys-clothing 1 
    http://www.backcountry.com/mens-shorts 1 
    https://www.backcountry.com/Store/account/login.jsp 1
    

    user_id_urls rdd:

    [(u'4611687717086954899-2907911088913069555', 
      u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'), 
     (u'2023386797-562458996', u'http://www.backcountry.com'), 
     (u'6917530783747871522-2923626095076314968', 
      u'http://www.backcountry.com/pikolinos-verona-boot-womens'), 
     (u'6917530818644021208-2821777435347267515', 
      u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'), 
     (u'6917530152391623611-2707424459370863148', 
      u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), 
     (u'6917530609264617841-2788188800375174579', 
      u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), 
     (u'1657310128-1262694438', 
      u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')] 
    

0 个答案:

没有答案