如何将包含多个键值对的列拆分为pyspark中的不同列

时间:2019-04-23 03:48:18

标签: python-3.x amazon-web-services pyspark

我正在处理一个很大的数据集,称为AWS上的Reddit。我先阅读了一个小样本:

using System;               
public class Program
{

    public int W(int a, int b)
    {   
        return 0;   
    }

    public static void Main()   
    {
        int j= W(1,1);
    }
}

因此我得到了一个名为file_lzo = sc.newAPIHadoopFile("s3://mv559/reddit/sample-data/", "com.hadoop.mapreduce.LzoTextInputFormat", "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text") 的rdd。我标记了第一个元素,数据如下:

file_lzo

然后我使用

从此rdd创建一个数据框
[(0,
  '{"archived":false,"author":"TistedLogic","author_created_utc":1312615878,"author_flair_background_color":null,"author_flair_css_class":null,"author_flair_richtext":[],"author_flair_template_id":null,"author_flair_text":null,"author_flair_text_color":null,"author_flair_type":"text","author_fullname":"t2_5mk6v","author_patreon_flair":false,"body":"Is it still r\\/BoneAppleTea worthy if it\'s the opposite?","can_gild":true,"can_mod_post":false,"collapsed":false,"collapsed_reason":null,"controversiality":0,"created_utc":1538352000,"distinguished":null,"edited":false,"gilded":0,"gildings":{"gid_1":0,"gid_2":0,"gid_3":0},"id":"e6xucdd","is_submitter":false,"link_id":"t3_9ka1hp","no_follow":true,"parent_id":"t1_e6xu13x","permalink":"\\/r\\/Unexpected\\/comments\\/9ka1hp\\/jesus_fking_woah\\/e6xucdd\\/","removal_reason":null,"retrieved_on":1539714091,"score":2,"send_replies":true,"stickied":false,"subreddit":"Unexpected","subreddit_id":"t5_2w67q","subreddit_name_prefixed":"r\\/Unexpected","subreddit_type":"public"}')]

看起来像这样

df = spark.createDataFrame(file_lzo,['idx','map_col'])
df.show(4)

最后,我想以如下所示的数据帧格式获取数据,并将其保存为S3中的拼花格式,以备将来使用。

the desired results

我尝试创建一个架构,然后使用+-----+--------------------+ | idx| map_col| +-----+--------------------+ | 0|{"archived":false...| |70139|{"archived":false...| |70139|{"archived":false...| |70139|{"archived":false...| +-----+--------------------+ only showing top 4 rows ,但是我得到的所有值都是Null

read.json

1 个答案:

答案 0 :(得分:0)

查看所需的输出,可以将json视为MapType()的列,然后从中提取列。

开始创建数据框:

scroll(el: HTMLElement) {
 el.scrollIntoView({behavior: 'smooth'});
}

然后,如果您事先不知道要提取哪些密钥,请收集一个密钥并获取密钥,例如:

my_rdd = [(0, {"author":  "abc", "id": "012", "archived": "False"}),
        (1, {"author": "bcd", "id": "013", "archived": "False"}),
        (2, {"author": "cde", "id": "014", "archived": "True"}),
        (3, {"author": "edf", "id": "015", "archived": "False"})]
df = sqlContext.createDataFrame(my_rdd,['idx','map_col'])
df.show()
# +---+--------------------+
# |idx|             map_col|
# +---+--------------------+
# |  0|Map(id -> 012, au...|
# |  1|Map(id -> 013, au...|
# |  2|Map(id -> 014, au...|
# |  3|Map(id -> 015, au...|
# +---+--------------------+

如果您已经知道密钥列表,请直接使用它。

因此,您可以将地图列变平:

from pyspark.sql import functions as f

one = df.select(f.col('map_col')).rdd.take(1)
my_dict = one[0][0].keys()
my_dict
# dict_keys(['id', 'author', 'archived'])

方法keep_cols = [f.col('map_col').getItem(k).alias(k) for k in my_dict] df.select(keep_cols).show() #+---+------+--------+ #| id|author|archived| #+---+------+--------+ #|012| abc| False| #|013| bcd| False| #|014| cde| True| #|015| edf| False| #+---+------+--------+ getItem()发挥了神奇的作用:第一种方法从map列中提取所选键,第二种方法根据需要将所获得的列重命名。