Pyspark Groupby和计数数组中的字符串

时间:2019-04-26 14:32:49

标签: python group-by count pyspark pyspark-sql

我有一个从Twitter JSON文件读取的数据框。

我正在尝试查找数据集中所有URL的数量。 URL存储在字符串数组中。因此,某些对象具有1个URL,而某些具有多个URL,如下所示: enter image description here

如何计算每个数组中URL的出现次数,并获得显示存储在这些数组中的每个URL计数的输出?

我正在使用以下内容,但未显示期望的结果:

print(withURLenglish.groupby('entities.urls.expanded_url').count().sort(desc('count')).show(n=1500,truncate=False))

'entities.urls.expanded_url'返回一列,该列是字符串数组

我的数据帧在以下变量中定义:

data = df.na.drop(subset=["user.id"]).select(["user","text","entities", "lang"])
withURLenglish = data.filter(size(data['entities.urls']) > 0).filter(data['lang']=='en').select(["user","text","entities", "lang"])

,其架构如下: enter image description here

print(withURLenglish.head())

将产生:

  

行(用户=行(​​contributors_enabled = False,created_at ='周一9月22日   20:19:17 +0000 2008',default_profile = False,   default_profile_image = False,描述='我的推文就像ziti。',   favourites_count = 605,follow_request_sent = None,followers_count = 518,   following = None,friends_count = 495,geo_enabled = True,id = 16409225,   id_str ='16409225',is_translator = False,lang ='en',listing_count = 17,   location ='Same City',name ='bojack horton',notifications = None,   profile_background_color ='FFFFFF',   profile_background_image_url ='http://pbs.twimg.com/profile_background_images/502679497120837632/BxqZlfVD.jpeg',   profile_background_image_url_https ='https://pbs.twimg.com/profile_background_images/502679497120837632/BxqZlfVD.jpeg',   profile_background_tile =真实,   profile_banner_url ='https://pbs.twimg.com/profile_banners/16409225/1398568361',   profile_image_url ='http://pbs.twimg.com/profile_images/509891854360264704/R9q_xrfd_normal.jpeg',   profile_image_url_https ='https://pbs.twimg.com/profile_images/509891854360264704/R9q_xrfd_normal.jpeg',   profile_link_color ='121444',profile_sidebar_border_color ='FFFFFF',   profile_sidebar_fill_color ='FAFEFF',profile_text_color ='243536',   profile_use_background_image =假,保护=假,   screen_name ='aehorton',statuses_count = 43978,time_zone ='东部时间   (美国和加拿大)',url ='http://wtfismikewearing.tumblr.com/',   utc_offset = -18000,已验证= False),文本='RT @ComplexMag:广播   切换到仅经典嘻哈的电台正在增加   评级。 http://tgegregergeg.co/Fn81nNs68R http://tssdfsfsf.co/UvBQ4MDbu9',   实体=行(标签= [],媒体=无,符号= [],趋势= [],   urls = [Row(display_url ='trib.al / htg6YTP',   expand_url ='http://trib.al/htg6YTP',索引= [95,117],   url ='http://tregregergge.co/Fn81nNs68R'),   行(display_url ='pic.twitter.com / UvBQ4MDbu9',   expand_url ='http://twitter.com/ComplexMag/status/549700671382233088/photo/1',   index = [118,140],url ='http://tsdfsfdsf.co/UvBQ4MDbu9')],   user_mentions = [行(id = 13049362,id_str ='13049362',索引= [3,14],   name ='Complex',screen_name ='ComplexMag')]),lang ='en')

非常感谢任何想法。

0 个答案:

没有答案