我有一个从Twitter JSON文件读取的数据框。
我正在尝试查找数据集中所有URL的数量。 URL存储在字符串数组中。因此,某些对象具有1个URL,而某些具有多个URL,如下所示:
如何计算每个数组中URL的出现次数,并获得显示存储在这些数组中的每个URL计数的输出?
我正在使用以下内容,但未显示期望的结果:
print(withURLenglish.groupby('entities.urls.expanded_url').count().sort(desc('count')).show(n=1500,truncate=False))
'entities.urls.expanded_url'返回一列,该列是字符串数组
我的数据帧在以下变量中定义:
data = df.na.drop(subset=["user.id"]).select(["user","text","entities", "lang"])
withURLenglish = data.filter(size(data['entities.urls']) > 0).filter(data['lang']=='en').select(["user","text","entities", "lang"])
print(withURLenglish.head())
将产生:
行(用户=行(contributors_enabled = False,created_at ='周一9月22日 20:19:17 +0000 2008',default_profile = False, default_profile_image = False,描述='我的推文就像ziti。', favourites_count = 605,follow_request_sent = None,followers_count = 518, following = None,friends_count = 495,geo_enabled = True,id = 16409225, id_str ='16409225',is_translator = False,lang ='en',listing_count = 17, location ='Same City',name ='bojack horton',notifications = None, profile_background_color ='FFFFFF', profile_background_image_url ='http://pbs.twimg.com/profile_background_images/502679497120837632/BxqZlfVD.jpeg', profile_background_image_url_https ='https://pbs.twimg.com/profile_background_images/502679497120837632/BxqZlfVD.jpeg', profile_background_tile =真实, profile_banner_url ='https://pbs.twimg.com/profile_banners/16409225/1398568361', profile_image_url ='http://pbs.twimg.com/profile_images/509891854360264704/R9q_xrfd_normal.jpeg', profile_image_url_https ='https://pbs.twimg.com/profile_images/509891854360264704/R9q_xrfd_normal.jpeg', profile_link_color ='121444',profile_sidebar_border_color ='FFFFFF', profile_sidebar_fill_color ='FAFEFF',profile_text_color ='243536', profile_use_background_image =假,保护=假, screen_name ='aehorton',statuses_count = 43978,time_zone ='东部时间 (美国和加拿大)',url ='http://wtfismikewearing.tumblr.com/', utc_offset = -18000,已验证= False),文本='RT @ComplexMag:广播 切换到仅经典嘻哈的电台正在增加 评级。 http://tgegregergeg.co/Fn81nNs68R http://tssdfsfsf.co/UvBQ4MDbu9', 实体=行(标签= [],媒体=无,符号= [],趋势= [], urls = [Row(display_url ='trib.al / htg6YTP', expand_url ='http://trib.al/htg6YTP',索引= [95,117], url ='http://tregregergge.co/Fn81nNs68R'), 行(display_url ='pic.twitter.com / UvBQ4MDbu9', expand_url ='http://twitter.com/ComplexMag/status/549700671382233088/photo/1', index = [118,140],url ='http://tsdfsfdsf.co/UvBQ4MDbu9')], user_mentions = [行(id = 13049362,id_str ='13049362',索引= [3,14], name ='Complex',screen_name ='ComplexMag')]),lang ='en')
非常感谢任何想法。