Question

我有一个从Twitter JSON文件读取的数据框。

我正在尝试查找数据集中所有URL的数量。 URL存储在字符串数组中。因此，某些对象具有1个URL，而某些具有多个URL，如下所示：

如何计算每个数组中URL的出现次数，并获得显示存储在这些数组中的每个URL计数的输出？

我正在使用以下内容，但未显示期望的结果：

print(withURLenglish.groupby('entities.urls.expanded_url').count().sort(desc('count')).show(n=1500,truncate=False))

'entities.urls.expanded_url'返回一列，该列是字符串数组

我的数据帧在以下变量中定义：

data = df.na.drop(subset=["user.id"]).select(["user","text","entities", "lang"])
withURLenglish = data.filter(size(data['entities.urls']) > 0).filter(data['lang']=='en').select(["user","text","entities", "lang"])

，其架构如下：

print(withURLenglish.head())

将产生：

行（用户=行（contributors_enabled = False，created_at ='周一9月22日 20:19:17 +0000 2008'，default_profile = False， default_profile_image = False，描述='我的推文就像ziti。'， favourites_count = 605，follow_request_sent = None，followers_count = 518， following = None，friends_count = 495，geo_enabled = True，id = 16409225， id_str ='16409225'，is_translator = False，lang ='en'，listing_count = 17， location ='Same City'，name ='bojack horton'，notifications = None， profile_background_color ='FFFFFF'， profile_background_image_url ='http://pbs.twimg.com/profile_background_images/502679497120837632/BxqZlfVD.jpeg'， profile_background_image_url_https ='https://pbs.twimg.com/profile_background_images/502679497120837632/BxqZlfVD.jpeg'， profile_background_tile =真实， profile_banner_url ='https://pbs.twimg.com/profile_banners/16409225/1398568361'， profile_image_url ='http://pbs.twimg.com/profile_images/509891854360264704/R9q_xrfd_normal.jpeg'， profile_image_url_https ='https://pbs.twimg.com/profile_images/509891854360264704/R9q_xrfd_normal.jpeg'， profile_link_color ='121444'，profile_sidebar_border_color ='FFFFFF'， profile_sidebar_fill_color ='FAFEFF'，profile_text_color ='243536'， profile_use_background_image =假，保护=假， screen_name ='aehorton'，statuses_count = 43978，time_zone ='东部时间（美国和加拿大）'，url ='http://wtfismikewearing.tumblr.com/'， utc_offset = -18000，已验证= False），文本='RT @ComplexMag：广播切换到仅经典嘻哈的电台正在增加评级。 http://tgegregergeg.co/Fn81nNs68R http://tssdfsfsf.co/UvBQ4MDbu9'，实体=行（标签= []，媒体=无，符号= []，趋势= []， urls = [Row（display_url ='trib.al / htg6YTP'， expand_url ='http://trib.al/htg6YTP'，索引= [95，117]， url ='http://tregregergge.co/Fn81nNs68R'），行（display_url ='pic.twitter.com / UvBQ4MDbu9'， expand_url ='http://twitter.com/ComplexMag/status/549700671382233088/photo/1'， index = [118，140]，url ='http://tsdfsfdsf.co/UvBQ4MDbu9'）]， user_mentions = [行（id = 13049362，id_str ='13049362'，索引= [3，14]， name ='Complex'，screen_name ='ComplexMag'）]），lang ='en'）

非常感谢任何想法。

Pyspark Groupby和计数数组中的字符串

0 个答案: