Question

示例记录：

    Row(user_id='KxGeqg5ccByhaZfQRI4Nnw', gender='male', year='2015', month='September', day='20', 
hour='16', weekday='Sunday', reviewClass='place love back', business_id='S75Lf-Q3bCCckQ3w7mSN2g', 
business_name='Notorious Burgers', city='Scottsdale', categories='Nightlife, American (New), Burgers, 
Comfort Food, Cocktail Bars, Restaurants, Food, Bars, American (Traditional)', user_funny='1', 
review_sentiment='Positive', friend_id='my4q3Sy6Ei45V58N2l8VGw')

此表有超过1亿条记录。我的SQL查询正在执行以下操作：

Select the most occurring review_sentiment among the friends (friend_id) and the most occurring gender among friends of a particular user visiting a specific business

friend_id is eventually a user_id

方案示例：

一个用户
访问过4个企业
有10个朋友
其中5位朋友访问过业务1和2，而其他5位访问过仅访问了第三项业务，而没有一家访问过第四项
现在，对于业务1和业务2，这5个朋友的积极性比 B1的负面情绪，而-1的+ ve情绪大于 B2和B3的所有-ve

我想要以下输出：

**user_id | business_id | friend_common_sentiment | mostCommonGender | .... otherCols**

user_id_1 | business_id_1 | positive | male | .... otherCols
user_id_1 | business_id_2 | negative | female | .... otherCols
user_id_1 | business_id_3 | negative | female | .... otherCols

这是我在pyspark中为此编写的一个简单查询：

SELECT user_id, gender, year, month, day, hour, weekday, reviewClass, business_id, business_name, city, 
categories, user_funny, review_sentiment FROM events1 GROUP BY user_id, friend_id, business_id ORDER BY 
COUNT(review_sentiment DESC LIMIT 1

此查询不会给出期望的结果，但是我不确定如何精确地将它加入INNER-JOIN？

Answer 1

人做那种数据结构使事情变得困难。但让我们将其分解为几个步骤，

您需要自行加入才能获取朋友的数据
一旦有了朋友的数据，就执行汇总功能以获取每个可能值的计数，并按用户和业务分组
子查询上面的内容，以便基于计数在值之间做出决定。

我只是将您的表称为“标签”，因此联接将如下所示，可悲的是，就像在现实生活中一样，我们不能假设每个人都有朋友，并且因为您未指定永远排除单身人群，我们需要使用左联接来保持用户没有朋友。

From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
    and friends.business_id = user.business_id

接下来，您必须确定给定用户和企业组合最常见的性别/评价是什么。这是数据结构真正为我们所用的地方，我们可以使用一些巧妙的窗口函数一步一步地做到这一点，但是我希望这个答案易于理解，因此我将使用一个子查询和一个案例陈述。为了简单起见，我假设使用二进制性别，但是根据您应用的唤醒级别，您可以遵循相同的模式来添加其他性别。

select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
  and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id

现在我们只需要从子查询中获取数据并做出一些决定，您可能想要添加一些其他选项，例如，如果没有朋友，或者朋友之间平均分配，您可能希望添加选项性别/情感。与以下相同的模式，但有其他值可供选择。

select user_id
, business_id
, case when MaleFriends > than FemaleFriends then 'Male' else 'Female' as MostCommonGender
, case when FriendsPositive > FriendsNegative then 'Positive' else 'Negative' as MostCommonSentiment
from (    select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
  and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id) as a

这为您提供了要执行的步骤，并希望能对它们的工作方式进行清晰的解释。祝你好运！

如何编写多个内部联接的SQL查询？

1 个答案: