Question

我有一个相当复杂的SQL Server查询（至少对我来说）在人口统计数据集上写。我需要弄清楚系统中有多少受访者认为特定人群。

我有2个主表。我将列出相关的专栏。假设每行都有唯一的ID。

表格受访者：

[RespondentID] [SystemEntryDate]

表格RespondentProfiles：

[QuestionID] [AnswerID]

受访者的受访者ID链接到RespondentProfiles。对于每个回答的问题，都会创建一行。问题ID对应于特定问题（例如性别，种族，州和汽车所有权），答案ID意味着根据问题而有所不同。像1是男性，2是女性，或者1可能是白人，2是西班牙裔，3名太平洋岛民，依此类推。

我还有一个名为条件的表。条件表如下所示：

[ConditionSetID] [QuestionID] [AnswerID]

条件集id将条件链接到条件集合中。因此，我可以将条件集ID传递给查询，它将返回有多少受访者符合该条件的计数，以及该集合的最小和最大日期。

我的查询将如下所示：

create procedure query

@ConditionSetID int

as

select count(distinct r.ID) as Respondents,
       min(r.SystemEntryDate) as EarliestDate,
       max(r.SystemEntryDate) as LatestDate
  from Respondents r
  join RespondentProfiles rp
    on r.ID = rp.RespondentID
  join Conditions c
    on c.ConditionSetID = @ConditionSetID
   and c.QuestionID = rp.QuestionID
 where rp.QuestionID = c.QuestionID
   and rp.Condition = c.AnswerID

作为一个例子，我可能有一个像这样的受访者个人资料表

  [RespondentID] [QuestionID] [AnswerID]

      10001      1 (gender)    1 (male)
      10001      2 (ethnicity) 1 (white)
      10001      3 (car)       23 (lexus)
      10002      1 (gender)    2 (female)
      10002      2 (ethnicity) 2 (black)
      10002      3 (car)       24 (buick)
      10003      1 (gender)    2 (female)
      10003      2 (ethnicity) 1 (white)
      10003      3 (car)       5 (honda)
      10004      1 (gender)    1 (male)
      10004      2 (ethnicity) 2 (black)
      10004      3 (car)       24 (buick)

如果我选择一个特定的条件集，那么行id可能就像：

      [QuestionID] [AnswerID]

      1 (gender)    2 (female)
      2 (ethnicity) 2 (black)
      3 (car)       24 (buick)

这将要求所有拥有别克的黑人女性，这应该给他们一个计数。

或者我可以：

      [QuestionID] [AnswerID]

      3 (car)       23 (lexus)
      3 (car)       24 (buick)

这要求所有拥有别克或者lexus的人，这将是3个人。

然后作为最后一个例子：

      [QuestionID] [AnswerID]
      2 (ethnicity) 2 (black)
      3 (car)       23 (lexus)
      3 (car)       24 (buick)

这是要求每个人都是黑人并且拥有一个lexus或者每个人都是黑人并且拥有一个buick，这将是2个人。

我知道这并不是非常复杂，但这是我尝试过的最复杂的事情，任何帮助都会受到高度赞赏。我在弄清楚如何设置where子句时遇到了很多麻烦，甚至大方向都会受到赞赏。在respondentprofiles表中还有大约800,000条记录，因此它必须是有效的。

我设置的where子句并不完全正确，因为它只会将记录视为不同的问题是在一起，而不是和/或在一起。所以即使只有一个答案匹配，它也会为该答复者返回一行，这是错误的。特定受访者必须满足所选条件中的所有条件。

也许我需要一次选择一个临时表问题？或者使用某种分组？我真的很困惑这个去哪里。我希望我已经提供了足够的信息来充分展示我的困境。

Answer 1

以下示例显示了如何获得回答的受访者的受访者ID：问A，是的问B，不问C，是

假设您实际上正在使用SQL服务器（您在问题中标记了mysql和sql server），您可以使用：

select id
  from RespondentProfiles
 where QuestionID = 'a'
   and AnswerID = 'yes'
intersect
select id
  from RespondentProfiles
 where QuestionID = 'b'
   and AnswerID = 'no'
intersect
select id
  from RespondentProfiles
 where QuestionID = 'c'
   and AnswerID = 'yes'

如果您使用的是MySQL，可以使用：

select id
  from RespondentProfiles x
 where QuestionID = 'a'
   and AnswerID = 'yes'
  join (select id
          from RespondentProfiles
         where QuestionID = 'b'
           and AnswerID = 'no') y
    on x.id = y.id
  join (select id
          from RespondentProfiles
         where QuestionID = 'c'
           and AnswerID = 'yes') z
    on y.id = z.id

只是将我在评论中添加的内容添加到我的答案中 - 不需要您的条件表。您不需要拥有这样的表格来查询以某种方式回答2个以上问题的受访者。您可以使用内联视图和/或子查询来实现此目的。（或者在sql server的情况下，相交集运算符）

人口统计数据集的强大SQL查询

1 个答案: