假设我有一个mySQL表(用户),其中包含以下字段:
userid
gender
region
age
ethnicity
income
我希望能够根据用户输入的数量返回总记录数。此外,他们还将提供额外的标准。
在最简单的例子中,他们可能要求提供1,000条记录,其中600条记录应具有性别='男性'和性别='女性的400条记录。这很简单。
现在,再往前走一步。假设他们现在想要指定Region:
GENDER
Male: 600 records
Female: 400 records
REGION
North: 100 records
South: 200 records
East: 300 records
West: 400 records
同样,只应返回1000条记录,但最终必须有600名男性,400名女性,100名北方人,200名南方人,300名东方人和400名西方人。
我知道这不是有效的语法,但使用伪mySQL代码,它有望说明我尝试做的事情:
(SELECT * FROM users WHERE gender = 'Male' LIMIT 600
UNION
SELECT * FROM users WHERE gender = 'Female' LIMIT 400)
INTERSECT
(SELECT * FROM users WHERE region = 'North' LIMIT 100
UNION
SELECT * FROM users WHERE region = 'South' LIMIT 200
UNION
SELECT * FROM users WHERE region = 'East' LIMIT 300
UNION
SELECT * FROM users WHERE region = 'West' LIMIT 400)
请注意,我没有寻找一次性查询。每个标准中的记录总数和记录数将根据用户的输入不断变化。因此,我试图提出一种可以反复使用的通用解决方案,而不是硬编码解决方案。
为了使事情变得更复杂,现在添加更多标准。也可能有年龄,种族和收入,每个都有自己设定的每组记录数,附加的代码如上:
INTERSECT
(SELECT * FROM users WHERE age >= 18 and age <= 24 LIMIT 300
UNION
SELECT * FROM users WHERE age >= 25 and age <= 36 LIMIT 200
UNION
SELECT * FROM users WHERE age >= 37 and age <= 54 LIMIT 200
UNION
SELECT * FROM users WHERE age >= 55 LIMIT 300)
INTERSECT
etc.
我不确定是否可以在一个查询中写入,或者这是否需要多个语句和迭代。
答案 0 :(得分:16)
您可以将多维标准展平为单一级别标准
现在可以在一个查询中实现此标准,如下所示
(SELECT * FROM users WHERE gender = 'Male' AND region = 'North' LIMIT 40) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'South' LIMIT 80) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'East' LIMIT 120) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'West' LIMIT 160) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'North' LIMIT 60) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'South' LIMIT 120) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'East' LIMIT 180) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'West' LIMIT 240)
<强>问题强>
假设只有不到40名男性和北方用户。然后,您需要调整其他标准数量以涵盖“男性”和“北部”中缺少的数量。我相信用裸SQL做这件事是不可能的。这是我想到的伪代码。为简化起见,我认为我们只会查询男性,女性,北方和南方
conditions.add({ gender: 'Male', region: 'North', limit: 40 })
conditions.add({ gender: 'Male', region: 'South', limit: 80 })
conditions.add({ gender: 'Female', region: 'North', limit: 60 })
conditions.add({ gender: 'Female', region: 'South', limit: 120 })
foreach(conditions as condition) {
temp = getResultFromDatabaseByCondition(condition)
conditions.remove(condition)
// there is not enough result for this condition,
// increase other condition quantity
if (temp.length < condition.limit) {
adjust(...);
}
}
假设只有30名北方男性。所以我们需要调整+10男性和+10北方人。
To Adjust
---------------------------------------------------
Male +10
North +10
Remain Conditions
----------------------------------------------------
{ gender: 'Male', region: 'South', limit: 80 }
{ gender: 'Female', region: 'North', limit: 60 }
{ gender: 'Female', region: 'South', limit: 120 }
'男'+'南'是第一个符合'男性'调整条件的条件。将其增加+10,并将其从“保留条件”列表中删除。因为,我们增加南方,我们需要在其他条件下减少它。因此,将“South”条件添加到“To Adjust”列表中
To Adjust
---------------------------------------------------
South -10
North +10
Remain Conditions
----------------------------------------------------
{ gender: 'Female', region: 'North', limit: 60 }
{ gender: 'Female', region: 'South', limit: 120 }
Final Conditions
----------------------------------------------------
{ gender: 'Male', region: 'South', limit: 90 }
找到与“南方”匹配的条件并重复相同的过程。
To Adjust
---------------------------------------------------
Female +10
North +10
Remain Conditions
----------------------------------------------------
{ gender: 'Female', region: 'North', limit: 60 }
Final Conditions
----------------------------------------------------
{ gender: 'Female', region: 'South', limit: 110 }
{ gender: 'Male', region: 'South', limit: 90 }
最后
{ gender: 'Female', region: 'North', limit: 70 }
{ gender: 'Female', region: 'South', limit: 110 }
{ gender: 'Male', region: 'South', limit: 90 }
我还没有想出调整的确切实施。这比我想象的要困难得多。一旦我弄清楚如何实现它,我会更新。
答案 1 :(得分:10)
您描述的问题是多维建模问题。特别是,您试图同时沿多个维度获取分层样本。关键是要降低到最小的粒度级别并从那里建立样本。
我进一步假设您希望样本在所有级别都具有代表性。也就是说,你不希望所有来自&#34; North&#34;是女性。或者所有&#34;男性&#34;来自&#34; West&#34;,即使这符合最终标准。
首先考虑每个维度的记录,维度和分配总数。例如,对于第一个样本,将其视为:
然后,您想要将这些数字分配给每个性别/地区组合。数字是:
您会看到这些在尺寸上加起来。
计算每个单元格中的数字非常简单。它是百分比乘以总数的乘积。所以,&#34;东方,女性&#34;是30%* 40%* 1000。 。 。瞧!值为120。
以下是解决方案:
假设您有一个表cells
,其中包含预期的计数和原始数据(users
)。
select enumerated.*
from (select u.*,
(@rn := if(@dims = concat_ws(':', dim1, dim2, dim3), @rn + 1,
if(@dims := concat_ws(':', dim1, dim2, dim3), 1, 1)
)
) as seqnum
from users u cross join
(select @dims = '', @rn := '') vars
order by dim1, dim2, dim3, rand()
) enumerated join
cells
on enumerated.dims = cells.dims
where enuemrated.seqnum <= cells.expectedcount;
请注意,这是解决方案的草图。您必须填写有关尺寸的详细信息。
只要您有足够的数据用于所有单元格,这将有效。
在实践中,当进行这种类型的多维分层抽样时,您确实存在细胞空或太小的风险。发生这种情况时,您可以经常通过额外的传递来解决此问题。从足够大的细胞中获取尽可能多的东西。这些通常占所需数据的大部分。然后添加记录以满足最终计数。要添加的记录是其值与最需要的维度所需的值匹配的记录。但是,此解决方案只是假设有足够的数据来满足您的标准。
答案 2 :(得分:3)
您的请求有问题,可以使用大量选项来实现建议的数字:
Male Female Sum
-----------------------------
North: 100 0 100
South: 200 0 200
East: 300 0 300
West: 0 400 400
Sum: 600 400
-----------------------------
North: 99 1 100
South: 200 0 200
East: 300 0 300
West: 1 399 400
Sum: 600 400
-----------------------------
....
-----------------------------
North: 0 100 100
South: 200 0 200
East: 0 300 300
West: 400 0 400
Sum: 600 400
只需将北,东和西(南方总是男性:200)结合起来,您将获得400种可能性来实现建议的数字。当每个“类”(男/北=“类”)只有有限数量的记录时,它会变得更加复杂。
对于上表中的每个单元格,您可能需要最多MIN(COUNT(gender), COUNT(location))
个记录(对于它的对应部分为零的情况)。
这取决于:
Male Female
---------------------
North: 100 100
South: 200 200
East: 300 300
West: 400 400
因此,您需要计算每个性别/位置对AVAILABLE(gender, location)
的可用记录。
找到特别适合似乎接近 semimagic squares [1] [2]。
有关此[3] [4]的math.stackexchange.com上有几个问题。
我最后阅读some paper关于如何构建这些内容的方法,我怀疑用一个选项可以做到这一点。
如果你有足够的记录,并且最终不会出现这样的情况:
Male Female
---------------------
North: 100 0
South: 200 200
East: 300 0
West: 200 200
我会选择迭代槽位并在每一步中添加比例数量的男性/女性:
但是这只会给你一些近似的结果,并且在验证那些之后你可能想要通过几次迭代结果并将每个类别中的计数调整为“足够好”。
答案 3 :(得分:1)
这可以通过两个步骤解决。我将描述如何在性别和区域是维度的示例中执行此操作。然后我将描述更一般的情况。在第一步中,我们求解一个8个变量的方程组,然后我们采用由步骤1中找到的解决方案限制的8个select语句的不相交联合。请注意,任何行只有8种可能性。他们可以是男性或女性,然后该地区是北部,南部,东部或西部之一。现在让,
X1 equal the number of rows that are male and from the north,
X2 equal the number of rows that are male and from the south,
X3 equal the number of rows that are male and from the east,
X4 equal then number that are male and from the west
X5 equal the number of rows that are female and from the north,
X6 equal the number of rows that are female and from the south,
X7 equal the number of rows that are female and from the east,
X8 equal then number that are female and from the west
方程是:
X1+X2+X3+X4=600
X5+X6+X7+X8=400
X1+X5=100
X2+X6=200
X3+X7=300
X4+X8=400
现在解决上面的X1,X2,...... X8。有很多解决方案(我将在稍后介绍如何解决)这是一个解决方案:
X1=60, X2=120, X3=180,X4=240,X5=40,X6=80,X7=120,X8=160.
现在我们可以通过8个选择的简单联合得到结果:
(select * from user where gender='m' and region="north" limit 60)
union distinct(select * from user where gender='m' and region='south' limit 120)
union distinct(select * from user where gender='m' and region='east' limit 180)
union distinct(select * from user where gender='m' and region='west' limit 240)
union distinct(select * from user where gender='f' and region='north' limit 40)
union distinct(select * from user where gender='f' and region='south' limit 80)
union distinct(select * from user where gender='f' and region='east' limit 120)
union distinct(select * from user where gender='f' and region='west' limit 160);
请注意,如果数据库中没有60行满足上面的第一个选择,那么给定的特定解决方案将不起作用。所以我们必须添加其他约束,LT:
0<X1 <= (select count(*) from user where from user where gender='m' and region="north")
0<X2 <= (select count(*) from user where gender='m' and region='south')
0<X3 <= (select count(*) from user where gender='m' and region='east' )
0<X4 <= (select count(*) from user where gender='m' and region='west')
0<X5 <= (select count(*) from user where gender='f' and region='north' )
0<X6 <= (select count(*) from user where gender='f' and region='south')
0<X7 <= (select count(*) from user where gender='f' and region='east' )
0<X8 <= (select count(*) from user where gender='f' and region='west');
现在让我们概括一下这种情况,允许任何分裂。方程是E:
X1+X2+X3+X4=n1
X5+X6+X7+X8=n2
X1+X5=m1
X2+X6=m2
X3+X7=m3
X4+X8=m4
给出数n1,n2,m1,m2,m3,m4并且满足n1 + n2 =(m1 + m2 + m3 + m4)。因此我们将问题简化为求解上述方程式LT和E.这只是一个线性编程问题,可以使用单纯形法或其他方法解决。另一种可能性是将其视为线性丢番图方程组,并使用其方法来找到解。在任何情况下,我都将问题简化为找到上述方程的解。 (假设方程是一种特殊的形式,那么使用单纯形法或求解线性不定方程组的系统可能会有更快的方法。)一旦我们求解Xi,最终解是:
(select * from user where gender='m' and region="north" limit :X1)
union distinct(select * from user where gender='m' and region='south' limit :X2)
union distinct(select * from user where gender='m' and region='east' limit :X3)
union distinct(select * from user where gender='m' and region='west' limit :X4)
union distinct(select * from user where gender='f' and region='north' limit :X5)
union distinct(select * from user where gender='f' and region='south' limit :X6)
union distinct(select * from user where gender='f' and region='east' limit :X7)
union distinct(select * from user where gender='f' and region='west' limit :X8);
让我们将具有n种可能性的维D表示为D:n。假设您有D1:n1,D2:n2,... DM:nM尺寸。将生成n1 * n2 * ... nM变量。生成的方程的数量是n1 + n2 + ... nM。而不是定义一般方法让我们采取3维,4维和2维的另一种情况;让我们将D1的可能值称为d11,d12,d13,D2为d21,d22,d23,d24,D3值为d31,d32。我们将有24个变量,方程式为:
X1 + X2 + ...X8=n11
X9 + X10 + ..X16=n12
X17+X18 + ...X24=n13
X1+X2+X9+x10+x17+x18=n21
X3+X4+X11+x12+x19+x20=n22
X5+X6+X13+x14+x21+x22=n23
X7+X8+X15+x116+x23+x24=n24
X1+X3+X5+...X23=n31
X2+X4+......X24=n32
哪里
X1 equals number with D1=d11 and D2=d21 and D3=d31
X2 equals number with D1=d11 and D2=d21 and D3 = d31
....
X24 equals number with D1=D13 and D2=d24, and D3=d32.
添加少于约束。然后求解X1,X2,...... X24。创建24个select语句并使用不相交的联合。 我们可以针对任何维度进行类似的解决。
总结:给定尺寸D1:n1,D2:n2,... DM:nM我们可以解决相应的线性规划问题,如上面针对n1 * n2 * ... nM变量描述的那样,然后生成一个解决方案将不相交的联合取为n1 * n2 * ... nM选择语句。所以是的,我们可以通过select语句生成一个解决方案,但首先我们必须通过获取每个n1 * n2 * ... nM变量的计数来求解方程并确定限制。
即使赏金结束了,我还要为你感兴趣的人增加一点点。我在这里声称,如果有解决方案,我已经完全展示了如何解决这个问题。
澄清我的方法。在3维的情况下,假设我们将年龄分为3种可能性中的一种。然后在问题中使用性别和地区。每个用户有24种不同的可能性,对应于它们属于这些类别的位置。设Xi是最终结果中每种可能性的数量。让我写一个矩阵,其中每一行代表每种可能性之一。每个用户将贡献最多1到m或f,1到北,南,东或西,1到年龄类别。并且用户只有24种可能性。让我们看一个矩阵:(abc)3个年龄,(nsew)区域和 (mf)男性或女性:a年龄小于或等于10岁,b岁年龄在11岁至30岁之间,c岁年龄在31岁至50岁之间。
abc nsew mf
X1 100 1000 10
X2 100 1000 01
X3 100 0100 10
X4 100 0100 01
X5 100 0010 10
X6 100 0010 01
X7 100 0001 10
X8 100 0001 01
X9 010 1000 10
X10 010 1000 01
X11 010 0100 10
X12 010 0100 01
X13 010 0010 10
X14 010 0010 01
X15 010 0001 10
X16 010 0001 01
X17 001 1000 10
X18 001 1000 01
X19 001 0100 10
X20 001 0100 01
X21 001 0010 10
X22 001 0010 01
X23 001 0001 10
X24 001 0001 01
每行代表一个用户,如果它对结果有贡献,则列中有1。例如,第一行显示1表示a,1表示n,1表示m。这意味着用户的年龄小于或等于10岁,来自北方并且是男性。
Xi表示最终结果中有多少行。因此,假设X1是10,这意味着我们说最终结果有10个结果,所有这些都来自北方,是男性,并且小于或等于10.好吧现在我们只需添加一些东西。请注意,前8个X1+X2+X3+X4+X5+X6+X7+X8
是年龄小于或等于10的所有行。它们必须加起来为我们为该类别选择的任何内容。同样适用于接下来的两组8。
所以到目前为止,我们得到了方程式:(na是年龄小于10的数字,nb年龄在10到20之间,nc是年龄小于50的数字
X1+X2+X3+X4+X5+X6+X7+X8 = na
X9+X10+X11 + .... X16 = nb
X17+X18+X19+... X24=nc
这些是年龄分裂。现在让我们看看区域分裂。只需将变量加在&#34; n&#34;柱,
X1+X2+X9+X10+X17+X18 = nn
X3+X4+X11+X12+X19+20=ns
...
等。 你看到我如何通过俯视列来获得这些方程式吗? 继续ew和mf。总共给出3 + 4 + 2个方程。所以我在这里做的很简单。我已经推断出,你选择的任何一行都会对3个维度中的每个维度贡献一个,并且只有24种可能性。然后让Xi为每种可能性的数字,并得到需要求解的方程式。在我看来,你提出的任何方法都必须是这些方程的解决方案。换句话说,我只是在解决这些方程式时重新解决了这个问题。
现在我们想要一个整数解,因为我们不能有一个小数行。请注意,这些都是线性方程式。但我们想要一个整数解决方案。以下是描述如何解决这些问题的论文的链接:https://www.math.uwaterloo.ca/~wgilbert/Research/GilbertPathria.pdf
答案 4 :(得分:1)
我将构建数据库分布的映射,并使用它来实现抽样逻辑。奖金包括向用户添加快速人口统计反馈的可能性,并且不会给服务器带来额外负担。另外,您需要实现一种机制来保持数据库和地图同步。
使用JSON可能看起来像这样:
{"gender":{
"Male":{
"amount":35600,
"region":{
"North":{
"amount":25000,
"age":{
"18":{
"amount":2400,
"ethnicity":{
...
"income":{
...
}
},
"income":{
...
"ethnicity":{
...
}
}
},
"19":{
...
},
...
"120":{
...
}
},
"ethnicity":{
...
},
"income":{
...
}
},
"South":{
...
},
...
}
"age":{
...
}
"ethnicity":{
...
},
"income":{
...
}
},
"Female":{
...
}
},
"region":{
...
},
"age":{
...
},
"ethnicity":{
...
},
"income":{
...
}}
因此用户选择
total 1000
600 Male
400 Female
100 North
200 South
300 East
400 West
300 <20 years old
300 21-29 years old
400 >=30 years old
计算线性分布:
male-north-u20: 1000*0.6*0.1*0.3=18
male-north-21to29: 18
male-north-o29: 24 (keep a track of rounding errors)
etc
然后我们检查地图:
tmp.male.north.u20=getSumUnder(JSON.gender.Male.region.North.age,20) // == 10
tmp.male.north.f21to29=getSumBetween(JSON.gender.Male.region.North.age,21,29) // == 29
tmp.male.north.o29=getSumOver(JSON.gender.Male.region.north.age,29) // == 200
etc
将符合线性分布的所有内容标记为正常并跟踪盈余。如果某些东西(如male.north.u20)低于第一次调整父级(以确保male.north符合条件),则u20会丢失8,而f21to29则会过度使用8。首次运行后,调整其他区域中的每个缺失标准。就像tmp.male.south.u20+=8;tmp.male.south.f21to29-=8;
。
要做到这一点非常繁琐。
最后,您可以使用正确的分发来构建一个简单的SQL查询。
答案 5 :(得分:0)
我希望您能够根据所需的过滤器生成大量查询。
我将使用完整的代码示例解释一种可能的方法 - 但请注意稍后的注意事项。
我还将解决您无法通过比例分配来满足所请求样本的问题,但您可以从调整后的分布中解决 - 并解释如何进行调整
基本算法如下:
从一组过滤器{F1, F2, ... Fn}
开始,每个过滤器都有一组值,以及应在这些值之间分配的百分比。例如,F1可能是性别,有2个值(F1V1 =男性:60%,F1V2 =女性:40%)您还需要所需的总样本量(调用此X
)从此起点开始然后,您可以组合每个过滤器中的所有过滤器项目,以获得所有组合过滤器项目的单个集合,以及每个过滤器项目所需的数量。
代码应该能够处理任意数量的过滤器,具有任意数量的值(精确值或范围)
EG:假设2个过滤器,F1:性别,{F1V1 =男性:60%,F1V2 =女性:40%},F2:区域,{F2V1 =北:50%,F2V2 =南:50%}和X = 10人所需的总样本 在这个样本中,我们认为其中6个是男性,4个是女性,5个来自北方,5个来自南方。
然后我们
WHERE gender = 'Male'
:0.6,WHERE gender = 'Female'
:0.4)WHERE gender = 'Male' AND region = 'North'
:0.6 * 0.5 = 0.3,WHERE gender = 'Female' AND region = 'North'
:0.4 * 0.5 = 0.2,WHERE gender = 'Male' AND region = 'South'
:0.6 * 0.5 = 0.3,WHERE gender = 'Female' AND region = 'South'
:0.4 * 0.5 = 0.2 代码示例
我会为此提供C#代码,但将其转换为其他语言应该很容易。 在纯动态SQL中尝试这个是非常棘手的
请注意,这是未经测试的 - 并且可能充满了错误 - 但它是您可以采取的方法的想法。
我已经定义了一个公共方法和一个公共类 - 这将是切入点。
// This is an example of a public class you could use to hold one of your filters
// For example - if you wanted 60% male / 40% female, you could have an item with
// item1 = {Fraction: 0.6, ValueExact: 'Male', RangeStart: null, RangeEnd: null}
// & item2 = {Fraction: 0.4, ValueExact: 'Female', RangeStart: null, RangeEnd: null}
public class FilterItem{
public decimal Fraction {get; set;}
public string ValueExact {get; set;}
public int? RangeStart {get; set;}
public int? RangeEnd {get; set;}
}
// This is an example of a public method you could call to build your SQL
// - passing in a generic list of desired filter
// for example the dictionary entry for the above filter would be
// {Key: "gender", Value: new List<FilterItem>(){item1, item2}}
public string BuildSQL(Dictionary<string, List<FilterItem>> filters, int TotalItems)
{
// we want to build up a list of SQL stubs that can be unioned together.
var sqlStubItems = new List<SqlItem>();
foreach(var entry in filters)
{
AddFilter(entry.Key, entry.Value, sqlStubItems);
}
// ok - now just combine all of the sql stubs into one big union.
var result = ""; // Id use a stringbuilder for this normally,
// but this is probably more cross-language readable.
int limitSum = 0;
for(int i = 0; i < sqlStubItems.Count; i++) // string.Join() would be more succinct!
{
var item = sqlStubItems[i];
if (i > 0)
{
result += " UNION ";
}
int limit = (int)Math.Round(TotalItems * item.Fraction, 0);
limitSum+= limit;
if (i == sqlStubItems.Count - 1 && limitSum != TotalItems)
{
//may need to adjust one of the rounded items to account
//for rounding errors making a total that is not the
//originally required total limit.
limit += (TotalItems - limitSum);
}
result += item.Sql + " LIMIT "
+ Convert.ToString(limit);
}
return result;
}
// This method expands the number of SQL stubs for every filter that has been added.
// each existing filter is split by the number of items in the newly added filter.
private void AddFilter(string filterType,
List<FilterItem> filterValues,
List<SqlItem> SqlItems)
{
var newItems = new List<SqlItem>();
foreach(var filterItem in filterValues)
{
string filterAddon;
if (filterItem.RangeStart.HasValue && filterItem.RangeEnd.HasValue){
filterAddon = filterType + " >= " + filterItem.RangeStart.ToString()
+ " AND " + filterType + " <= " + filterItem.RangeEnd.ToString();
} else {
filterAddon = filterType + " = '"
+ filterItem.ValueExact.Replace("'","''") + "'";
//beware of SQL injection. (hence the .Replace() above)
}
if(SqlItems.Count() == 0)
{
newItems.Add(new SqlItem(){Sql = "Select * FROM users WHERE "
+ filterAddon, Fraction = filterItem.Fraction});
} else {
foreach(var existingItem in SqlItems)
{
newItems.Add(new SqlItem()
{
Sql = existingItem + " AND " + filterAddon,
Fraction = existingItem.Fraction * filterItem.Fraction
});
}
}
}
SqlItems.Clear();
SqlItems.AddRange(newItems);
}
// this class is for part-built SQL strings, with the fraction
private class SqlItem{
public string Sql { get; set;}
public decimal Fraction{get; set;}
}
备注(根据Sign的评论)
'
字符。分布不均的样本问题
如何根据代表性分割(上述算法给出)来解决我们其中一个桶中的项目不足以创建样本的问题?或者如果你的数字不是整数怎么办?
嗯,我甚至没有提供代码,但我会描述一种可行的方法。您需要更改上面的代码,因为sql存根的平面列表不会再削减它。相反,您需要构建一个SQL存根的n维矩阵(为每个过滤器添加一个维度 - n)在上面的步骤4完成之后(我们有了每个SQL所需的但不一定可能的数字)存根项目),我期望做的是
因此,假设继续前面的例子 - 我们的代表性分歧是:
男/北= 3,女/北= 2,男/南= 3,女/南= 2,但北方只有2名男性(但我们可以选择其他组中的人数)
当调整更高维度的交叉项目时,此图示可能会有所帮助(仅显示最多4个维度,但应该有助于描绘需要完成的工作!每个点代表n维矩阵中的一个SQL存根项目(并且有一个相关的限制数字)一条线代表一个共同的标准值(例如性别=男性)。目标是在调整完成后,任何一行的总数应该保持不变!我们从红点开始,然后继续每个额外的维度......在上面的例子中,我们只看到2个维度 - 一个由红点组成的正方形,它上面和右边的2个橙色点,以及1个绿色点到NE完成方。
答案 6 :(得分:0)
在SQL中形成业务逻辑绝不是一个好主意,因为它会妨碍吸收甚至微小变化的能力。
我的建议是在ORM中执行此操作,并保持业务逻辑从SQL中抽象出来。
例如,如果您使用 Django :
您的模型看起来像:
class User(models.Model):
GENDER_CHOICES = (
('M', 'Male'),
('F','Female')
)
gender = models.CharField(max_length=1, choices=GENDER_CHOICES)
REGION_CHOICES = (
('E', 'East'),
('W','West'),
('N','North'),
('S','South')
)
region = models.CharField(max_length=1, choices=REGION_CHOICES)
age = models.IntegerField()
ETHNICITY_CHOICES = (
.......
)
ethnicity = models.CharField(max_length=1, choices=ETHNICITY_CHOICES)
income = models.FloatField()
您的查询功能可能是这样的:
# gender_limits is a dict like {'M':400, 'F':600}
# region_limits is a dict like {'N':100, 'E':200, 'W':300, 'S':400}
def get_users_by_gender_and_region(gender_limits,region_limits):
for gender in gender_limits:
gender_queryset = gender_queryset | User.objects.filter(gender=gender)[:gender_limits[gender]]
for region in region_limits:
region_queryset = region_queryset | User.objects.filter(region=region)[:region_limits[region]]
return gender_queryset & region_queryset
查询功能可以通过您计划支持的所有查询的知识进一步抽象,但这应该作为一个例子。
如果您正在使用不同的ORM,那么同样的想法也可以转换为任何好的ORM将具有并集和交集抽象。
答案 7 :(得分:0)
我会使用编程语言来生成SQL语句,但下面是纯mySQL中的解决方案。一个假设是:在一个地区总有足够的男性/女性来适应这些数字(例如,如果没有女性生活在北方,会怎么样?)。
例程正在预先计算所需的行数。无法使用变量指定限制。我更像是一个具有分析功能的神谕家伙。 MySQL还通过允许变量在某种程度上提供了这一点。所以我设定了目标地区和性别并计算了细目。然后我使用计算限制我的输出。
此查询显示用于证明概念的计数。
set @male=600;
set @female=400;
set @north=100;
set @south=200;
set @east=300;
set @west=400;
set @north_male=@north*(@male/(@male+@female));
set @south_male=@south*(@male/(@male+@female));
set @east_male =@east *(@male/(@male+@female));
set @west_male =@west *(@male/(@male+@female));
set @north_female=@north*(@female/(@male+@female));
set @south_female=@south*(@female/(@male+@female));
set @east_female =@east *(@female/(@male+@female));
set @west_female =@west *(@female/(@male+@female));
select gender, region, count(*)
from (
select * from (select @north_male :=@north_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0
union all select * from (select @south_male :=@south_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0
union all select * from (select @east_male :=@east_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'East' ) me where row>=0
union all select * from (select @west_male :=@west_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'West' ) mw where row>=0
union all select * from (select @north_female:=@north_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0
union all select * from (select @south_female:=@south_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0
union all select * from (select @east_female :=@east_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'East' ) fe where row>=0
union all select * from (select @west_female :=@west_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'West' ) fw where row>=0
) a
group by gender, region
order by gender, region;
输出:
Female East 120
Female North 40
Female South 80
Female West 160
Male East 180
Male North 60
Male South 120
Male West 240
删除外部部分以获取真实记录:
set @male=600;
set @female=400;
set @north=100;
set @south=200;
set @east=300;
set @west=400;
set @north_male=@north*(@male/(@male+@female));
set @south_male=@south*(@male/(@male+@female));
set @east_male =@east *(@male/(@male+@female));
set @west_male =@west *(@male/(@male+@female));
set @north_female=@north*(@female/(@male+@female));
set @south_female=@south*(@female/(@male+@female));
set @east_female =@east *(@female/(@male+@female));
set @west_female =@west *(@female/(@male+@female));
select * from (select @north_male :=@north_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0
union all select * from (select @south_male :=@south_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0
union all select * from (select @east_male :=@east_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'East' ) me where row>=0
union all select * from (select @west_male :=@west_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'West' ) mw where row>=0
union all select * from (select @north_female:=@north_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0
union all select * from (select @south_female:=@south_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0
union all select * from (select @east_female :=@east_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'East' ) fe where row>=0
union all select * from (select @west_female :=@west_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'West' ) fw where row>=0
;
为了进行测试,我编写了一个程序,它可以完全随机地创建10000个样本记录:
use test;
drop table if exists users;
create table users (userid int not null auto_increment, gender VARCHAR (20), region varchar(20), primary key (userid) );
drop procedure if exists load_users_table;
delimiter #
create procedure load_users_table()
begin
declare l_max int unsigned default 10000;
declare l_cnt int unsigned default 0;
declare l_gender varchar(20);
declare l_region varchar(20);
declare l_rnd smallint;
truncate table users;
start transaction;
WHILE l_cnt < l_max DO
set l_rnd = floor( 0 + (rand()*2) );
if l_rnd = 0 then
set l_gender = 'Male';
else
set l_gender = 'Female';
end if;
set l_rnd=floor(0+(rand()*4));
if l_rnd = 0 then
set l_region = 'North';
elseif l_rnd=1 then
set l_region = 'South';
elseif l_rnd=2 then
set l_region = 'East';
elseif l_rnd=3 then
set l_region = 'West';
end if;
insert into users (gender, region) values (l_gender, l_region);
set l_cnt=l_cnt+1;
end while;
commit;
end #
delimiter ;
call load_users_table();
select gender, region, count(*)
from users
group by gender, region
order by gender, region;
希望这一切对你有所帮助。底线是:使用UNION ALL
并使用预先计算的变量限制LIMIT
。
答案 8 :(得分:0)
嗯,我认为问题是随机获取记录而不是所有地区的比例为60/40。我为地区和性别做过。它可以以同样的方式推广到其他领域,如年龄,收入和种族。
Declare @Mlimit bigint
Declare @Flimit bigint
Declare @Northlimit bigint
Declare @Southlimit bigint
Declare @Eastlimit bigint
Declare @Westlimit bigint
Set @Mlimit= 600
Set @Flimit=400
Set @Northlimit= 100
Set @Southlimit=200
Set @Eastlimit=300
Set @Westlimit=400
CREATE TABLE #Users(
[UserId] [int] NOT NULL,
[gender] [varchar](10) NULL,
[region] [varchar](10) NULL,
[age] [int] NULL,
[ethnicity] [varchar](50) NULL,
[income] [bigint] NULL
)
Declare @MnorthCnt bigint
Declare @MsouthCnt bigint
Declare @MeastCnt bigint
Declare @MwestCnt bigint
Declare @FnorthCnt bigint
Declare @FsouthCnt bigint
Declare @FeastCnt bigint
Declare @FwestCnt bigint
Select @MnorthCnt=COUNT(*) from users where gender='male' and region='north'
Select @FnorthCnt=COUNT(*) from users where gender='female' and region='north'
Select @MsouthCnt=COUNT(*) from users where gender='male' and region='south'
Select @FsouthCnt=COUNT(*) from users where gender='female' and region='south'
Select @MeastCnt=COUNT(*) from users where gender='male' and region='east'
Select @FeastCnt=COUNT(*) from users where gender='female' and region='east'
Select @MwestCnt=COUNT(*) from users where gender='male' and region='west'
Select @FwestCnt=COUNT(*) from users where gender='female' and region='west'
If (@Northlimit=@MnorthCnt+@FnorthCnt)
begin
Insert into #Users select * from Users where region='north'
set @Northlimit=0
set @Mlimit-=@MnorthCnt
set @Flimit-=@FnorthCnt
set @MnorthCnt=0
set @FnorthCnt=0
end
If (@Southlimit=@MSouthCnt+@FSouthCnt)
begin
Insert into #Users select * from Users where region='South'
set @Southlimit=0
set @Mlimit-=@MSouthCnt
set @Flimit-=@FSouthCnt
set @MsouthCnt=0
set @FsouthCnt=0
end
If (@Eastlimit=@MEastCnt+@FEastCnt)
begin
Insert into #Users select * from Users where region='East'
set @Eastlimit=0
set @Mlimit-=@MEastCnt
set @Flimit-=@FEastCnt
set @MeastCnt=0
set @FeastCnt=0
end
If (@Westlimit=@MWestCnt+@FWestCnt)
begin
Insert into #Users select * from Users where region='West'
set @Westlimit=0
set @Mlimit-=@MWestCnt
set @Flimit-=@FWestCnt
set @MwestCnt=0
set @FwestCnt=0
end
If @MnorthCnt<@Northlimit
Begin
insert into #Users select top (@Northlimit-@MnorthCnt) * from Users where gender='female' and region='north'
and userid not in (select userid from #users)
set @Flimit-=(@Northlimit-@MnorthCnt)
set @FNorthCnt-=(@Northlimit-@MnorthCnt)
set @Northlimit-=(@Northlimit-@MnorthCnt)
End
If @FnorthCnt<@Northlimit
Begin
insert into #Users select top (@Northlimit-@FnorthCnt) * from Users where gender='male' and region='north'
and userid not in (select userid from #users)
set @Mlimit-=(@Northlimit-@FnorthCnt)
set @MNorthCnt-=(@Northlimit-@FnorthCnt)
set @Northlimit-=(@Northlimit-@FnorthCnt)
End
if @MsouthCnt<@southlimit
Begin
insert into #Users select top (@southlimit-@MsouthCnt) * from Users where gender='female' and region='south'
and userid not in (select userid from #users)
set @Flimit-=(@southlimit-@MsouthCnt)
set @FSouthCnt-=(@southlimit-@MsouthCnt)
set @southlimit-=(@southlimit-@MsouthCnt)
End
if @FsouthCnt<@southlimit
Begin
insert into #Users select top (@southlimit-@FsouthCnt) * from Users where gender='male' and region='south'
and userid not in (select userid from #users)
set @Mlimit-=(@southlimit-@FsouthCnt)
set @MSouthCnt-=(@southlimit-@FsouthCnt)
set @southlimit-=(@southlimit-@FsouthCnt)
End
if @MeastCnt<@eastlimit
Begin
insert into #Users select top (@eastlimit-@MeastCnt) * from Users where gender='female' and region='east'
and userid not in (select userid from #users)
set @Flimit-=(@eastlimit-@MeastCnt)
set @FEastCnt-=(@eastlimit-@MeastCnt)
set @eastlimit-=(@eastlimit-@MeastCnt)
End
if @FeastCnt<@eastlimit
Begin
insert into #Users select top (@eastlimit-@FeastCnt) * from Users where gender='male' and region='east'
and userid not in (select userid from #users)
set @Mlimit-=(@eastlimit-@FeastCnt)
set @MEastCnt-=(@eastlimit-@FeastCnt)
set @eastlimit-=(@eastlimit-@FeastCnt)
End
if @MwestCnt<@westlimit
Begin
insert into #Users select top (@westlimit-@MwestCnt) * from Users where gender='female' and region='west'
and userid not in (select userid from #users)
set @Flimit-=(@westlimit-@MwestCnt)
set @FWestCnt-=(@westlimit-@MwestCnt)
set @westlimit-=(@westlimit-@MwestCnt)
End
if @FwestCnt<@westlimit
Begin
insert into #Users select top (@westlimit-@FwestCnt) * from Users where gender='male' and region='west'
and userid not in (select userid from #users)
set @Mlimit-=(@westlimit-@FwestCnt)
set @MWestCnt-=(@westlimit-@FwestCnt)
set @westlimit-=(@westlimit-@FwestCnt)
End
IF (@MnorthCnt>=@Northlimit and @FnorthCnt>=@Northlimit and @MsouthCnt>=@southlimit and @FsouthCnt>=@southlimit and @MeastCnt>=@eastlimit and @FeastCnt>=@eastlimit and @MwestCnt>=@westlimit and @FwestCnt>=@westlimit and not(@Mlimit=0 and @Flimit=0))
Begin
---Create Cursor
DECLARE UC CURSOR FAST_forward
FOR
SELECT *
FROM Users
where userid not in (select userid from #users)
Declare @UserId [int] ,
@gender [varchar](10) ,
@region [varchar](10) ,
@age [int] ,
@ethnicity [varchar](50) ,
@income [bigint]
OPEN UC
FETCH NEXT FROM UC
INTO @UserId ,@gender, @region, @age, @ethnicity, @income
WHILE @@FETCH_STATUS = 0 and not (@Mlimit=0 and @Flimit=0)
BEGIN
If @gender='male' and @region='north' and @Northlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MNorthCnt-=1
set @Northlimit-=1
end
If @gender='male' and @region='south' and @southlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MsouthCnt-=1
set @Southlimit-=1
end
If @gender='male' and @region='east' and @eastlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MeastCnt-=1
set @eastlimit-=1
end
If @gender='male' and @region='west' and @westlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MwestCnt-=1
set @westlimit-=1
end
If @gender='female' and @region='north' and @Northlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Flimit-=1
set @FNorthCnt-=1
set @Northlimit-=1
end
If @gender='female' and @region='south' and @southlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Flimit-=1
set @FsouthCnt-=1
set @Southlimit-=1
end
If @gender='female' and @region='east' and @eastlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @flimit-=1
set @feastCnt-=1
set @eastlimit-=1
end
If @gender='female' and @region='west' and @westlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @flimit-=1
set @fwestCnt-=1
set @westlimit-=1
end
FETCH NEXT FROM UC
INTO @UserId ,@gender, @region, @age, @ethnicity, @income
END
CLOSE UC
DEALLOCATE UC
end
Select * from #Users
SELECT GENDER, REGION, COUNT(*) AS COUNT FROM #USERS
GROUP BY GENDER, REGION
DROP TABLE #Users
答案 9 :(得分:0)
我选择GROUP BY
:
SELECT gender,region,count(*) FROM users GROUP BY gender,region
+----------------------+
|gender|region|count(*)|
+----------------------+
|f |E | 129|
|f |N | 43|
|f |S | 84|
|f |W | 144|
|m |E | 171|
|m |N | 57|
|m |S | 116|
|m |W | 256|
+----------------------+
你可以证实你有600名男性,400名女性,100名北方,200名南方,300名东方和400名西方。
您也可以包含其他字段。
对于范围字段,例如年龄和收入,您可以按照以下示例:
SELECT
gender,
region,
case when age < 30 then 'Young'
when age between 30 and 59 then 'Middle aged'
else 'Old' end as age_range,
count(*)
FROM users
GROUP BY gender,region, age_range
所以,结果就像是:
+----------------------------------+
|gender|region|age |count(*)|
+----------------------------------+
|f |E |Middle aged| 56|
|f |E |Old | 31|
|f |E |Young | 42|
|f |N |Middle aged| 14|
|f |N |Old | 11|
|f |N |Young | 18|
|f |S |Middle aged| 40|
|f |S |Old | 23|
|f |S |Young | 21|
|f |W |Middle aged| 67|
|f |W |Old | 42|
|f |W |Young | 35|
|m |E |Middle aged| 77|
|m |E |Old | 56|
|m |E |Young | 38|
|m |N |Middle aged| 13|
|m |N |Old | 25|
|m |N |Young | 19|
|m |S |Middle aged| 46|
|m |S |Old | 39|
|m |S |Young | 31|
|m |W |Middle aged| 103|
|m |W |Old | 66|
|m |W |Young | 87|
+----------------------------------+