关联varchar值

时间:2011-07-26 18:31:15

标签: sql oracle statistics

在Oracle 11中是否有内置方法来检查varchar2字段中值的相关性?例如,给出一个简单的表格如下:

MEAL_NUM  INGREDIENT
--------------------
1         BEEF
1         CHEESE
1         PASTA
2         CHEESE
2         PASTA
2         FISH
3         CHEESE
3         CHICKEN

我希望得到一个基于MEAL_NUM的数字表示,CHEESE主要与PASTA配对,并减少BEEF,CHICKEN和FISH的程度。

我的第一个倾向是使用CORR函数并将字符串转换为数字,或者通过预先枚举它们或从唯一选择中获取rownum。

有任何建议如何解决这个问题?

4 个答案:

答案 0 :(得分:3)

您不想使用CORR - 如果您创建“食物编号”并指定Beef = 1,Chicken = 2和Pasta = 3,则相关系数将告诉您是否增加了奶酪与增加的“食物数量”相关联。但是,“食物数量”更高或更低并不代表任何事情,因为你做了。所以,不要使用CORR,除非你的食物实际上是以某种方式订购的,比如数字。

统计学家谈论这个问题的方式是levels of measurement。在链接文章的语言中,MEAL_NUM是名义上的衡量标准 - 或者如果膳食按顺序发生,可能是一种有序的衡量标准,但不管怎样,在其上使用相关系数是一个非常糟糕的主意。

你可能会想要找到类似“牛肉饭中有多少比例也有奶酪?”之类的东西。对于每种成分,以下将返回含有它的膳食数量以及含有它的膳食数量和奶酪。诀窍是COUNT只计算非空值。

SELECT Other.Ingredient, 
       COUNT(*) AS TotalMeals, 
       COUNT(Cheese.Ingredient) AS CheesyMeals
     FROM table Other
LEFT JOIN table Cheese
      ON (Cheese.Ingredient = 'Cheese' 
      AND Cheese.Meal_Num = Other.Meal_Num)
GROUP BY Other.Ingredient

警告:如果您在任何一餐中包含两次成分,则会返回错误的结果。

编辑:事实证明你对奶酪不感兴趣。你真的想要所有的“相关”对。因此,我们可以抽出“奶酪”,并将它们称为第一和第二成分。我已经为这一个添加了一个“PossibleScore”,它试图像餐饮百分比一样,但如果该成分的实例很少,则不会给出强烈的分数。

SELECT First.Ingredient, 
       Second.Ingredient, 
       COUNT(*) AS MealsWithFirst, 
       COUNT(First.Ingredient) AS MealsWithBoth,
       COUNT(First.Ingredient) / (COUNT(*) + 3) AS PossibleScore,
     FROM table First
LEFT JOIN table Second
      ON (First.Meal_Num = Second.Meal_Num)
GROUP BY First.Ingredient, Second.Ingredient

按分数排序时,应返回

PASTA    CHEESE    2    2    0.400
CHEESE   PASTA     3    2    0.333
BEEF     CHEESE    1    1    0.250
BEEF     PASTA     1    1    0.250
FISH     CHEESE    1    1    0.250
FISH     PASTA     1    1    0.250
CHICKEN  CHEESE    1    1    0.250
PASTA    BEEF      2    1    0.200
PASTA    FISH      2    1    0.200
CHEESE   BEEF      3    1    0.167
CHEESE   FISH      3    1    0.167
CHEESE   CHICKEN   3    1    0.167

答案 1 :(得分:2)

进行自我加入以获得所有成分组合,然后通过两个进餐点

进行评估
SELECT t1.INGREDIENT, t2.INGREDIENT, CORR(t1.MEAL_NUM, t2.MEAL_NUM)
FROM TheTable t1, TheTable t2
WHERE t1.INGREDIENT < t2.INGREDIENT
GROUP BY t1.INGREDIENT, t2.INGREDIENT

应该给你类似的东西:

BEEF    CHEESE  0.999
BEEF    PASTA   0.998
CHEESE  PASTA   0.977

更新:克里斯指出,这不会有效。我希望可能有一些方法来捏造从序数 meal_num到间隔(@Chris,感谢链接)值的映射。这可能是不可能的,在这种情况下,这个答案无济于事。

答案 2 :(得分:1)

尝试DBMS_FREQUENT_ITEMSET

--Create sample data
create table meals(meal_num number, ingredient varchar2(10));

insert into meals
select 1, 'BEEF' from dual union all
select 1, 'CHEESE' from dual union all
select 1, 'PASTA' from dual union all
select 2, 'CHEESE' from dual union all
select 2, 'PASTA' from dual union all
select 2, 'FISH' from dual union all
select 3, 'CHEESE' from dual union all
select 3, 'CHICKEN' from dual;

commit;

--Create nested table type to hold results
CREATE OR REPLACE TYPE fi_varchar_nt AS TABLE OF VARCHAR2(10);
/

--Find the items most frequently combined with CHEESE.
select bt.setid, nt.column_value, support occurances_of_itemset
    ,length, total_tranx
from
(
    select
        cast(itemset as fi_varchar_nt) itemset, rownum setid
        ,support, length, total_tranx
    from table(dbms_frequent_itemset.fi_transactional(
        tranx_cursor => cursor(select meal_num, ingredient from meals),
        support_threshold => 0,
        itemset_length_min => 2,
        itemset_length_max => 2,
        including_items => cursor(select 'CHEESE' from dual),
        excluding_items => null))
) bt,
table(bt.itemset) nt
where column_value <> 'CHEESE'
order by 3 desc;


     SETID COLUMN_VAL OCCURANCES_OF_ITEMSET     LENGTH TOTAL_TRANX
---------- ---------- --------------------- ---------- -----------
         4 PASTA                          2          2           3
         3 FISH                           1          2           3
         1 BEEF                           1          2           3
         2 CHICKEN                        1          2           3

答案 3 :(得分:0)

那样的查询怎么样?

select t1.INGREDIENT, count(*)a 
from table t1,
     (select meal_num 
      from table 
      where INGREDIENT = 'CHEESE') t2
where t1.INGREDIENT <> 'CHEESE'
and t1.meal_num=t2.mealnum
group by t1.INGREDIENT;

结果应该是每种成分与CHEESE分享饭数的时间。