数据表中的条件计数

时间:2018-04-23 22:20:46

标签: r data.table

假设我有以下数据表:

$ yum install libuuid-devel

输出:

 hs_code   country        city           company
1:  apples  Canada     Calgary          West Jet
2:  apples  Canada     Calgary            United 
3:  apples      US     Los Angeles        Alaska
4:  apples      US     Chicago            Alaska 
5:  oranges  Korea       Seoul          West Jet
6:  oranges  China    Shanghai John's Freight Co
7:  oranges  China      Harbin John's Freight Co
8:  oranges  China      Ningbo John's Freight Co

逻辑如下:

对于每种商品,我希望第一列总结出独特国家/地区的数量。对于苹果,它是2.根据这个值,我想在城市列中有一个2元组,它总结了每个国家的唯一城市数量。因此,由于加拿大只有一个独特的城市,而美国只有两个城市,因此价值变为(1,2)。请注意,这个元组的总和是3.最后,在公司专栏中,我想要一个3元组,它总结了每个城市和国家可能性的唯一公司数量。因此,由于West Jet和United为(加拿大,卡尔加里)对,我指定了2.接下来的两个值是1和1,因为洛杉矶和芝加哥只有一家运输公司上市。

我知道这很令人困惑和参与。但任何帮助将不胜感激。我尝试过使用数据表方法,例如

 hs_code   countries        city           company
1:  apples       2          1,2             2,1,1
2:  oranges      2          1,3           1,1,1,1

但我不确定如何将这个方便地以列表形式传递到data.table中递归。

谢谢!

2 个答案:

答案 0 :(得分:0)

嗯,这是某种嵌套转换,您可以分三步完成:

dt[, .(companies = length(unique(company))), by = .(hs_code, country, city)][, 
 .(cities = length(unique(city)), 
   companies = paste0(companies, collapse = ",")), by = .(hs_code, country)][, 
 .(countries = length(unique(country)), 
   cities = paste0(cities, collapse = ","),
   companies = paste0(companies, collapse = ",")), by = hs_code ]
  # hs_code countries cities companies
  # 1:  apples         2    1,2     2,1,1
  # 2: oranges         2    1,3   1,1,1,1

答案 1 :(得分:0)

您可以使用.SD[]表示法创建具有更细粒度分组的子组:

dt[, .(
    countries = uniqueN(country),
    company = c(.SD[, uniqueN(city), .(country)][, .(V1)]),
    company = c(.SD[, uniqueN(company), .(country, city)][, .(V1)])
    ), .(hs_code)]

#    hs_code countries company company
# 1:  apples         2     1,2   2,1,1
# 2: oranges         2     1,3 1,1,1,1