假设我有以下数据表:
$ yum install libuuid-devel
输出:
hs_code country city company
1: apples Canada Calgary West Jet
2: apples Canada Calgary United
3: apples US Los Angeles Alaska
4: apples US Chicago Alaska
5: oranges Korea Seoul West Jet
6: oranges China Shanghai John's Freight Co
7: oranges China Harbin John's Freight Co
8: oranges China Ningbo John's Freight Co
逻辑如下:
对于每种商品,我希望第一列总结出独特国家/地区的数量。对于苹果,它是2.根据这个值,我想在城市列中有一个2元组,它总结了每个国家的唯一城市数量。因此,由于加拿大只有一个独特的城市,而美国只有两个城市,因此价值变为(1,2)。请注意,这个元组的总和是3.最后,在公司专栏中,我想要一个3元组,它总结了每个城市和国家可能性的唯一公司数量。因此,由于West Jet和United为(加拿大,卡尔加里)对,我指定了2.接下来的两个值是1和1,因为洛杉矶和芝加哥只有一家运输公司上市。
我知道这很令人困惑和参与。但任何帮助将不胜感激。我尝试过使用数据表方法,例如
hs_code countries city company
1: apples 2 1,2 2,1,1
2: oranges 2 1,3 1,1,1,1
但我不确定如何将这个方便地以列表形式传递到data.table中递归。
谢谢!
答案 0 :(得分:0)
嗯,这是某种嵌套转换,您可以分三步完成:
dt[, .(companies = length(unique(company))), by = .(hs_code, country, city)][,
.(cities = length(unique(city)),
companies = paste0(companies, collapse = ",")), by = .(hs_code, country)][,
.(countries = length(unique(country)),
cities = paste0(cities, collapse = ","),
companies = paste0(companies, collapse = ",")), by = hs_code ]
# hs_code countries cities companies
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1
答案 1 :(得分:0)
您可以使用.SD[]
表示法创建具有更细粒度分组的子组:
dt[, .(
countries = uniqueN(country),
company = c(.SD[, uniqueN(city), .(country)][, .(V1)]),
company = c(.SD[, uniqueN(company), .(country, city)][, .(V1)])
), .(hs_code)]
# hs_code countries company company
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1