列表的列联表

时间:2017-12-13 13:20:56

标签: r contingency

我有这个数据框glimpse(df)

Observations: 2,211
Variables: 3
$ city       <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open    <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...

这是一个小dput()

structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", 
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")", 
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", 
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city", 
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")

数据包含不同城市的city categories。 我想制作一个列联表来显示哪些菜肴与关闭(is_opem = 0)或开口(is_open = 1)相关联。

我想用列联表来做这件事。为此,我尝试了这个,但我收到了这个错误:

xtabs(is_open ~., data = df)

Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument

当我转换变量as.factor()时,我会得到很多表,而不是一个。有没有办法让这看起来像下面的那样?

Categorie/City          Las Vegas     Pittsburgh
           Korean       50/50         30/70
           Sushi Bars   40/60         40/60

列中的数字是每个城市每个类别的结束(is_opem = 0)和开口(is_open = 1)的频率(例如,拉斯维加斯的韩国人关闭(0)和开口(1)的分布是50/50)。

1 个答案:

答案 0 :(得分:1)

此处使用基于data.table cast包中的stri_count的计数功能,使用stringitable您的数据的解决方案。后者也可以通过具有sum(grepl())构造的ifelse#your data df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas") ,categories = c("c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", "Korean") ,is_open = c("0", "0", "1", "0", "1")) ,.Names = c("city", "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame") library(data.table) library(stringi) #format data to correct "long format" DT <- as.data.table(df) DT[, categories := gsub("c\\(\"|\"|\"\\)", "", categories)] DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))), by = .(city, is_open)] # city is_open categories # 1: Las Vegas 0 Korean # 2: Las Vegas 0 Sushi Bars # 3: Pittsburgh 0 Japanese # 4: Pittsburgh 0 Sushi Bars # 5: Las Vegas 1 Thai # 6: Las Vegas 1 Korean # 7: Phoenix 0 Sushi Bars # 8: Phoenix 0 Japanese #specify all_unique_count_items to also cover items that are not present in x calc_count_distr <- function(x, all_unique_count_items) { count_distribution <- sapply(all_unique_count_items, function(y) { 100*round(sum(stri_count_fixed(x, y))/length(x), d =2) }) paste(count_distribution, collapse = "/") } dcast.data.table(DT, categories ~ city, value.var = "is_open" ,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open)) ,fill = NA) # categories Las Vegas Phoenix Pittsburgh #1: Japanese NA 100/0 100/0 #2: Korean 50/50 NA NA #3: Sushi Bars 100/0 100/0 100/0 #4: Thai 0/100 NA NA 来实现(取决于数据结构所需的灵活性,速度要求等)。请注意,我还借助this answer将您的数据重新格式化为更干净的“长格式”。如果您从一开始就以这种方式格式化数据,则可能会跳过此重新格式化。我希望这就是你要找的东西。

#include <iostream>
#include <thread>
#include <atomic>
#include <cassert>

std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};

void write_x()
{
    x.store(true, std::memory_order_seq_cst);
}

void write_y()
{
    y.store(true, std::memory_order_seq_cst);
}

void read_x_then_y()
{
    while (!x.load(std::memory_order_seq_cst))
        ;
    if (y.load(std::memory_order_seq_cst)) {
        ++z; 
    }
}

void read_y_then_x()
{
    while (!y.load(std::memory_order_seq_cst))
        ;
    if (x.load(std::memory_order_seq_cst)) {
        ++z;
    }
}

int main()
{
    std::thread a(write_x);
    std::thread b(write_y);
    std::thread c(read_x_then_y);
    std::thread d(read_y_then_x);
    a.join(); b.join(); c.join(); d.join();
    assert(z.load() != 0);  // will never happen
}