我有这个数据框glimpse(df)
Observations: 2,211
Variables: 3
$ city <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...
这是一个小dput()
structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas",
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city",
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
数据包含不同城市的city
categories
。
我想制作一个列联表来显示哪些菜肴与关闭(is_opem = 0)
或开口(is_open = 1)
相关联。
我想用列联表来做这件事。为此,我尝试了这个,但我收到了这个错误:
xtabs(is_open ~., data = df)
Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
当我转换变量as.factor()
时,我会得到很多表,而不是一个。有没有办法让这看起来像下面的那样?
Categorie/City Las Vegas Pittsburgh
Korean 50/50 30/70
Sushi Bars 40/60 40/60
列中的数字是每个城市每个类别的结束(is_opem = 0)
和开口(is_open = 1)
的频率(例如,拉斯维加斯的韩国人关闭(0)和开口(1)的分布是50/50)。
答案 0 :(得分:1)
此处使用基于data.table
cast
包中的stri_count
的计数功能,使用stringi
到table
您的数据的解决方案。后者也可以通过具有sum(grepl())
构造的ifelse
或#your data
df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas")
,categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean")
,is_open = c("0", "0", "1", "0", "1"))
,.Names = c("city", "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
library(data.table)
library(stringi)
#format data to correct "long format"
DT <- as.data.table(df)
DT[, categories := gsub("c\\(\"|\"|\"\\)", "", categories)]
DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))),
by = .(city, is_open)]
# city is_open categories
# 1: Las Vegas 0 Korean
# 2: Las Vegas 0 Sushi Bars
# 3: Pittsburgh 0 Japanese
# 4: Pittsburgh 0 Sushi Bars
# 5: Las Vegas 1 Thai
# 6: Las Vegas 1 Korean
# 7: Phoenix 0 Sushi Bars
# 8: Phoenix 0 Japanese
#specify all_unique_count_items to also cover items that are not present in x
calc_count_distr <- function(x, all_unique_count_items) {
count_distribution <- sapply(all_unique_count_items, function(y) {
100*round(sum(stri_count_fixed(x, y))/length(x), d =2)
})
paste(count_distribution, collapse = "/")
}
dcast.data.table(DT, categories ~ city, value.var = "is_open"
,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open))
,fill = NA)
# categories Las Vegas Phoenix Pittsburgh
#1: Japanese NA 100/0 100/0
#2: Korean 50/50 NA NA
#3: Sushi Bars 100/0 100/0 100/0
#4: Thai 0/100 NA NA
来实现(取决于数据结构所需的灵活性,速度要求等)。请注意,我还借助this answer将您的数据重新格式化为更干净的“长格式”。如果您从一开始就以这种方式格式化数据,则可能会跳过此重新格式化。我希望这就是你要找的东西。
#include <iostream>
#include <thread>
#include <atomic>
#include <cassert>
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
void write_x()
{
x.store(true, std::memory_order_seq_cst);
}
void write_y()
{
y.store(true, std::memory_order_seq_cst);
}
void read_x_then_y()
{
while (!x.load(std::memory_order_seq_cst))
;
if (y.load(std::memory_order_seq_cst)) {
++z;
}
}
void read_y_then_x()
{
while (!y.load(std::memory_order_seq_cst))
;
if (x.load(std::memory_order_seq_cst)) {
++z;
}
}
int main()
{
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();
assert(z.load() != 0); // will never happen
}