我对R比较陌生。我的问题并不像标题那么简单。这是df
的样子:
id amenities
1 wireless internet, air conditioning, pool, kitchen
2 pool, kitchen, washer, dryer
3 wireless internet, kitchen, dryer
4
5 wireless internet
这就是我想要df
的样子:
id wireless internet air conditioning pool kitchen washer dryer
1 1 1 1 1 0 0
2 0 0 1 1 1 1
3 1 0 0 1 0 1
4 0 0 0 0 0 0
5 1 0 0 0 0 0
重现数据的示例代码
df <- data.frame(id = c(1, 2, 3, 4, 5),
amenities = c("wireless internet, air conditioning, pool, kitchen",
"pool, kitchen, washer, dryer",
"wireless internet, kitchen, dryer",
"",
"wireless internet"),
stringsAsFactors = FALSE)
答案 0 :(得分:3)
使用dplyr
和tidyr
的解决方案。请注意,我将""
替换为None
,因为以后更容易处理列名。
library(dplyr)
library(tidyr)
df2 <- df %>%
separate_rows(amenities, sep = ",") %>%
mutate(amenities = ifelse(amenities %in% "", "None", amenities)) %>%
mutate(value = 1) %>%
spread(amenities, value , fill = 0) %>%
select(-None)
df2
# id air conditioning dryer kitchen pool washer pool wireless internet
# 1 1 1 0 1 1 0 0 1
# 2 2 0 1 1 0 1 1 0
# 3 3 0 1 1 0 0 0 1
# 4 4 0 0 0 0 0 0 0
# 5 5 0 0 0 0 0 0 1
答案 1 :(得分:3)
FWIW,这里是一个基础R方法(假设df
包含问题中显示的数据)
dat <- with(df, strsplit(amenities, ', '))
df2 <- data.frame(id = factor(rep(df$id, times = lengths(dat)),
levels = df$id),
amenities = unlist(dat))
df3 <- as.data.frame(cbind(id = df$id,
table(df2$id, df2$amenities)))
这导致
> df3
id air conditioning dryer kitchen pool washer wireless internet
1 1 1 0 1 1 0 1
2 2 0 1 1 1 1 0
3 3 0 1 1 0 0 1
4 4 0 0 0 0 0 0
5 5 0 0 0 0 0 1
分解正在发生的事情:
dat <- with(df, strsplit(amenities, ', '))
在amenities
上拆分', '
变量,结果
> dat
[[1]]
[1] "wireless internet" "air conditioning" "pool"
[4] "kitchen"
[[2]]
[1] "pool" "kitchen" "washer" "dryer"
[[3]]
[1] "wireless internet" "kitchen" "dryer"
[[4]]
character(0)
[[5]]
[1] "wireless internet"
第二行占用dat
并将其转换为向量,我们通过重复原始id
值添加和id
列的次数与id
的便利设施。这导致
> df2
id amenities
1 1 wireless internet
2 1 air conditioning
3 1 pool
4 1 kitchen
5 2 pool
6 2 kitchen
7 2 washer
8 2 dryer
9 3 wireless internet
10 3 kitchen
11 3 dryer
12 5 wireless internet
使用table()
功能创建列联表,然后我们添加id
列。
答案 2 :(得分:0)
int main (int argc,char *argv[])
{
char c[100];
char buffer[100];
FILE *input = fopen(argv[1], "r");
Story *temp = (Story*) malloc(sizeof(Story) * 8);
if(input)
{
int flag = 0;
while (fgets(c, sizeof(buffer),input) != NULL)
{
if(flag == 0)
{
sscanf(c, "%s", temp->title);
}
else if(flag == 1)
{
sscanf(c, "%s", temp->file_x);
}
else if(flag == 2)
{
sscanf(c, "%s", temp->file_y);
}
else
{
while(!feof(input))
{
fread(temp->text, sizeof(Story),1,input);
}
}
flag++;
}
printf("%s\n%s\n%s\n", temp->title,
temp->file_x, temp->file_y);
}
else if (input == NULL)
{
printf("ERROR MESSAGE HERE \n");
return 1;
}
free(temp);
fclose(input);
return 0;
包在这里很有用。尝试
dummies
答案 3 :(得分:0)
为了完整起见,这里也是一个简洁的data.table
解决方案:
library(data.table)
setDT(df)[, strsplit(amenities, ", "), by = id][
, dcast(.SD, id ~ V1, length)]
id air conditioning dryer kitchen pool washer wireless internet 1: 1 1 0 1 1 0 1 2: 2 0 1 1 1 1 0 3: 3 0 1 1 0 0 1 4: 5 0 0 0 0 0 1
强制执行data.table后,amenities
被", "
拆分为每个项目的单独行(长格式)。然后使用length()
函数将其重新整理为宽格式。