Question

我的数据框包含3个分类变量（x，y，z）以及ID列：

df <- frame_data(
  ~id, ~x, ~y, ~z,
  1, "a", "c" ,"v",
  1, "b", "d", "f",
  2, "a", "d", "v",
  2, "b", "d", "v")

我想spread()将ID应用于每个分类变量组。

输出应该是这样的：

id  a  b  c  d  v  f
1  1  1  1  1  1  1
2  1  1  0  2  2  0

我尝试过这样做，但我只能同时为一个变量做到这一点。

例如：仅将传播应用于y列（类似地，可以分别对x和z执行）但不能在一行中一起使用

df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1.00     1     1
2.00     0     2

分三步解释我的代码：

第1步：计算频率

df %>% count(id,y)    
id     y         n
<dbl> <chr> <int>
1.00   c     1
1.00   d     1
2.00   d     2

第2步：应用spread()

df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1  1.00     1     1
2  2.00    NA     2

第3步：添加fill = 0，替换NA，这意味着c的y列中id 2没有出现（正如您在df中所见）

df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id     c     d
<dbl> <int> <int>
1.00     1     1
2.00     0     2

问题：在我的实际数据集中，我有20个这样的分类变量，我不能一个接一个地做。我希望一下子做到这一切。是否可以在spread()中对所有分类变量应用tidyr？如果没有，请你建议一个替代

注意：我也尝试了这些答案，但对这个特殊情况没有帮助：

其他相关的有用问题：

两个分类列（例如：Survey数据集）可能具有相同的值。如下。

df <- frame_data(
  ~id, ~Do_you_Watch_TV, ~Do_you_Drive, 
  1, "yes", "yes",
  1, "yes", "no",
  2, "yes", "no",
  2, "no", "yes")

# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr>           <chr>       
  1  1.00 yes             yes         
2  1.00 yes             no          
3  2.00 yes             no          
4  2.00 no              yes

运行以下代码不会区分'Do_you_Watch_TV'，'Do_you_Drive'的是和否的计数：

df %>% gather(Key, value, -id) %>% 
  group_by(id, value) %>%
  summarise(count = n())  %>%
  spread(value, count, fill = 0) %>%
  as.data.frame()
id no yes
1  1   3
2  2   2

Whereas, expected output should be :
id Do_you_Watch_TV_no   Do_you_Watch_TV_yes  Do_you_Drive_no   Do_you_Drive_yes
1         0               2                    1                 1
2         1               1                    1                 1

因此，我们需要通过添加前缀来分别处理Do_you_Watch_TV和Do_you_Drive中的No和Yes。 Do_you_Drive_yes，Do_you_Drive_no，Do_you_Watch_TV _yes，Do_you_Watch_TV _no。

我们如何实现这一目标？

由于

Answer 1

首先，您需要以长格式转换数据框，然后才能以宽格式对其进行实际转换。因此，首先需要使用internal sealed class Configuration : DbMigrationsConfiguration<ZagrosProject.Models.ApplicationDbContext> { public Configuration() { AutomaticMigrationsEnabled = true; AutomaticMigrationDataLossAllowed = true; } protected override void Seed(Models.ApplicationDbContext context) { } }并将数据帧转换为长格式。之后，您有几个选择：

选项＃1：使用tidyr::gather：

tidyr::spread

选项＃2：另一个选项可以是#data df <- frame_data( ~id, ~x, ~y, ~z, 1, "a", "c" ,"v", 1, "b", "d", "f", 2, "a", "d", "v", 2, "b", "d", "v") library(tidyverse) df %>% gather(Key, value, -id) %>% group_by(id, value) %>% summarise(count = n()) %>% spread(value, count, fill = 0) %>% as.data.frame() # id a b c d f v # 1 1 1 1 1 1 1 1 # 2 2 1 1 0 2 0 2：

reshape2::dcast

已编辑：要包含第二个数据框的解决方案。

library(tidyverse)
library(reshape2)

df %>% gather(Key, value, -id) %>% 
  dcast(id~value, fun.aggregate = length)

#   id a b c d f v
# 1  1 1 1 1 1 1 1
# 2  2 1 1 0 2 0 2

我们如何一次为所有分类变量应用tidyr :: spread（）为每个分类变量的每个级别创建新列？

1 个答案: