Question

我有一个调查表，需要将此数据集分组为一行，但是在使用散布和分组方面存在一些问题。

我的数据集具有以下格式：数据

country date_   user_id int_id  user_name   ext_name    q_order questions   answers
AR  2019    AR-100  XP200   jhon foo    damian, khon    1   Question1 … yes
AR  2019    AR-100  XP200   jhon foo    damian, khon    2   Question2 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    3   Question3 … no apply
AR  2019    AR-100  XP200   jhon foo    damian, khon    4   Question4 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    5   Question5 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    6   Question6 … yes
US  2018    US-100  PP300   Peter fields    jhon voigh  1   Question1 … no
US  2018    US-100  PP300   Peter fields    jhon voigh  2   Question2 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  3   Question3 … yes apply
US  2018    US-100  PP300   Peter fields    jhon voigh  4   Question4 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  5   Question5 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  6   Question6 … no

我试图对结果数据集进行分组，但总是得到14行而不是2行。

代码：

data %>% 
    group_by(country=.$country  ,
             date_ = .$date_,
             medic_id=.$user_id,
             user_id= .$int_id,
             user_name= .$user_name,
             ext_name= .$ext_name,
             q_order=.$q_order
             ) %>% 
    spread(questions, answers)

上面的代码使我内存不足。

我什至尝试了dcast

data %>% 
    select(-q_order) %>% 
    dcast( ...  ~ questions, value.var = "answers")

我得到以下信息：

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    1   2   0   1   1   1
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  0   1   1   2   1   2

但是我需要：

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    yes 0   no apply    0   0   yes
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  no  0   yes apply   0   0   no

为什么dcast将答案变量中的值转换为数值形式？（我什至尝试使用var.values ='answers'）？

我的问题与此link非常相似！

但是我无法运行它，总是释放内存或使用数字值而不是Answers变量的值来生成。

Answer 1

我终于找到答案了！

问题是（R中的imnewby），我想在行中具有某些列的值，但是，这个值是字符，大多数解决方案都处理数字而不是字符！

另一方面，我的解决方案（具有5行的示例）在RESHAPE！上可以很好地工作，但是对于（中小型的）实际数据集，我的内存不足（永远都没有结束）。

例如，下一个代码永远不会结束（是的，我也尝试过使用group，就像我说的那样）

b<-reshape(data=a %>% select(-q_order) ,
           direction="wide",
           idvar = c("Country.Code","Created.Date", "user_id", "int_id", "user_name",
                     "ext_name"),
           timevar="questions" )

此解决方案在2秒钟内运行：

b<-dcast( a, Country.Code+Created.Date+user_id+int_id +user_name+ ext_name ~ questions,
          toString, value.var="answers")

最后

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    yes 0   no apply    0   0   yes
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  no  0   yes apply   0   0   no

R中的传播列会产生内存不足

1 个答案: