为什么在使用pivot_wider时会产生NA值?

时间:2020-01-07 06:41:22

标签: r dataframe tidyverse tidyr na

我正在尝试使用pivot wider创建包含值的多个列/变量,但是我不应该在列中使用NA。

以下是数据的代表性示例:

df <- structure(list(Condition = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Retraction1", 
"Retraction2"), class = "factor"), First = structure(c(2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Second = structure(c(3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Third = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Fourth = structure(c(4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), ID = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", 
"36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", 
"47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", 
"58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", 
"69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", 
"80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", 
"91", "92", "93", "94", "95", "96", "97", "98", "99", "100", 
"101"), class = "factor"), Scenario = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 1L, 2L, 3L, 4L), .Label = c("J", "P", "R", 
"S"), class = "factor"), Estimate = structure(c(4L, 8L, 7L, 11L, 
9L, 12L, 10L, 2L, 5L, 6L, 4L, 7L, 11L, 9L, 12L, 10L, 2L, 3L, 
5L, 6L, 4L, 8L, 7L, 11L, 9L, 12L, 10L, 2L, 5L, 6L, 4L, 8L, 7L, 
11L, 9L, 12L, 10L, 2L, 5L, 6L, 1L, 1L, 1L, 1L), .Label = c("CompMean", 
"P.H.Reps.", "P.H.Reps..1", "P.Rel.", "P.Rel1.Reps.", "P.Rel2.Reps.", 
"P.Rep1.nH.nRel.", "P.Rep1.nH.Rel.", "P.Rep2.nH.nRel.nRep1.", 
"P.Rep2.nH.nRel.Rep1.", "P.Rep2.nH.Rel.nRep1.", "P.Rep2.nH.Rel.Rep1."
), class = "factor"), value = c(90L, 8L, 82L, 11L, 82L, 11L, 
82L, 100L, 99L, NA, 62L, 11L, 91L, 12L, 91L, 5L, 82L, 91L, 80L, 
NA, 92L, 12L, 61L, 18L, 90L, 21L, 81L, 96L, 92L, NA, 91L, 10L, 
72L, 22L, 62L, 21L, 73L, 99L, 98L, NA, 7L, 7L, 7L, 7L)), row.names = c(NA, 
-44L), class = c("tbl_df", "tbl", "data.frame"))

head(df)

这是来自一个主题的数据。 P.Rel2.Reps.中应该只有 个NA,而没有其他

但是,当我像这样使用pivotwide时,其他一些列中也有NA:

pivot_wider(df, names_from = Estimate, values_from = value)

以下是数据在旋转更宽后的外观示例。

df2 <- structure(list(Condition = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("Control", "Retraction1", "Retraction2"
), class = "factor"), First = structure(c(2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("Journalist", "Police", "Reviewer", 
"Spokesperson"), class = "factor"), Second = structure(c(3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Third = structure(c(1L, 
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), Fourth = structure(c(4L, 
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Journalist", 
"Police", "Reviewer", "Spokesperson"), class = "factor"), ID = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", 
"27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", 
"38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", 
"49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", 
"60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", 
"71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", 
"82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", 
"93", "94", "95", "96", "97", "98", "99", "100", "101"), class = "factor"), 
    Scenario = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 
    2L), .Label = c("J", "P", "R", "S"), class = "factor"), P.Rel. = c(90L, 
    62L, 92L, 91L, 57L, 81L, 71L, 80L, 40L, 75L), P.Rep1.nH.Rel. = c(8L, 
    NA, 12L, 10L, 31L, NA, 19L, 17L, 25L, NA), P.Rep1.nH.nRel. = c(82L, 
    11L, 61L, 72L, 89L, 15L, 79L, 84L, 76L, 25L), P.Rep2.nH.Rel.nRep1. = c(11L, 
    91L, 18L, 22L, 35L, 64L, 30L, 22L, 25L, 50L), P.Rep2.nH.nRel.nRep1. = c(82L, 
    12L, 90L, 62L, 62L, 13L, 45L, 53L, 25L, 50L), P.Rep2.nH.Rel.Rep1. = c(11L, 
    91L, 21L, 21L, 15L, 52L, 9L, 10L, 100L, 50L), P.Rep2.nH.nRel.Rep1. = c(82L, 
    5L, 81L, 73L, 67L, 22L, 60L, 61L, 100L, 25L), P.H.Reps. = c(100L, 
    82L, 96L, 99L, 81L, 40L, 71L, 76L, 75L, 90L), P.Rel1.Reps. = c(99L, 
    80L, 92L, 98L, 81L, 80L, 89L, 79L, 75L, 76L), P.Rel2.Reps. = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), P.H.Reps..1 = c(NA, 
    91L, NA, NA, NA, 80L, NA, NA, NA, 100L), CompMean = c(7L, 
    7L, 7L, 7L, 7L, 7L, 7L, 6L, 4L, 7L)), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))

head(df2)

我已经看到有关此主题的类似文章,但没有回答为什么在我的情况下会生成NA。

我需要添加一些其他参数吗?

2 个答案:

答案 0 :(得分:2)

在查看数据时,好像您在一处有一些损坏的数据。您可以通过

进行更正
df$Estimate <- replace(df$Estimate, df$Estimate == "P.H.Reps..1", "P.Rep1.nH.Rel.") 

,然后使用pivot_wider,它只会在列NA中给您P.Rel2.Reps.

tidyr::pivot_wider(df, names_from = Estimate, values_from = value) 

答案 1 :(得分:1)

NA值将导致原始长数据帧中不存在的新数据透视列的类别的任何组合。例如,让我们用Estimate=="P.Rep1.nH.Rel."查看长数据帧的行:

df %>% filter(Estimate=="P.Rep1.nH.Rel.")
  Condition First  Second   Third      Fourth       ID    Scenario Estimate       value
1 Control   Police Reviewer Journalist Spokesperson 1     J        P.Rep1.nH.Rel.     8
2 Control   Police Reviewer Journalist Spokesperson 1     R        P.Rep1.nH.Rel.    12
3 Control   Police Reviewer Journalist Spokesperson 1     S        P.Rep1.nH.Rel.    10

现在查看pivot_wider的结果(为简洁起见,我仅保留了相关的列)。请注意,在下面的输出中,P.Rep1.nH.Rel.列中缺少一个值。在Scenario=="P"时会出现缺失值,因为长数据帧没有P.Rep1.nH.Rel.Scenario=="P"的一行,从而导致宽数据帧中的缺失值。出于类似原因,P.H.Reps..1列中还会出现缺失值,因为在长数据帧中只有一行包含Estimate=="P.H.Reps..1,并且它有Scenario=="P"。因此,其他三个方案的值均缺失。

pivot_wider(df, names_from = Estimate, values_from = value) %>% 
   select(Condition:Scenario, P.Rep1.nH.Rel., P.H.Reps..1)
  Condition First  Second   Third      Fourth       ID    Scenario P.Rep1.nH.Rel. P.H.Reps..1
1 Control   Police Reviewer Journalist Spokesperson 1     J                     8          NA
2 Control   Police Reviewer Journalist Spokesperson 1     P                    NA          91
3 Control   Police Reviewer Journalist Spokesperson 1     R                    12          NA
4 Control   Police Reviewer Journalist Spokesperson 1     S                    10          NA

这可能是数据错误,如@RonakShah所建议,但是如果数据正确,则在转向宽格式时自然会产生NA值。您可以通过向values_fill=list(value=0)添加参数pivot_wider来用其他值来填充缺失的值(您当然可以使用所需的任何填充值;我只是使用0进行说明) 。请注意,即使您使用values_fill参数,原始长数据中的显式缺失值仍将保留在宽数据框中。旋转操作只会导致丢失的值被填充为其他值。