删除重复数据并替换某些区域中的缺失值

时间:2019-04-01 15:35:58

标签: r excel

因此,我一直在查看此代码,该代码最初是excel工作表。将数据集放入R Studio之后,我将遇到一些问题。

首先,一旦运行,我将所有空白单元格都更改为NA

CarparkData[is.na(CarparkData)] <- 0

它只会更改原来不为空白单元格的数据。

第二次删除重复数据,我使用了以下代码,但未发生任何事情。

library("dplyr")
install.packages("tidyverse")
library(tidyverse)
x <-CarparkData
duplicated(x)


x[duplicated(x),]
x[!duplicated(x),]

由于我有一行用于日期和时间,所以我想以此为列来删除重复数据的行。因为我有相同的数据,但是与相同的数据和相同的日期和时间相比,它们处于不同的时间。

第三次替换缺失的值 一些数据上面写有FULL,我想进入一列,然后将FULL更改为该特定停车场中已满的数字,因此更改该列中的FULL单元格,而不是全部FULL单元格。 / p>

样本数据

> dput(head(CarparkData))
structure(list(Parnell = c(188L, 183L, 185L, 229L, 237L, 272L
), Ilac = c(665, 683, 694, 769, 786, 839), Jervis = c(421, 408, 
403, 417, 423, 455), Arnotts = c(340, 344, 350, 359, 359, 355
), Malboro = c(160L, 160L, 156L, 157L, 173L, 207L), Abbey = c(0, 
0, 0, 0, 0, 0), `Thomas Street` = c(173, 173, 173, 186, 189, 
198), `Christ Church` = c(77, 76, 74, 73, 83, 91), Setanta = structure(c(24L, 
23L, 23L, NA, NA, 46L), .Label = c("10", "100", "101", "102", 
"103", "104", "107", "108", "110", "111", "112", "113", "114", 
"115", "120", "123", "125", "128", "129", "131", "14", "17", 
"19", "21", "24", "27", "28", "29", "30", "31", "32", "34", "36", 
"39", "40", "44", "45", "47", "48", "51", "52", "53", "56", "57", 
"6", "60", "63", "66", "67", "7", "70", "72", "74", "78", "79", 
"80", "81", "82", "84", "85", "86", "89", "9", "91", "92", "93", 
"94", "96", "98", "FULL"), class = "factor"), Dawson = c(70, 
87, 83, 118, 122, 140), Trinity = c(142L, 143L, 145L, 165L, 167L, 
191L), Greenrcs = structure(c(NA, 8L, 9L, NA, 4L, 5L), .Label = c("1125", 
"157", "205", "250", "262", "264", "266", "267", "270", "296", 
"305", "311", "319", "320", "324", "327", "342", "347", "350", 
"353", "364", "371", "374", "375", "378", "379", "459", "463", 
"591", "729", "754", "761", "879", "902", "903", "907", "911", 
"913", "916", "917", "922", "931", "944", "955", "974", "985", 
"FULL"), class = "factor"), Drury = c(148, 143, 147, 182, 193, 
235), `Brown Thomas` = c(230, 231, 0, 267, 272, 293), `Date & Time` = structure(1:6, .Label = c("2019-03-19 13:43:33", 
"2019-03-19 13:55:39", "2019-03-19 14:07:35", "2019-03-19 15:45:02", 
"2019-03-19 16:00:02", "2019-03-19 16:45:03", "2019-03-19 17:00:02", 
"2019-03-19 17:45:03", "2019-03-19 18:00:01", "2019-03-19 18:00:02", 
"2019-03-19 18:45:03", "2019-03-19 19:00:01", "2019-03-19 19:00:02", 
"2019-03-19 19:07:12", "2019-03-19 19:45:03", "2019-03-19 20:00:01", 
"2019-03-19 20:00:02", "2019-03-19 20:45:03", "2019-03-19 21:00:01", 
"2019-03-19 21:00:03", "2019-03-19 21:45:04", "2019-03-19 22:00:01", 
"2019-03-19 22:00:03", "2019-03-19 22:45:04", "2019-03-19 23:00:01", 
"2019-03-19 23:00:02", "2019-03-19 23:00:03", "2019-03-19 23:45:04", 
"2019-03-20 00:00:01", "2019-03-20 00:00:02", "2019-03-20 00:00:03", 
"2019-03-20 00:45:04", "2019-03-20 01:00:01", "2019-03-20 01:00:02", 
"2019-03-20 01:00:03", "2019-03-20 01:45:04", "2019-03-20 02:00:01", 
"2019-03-20 02:00:02", "2019-03-20 02:00:03", "2019-03-20 02:45:04", 
"2019-03-20 03:00:01", "2019-03-20 03:00:02", "2019-03-20 03:00:03", 
"2019-03-20 03:45:05", "2019-03-20 04:00:01", "2019-03-20 04:00:02", 
"2019-03-20 04:00:04", "2019-03-20 04:45:05", "2019-03-20 05:00:01", 
"2019-03-20 05:00:02",

谢谢。

1 个答案:

答案 0 :(得分:0)

第一个问题...如果要将所有空单元格显式设置为NA,则可以使用如下自定义函数:

empty_as_na <- function(x){
  if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
  ifelse(as.character(x)!="", x, NA)
}

然后应用此功能:

dplyr::mutate_all(df, .funs = empty_as_na)

其中df是您的数据框。


第二个问题...要删除重复的行,您应该查看dplyr::distinct()


第三个问题...我没有得到什么问题...也许您可以澄清?


很抱歉,我无法使用您提供的数据为您提供完整的工作示例...但是这些功能应该可以使您到达所需的位置。

编辑

基于评论的第三期解决方案...

可能不是最优雅的解决方案,但同样,由于未提供reprex,因此受到限制。

df为数据框,column_new为新列,column_number提到的列有数字或FULL的列,column_car为汽车所在的列是。

df %>% 
  mutate(
    column_new = case_when(
      column_number == "FULL" & column_car == "car_a" ~ 300,
      column_number == "FULL" & column_car == "car_b" ~ 500,
      TRUE ~ column_number
    )
  )