我有一个大型数据集,其中所需的一些信息作为以分号分隔的字符串存储在第一列中。例如:
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
,并提供:
Information Data
1 Forrest;Trees;Unknown 5
2 Forrest;Trees;Leaves 1
3 Forrest;Trees;Trunks 3
4 Forrest;Shrubs;Unknown 4
5 Forrest;Shrubs;Branches 2
6 Forrest;Shrubs;Leaves 1
7 Forrest;Shrubs;NA 3
我需要简化名称,以便我只有最后一个不是“Unknown”或“NA”的唯一名称,这样我的数据框就变成了:
Information Data
1 Trees;Unknown 5
2 Trees;Leaves 1
3 Trunks 3
4 Shrubs;Unknown 4
5 Branches 2
6 Shrubs;Leaves 1
7 Shrubs;NA 3
答案 0 :(得分:1)
也许它不是最有效或最优雅的解决方案,但它适用于样本数据。希望它也足以满足您的需求:
library(stringr)
library(dplyr)
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
# split text into 3 columns
TestData[3:5] <- str_split_fixed(TestData$Information, ";", 3)
# filter Unknown and NA values, count frequencies to determine unique values
a <- TestData %>%
filter(!V5 %in% c("Unknown", "NA")) %>%
group_by(V5) %>%
summarise(count = n())
# join back to original data
TestData <- TestData %>%
left_join(a)
TestData$Clean <- ifelse(TestData$count > 1 | is.na(TestData$count), paste0(TestData$V4, ";", TestData$V5), TestData$V5)
答案 1 :(得分:0)
一般情况下,不建议将多个变量放在同一列中,但使用dplyr应该可以满足您的需求:
TestData_filtered<-TestData%>%separate(Information,into=c("common","TS","BL"),remove=FALSE)%>%filter(!grepl("Unknown|NA",BL))%>%mutate(wanted=paste(TS,BL,sep=";"))