在R中汇总数据字符串

时间:2017-02-02 16:40:46

标签: r split

我有一个大型数据集,其中所需的一些信息作为以分号分隔的字符串存储在第一列中。例如:

TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))

,并提供:

    Information Data
1   Forrest;Trees;Unknown   5
2   Forrest;Trees;Leaves    1
3   Forrest;Trees;Trunks    3
4   Forrest;Shrubs;Unknown  4
5   Forrest;Shrubs;Branches 2
6   Forrest;Shrubs;Leaves   1
7   Forrest;Shrubs;NA   3

我需要简化名称,以便我只有最后一个不是“Unknown”或“NA”的唯一名称,这样我的数据框就变成了:

    Information Data
1   Trees;Unknown   5
2   Trees;Leaves    1
3   Trunks  3
4   Shrubs;Unknown  4
5   Branches    2
6   Shrubs;Leaves   1
7   Shrubs;NA   3

2 个答案:

答案 0 :(得分:1)

也许它不是最有效或最优雅的解决方案,但它适用于样本数据。希望它也足以满足您的需求:

library(stringr)
library(dplyr)


TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))

# split text into 3 columns
TestData[3:5] <- str_split_fixed(TestData$Information, ";", 3)

# filter Unknown and NA values, count frequencies to determine unique values
a <- TestData %>%
  filter(!V5 %in% c("Unknown", "NA")) %>%
  group_by(V5) %>%
  summarise(count = n())

# join back to original data
TestData <- TestData %>%
  left_join(a)


TestData$Clean <- ifelse(TestData$count > 1 | is.na(TestData$count), paste0(TestData$V4, ";", TestData$V5), TestData$V5)

答案 1 :(得分:0)

一般情况下,不建议将多个变量放在同一列中,但使用dplyr应该可以满足您的需求:

TestData_filtered<-TestData%>%separate(Information,into=c("common","TS","BL"),remove=FALSE)%>%filter(!grepl("Unknown|NA",BL))%>%mutate(wanted=paste(TS,BL,sep=";"))