我需要重新解释一下我的问题,因为我的数据框中没有包含其他类型的数据,这会在使用空格分割时导致很多问题。真的很抱歉!
重要提示:整个数据集中不能信任空白,因为它们以不稳定的方式出现,即使在相同类型的日期内(我的示例中为type1a,type1b)
df <- data.table(v=c( " 555 OUT XYZ STR44W PASSED TRUE", #interesting data type1
" A 45 OUT XYW STR44W PASSED TRUE",
" 555 OUT XYZ STR55W PASSED TRUE",
" 6755 OUT XYZ 4444W PASSED TRUE",
" 75/850CC/PF ", #eratic data to be ignored
" BY HHU 56TT00 6 415 UP HHU 88H900 ", #interesting data type2
" 555 OUT WWWZ STR44W PASSED TRUE"))
期望的结果:
T1 T2 T1_V1 T1_V2 T1_V3 T2_V1 T2_V2 T2_V3 T2_V4 T2_V5
1 0 555 XYZ STR44W NA NA NA NA NA
1 0 A 45 XYW STR44W NA NA NA NA NA
1 0 555 XYZ STR55W NA NA NA NA NA
1 0 6755 XYZ 4444W NA NA NA NA NA
0 0 NA NA NA NA NA NA NA NA
0 1 NA NA NA HHU 56TT00 6 415 HHU 88H900
1 0 555 NA STR44W NA NA NA NA NA
现在解决type1数据的问题: 库(data.table)
df&lt; - data.table(v = c(“555 OUT XYZ STR44W PASSED TRUE”,#Type1a “45 OUT XYW STR44W PASSED TRUE”,#Type1b “555 OUT XYZ STR55W PASSED TRUE”,#Type1a “6755 OUT XYZ 4444W PASSED TRUE”,#Type1a “75 / 850CC / PF”,#!!新的eratic数据 “BY HHU 56TT00 6 415 UP HHU 88H900”,#Type2 “555 OUT WWWZ STR44W PASSED TRUE”))#Type1a
df$T1<-0
df$T1[grepl("PASSED TRUE", df$v)]<-1
df$T1_V1[df$T1==1]<-df$T1_V1
df$T1_V1[df$T1==1] <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$T1_V2[df$T1==1]<-gsub(".*\\OUT","",df$v)#Getting rid of the everything before "OUT"
df$T1_V2 <-gsub("\\PASSED.*","",df$T1_V2) #Getting rid of the everything after "PASSED"
df$T1_V2<-strsplit(df$T1_V2, "[[:blank:]*]") # Seperation of the two relevant strings by stringsplit
df$T1_V2<- lapply(df$T1_V2, head)
老问题:
第一篇文章,我尽力找到答案并准备我的问题。
我需要清理一个令人讨厌的字符串,其中有很多空格不规则。我尝试在“OUT”之前获得第一个块,在“OUT”和“PASSED”之间获得第二个和第三个块。之后,应使用列表检查数据,以控制v4是否正确。
使用stringsplit和afterwars head / tail不起作用,我将非常感谢任何帮助!非常感谢提前
library(data.table)
df <- data.table(v=c(" 555 OUT XYZ STR44W PASSED TRUE",
" A 45 OUT XYW STR44W PASSED TRUE",
" 555 OUT XYZ STR55W PASSED TRUE",
" 6755 OUT XYZ 4444W PASSED TRUE",
" 555 OUT WWWZ STR44W PASSED TRUE"))
control <-data.table(control=c("XYZ","PPO","XMX","WWWZ"))
df$v1 <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$v2<-gsub(".*\\OUT","",df$v) #Getting rid of the everything before "OUT"
df$v2 <-gsub("\\PASSED.*","",df$v2) #Getting rid of the everything after "PASSED"
df$v2<-strsplit(df$v2, "[[:blank:]*]") # Seperation of the two relevant strings by stringsplit
df$v3<- lapply(df$v2, head) #Taking the first element from the stringsplit
df$v4<- lapply(df$v2, head,2) #Taking the second element from the stringsplit
运行之后,在r-studio中我得到v(“”,“XYZ”)为v4。第一个元素似乎是一个空元素?我不能通过直接从我的控制列表(fail1)控制,也不能通过转换(fail2)或unlist(fail3)继续使用该表达式
#fail#1
df$v4[!(df$v4 %in% control$control)] <- NA
#fail#2
df$v4 <- as.character(df$v4)
#fail3
df$v4 <- unlist(df$v4)
答案 0 :(得分:0)
这适用于您当前的数据,使其更整洁。
library(data.table)
df <- data.table(v=c(" 555 OUT XYZ STR44W PASSED TRUE",
" A 45 OUT XYW STR44W PASSED TRUE",
" 555 OUT XYZ STR55W PASSED TRUE",
" 6755 OUT XYZ 4444W PASSED TRUE",
" 555 OUT WWWZ STR44W PASSED TRUE"))
control <-data.table(control=c("XYZ","PPO","XMX","WWWZ"))
df$v1 <-gsub("\\OUT.*","",df$v) #Getting rid of the everything after "OUT"
df$v2<-gsub(".*\\OUT","",df$v) #Getting rid of the everything before "OUT"
修剪空格并用空格分割,然后将其cbind到当前的df。然后我们可以重命名列,以便更容易导航。
lists <- strsplit(trimws(df$v2), " ")
extra <- data.frame(do.call(rbind, lists))
newdf <- cbind(df, extra)
colnames(newdf) <- c("full string", paste0("piece_", 1:6))
newdf
full string piece_1 piece_2 piece_3 piece_4 piece_5 piece_6
1: 555 OUT XYZ STR44W PASSED TRUE 555 XYZ STR44W PASSED TRUE XYZ STR44W PASSED TRUE
2: A 45 OUT XYW STR44W PASSED TRUE A 45 XYW STR44W PASSED TRUE XYW STR44W PASSED TRUE
3: 555 OUT XYZ STR55W PASSED TRUE 555 XYZ STR55W PASSED TRUE XYZ STR55W PASSED TRUE
4: 6755 OUT XYZ 4444W PASSED TRUE 6755 XYZ 4444W PASSED TRUE XYZ 4444W PASSED TRUE
5: 555 OUT WWWZ STR44W PASSED TRUE 555 WWWZ STR44W PASSED TRUE WWWZ STR44W PASSED TRUE
答案 1 :(得分:0)
我并不完全明白你的最终结果要求是什么。您无需在每一步都使用gsub
。您可以按space
拆分所有内容,然后选择需要进一步操作的列。
library(tidyr)
library(splitstackshape) # cSplit function
df_selected <- df %>% cSplit("v", " ") %>% select(v_1,v_3,v_4,v_6)
control <-data.table(control=c("XYZ","PPO","XMX","WWWZ"))
filter(df_selected, v_3 %in% control$control)