我试图读取一些文件,其中某些行在文本字符串中包含额外的分号(我不知道是什么导致了这种情况)
作为一个例子,这是一个具有相同问题的超简化数据:
bad_data <- "100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400;; Texas; busy day
10; D;unkin Donuts; Washin;gton; lazy day"
所以它没有标题,我试着用它来阅读:
library(data.table)
fread(bad_data, sep = ";", header = F, na.strings = c("", NA), strip.white = T)
但是没有雪茄......这有点不可能阅读,如果没有干净的解决方案,我想跳过这些行。
答案 0 :(得分:1)
如果您只想删除没有预期分隔符数的行:
#get all Animal_IDs from capture dataset
allID = unique(capdat$Animal_ID)
#create list to hold data frames, one df for each animalID
df.list <- as.list(rep("", length(allID)))
#loop through each animal ID, find matching collar serial #, capture date,
#and mortality date (if applicable)
for (i in 1:length(allID)){
ID.i = allID[i]
ser.i <- pull(capdat[capdat$Animal_ID == ID.i, 4])
capdate.i = pull(capdat[capdat$Animal_ID == ID.i, 2])
mortdate.i = pull(capdat[capdat$Animal_ID == ID.i, 11])
ifelse(is.na(mortdate.i),
df.list[[i]] <- dat[(dat$CollarSerialNumber == ser.i &
dat$Date > capdate.i) ,],
df.list[[i]] <- dat[(dat$CollarSerialNumber == ser.i &
dat$Date > capdate.i & dat$Date < mortdate.i) ,])
df.list[[i]]$Animal_ID = ID.i
}
#merge list into a single data frame
df <- ldply(df.list, data.frame)
结果:
library(stringi)
library(magrittr)
bad_data <-
"100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400;; Texas; busy day
10; D;unkin Donuts; Washin;gton; lazy day"
# split to lines. you could also use readLines if it's coming from a file
text_lines <- unlist(strsplit(bad_data, '\n'))
# which lines contain the expected number of semicolons?
good_lines <- sapply(text_lines, function(x) stri_count_fixed(x, ';') == 3)
# for those lines, split to vectors and (optional bonus) trim whitespace
good_vectors <- lapply(
text_lines[good_lines],
function(x) x %>% strsplit(';') %>% unlist %>% trimws)
# flatten to matrix (from which you can make a data.frame or whatever you want)
my_mat <- do.call(rbind, good_vectors)
答案 1 :(得分:1)
您可以尝试删除文本字符串中的所有分号(这假设所有不需要的分号都完全在字符串中:
gsub("(\\S);(\\S)", "\\1\\2", bad_data, perl=TRUE)
[1] "100; Mc Donalds; Seattle; normal day\n 115; Starbucks; Boston; normal day\n 400; PF Chang; Chicago; busy day\n 400; Texas; busy day\n 10; Dunkin Donuts; Washington; lazy day"