我有24个数据文件(bsls
)。每个文件包含固定数量的行,但列数可变(sites
)。我有一个清晰的列表23 sites
但不能完全匹配,因为与每个站点相关的列名包含其他信息。
我已使用以下代码将这些文件读入R
:
#list files from dir and read, skipping rows until 'Q Num'
temp <- list.files() # e.g. info-stuff-nameofbsl-otherStuff.csv
# read.xls and strip bsl name from file and assign as object name
for(i in temp){
assign(unlist(strsplit(i, split = '-', fixed = T))[3],
read.xls(i, pattern = "Q Num"))
}
#create list of dataframes (24 bsls)
bsls <- Filter(function(x) is(x, "data.frame"), mget(ls()))
#clean list of site names
sites <- ("NewYork","London","Sydney","Paris","Manchester","Angers","Venice","Bangkok","Glasgow","Boston","Perth","Canberra","Lyons","Washington","Milan","Cardiff","Dublin","Frankfurt","Ottawa","Toronto","El.Salvador","Taltal","Caldera")
24个bsls
数据集中的1个的前3行示例
e.g。 BSL1
QNum, QuestionText, % unrelatedCol, NewYork_Other_info, London_some_other_info, Venice_other_diff_info,
q17a, question?, 74%, 69%, 81%, 76%,
q17b, Another question?, 72%, 73%, 77%, 74%,
我需要的结果是,23个sites
中的每一个都有一个.csv
文件,其中包含24个数据文件(bsls
)中找到的所有列。
我目前的尝试......
for(site in sites){ #for each site
assign(site, data.frame()) #create empty data frame to add vectors to
for(bsl in dfs){ #for each dataset
if (grepl(site, colnames(bsl))){ #substring match
next #go back to for loop
}
assign(site$bsl, bsl[,grepl("site", colnames(bsl))]) #assign column to dataframe
}
}
解决方案看起来像这样......
例如London.csv
QNum, QuestionText, BSLname1_Other_info, BSLname2_some_other_info, BSL5other_diff_info,
q17a, question?, 74%, 69%, 81%, 76%,
q17b, Another question?, 72%, 73%, 77%, 74%,
将有23个文件,每个站点一个,包含24个输入bsl
文件中与该站点相关的列。
编辑 - 值得说明bsls
中的每一个都不会被称为bsl1
,bsl2
...等等,但实际上是唯一的字符串,例如unit
,section
,team
等等。
答案 0 :(得分:0)
library(dplyr)
library(stringi)
library(tidyr)
bind_rows(bsls, .id = bsl) %>%
gather(variable, value,
matches(sites %>% paste(collapse = "|") ),
na.rm = TRUE ) %>%
separate(variable, c("site", "new_variable",
sep = "_", extra = "merge") %>%
unite(final_variable, bsl, new_variable, sep = "_") %>%
spread(final_variable, value) %>%
group_by(site) %>%
do(write.csv(., paste("site", first(.$site), ".csv") ) )
答案 1 :(得分:0)
以下代码最终解决了我的问题。我首先必须通过重命名bsls
之前的for loop
数据框列表中的所有列来进一步打破原始问题。这是为了知道bsl
所属的site
- 可以找到重命名逻辑here。
循环解决方案
#this loop prints the files
for (site in sites){
#create new file with question cols only
newfile <- data.frame(NewYork[,1:2], stringsAsFactors = F)
# search for columns in bsls relating to site
for (bsl in bsls){
colids <- grepl(site, colnames(bsl))
cols <- bsl[,colids, drop = F]
newfile <- cbind(newfile, cols)
}
filename <- paste0("Site ", site," .csv")
write.xlsx(newfile, file = filename, row.names = F)
}