我正在尝试读取多个CSV,它们的标题从不同的行开始,然后将它们映射到一个数据帧中。我尝试了此处提供的代码,但无法使用该功能。
Read CSV into R based on where header begins
以下是两个示例DF:
file1 <- structure(list(X..Text = c("# Text", "#", "agency_cd", "5s",
"USGS", "USGS"), X = c("", "", "site_no", "15s", "4294000", "4294000"
), X.1 = c("", "", "datetime", "20d", "6/24/13 0:00", "6/24/13 0:15"
), X.2 = c("", "", "tz_cd", "6s", "EDT", "EDT"), X.3 = c("",
"", "Gage height", "14n", "1.63", "1.59"), X.4 = c("", "", " Discharge",
"14n", "1310", "1250")), class = "data.frame", row.names = c(NA,
-6L))
file2 <- structure(list(X..Text = c("# Text", "# Text", "#", "agency_cd",
"5s", "USGS", "USGS"), X = c("", "", "", "site_no", "15s", "4294002",
"4294002"), X.1 = c("", "", "", "datetime", "20d", "6/24/13 0:00",
"6/24/13 0:15"), X.2 = c("", "", "", "tz_cd", "6s", "EDT", "EDT"
), X.3 = c("", "", "", "Gage height", "14n", "1.63", "1.59"),
X.4 = c("", "", "", " Discharge", "14n", "1310", "1250")), class =
"data.frame", row.names = c(NA,
-7L))
我想对上述相关问题使用类似的解决方案,尽管我还需要跳过标头(标头行=以“ agency_cd”开头的行)之后的行,然后执行类似的操作将所有CSV绑定到一个数据帧中,文件名列在其中:
# Path to the data
data_path <- "Data/folder1/folder2"
# Bind all files together to form one data frame
discharge <-
# Find all file names ending in CSV in all subfolders
dir(data_path, pattern = "*.csv", recursive = TRUE) %>%
# Create a dataframe holding the file names
data_frame(filename = .) %>%
# Read in all CSV files into a new data frame,
# Create a new column with the filenames
mutate(file_contents = map(filename, ~ read_csv(file.path(data_path, .), col_types = cols(.default = "c")))
) %>%
# Unpack the list-columns to make a useful data frame
unnest()
如果使用上面相关问题中提供的示例函数:A)我无法获得header_begins行给我一个矢量,并且B)我不知道该如何将该函数合并到上面的read_csv函数中
首先,我尝试使用相关问题的解决方案进行此操作
# Function
detect_header_line <- function(file_names, column_name) {
header_begins <- NULL
for(i in 1:length(file_names)){
lines_read <- readLines(file_names[i], warn=F)
header_begins[i] <- grep(column_name, lines_read)
}
}
# Path to the data
data_path <- "Data/RACC_2012-2016/discharge"
# Get all CSV file names
file_names = dir(data_path, pattern = "*.csv", recursive = TRUE)
# Get beginning rows of each CSV file
header_begins <- detect_header_line(file.path(data_path, file_names), 'agency_cd')
但是header_begins向量为空。而且,如果我可以解决该问题,我仍然需要帮助,将其合并到上面的代码中。
非常感谢您的帮助!
答案 0 :(得分:1)
使用问题中显示的file1
将其转换为Lines1
中的文本行,然后使用所示的read.table进行读取,并与file2
类似。
Lines1 <- capture.output(write.table(file1, stdout(), row.names = FALSE, quote = FALSE))
ix <- grep("agency", Lines1) # line number of header
DF1 <- read.table(text = Lines1[-c(seq_len(ix-1), ix+1)], header = TRUE)
给予:
> DF1
agency_cd site_no datetime tz_cd Gage height Discharge
1 USGS 4294000 6/24/13 0:00 EDT 1.63 1310
2 USGS 4294000 6/24/13 0:15 EDT 1.59 1250
固定。
答案 1 :(得分:0)
这是一个基本的R解决方案,该解决方案将查找标题行然后将文件读入循环的过程,以处理文件目录。
#define column names
#columnnames<-c("agency_cd","site_no", "datetime", "tz_cd", "Gage height", "Discharge")
#find files that match pattern
fname<-dir( pattern = "file[0-9]\\.csv")
#loop and read all files
dfs<-lapply(fname, function(f) {
#find header row
headerline<-grep("agency_cd", readLines(f))
#read data with header row and following row
#by reading the header row bind will align the columns
df<- read.csv(f, skip=headerline-1, stringsAsFactors = FALSE)
})
finalanswer<-do.call(rbind, dfs)
> finalanswer
# agency_cd site_no datetime tz_cd Gage.height Discharge
# 5s 15s 20d 6s 14n 14n
# USGS 4294000 6/24/13 0:00 EDT 1.63 1310
# USGS 4294000 6/24/13 0:15 EDT 1.59 1250
# 5s 15s 20d 6s 14n 14n
# USGS 4294002 6/24/13 0:00 EDT 1.63 1310
# USGS 4294002 6/24/13 0:15 EDT 1.59 1250
现在需要删除没有USGS的行,然后将列从字符转换为数字。
注意“ \”。在dir
函数中,点在正则表达式中具有特殊含义。点表示任何字符。对于仅表示一个句点的点,然后使用R中的双\对其进行转义。
答案 2 :(得分:0)
我找到2个解决方案。第一个使用大多数@ Dave2e的解决方案,但是我没有使用do.call(rbind, dfs)
将所有dfs绑定为一个,而是使用了dplyr::bind_rows()
。 do.call(rbind, dfs)
无法正常工作,因为有时我的标题列的名称有时会稍有不同,这导致了以下错误:Error in match.names(clabs, names(xi)) : names do not match previous names
。 dplyr::bind_rows()
使用不同的列名更加灵活。我还根据个人喜好使用readr::read_csv
代替了read.csv
。
# First solution using most of @Dave2e's solution
library(tidyverse)
# Path to the data
data_path <- "Data/RACC_2012-2016/discharge"
# Get all CSV file names
file_names = dir(data_path, pattern = "*.csv", recursive = TRUE)
# Loop and read all files
dfs <- lapply(file.path(data_path, file_names), function(f) {
# Find header row
headerline <- grep("agency_cd", readLines(f))
# Read data with header row and following row
# by reading the header row bind will align the columns
df <- read_csv(f, col_types = cols(.default = "c"), skip = headerline-1)
}) %>%
# Bind all into one data frame
bind_rows() %>%
# Filters the row below the header row that doesn't contain data
dplyr::filter(agency_cd != "5s") %>%
# Combine "Gage Height" and "Gage height" columns into one
# First rename the columns to make them easier to call
rename(Gage_height = "Gage Height", Gage_height2 = "Gage height") %>%
mutate(Gage_height = ifelse(is.na(Gage_height), Gage_height2, Gage_height)) %>% select(-Gage_height2)
第二种解决方案与解决方案1的功能相同,除了它还允许我将原始文件名添加为最终数据帧中的一列。我使用lapply
而不是上面的purrr::map
。而且我还使用包fs
处理文件路径。
# Second solution
library(tidverse)
library(fs)
# Path to the data
data_path <- "Data/RACC_2012-2016/discharge"
# Bind all files together to form one data frame
discharge <-
# Find all file names ending in CSV in all subfolders
fs::dir_ls(data_path, regexp = "*.csv", recursive = TRUE) %>%
# Create a dataframe holding the file names
data_frame(filename = .) %>%
# Read in all CSV files into a new data frame,
# Create a new column with the filenames
mutate(file_contents = map(filename,
# Here we append path to the data before the file name & force all columns to be as character
# because the typecasting was causing problems
# We use skip = grep("agency_cd", readLines(.))-1)) to find header row
~ read_csv(., col_types = cols(.default = "c"), skip = grep("agency_cd", readLines(.))-1))
) %>%
# Unpack the list-columns to make a useful data frame
unnest() %>%
# Filters the row below the header row that doesn't contain data
dplyr::filter(agency_cd != "5s") %>%
# Combine "Gage Height" and "Gage height" columns into one
# First rename the columns to make them easier to call
rename(Gage_height = "Gage Height", Gage_height2 = "Gage height") %>%
mutate(Gage_height = ifelse(is.na(Gage_height), Gage_height2, Gage_height)) %>% select(-Gage_height2)
感谢大家的帮助!我也从以下方面获得帮助: https://serialmentor.com/blog/2016/6/13/reading-and-combining-many-tidy-data-files-in-R和 How to import multiple .csv files at once?