创建“ for循环”以合并几对csv文件

时间:2019-06-16 20:38:12

标签: r csv for-loop filter merge

首先,我是一名生物学家,负责追踪海鸟的运动和行为。在这些海鸟上,我附加了两个单独的生物记录仪,它们可以同时收集数据。一种是每2分钟记录一次坐标的GPS,另一种称为时间深度记录器(TDR),它每1秒记录一次深度(当鸟类潜水超过一定深度时,该潜水事件可视为觅食潜水)。结合这些数据将有助于在空间上识别鸟类在哪里觅食。因此,我追踪的每只鸟都有一对GPS和TDR数据,需要根据它们的时间戳进行组合。但是,让生活变得更轻松的原因是,使用For循环或其他方式对这些批处理进行批处理,因为我已经跟踪了20多只鸟,并且将它们逐一合并非常繁琐。我几乎没有编写循环的经验,需要帮助。有人有任何建议吗?

我目前正在做的是通过将GPS数据(日期)上的时间戳与TDR数据(DateTime)上的时间戳进行匹配,将每只鸟的这两个数据集一一合并,从而过滤掉深度数据没有相应的坐标。

# Read in GPS and TDR files for each bird
rh01gps <- read.csv(file.choose(), sep=",", stringsAsFactors = F, strip.white = T, na.strings = c(""))

head(rh01gps)
          x        y              date      id
1 -123.0033 37.69831 6/3/2018 01:02:00 2018_01
2 -123.0033 37.69826 6/3/2018 01:04:00 2018_01
3 -123.0032 37.69821 6/3/2018 01:06:00 2018_01
4 -123.0033 37.69829 6/3/2018 01:08:00 2018_01
5 -123.0033 37.69830 6/3/2018 01:10:00 2018_01
6 -123.0033 37.69832 6/3/2018 01:12:00 2018_01

rh01tdr <- read.csv(file.choose(), sep=",", stringsAsFactors = F, strip.white = T, na.strings = c(""))

head(rh01tdr)
      Date Pressure   Temp        Time          DateTime
1 6/3/2018    -0.94 25.203 12:00:00 AM 6/3/2018 00:00:00
2 6/3/2018    -0.94 25.203 12:00:01 AM 6/3/2018 00:00:01
3 6/3/2018    -0.94 25.203 12:00:02 AM 6/3/2018 00:00:02
4 6/3/2018    -0.94 25.203 12:00:03 AM 6/3/2018 00:00:03
5 6/3/2018    -0.94 25.203 12:00:04 AM 6/3/2018 00:00:04
6 6/3/2018    -0.94 25.203 12:00:05 AM 6/3/2018 00:00:05

# Create a dataframe with dates from TDR file that match GPS datetime (many 
# more data points from TDRs than GPS, need to filter out dates that won't 
# have a match in the GPS file)
rh_gps_tdr <- subset(rh01tdr, DateTime %in% rh01gps$date)

# Merge newly created data
merge <- cbind(rh_gps_tdr, rh01gps$x, rh01gps$y)

# Rename longitude (rh01gps$x) and latitude (rh01gps$y) columns to "x" and "y"
colnames(merge)[colnames(merge)=="rh01gps$x"] <- "x"
colnames(merge)[colnames(merge)=="rh01gps$y"] <- "y"

# Subset data to filter out unnecessary columns
rh01_gt <- subset(merge, select = c(5, 6, 7, 2, 3))

# Combined GPS coordinates plus pressure data.
head(rh01_gt)
           DateTime         x        y Pressure   Temp
1 6/3/2018 01:02:00 -123.0033 37.69831    -0.94 24.828
2 6/3/2018 01:04:00 -123.0033 37.69826    -0.91 24.703
3 6/3/2018 01:06:00 -123.0032 37.69821    -0.94 24.625
4 6/3/2018 01:08:00 -123.0033 37.69829    -0.94 24.578
5 6/3/2018 01:10:00 -123.0033 37.69830    -0.91 24.531
6 6/3/2018 01:12:00 -123.0033 37.69832    -0.94 24.516

write.csv(rh01_gt, "RHAU01_2018_TDR&GPS.csv")

我提供的代码用于处理一只鸟的数据集,但是我想看看是否有一种方法可以在一个进程中为每只鸟运行此数据。

1 个答案:

答案 0 :(得分:1)

我已将您的代码放入for循环中。只要每个csv文件数相等且它们具有相同的名称模式,此循环就应该起作用。在我的测试中,文件名为rh01gps.csv,rh02gps.csv…和rh01tdr.csv,rh02tdr.csv… 我必须设置日期格式,因为否则它不起作用(请注意,我假设您的日期格式为dd / mm / yyyyy)。我也更改了subset,因为如果有日期列,我认为没有必要使用DateTime列(可以随意更改)。

# your directory with all the csv files
setwd('yourpath')

# list tdr files by pattern 'tdr'
tdr.list<-list.files(pattern='tdr')

# list gps files by pattern 'gps'
gps.list<-list.files(pattern='gps')

# starting loop
for (i in 1:length(gps.list)) 
{
  # open each csv
  tdr<-read.csv(tdr.list[i], sep=",", stringsAsFactors = F, strip.white = T, na.strings = c(""))
  gps<-read.csv(gps.list[i], sep=",", stringsAsFactors = F, strip.white = T, na.strings = c(""))

  # set date format 
  gps$date<-as.Date(gps$date, '%d/%m/%Y')
  tdr$Date<-as.Date(tdr$Date, '%d/%m/%Y')

  # Create a dataframe with dates from TDR file that match GPS datetime (many 
  # more data points from TDRs than GPS, need to filter out dates that won't 
  # have a match in the GPS file)
  rh_gps_tdr <- subset(tdr, Date %in% gps$date) # subset made with date

  # Merge newly created data
  merge <- cbind(rh_gps_tdr, gps$x, gps$y)

  # Rename longitude (rh01gps$x) and latitude (rh01gps$y) columns to "x" and "y"
  colnames(merge)[colnames(merge)=="gps$x"] <- "x"
  colnames(merge)[colnames(merge)=="gps$y"] <- "y"

  # Subset data to filter out unnecessary columns
  gt <- subset(merge, select = c(5, 6, 7, 2, 3))

  # get the file number to have it in the output file
  filenumber<-substr(gps.list[i], 3,4) # 3 & 4 are the position of the number in the name (rhXXgps.csv)

  # writing csv file
  write.csv(gt, paste0("RHAU", filenumber, "_2018_TDR&GPS.csv"))
}