R迭代50k数据帧需要很长时间

时间:2017-04-10 13:50:14

标签: r csv parsing dataframe

我正在编写一个简单的程序,它应该将一个.tsv文件解析为多个.csv文件。问题是它花了这么长的时间(我认为在5万行的9分钟是可怕的表现)。请有人看看我的代码并告诉我我做错了什么?

我的表格包含name of participantname of mediatimestamp和一些坐标数据。在我的数据中,可以有一个或多个参与者,每个参与者使用2个媒体文件。我想为每个media files具体参与者创建csv文件。

例如,我有2位参与者P1P2,每位参与者都使用媒体文件M1M2。所以我想创建P1_M1.csvP1_M2.csvP2_M1.csvP2_M2.csv

数据如下所示:

P1 | M1 | data...
P1 | M1 | data...
...
P1 | M2 | data...
...
P2 | m1 | data...
...
...

这是我的代码:

data = read.table("./data.tsv", header = T, sep = "\t", stringsAsFactors = F) # load data from tsv

# function for creating csv file
writeData = function(filename, d){
  filename = paste("./", filename, ".csv", sep = "")
  write.csv(d, file = filename, row.names = F)
}

# initialize auxiliary variables
participantName = ""
mediaName = ""
# initialize empty dataframe
subdata <- data.frame(TimeStamp = numeric(), GazeLeftX = integer(), GazeLeftY = integer(), GazeRightX = integer(), GazeRightY = integer())

# for each row in original data...
for(r in 1:nrow(data))
{
  # check if last participant is same as participant on actual row
  if(participantName != data[r, 'ParticipantName']){
    # check if last participant is not empty (like no participant was processed yet)
    if(participantName != ""){
      # if it is not than participant and also his work on media file ended so write data to csv
      writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata)
      # empty auxiliary dataframe and also mediaName
      subdata = subdata[0,]
      mediaName = ""
    }
    # we detected new participant so record it into last participant variable
    participantName = data[r, 'ParticipantName']
  }
  # do same checks for media file because there can also change only mediafile and participant can be the same
  if(mediaName != data[r, 'MediaName']){
    if(mediaName != ""){
      writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata)
      subdata = subdata[0,]
    }
    mediaName = data[r, 'MediaName']  
  }
  # in every iteration append actual row into auxilliary dataframe
  subdata = rbind(subdata,
                  TimeStamp = data.frame(data[r, 'EyeTrackerTimestamp'],
                  GazeLeftX = data[r, 'GazeLeftX'],
                  GazeLeftY = data[r, 'GazeLeftY'],
                  GazeRightX = data[r, 'GazeRightX'],
                  GazeRightY = data[r, 'GazeRightY']))
}
# if there are any data left in auxiliary dataframe, save it to csv
if(nrow(subdata) != 0){
  writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata)
}

1 个答案:

答案 0 :(得分:1)

您正在寻找?split。试试例如:

split(data,data[,c("ParticipantName","MediaName")],drop=TRUE)

将为每个list - data.frame对创建包含ParticipantName的{​​{1}}。如果要将每个数据帧写在不同的文件上,可以尝试以下方法:

MediaName

其中res<-split(data,data[,c("ParticipantName","MediaName")],drop=TRUE) Map(writeData,names(res),res) 是您定义的函数。