Question

我正在尝试读取两个csv文件（dataset1和dataset2），其中一个有大约4亿行。两个文件都具有相同的列数，即7.

在下面的代码中，我正在读取固定大小的块中的两个文件，rbind它们，应用一个函数，然后将返回的输出写出到附加模式的文件中。

以下是我的代码：

# set x to 0 - number of lines to skip in dataset1
# set y to 7924 - number of lines to read in dataset1
# dataset1 has 60498*7924
x = 0
y = 7924

# set a to 0 - number of lines to skip in dataset2
# set b to 734 - number of lines to read in dataset2
# dataset2 has 60498*734 lines
a = 0
b = 734

# run the loop from 1 to 60498
# each time skip lines already read in
# each time read fixed number of rows
for(i in 1:60498)
{
  # read both datasets and combine in one
  dat <- read.csv('dataset1.csv', skip = x, nrows = y, header = F)
  dat2 <- read.csv('dataset2.csv', skip = a, nrows = b, header = F)
  dat3 <- rbind(dat, dat2)

  # apply function to this dataset and return the output
  # the function is too long and not in the scope so I will skip it
  # it returns a dataframe of 1 row
  res <- limma.test(dat3)

  # write out the output in append mode
  # so at the end of the loop, out.txt should have 60498 lines
  write.table(res, file = 'out.txt', append = TRUE, quote = F, col.names = F)

  # set x and y so that it skips the lines that are already read in 
  x = x + 7924
  a = a + 734
}

功能本身非常快，没有瓶颈。但是，运行for loop 60498次，需要很长时间。我有一台8核的电脑。如何修改我的代码以并行运行for循环以最小化时间？

谢谢！

R：并行运行for循环

0 个答案: