我正在尝试读取两个csv文件(dataset1和dataset2),其中一个有大约4亿行。两个文件都具有相同的列数,即7.
在下面的代码中,我正在读取固定大小的块中的两个文件,rbind它们,应用一个函数,然后将返回的输出写出到附加模式的文件中。
以下是我的代码:
# set x to 0 - number of lines to skip in dataset1
# set y to 7924 - number of lines to read in dataset1
# dataset1 has 60498*7924
x = 0
y = 7924
# set a to 0 - number of lines to skip in dataset2
# set b to 734 - number of lines to read in dataset2
# dataset2 has 60498*734 lines
a = 0
b = 734
# run the loop from 1 to 60498
# each time skip lines already read in
# each time read fixed number of rows
for(i in 1:60498)
{
# read both datasets and combine in one
dat <- read.csv('dataset1.csv', skip = x, nrows = y, header = F)
dat2 <- read.csv('dataset2.csv', skip = a, nrows = b, header = F)
dat3 <- rbind(dat, dat2)
# apply function to this dataset and return the output
# the function is too long and not in the scope so I will skip it
# it returns a dataframe of 1 row
res <- limma.test(dat3)
# write out the output in append mode
# so at the end of the loop, out.txt should have 60498 lines
write.table(res, file = 'out.txt', append = TRUE, quote = F, col.names = F)
# set x and y so that it skips the lines that are already read in
x = x + 7924
a = a + 734
}
功能本身非常快,没有瓶颈。但是,运行for loop
60498次,需要很长时间。我有一台8核的电脑。如何修改我的代码以并行运行for循环以最小化时间?
谢谢!