我正在使用一种主题建模方法,该方法在RStudio中的计算机上效果很好,但需要一定的时间。所以我正在使用Linux集群。但是,我似乎也要求很多容量,但并没有真正加快速度:
对不起,我是新角军。所以这是我在选美shell中使用的:
salloc -N 240 --mem=61440 -t 06:00:00 -p med
#!/bin/sh
#SBATCH --nodes=200
#SBATCH --time=06:00:00
#SBATCH --partition=med
#SBATCH --mem=102400
#SBATCH --job-name=TestJobUSERNAME
#SBATCH --mail-user=username@ddomain.com
#SBATCH --mail-type=ALL
#SBATCH --cpus-per-task=100
squeue –u username
cd /work/username/data
module load R
export OMP_NUM_THREADS=100
echo "sbatch: START SLURM_JOB_ID $SLURM_JOB_ID (SLURM_TASK_PID $SLURM_TASK_PID) on $SLURMD_NODENAME"
echo "sbatch: SLURM_JOB_NODELIST $SLURM_JOB_NODELIST"
echo "sbatch: SLURM_JOB_ACCOUNT $SLURM_JOB_ACCOUNT"
Rscript myscript.R
我很确定。我的输入有误,因为:
这些是我的典型结果:
#just very small request to copy/paste the results, usually I request the one above
[username@gw02 ~]$ salloc -N 2 --mem=512 -t 00:10:00 -p short
salloc: Granted job allocation 1234567
salloc: Waiting for resource configuration
salloc: Nodes cstd01-[218-219] are ready for job
Disk quotas for user username (uid 12345):
-- disk space --
Filesystem limit used avail used
/home/user 32G 432M 32G 2%
/work/user 1T 219M 1024G 0%
[username@gw02 ~]$ squeue -u username
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234567 short bash username R 2:14 2 cstd01-[218-219]
#(directory, module load, etc.)
#missing outcomes for SLURM_TAST_PID and SLUMD_NODENAME:
[username@gw02 data]$ echo "sbatch: START SLURM_JOB_ID $SLURM_JOB_ID (SLURM_TASK_PID $SLURM_TASK_PID) on $SLURMD_NODENAME"
sbatch: START SLURM_JOB_ID 1314914 (SLURM_TASK_PID ) on
有人可以帮忙吗?非常感谢!
编辑: 正如Ralf Stubner在其评论中指出的那样,我没有在R代码中进行并行化。我绝对不知道该怎么做。 这是一个示例计算:
# Create the data frame
col1 <- runif (12^5, 0, 2)
col2 <- rnorm (12^5, 0, 2)
col3 <- rpois (12^5, 3)
col4 <- rchisq (12^5, 2)
df <- data.frame (col1, col2, col3, col4)
# Original R code: Before vectorization and pre-allocation
system.time({
for (i in 1:nrow(df)) { # for every row
if ((df[i, "col1"] + df[i, "col2"] + df[i, "col3"] + df[i, "col4"]) > 4) { # check if > 4
df[i, 5] <- "greater_than_4" # assign 5th column
} else {
df[i, 5] <- "lesser_than_4" # assign 5th column
}
}
})
...和缩短的“真实代码”:
library(NLP)
library(tm)
library(SnowballC)
library(topicmodels)
library(lda)
library(textclean)
# load data and create corups
filenames <- list.files(getwd(),pattern='*.txt')
files <- lapply(filenames,readLines)
docs <- Corpus(VectorSource(files))
# clean data (shortened, just two examples)
docs.adj <- tm_map(docs.adj, removeWords, stopwords('english'))
docs.adj <-tm_map(docs.adj,content_transformer(tolower))
# create document-term matrix
dtm <- DocumentTermMatrix(docs.adj)
dtm_stripped <- removeSparseTerms(dtm, 0.8)
rownames(dtm_stripped) <- filenames
freq <- colSums(as.matrix(dtm_stripped))
ord <- order(freq,decreasing=TRUE)
### find optimal number of k
burnin <- 10000
iter <- 250
thin <- 50
seed <-list(3)
nstart <- 1
best <- TRUE
seq_start <- 2
seq_end <- length(files)
iteration <- floor(length(files)/5)
best.model <- lapply(seq(seq_start,seq_end, by=iteration), function(k){LDA(dtm_stripped, k, method = 'Gibbs',control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(seq(seq_start,seq_end, by=iteration)), LL=as.numeric(as.matrix(best.model.logLik)))
optimal_k <- best.model.logLik.df[which.max(best.model.logLik.df$LL),]
print(optimal_k)
### do topic modeling with more iterations on optimal_k
burnin <- 4000
iter <- 1000
thin <- 100
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE
ldaOut <-LDA(dtm_stripped,optimal_k, method='Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
答案 0 :(得分:0)
快速浏览一下R脚本,看起来就像在其中一样:
public class Member
{
public int ID { get; set; }
public String Name { get; set; }
public String Email { get; set; }
public String Password { get; set; }
}
大部分处理时间都在此进行。在这里,您可以尝试使用public Advert GetById(int id)
{
String sql = "select * from Advert inner join Member on Advert.Creator = Member.ID where Advert.ID = @aid";
return unitOfWork.Connection.Query<Advert,Member,Advert>(sql, (advert, member) => { advert.Creator = member; return advert; } , new { aid=id}).Single();
}
而不是best.model <- lapply(seq(seq_start,seq_end, by=iteration), function(k){
LDA(dtm_stripped, k, method = 'Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
})
来并行化代码,即
future_lapply()
我还添加了lapply()
,以确保在并行执行时,随机数生成在统计上是合理的。 best.model <- future_lapply(seq(seq_start,seq_end, by=iteration), function(k){
LDA(dtm_stripped, k, method = 'Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
}, future.seed = TRUE)
函数位于future.apply包(*)中,因此您需要执行以下操作:
future.seed = TRUE
位于脚本顶部。现在,您需要做的最后一件事-您需要通过添加以下内容来使其并行运行(默认为顺序运行):
future_lapply()
也位于顶部(在附加future.apply之后)。默认值是使用“可用”的任何核心,其中“可用”表示它还可以与HPC调度程序(例如Slurm)分配给您的工作的核心数量保持一致。如果您在本地计算机上尝试上述操作,则默认使用其拥有的内核数。也就是说,您也可以在本地计算机上验证代码,并且应该会看到一些加速。当您知道它可以工作时,您可以通过Slurm分配在集群上重新运行它,它应该可以在即开即用的情况下工作-但可以使用更多并行进程运行。
您可能会发现我的blog post on future.apply from 2018-06-23有用-末尾有一些常见问题解答。
(*)免责声明:我是future.apply的作者。