分布式训练一开始就卡住了

时间:2020-12-25 07:35:09

标签: allennlp

我使用 allennlp 框架进行 nlp 学习。使用单个 GPU 时,它可以工作。但是当我改成多gpu的时候,一开始就卡住了。

配置在单 GPU 下运行良好。

环境

using anaconda
ubuntu 16.04

pytorch==1.7.1
allennlp==1.3.0
nvcc -V v10.2.89
driver version: 440.33.01
cuda version: 10.2

我用的是 1080ti * 2 和 AMD Ryzen 5 1600

程序生成 3 个日志。out.logout_worker0.logout_worker1.log

在下面列出它们

# out.log

2020-12-25 14:54:22,558 - INFO - allennlp.common.params - datasets_for_vocab_creation = None
2020-12-25 14:54:22,558 - INFO - allennlp.common.params - dataset_reader.type = my_simple_reader
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.lazy = False
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.cache_directory = None
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.max_instances = None
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.manual_distributed_sharding = False
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.manual_multi_process_sharding = False
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - train_data_path = data/train.txt
2020-12-25 14:54:22,559 - INFO - allennlp.training.util - Reading training data from data/train.txt
2020-12-25 14:54:22,561 - INFO - tqdm - reading instances: 0it [00:00, ?it/s]
2020-12-25 14:54:23,212 - INFO - allennlp.common.params - vocabulary.type = from_instances
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.min_count = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.max_vocab_size = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.non_padded_namespaces = ('*tags', '*labels')
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.pretrained_files = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.only_include_pretrained_words = False
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.tokens_to_add = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.min_pretrained_embeddings = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.padding_token = @@PADDING@@
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.oov_token = @@UNKNOWN@@
2020-12-25 14:54:23,213 - INFO - allennlp.data.vocabulary - Fitting token dictionary from dataset.
2020-12-25 14:54:23,214 - INFO - tqdm - building vocab: 0it [00:00, ?it/s]
2020-12-25 14:54:23,214 - INFO - allennlp.training.util - writing the vocabulary to tmp/debugger/vocabulary.
2020-12-25 14:54:23,214 - INFO - allennlp.training.util - done creating vocab
2020-12-25 14:54:23,214 - INFO - root - Switching to distributed training mode since multiple GPUs are configured | Master is at: 127.0.0.1:37039 | Rank of this node: 0 | Number of workers in this node: 2 | Number of nodes: 1 | World size: 2

# out_worker0.log

0 | 2020-12-25 14:54:24,863 - INFO - allennlp.common.params - random_seed = 13370
0 | 2020-12-25 14:54:24,863 - INFO - allennlp.common.params - numpy_seed = 1337
0 | 2020-12-25 14:54:24,863 - INFO - allennlp.common.params - pytorch_seed = 133
0 | 2020-12-25 14:54:24,864 - INFO - allennlp.common.checks - Pytorch version: 1.7.1
# out_worker1.log

1 | 2020-12-25 14:54:24,826 - INFO - allennlp.common.params - random_seed = 13370
1 | 2020-12-25 14:54:24,826 - INFO - allennlp.common.params - numpy_seed = 1337
1 | 2020-12-25 14:54:24,826 - INFO - allennlp.common.params - pytorch_seed = 133
1 | 2020-12-25 14:54:24,827 - INFO - allennlp.common.checks - Pytorch version: 1.7.1

它卡住了 10 多分钟。所以我 ctrl-c 来中断它。消息如下:

^CTraceback (most recent call last):
  File "/home/axx/anaconda3/envs/allen-test/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 118, in main
    args.func(args)
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/commands/train.py", line 119, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/commands/train.py", line 178, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/commands/train.py", line 323, in train_model
    nprocs=num_procs,
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 77, in join
    timeout=timeout,
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

0 个答案:

没有答案
相关问题