Torch MMCV 深度学习模型报错原因及解决方法汇总(长期更新)
本人做的研究基于Pytorch和MMCV,由于深度学习模型的复杂性,在研究过程中经常会遇到各种各样的报错,有时候一个笔误就有可能导致一长串错误,对本人的学习和研究造成很大困扰。因此,我在这里列出自己遇到过的报错以及它们的解决方法,方便自己查阅,也希望能帮到大家。
前言
深度学习模型涉及到的依赖和包非常多,报错经常会是一大串,因此,在查看错误信息的时候一定要冷静。我的建议是,直接翻到最上方,查看第一个报错的位置,那里通常是错误的根源(后续会举个例子)。
1. Assertion `srcIndex < srcSelectDimSize` failed
C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:646: block: [0,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed. C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:646: block: [0,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed. C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:646: block: [0,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed. C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:646: block: [0,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed. C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:646: block: [0,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed. C:\w\b\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:646: block: [0,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.我在运行基于mmcv的目标检测模型中遇到了这个错误,错误的原因是使用自己的数据集训练模型时未更改类别数,导致模型运行时的类别数和标签给出的类别数不匹配。解决方法也很简单,修改配置文件:
num_classes = 202. 分布式训练中的 torch.distributed.elastic.multiprocessing. errors.ChildFailedError 问题
第一次尝试分布式训练,确实很快,也很容易遇到问题,报错信息如下:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).错误的原因是在模型中创建的网络层没有被用到,导致没有参与参数更新,解决方法是,在执行命令前加上 TORCH_DISTRIBUTED_DEBUG=INFO ,例如:
TORCH_DISTRIBUTED_DEBUG=INFO bash tools/dist_train.sh用这样的命令执行之后,仍然会报错,但同时会打印出没有参与更新的网络层参数:
Parameters which did not receive grad for rank 1: rpn_head.density_scorer.weight, rpn_head.density_scorer.bias, rpn_head.CCM.conv1.weight, rpn_head.CCM.conv1.bias, rpn_head.CCM.ccm.0.weight, rpn_head.CCM.ccm.0.bias, rpn_head.CCM.ccm.2.weight, rpn_head.CCM.ccm.2.bias, rpn_head.CCM.ccm.4.weight, rpn_head.CCM.ccm.4.bias, rpn_head.CCM.ccm.6.weight, rpn_head.CCM.ccm.6.bias, rpn_head.CCM.ccm.8.weight, rpn_head.CCM.ccm.8.bias, rpn_head.CCM.ccm.10.weight, rpn_head.CCM.ccm.10.bias, rpn_head.CCM.linear.weight, rpn_head.CCM.linear.bias照着参数找到对应的代码行,直接删掉或者注释掉即可。
