最近遇到一個奇怪的錯誤
在我本地端的 keras tensorflow 運行dla34 3D模型時候
是正常可以訓練
但是在 DGX 運行同樣程式碼時候就是會出現以下錯誤:
Traceback (most recent call last): File "main_train.py", line 655, in <module> history = net_final.fit_generator( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1943, in fit_generator return self.fit( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1214, in fit val_logs = self.evaluate( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1489, in evaluate tmp_logs = self.test_function(iterator) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__ result = self._call(*args, **kwds) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 956, in _call return self._concrete_stateful_fn._call_flat( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat return self._build_call_outputs(self._inference_function.call( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 591, in call outputs = execute.execute( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: side_input shape must be equal to input shape: [2,32,24,24,24] != [2,32,24,576] [[node model/base.level2.tree1.tree2.bn2/FusedBatchNormV3 (defined at main_train.py:655) ]] [Op:__inference_test_function_6241]
但是我一再地確認
資料輸入確實一樣
模型程式確實一樣
套件版本沒有差異很大
但是在 DGX 運行就是會有這個錯誤
然後我嘗試許多除錯方法
都無法解決這問題
就是在 FusedBatchNormV3 之前的 conv3D 出來的維度就是會少一度
非常奇怪
然後我就想到一個方法
在DGX版本的程式碼加入
tf.config.run_functions_eagerly(True)
發現...
就可以正常運行了
這...
好吧
所以如果有什麼奇怪錯誤的時候
可以嘗試先把
tf.config.run_functions_eagerly(True)
這種除錯模式先打開
也許可以解決問題
給大家參考囉
留言板
歡迎留下建議與分享!希望一起交流!感恩!