最近遇到一個奇怪的錯誤
在我本地端的 keras tensorflow 運行dla34 3D模型時候
是正常可以訓練
但是在 DGX 運行同樣程式碼時候就是會出現以下錯誤:

 

Traceback (most recent call last):
File "main_train.py", line 655, in <module>
history = net_final.fit_generator(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1943, in fit_generator
return self.fit(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1214, in fit
val_logs = self.evaluate(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1489, in evaluate
tmp_logs = self.test_function(iterator)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 956, in _call
return self._concrete_stateful_fn._call_flat(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: side_input shape must be equal to input shape: [2,32,24,24,24] != [2,32,24,576]
[[node model/base.level2.tree1.tree2.bn2/FusedBatchNormV3 (defined at main_train.py:655) ]] [Op:__inference_test_function_6241]

 

但是我一再地確認
資料輸入確實一樣
模型程式確實一樣
套件版本沒有差異很大
但是在 DGX 運行就是會有這個錯誤

然後我嘗試許多除錯方法
都無法解決這問題
就是在 FusedBatchNormV3 之前的 conv3D 出來的維度就是會少一度
非常奇怪


然後我就想到一個方法
在DGX版本的程式碼加入

 

tf.config.run_functions_eagerly(True)

 

發現...
就可以正常運行了
這...
好吧
所以如果有什麼奇怪錯誤的時候
可以嘗試先把

 

tf.config.run_functions_eagerly(True)

 


這種除錯模式先打開
也許可以解決問題
給大家參考囉