I am training my model with deepspeed using hugging face Trainer API. I was wondering how Trainer/Accelerate handles logging of loss / compute_metrices. My observation is that deepspeed creates separate processes per gpu for training and each process has its own Trainer. With this observation I have the following hypothesis:
The
accelearewill gather all the losses andavgthem in one of the processes before logging them in tensor board.
So my questions are:
- Is this hypothesis correct, or do I need additional handling for reporting if I am using
deepspeed? - Is this hypothesis valid during evaluation phase as well?
- Is this hypothesis valid for
compute_metricsfunction as well? If not, how do I write process-safecompute_metricsfunction to report global statistics?
Thanks.