How loss/metric reporting works with deepspeed and transformers.Trainer?

pg20sanger · June 24, 2024, 10:47am

I am training my model with deepspeed using hugging face Trainer API. I was wondering how Trainer/Accelerate handles logging of loss / compute_metrices. My observation is that deepspeed creates separate processes per gpu for training and each process has its own Trainer. With this observation I have the following hypothesis:

The acceleare will gather all the losses and avg them in one of the processes before logging them in tensor board.

So my questions are:

Is this hypothesis correct, or do I need additional handling for reporting if I am using deepspeed?
Is this hypothesis valid during evaluation phase as well?
Is this hypothesis valid for compute_metrics function as well? If not, how do I write process-safe compute_metrics function to report global statistics?

Thanks.

Topic		Replies	Views
Clarification on training metrics 🤗Accelerate	0	492	February 10, 2023
I cannot find the code that transformers trainer model_wrapped by deepspeed , i can find the theory about model_wrapped was wraped by DDP(Deepspeed(transformer model )) ,but i only find the code transformers model wrapped by ddp, where is the deepspeed wr DeepSpeed	1	150	May 1, 2024
Using deepspeed script launcher vs accelerate script launcher for TRL 🤗Accelerate	4	2075	January 24, 2024
Question about calculating training loss of multi-GPU with Accelerate 🤗Accelerate	1	969	July 20, 2024
Deepspeed script launcher vs accelerate script launcher for TRL DeepSpeed	0	375	December 25, 2023

How loss/metric reporting works with deepspeed and transformers.Trainer?

Related topics