Couple of days ago, I ran 4 GPU jobs (jobs 413-416) that don't seem to have actually used the GPU:
Resulting in a waste of 8 hours of GPU time (each job took 2 hours for evaluating a model that took 1 hour to train).
After short investigation, I noticed an inefficiency in my code (rather obvious one). I tested it with shorter runs (10 epochs, instead of 255 epochs):
Job 423, before fix: timed 1.09 minutes for the evaluation section (10 epochs).
Job 424, after fix: timed 0.67 minutes for the evaluation section (10 epochs).
So even before my change, I would've expected at most a run time of 28 minutes (and not 2 hours).
I have additional past (similar) jobs that also imply that the run time should've been around 20-30 minutes.