Currently, the output data generated at /output is kept when a job finishes, but if the job is interrupted through floyd stop <ID>
, any data created during the run is lost.
It would be great if we could issue a command like floyd stop -keep <ID>
, so that we may interrupt the experiment without loosing the ouput.
Use case: Let's say i'm running a big model, but for some reason (maybe random initializers) it has a low-enough error rate. I would like to be able to stop it then, and download the checkpoints, without having to wait for it to finish or pre-program a stoping logic.