I'm not sure if this is a bug or just an ambiguous error message, but something seems off. In experimenting with Floydhub's port of PyTorch's DCGAN...
https://docs.floydhub.com/examples/dcgan/
I cannot run more than 29 iterations without receiving the following error when attempting to persist data in home...
2019-01-04 09:31:27,692 INFO - Waiting for container to complete...
2019-01-04 09:31:27,914 INFO - Persisting outputs...
2019-01-04 09:31:28,127 INFO - Creating data module for output...
2019-01-04 09:31:28,179 INFO - Data module created for output.
2019-01-04 09:31:28,179 INFO - Persisting data in home...
2019-01-04 09:31:32,158 ERROR - Error finalizing data in home.
2019-01-04 09:31:32,158 INFO - [success] Finished execution
If anyone has any ideas on what is going on, I'd appreciate them. Thanks! More context below:
this succeeds and generates output files:
floyd run --gpu --env pytorch-0.2 --data osetinsky/datasets/private-dataset/1:private-dataset 'python main.py --dataset private-dataset --dataroot ./ --outf trained_models --cuda --ngpu 1 --niter 29'
this and anything with --niter
> 29 finishes the epochs, but fails when attempting to persist/finalize the data in home:
floyd run --gpu --env pytorch-0.2 --data osetinsky/datasets/private-dataset/1:private-dataset 'python main.py --dataset private-dataset --dataroot ./ --outf trained_models --cuda --ngpu 1 --niter 30
Full logs...
successful 29-epoch run:
29-epoch run
2019-01-04 09:27:00,271 INFO - Run Output:
2019-01-04 09:27:00,279 INFO - Starting services.
2019-01-04 09:27:00,286 INFO - supervisor: unrecognized service
2019-01-04 09:27:00,756 INFO - Namespace(batchSize=64, beta1=0.5, cuda=True, dataroot='./', dataset='private-dataset', imageSize=64, lr=0.0002, manualSeed=None, ndf=64, netD='', netG='', ngf=64, ngpu=1, niter=29, nz=100, outf='trained_models', workers=2)
2019-01-04 09:27:00,756 INFO - Random Seed: 7884
2019-01-04 09:27:01,041 INFO - _netG (
2019-01-04 09:27:01,041 INFO - (main): Sequential (
2019-01-04 09:27:01,041 INFO - (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
2019-01-04 09:27:01,042 INFO - (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:27:01,042 INFO - (2): ReLU (inplace)
2019-01-04 09:27:01,042 INFO - (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,042 INFO - (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:27:01,042 INFO - (5): ReLU (inplace)
2019-01-04 09:27:01,042 INFO - (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,042 INFO - (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:27:01,043 INFO - (8): ReLU (inplace)
2019-01-04 09:27:01,043 INFO - (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,043 INFO - (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:27:01,043 INFO - (11): ReLU (inplace)
2019-01-04 09:27:01,043 INFO - (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,043 INFO - (13): Tanh ()
2019-01-04 09:27:01,043 INFO - )
2019-01-04 09:27:01,044 INFO - )
2019-01-04 09:27:01,253 INFO - _netD (
2019-01-04 09:27:01,254 INFO - (main): Sequential (
2019-01-04 09:27:01,254 INFO - (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,254 INFO - (1): LeakyReLU (0.2, inplace)
2019-01-04 09:27:01,254 INFO - (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,254 INFO - (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:27:01,254 INFO - (4): LeakyReLU (0.2, inplace)
2019-01-04 09:27:01,255 INFO - (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,255 INFO - (6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:27:01,255 INFO - (7): LeakyReLU (0.2, inplace)
2019-01-04 09:27:01,255 INFO - (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:27:01,255 INFO - (9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:27:01,255 INFO - (10): LeakyReLU (0.2, inplace)
2019-01-04 09:27:01,255 INFO - (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
2019-01-04 09:27:01,256 INFO - (12): Sigmoid ()
2019-01-04 09:27:01,256 INFO - )
2019-01-04 09:27:01,256 INFO - )
2019-01-04 09:27:04,211 INFO - [0/29][0/1] Loss_D: 1.0293 Loss_G: 19.6073 D(x): 0.6288 D(G(z)): 0.4318 / 0.0000
2019-01-04 09:27:04,551 INFO - [1/29][0/1] Loss_D: 0.0371 Loss_G: 9.9628 D(x): 1.0000 D(G(z)): 0.0365 / 0.0000
2019-01-04 09:27:04,770 INFO - [2/29][0/1] Loss_D: 0.0156 Loss_G: 7.6343 D(x): 1.0000 D(G(z)): 0.0155 / 0.0005
2019-01-04 09:27:04,996 INFO - [3/29][0/1] Loss_D: 0.0499 Loss_G: 10.4233 D(x): 1.0000 D(G(z)): 0.0487 / 0.0000
2019-01-04 09:27:05,222 INFO - [4/29][0/1] Loss_D: 0.0339 Loss_G: 9.5846 D(x): 1.0000 D(G(z)): 0.0334 / 0.0001
2019-01-04 09:27:05,447 INFO - [5/29][0/1] Loss_D: 0.0285 Loss_G: 9.4756 D(x): 1.0000 D(G(z)): 0.0281 / 0.0001
2019-01-04 09:27:05,672 INFO - [6/29][0/1] Loss_D: 0.4217 Loss_G: 23.2836 D(x): 1.0000 D(G(z)): 0.3441 / 0.0000
2019-01-04 09:27:05,899 INFO - [7/29][0/1] Loss_D: 0.0006 Loss_G: 9.9534 D(x): 1.0000 D(G(z)): 0.0006 / 0.0000
2019-01-04 09:27:06,120 INFO - [8/29][0/1] Loss_D: 0.0001 Loss_G: 9.7888 D(x): 1.0000 D(G(z)): 0.0001 / 0.0001
2019-01-04 09:27:06,341 INFO - [9/29][0/1] Loss_D: 0.0018 Loss_G: 7.1280 D(x): 1.0000 D(G(z)): 0.0018 / 0.0008
2019-01-04 09:27:06,559 INFO - [10/29][0/1] Loss_D: 0.0016 Loss_G: 7.0320 D(x): 1.0000 D(G(z)): 0.0016 / 0.0009
2019-01-04 09:27:06,780 INFO - [11/29][0/1] Loss_D: 0.0029 Loss_G: 6.5007 D(x): 1.0000 D(G(z)): 0.0029 / 0.0015
2019-01-04 09:27:07,014 INFO - [12/29][0/1] Loss_D: 0.0009 Loss_G: 7.2578 D(x): 1.0000 D(G(z)): 0.0009 / 0.0007
2019-01-04 09:27:07,245 INFO - [13/29][0/1] Loss_D: 0.0554 Loss_G: 11.7622 D(x): 1.0000 D(G(z)): 0.0539 / 0.0000
2019-01-04 09:27:07,471 INFO - [14/29][0/1] Loss_D: 0.0047 Loss_G: 6.9018 D(x): 1.0000 D(G(z)): 0.0047 / 0.0010
2019-01-04 09:27:07,698 INFO - [15/29][0/1] Loss_D: 0.0133 Loss_G: 7.3852 D(x): 1.0000 D(G(z)): 0.0132 / 0.0006
2019-01-04 09:27:07,924 INFO - [16/29][0/1] Loss_D: 0.0372 Loss_G: 10.5192 D(x): 1.0000 D(G(z)): 0.0365 / 0.0000
2019-01-04 09:27:08,146 INFO - [17/29][0/1] Loss_D: 0.0254 Loss_G: 9.7352 D(x): 1.0000 D(G(z)): 0.0250 / 0.0001
2019-01-04 09:27:08,367 INFO - [18/29][0/1] Loss_D: 0.0270 Loss_G: 9.6917 D(x): 1.0000 D(G(z)): 0.0266 / 0.0001
2019-01-04 09:27:08,588 INFO - [19/29][0/1] Loss_D: 0.0068 Loss_G: 7.1343 D(x): 1.0000 D(G(z)): 0.0068 / 0.0008
2019-01-04 09:27:08,809 INFO - [20/29][0/1] Loss_D: 0.0006 Loss_G: 8.1597 D(x): 1.0000 D(G(z)): 0.0006 / 0.0003
2019-01-04 09:27:09,034 INFO - [21/29][0/1] Loss_D: 0.0077 Loss_G: 6.9068 D(x): 1.0000 D(G(z)): 0.0077 / 0.0010
2019-01-04 09:27:09,258 INFO - [22/29][0/1] Loss_D: 0.0059 Loss_G: 6.8561 D(x): 1.0000 D(G(z)): 0.0059 / 0.0011
2019-01-04 09:27:09,479 INFO - [23/29][0/1] Loss_D: 0.0021 Loss_G: 6.9945 D(x): 1.0000 D(G(z)): 0.0021 / 0.0009
2019-01-04 09:27:09,705 INFO - [24/29][0/1] Loss_D: 0.1880 Loss_G: 25.8647 D(x): 1.0000 D(G(z)): 0.1714 / 0.0000
2019-01-04 09:27:09,934 INFO - [25/29][0/1] Loss_D: 0.0040 Loss_G: 9.6543 D(x): 1.0000 D(G(z)): 0.0040 / 0.0001
2019-01-04 09:27:10,163 INFO - [26/29][0/1] Loss_D: 0.0000 Loss_G: 13.5612 D(x): 1.0000 D(G(z)): 0.0000 / 0.0000
2019-01-04 09:27:10,403 INFO - [27/29][0/1] Loss_D: 0.0001 Loss_G: 9.8095 D(x): 1.0000 D(G(z)): 0.0001 / 0.0001
2019-01-04 09:27:10,639 INFO - [28/29][0/1] Loss_D: 0.0009 Loss_G: 7.5156 D(x): 1.0000 D(G(z)): 0.0009 / 0.0005
2019-01-04 09:27:10,985 INFO -
################################################################################
2019-01-04 09:27:10,985 INFO - Waiting for container to complete...
2019-01-04 09:27:11,178 INFO - Persisting outputs...
2019-01-04 09:27:11,384 INFO - Creating data module for output...
2019-01-04 09:27:11,414 INFO - Data module created for output.
2019-01-04 09:27:11,414 INFO - Persisting data in home...
2019-01-04 09:27:37,777 INFO - Home data persisted.
2019-01-04 09:27:37,779 INFO - [success] Finished execution
Unsuccessful 30-epoch run:
2019-01-04 09:29:18,600 INFO - Run Output:
2019-01-04 09:29:21,529 INFO - Starting services.
2019-01-04 09:29:21,751 INFO - supervisor: unrecognized service
2019-01-04 09:29:53,296 INFO - Namespace(batchSize=64, beta1=0.5, cuda=True, dataroot='./', dataset='private-dataset', imageSize=64, lr=0.0002, manualSeed=None, ndf=64, netD='', netG='', ngf=64, ngpu=1, niter=30, nz=100, outf='trained_models', workers=2)
2019-01-04 09:29:53,297 INFO - Random Seed: 2334
2019-01-04 09:29:53,524 INFO - XXX
2019-01-04 09:29:53,525 INFO - /floyd/home
2019-01-04 09:29:53,893 INFO - _netG (
2019-01-04 09:29:53,893 INFO - (main): Sequential (
2019-01-04 09:29:53,893 INFO - (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
2019-01-04 09:29:53,893 INFO - (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:29:53,894 INFO - (2): ReLU (inplace)
2019-01-04 09:29:53,894 INFO - (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:53,894 INFO - (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:29:53,894 INFO - (5): ReLU (inplace)
2019-01-04 09:29:53,894 INFO - (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:53,894 INFO - (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:29:53,894 INFO - (8): ReLU (inplace)
2019-01-04 09:29:53,894 INFO - (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:53,895 INFO - (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:29:53,895 INFO - (11): ReLU (inplace)
2019-01-04 09:29:53,895 INFO - (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:53,895 INFO - (13): Tanh ()
2019-01-04 09:29:53,895 INFO - )
2019-01-04 09:29:53,895 INFO - )
2019-01-04 09:29:54,096 INFO - _netD (
2019-01-04 09:29:54,097 INFO - (main): Sequential (
2019-01-04 09:29:54,097 INFO - (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:54,097 INFO - (1): LeakyReLU (0.2, inplace)
2019-01-04 09:29:54,097 INFO - (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:54,097 INFO - (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:29:54,097 INFO - (4): LeakyReLU (0.2, inplace)
2019-01-04 09:29:54,097 INFO - (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:54,098 INFO - (6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:29:54,098 INFO - (7): LeakyReLU (0.2, inplace)
2019-01-04 09:29:54,098 INFO - (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
2019-01-04 09:29:54,098 INFO - (9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
2019-01-04 09:29:54,098 INFO - (10): LeakyReLU (0.2, inplace)
2019-01-04 09:29:54,098 INFO - (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
2019-01-04 09:29:54,098 INFO - (12): Sigmoid ()
2019-01-04 09:29:54,099 INFO - )
2019-01-04 09:29:54,099 INFO - )
2019-01-04 09:31:19,079 INFO - [0/30][0/1] Loss_D: 1.3548 Loss_G: 15.4099 D(x): 0.3965 D(G(z)): 0.3494 / 0.0000
2019-01-04 09:31:19,407 INFO - [1/30][0/1] Loss_D: 0.9297 Loss_G: 18.1907 D(x): 1.0000 D(G(z)): 0.6053 / 0.0000
2019-01-04 09:31:19,608 INFO - [2/30][0/1] Loss_D: 0.1771 Loss_G: 11.5425 D(x): 1.0000 D(G(z)): 0.1623 / 0.0000
2019-01-04 09:31:19,815 INFO - [3/30][0/1] Loss_D: 0.0589 Loss_G: 7.6991 D(x): 1.0000 D(G(z)): 0.0572 / 0.0005
2019-01-04 09:31:20,017 INFO - [4/30][0/1] Loss_D: 0.0269 Loss_G: 6.3628 D(x): 1.0000 D(G(z)): 0.0266 / 0.0017
2019-01-04 09:31:20,218 INFO - [5/30][0/1] Loss_D: 0.0402 Loss_G: 6.6518 D(x): 1.0000 D(G(z)): 0.0394 / 0.0013
2019-01-04 09:31:20,422 INFO - [6/30][0/1] Loss_D: 0.0293 Loss_G: 6.5363 D(x): 1.0000 D(G(z)): 0.0289 / 0.0014
2019-01-04 09:31:20,631 INFO - [7/30][0/1] Loss_D: 0.0165 Loss_G: 6.1067 D(x): 1.0000 D(G(z)): 0.0163 / 0.0022
2019-01-04 09:31:20,846 INFO - [8/30][0/1] Loss_D: 0.6438 Loss_G: 21.7612 D(x): 1.0000 D(G(z)): 0.4747 / 0.0000
2019-01-04 09:31:21,056 INFO - [9/30][0/1] Loss_D: 0.0093 Loss_G: 6.8389 D(x): 1.0000 D(G(z)): 0.0092 / 0.0011
2019-01-04 09:31:21,262 INFO - [10/30][0/1] Loss_D: 0.0052 Loss_G: 6.4490 D(x): 1.0000 D(G(z)): 0.0052 / 0.0016
2019-01-04 09:31:21,464 INFO - [11/30][0/1] Loss_D: 0.0083 Loss_G: 6.1403 D(x): 1.0000 D(G(z)): 0.0083 / 0.0022
2019-01-04 09:31:21,666 INFO - [12/30][0/1] Loss_D: 0.0069 Loss_G: 5.9454 D(x): 1.0000 D(G(z)): 0.0069 / 0.0026
2019-01-04 09:31:21,872 INFO - [13/30][0/1] Loss_D: 0.0249 Loss_G: 6.3679 D(x): 1.0000 D(G(z)): 0.0246 / 0.0017
2019-01-04 09:31:22,079 INFO - [14/30][0/1] Loss_D: 0.0932 Loss_G: 10.5466 D(x): 1.0000 D(G(z)): 0.0890 / 0.0000
2019-01-04 09:31:22,280 INFO - [15/30][0/1] Loss_D: 0.3812 Loss_G: 23.3383 D(x): 1.0000 D(G(z)): 0.3169 / 0.0000
2019-01-04 09:31:22,482 INFO - [16/30][0/1] Loss_D: 0.0049 Loss_G: 8.0218 D(x): 1.0000 D(G(z)): 0.0049 / 0.0003
2019-01-04 09:31:22,684 INFO - [17/30][0/1] Loss_D: 0.0005 Loss_G: 8.9741 D(x): 1.0000 D(G(z)): 0.0005 / 0.0001
2019-01-04 09:31:22,890 INFO - [18/30][0/1] Loss_D: 0.0009 Loss_G: 7.6702 D(x): 1.0000 D(G(z)): 0.0009 / 0.0005
2019-01-04 09:31:23,096 INFO - [19/30][0/1] Loss_D: 0.0010 Loss_G: 7.2427 D(x): 1.0000 D(G(z)): 0.0010 / 0.0007
2019-01-04 09:31:23,300 INFO - [20/30][0/1] Loss_D: 0.0032 Loss_G: 6.2463 D(x): 1.0000 D(G(z)): 0.0032 / 0.0019
2019-01-04 09:31:23,503 INFO - [21/30][0/1] Loss_D: 0.0087 Loss_G: 5.7332 D(x): 1.0000 D(G(z)): 0.0087 / 0.0032
2019-01-04 09:31:23,709 INFO - [22/30][0/1] Loss_D: 0.8705 Loss_G: 27.5817 D(x): 1.0000 D(G(z)): 0.5813 / 0.0000
2019-01-04 09:31:23,910 INFO - [23/30][0/1] Loss_D: 0.0077 Loss_G: 8.3669 D(x): 1.0000 D(G(z)): 0.0077 / 0.0002
2019-01-04 09:31:24,112 INFO - [24/30][0/1] Loss_D: 0.0000 Loss_G: 15.0154 D(x): 1.0000 D(G(z)): 0.0000 / 0.0000
2019-01-04 09:31:24,318 INFO - [25/30][0/1] Loss_D: 0.0002 Loss_G: 9.5998 D(x): 1.0000 D(G(z)): 0.0002 / 0.0001
2019-01-04 09:31:24,520 INFO - [26/30][0/1] Loss_D: 0.0072 Loss_G: 6.0238 D(x): 1.0000 D(G(z)): 0.0072 / 0.0024
2019-01-04 09:31:24,741 INFO - [27/30][0/1] Loss_D: 0.0000 Loss_G: 13.7234 D(x): 1.0000 D(G(z)): 0.0000 / 0.0000
2019-01-04 09:31:24,943 INFO - [28/30][0/1] Loss_D: 0.0059 Loss_G: 5.8825 D(x): 1.0000 D(G(z)): 0.0059 / 0.0028
2019-01-04 09:31:25,145 INFO - [29/30][0/1] Loss_D: 0.1510 Loss_G: 15.4843 D(x): 1.0000 D(G(z)): 0.1401 / 0.0000
2019-01-04 09:31:27,692 INFO -
################################################################################
2019-01-04 09:31:27,692 INFO - Waiting for container to complete...
2019-01-04 09:31:27,914 INFO - Persisting outputs...
2019-01-04 09:31:28,127 INFO - Creating data module for output...
2019-01-04 09:31:28,179 INFO - Data module created for output.
2019-01-04 09:31:28,179 INFO - Persisting data in home...
2019-01-04 09:31:32,158 ERROR - Error finalizing data in home.
2019-01-04 09:31:32,158 INFO - [success] Finished execution