Ephemeral storage

Ephemeral storage space is scratch space a CML session, job, application or model can use. This feature helps in better scheduling of CML pods, and provides a safety valve to ensure runaway computations do not consume all available scratch space on the node.

By default, each user pod in CML is allocated 0 GB of scratch space, and it is allowed to use up to 10 GB. These settings can be applied to an entire site, or on a per-project basis.

Change Site-wide Ephemeral Storage Configuration

In Site Administration > Settings > Advanced, you can see the fields to change the ephemeral storage request (minimum) and maximum limit.

Override Site-wide Ephemeral Storage Configuration

If you want to customize the ephemeral storage settings, you can do so on a per-project basis. Open your project, then click on Project Settings > Advanced and adjust the ephemeral storage parameters.

AWS Known Issues

There is a known issue with the cluster autoscaler that affects autoscaling from 0->1 if a non-zero value for Ephemeral Storage Request is set. This affects both CPU and GPU node groups of the CML workspace. The autoscaler throws the following error when this happens:

pod didn't trigger scale-up: 1 Insufficient ephemeral-storage

This is occurring even though the nodes in the CML autoscaling groups have sufficient ephemeral storage space in their group templates. See this github issue for details. Even though the issue is closed, the problem still persists.

The issue only affects node groups that have [0, x] autoscaling range.

Set the Ephemeral Storage Request value to 0 in both the site-wide and project settings if you run into this issue.