Known Issues and Limitations

There are some known issues you might run into while using Cloudera Machine Learning.

Service disruption when creating multiple teams (DSE-23440)

If the Admin user creates multiple teams within a short period of time, there may be a brief service disruption. It is suggested to create multiple teams over a longer period of time.

Applications appear in failed state after upgrade (DSE-23330)

After upgrading CML from version 1.29.0 on AWS, some applications may be in a Failed state. The workaround is to restart the application.

Cannot use hashtag character in JDBC connection string

The special character # (hashtag) cannot be used in a password that is then used in a JDBC connection string. Avoid using this special character, or use '%23' instead.

CML workspace installation fails

CML workspace installation with Azure NetApp Files on NFS v4.1 fails. The workaround is to use NFS v3.

Spark executors fail due to insufficient disk space

Generally, the administrator should estimate the shuffle data set size before provisioning the workspace, and then specify the root volume size of the compute node that is appropriate given that estimate. For more specific guidelines, see the following resources.

Runtime Addon fails to load (DSE-16200)

A Spark runtime add-on may fail when upgrading a workspace.

Solution: To resolve this problem, try to reload the add-on. In Site Administration > Runtime/Engine, in the option menu next to the failed add-on, select Reload.

CML workspace provisioning times out

When provisioning a CML workspace, the process may time out with an error similar to Warning FailedMount or Failed to sync secret cache:timed out waiting for the condition. This can happen on AWS or Azure.

Solution: Delete the workspace and retry provisioning.

CML endpoint connectivity from DataHub and Cloudera Data Engineering (DSE-14882)

When CDP services connect to CML services, if the ML workspace is provisioned on a public subnet, traffic is routed out of the VPC first, and then routed back in. On Private Cloud CML, traffic is not routed externally.

Transparent proxy supported only on AWS (DSE-13937)

Cloudera Machine Learning, when used on AWS public cloud, supports transparent proxies. Transparent proxy enables CML to proxy web requests without requiring any particular browser setup. In normal operation, CML requires the ability to reach several external domains. For more information, see: Outbound network access destinations.

Jupyter Notebook sessions do not time out (DSE-13741)

Jupyter Notebook sessions in legacy engine:8 through engine:13 do not exit after IDLE_MAXIMUM_MINUTES of inactivity. They will run until SESSION_MAXIMUM_MINUTES (which is seven days by default).

Workaround

You can change the configuration of your cluster to apply the fix for this issue. Change the editor command for Jupyter Notebook in every engine that uses it to the following:

NOTEBOOK_TIMEOUT_SECONDS=$(python3 -c “print(${IDLE_MAXIMUM_MINUTES}*60)“)
        /usr/local/bin/jupyter notebook --no-browser --ip=127.0.0.1 --port=${CDSW_APP_PORT}
        --NotebookApp.token= --NotebookApp.allow_remote_access=True --NotebookApp.quit_button=False
        --log-level=ERROR --NotebookApp.shutdown_no_activity_timeout=300
        --MappingKernelManager.cull_idle_timeout=${NOTEBOOK_TIMEOUT_SECONDS}
        --TerminalManager.cull_inactive_timeout=${NOTEBOOK_TIMEOUT_SECONDS}
        --MappingKernelManager.cull_interval=60 --TerminalManager.cull_interval=60
        --MappingKernelManager.cull_connected=True
This does the following:
  • Kills each running notebook after IDLE_MAXIMUM_MINUTES of inactivity
  • Kills the CDSW/CML session in which Jupyter is running after 5 minutes with no notebooks

NFS performance issues on AWS EFS (DSE-12404)

CML uses NFS as the filesystem for storing application and user data. NFS performance may be much slower than expected in situations where a data scientist writes a very large number (typically in the thousands) of small files. Example tasks include: using git clone to clone a very large source repository (such as TensorFlow), or using pip to install a Python package that includes JavaScript code (such as plotly). Reduced performance is particularly common with CML on AWS (which uses EFS), but it may be seen in other environments.

Disable file upload and download (DSE-12065)

You cannot disable file upload and download when using the Jupyter Notebook.

Remove Workspace operation fails (DSE-8834)

Remove Workspace operation fails if workspace creation is still in progress.

CML does not support modifying CPU/GPU scaling limits on provisioned ML workspaces (DSE-8407)

When provisioning a workspace, CML currently supports a maximum of 30 nodes of each type: CPUs and GPUs. Currently, CML does not provide a way to increase this limit for existing workspaces.

Workaround:
  1. Log in to the CDP web interface at https://console.us-west-1.cdp.cloudera.com using your corporate credentials or any other credentials that you received from your CDP administrator.
  2. Click ML Workspaces.
  3. Select the workspace whose limits you want to modify and go to its Details page.
  4. Copy the Liftie Cluster ID of the workspace. It should be of the format, liftie-abcdefgh.
  5. Login to the AWS EC2 console, and click Auto Scaling Groups.
  6. Paste the Liftie Cluster ID in the search filter box and press enter.
  7. Click on the auto-scaling group that has a name like: liftie-abcdefgh-ml-pqrstuv-xyz-cpu-workers-0-NodeGroup. Especially note the 'cpu-workers' in the middle of the string.
  8. On the Details page of this auto-scaling group, click Edit.
  9. Set Max capacity to the desired value and click Save.

Note that CML does not support lowering the maximum instances of an auto scaling group due to certain limitations in AWS.

API does not enforce a maximum number of nodes for ML workspaces

Problem: When the API is used to provision new ML workspaces, it does not enforce an upper limit on the autoscale range.

Downscaling ML workspace nodes does not work as expected (MLX-637, MLX-638)

Problem: Downscaling nodes does not work as seamlessly as expected due to a lack of Bin Packing on the Spark default scheduler, and because dynamic allocation is not currently enabled. As a result, currently infrastructure pods, Spark driver/executor pods, and session pods are tagged as non-evictable using the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation.