Data Lake scale

The scale of a Data Lake affects how many workload clusters can access your data using the security and governance services configured in the Data Lake, as well as resiliency of the Data Lake. CDP supports both light duty Data Lakes and medium duty Data Lakes for AWS, Azure, and GCP.

If you want to scale an existing light duty Data Lake to a medium duty Data Lake, you can perform Data Lake scaling, which is a preview feature currently under entitlement. Data Lake scaling is not supported for GCP.

Draft comment:

Medium duty Data Lakes incur additional cost, but are required for production scenarios that require resiliency and scale. Medium duty Data Lakes also have the ability to service a larger number of clients concurrently.

At this time, the following Data Lake scales are supported in CDP:

Feature Light Duty Data Lake Medium Duty Data Lake
High availability Not available
Backups
Availability Zones Single availability zone Single availability zone
Security Kerberos + LDAP/AD Kerberos + LDAP/AD
Scale About 5 concurrent workload clusters About 20 concurrent workload clusters
Node count 1 master node running SDX Services

1 IDBroker node running networking authentication services

2 IDBroker nodes running authentication services

2 master nodes running core services in HA-enabled mode, with replication for resilience and scale

3 core nodes running HDFS, Kafka, Solr, and HBase

2 gateway nodes running services with API/UI access

1 auxiliary node for services that cannot run in HA mode

Fault tolerance Services unavailable during cluster node repair Availability of services depends on the node being repaired. With the exception of the gateway and auxiliary nodes, the remaining groups can typically survive a single node failure without affecting workloads or UI/API access.

In the event of a gateway node failure on a medium duty AWS or Azure Data Lake, the load-balancer will seamlessly route to the other gateway node.

As Cloudera Manager runs on only one gateway node (either 0 or 1), if the Cloudera Manager server gateway node fails, CM will not be available at all, but UI and API calls that bypass CM will be routed to the healthy gateway node by the load balancer. If the non-CM server gateway node goes down, CM will still be available, and the load balancer will seamlessly route to the healthy gateway node.

On an GCP medium duty Data Lake, a gateway node failure will affect UI/API access.

Cloud-based load balancer Not applicable, since there is only one instance of services running. Network-based load balancer for front UI and API services. Available for AWS and Azure only.

Light Duty Data Lakes

If the master node of a light duty Data Lake fails, compute engine clients such as Hive, Impala, and Spark, are partially resilient due to caching; but new queries cannot run without updated policy information, and audit information can also be affected. Because the Knox gateway also runs on the master node, clients with UI (such as the Ranger Admin UI and Atlas UI) or API access are unavailable in the event of a master node failure. In a light-duty Data Lake, the cloud-based load balancer exists for networking purposes and has no effect on the scale.

If the IDBroker node fails, compute-engine clients are affected because cloud access tokens cannot be verified. Clients with UI/API access remain available.

Medium Duty Data Lakes (AWS and Azure)

Medium duty Data Lakes for AWS and Azure provide failure resilience for compute engine clients such as Hive, Impala, and Spark; as well as failure resilience for clients with UI and API access, such as the Ranger Admin UI and the Atlas UI.

Note that while CM is shown as running on Gateway Node 0, it can be installed on either gateway node 0 or gateway node 1. You can see which node has CM installed by looking at the Hardware tab of the Data Lake for the gateway node marked "CM Server."

Failures in a medium duty Data Lake impact services as follows:

  • Master node failure. Compute engine clients (for example, Hive, Impala, and Spark) are resilient to the failure, due to fallback high availability with smart client failover.
  • IDBroker node failure. Both compute engine clients that use standard data connectors (Hive, Impala, Spark) and compute engine clients that use custom data connectors (for example, Hue) are resilient to the failure.
  • Gateway node failure. Load-balanced UI and API access are available without interruption.
  • Core node failure. Compute engine clients (for example, Hive, Impala, and Spark) are resilient to the failure, due to fallback high availability with smart client failover.
  • Auxiliary node failure. Ranger user and tag sync are unavailable.

Medium duty Data Lakes (GCP)

Medium duty Data Lakes for GCP provide failure resilience for compute engine clients such as Hive, Impala, and Spark. The medium duty shape for GCP does not include a cloud-based load balancer for the gateway nodes, which makes UI and API access susceptible to failure in the event of a gateway node 0 failure.

Failures in a medium duty Data Lake impact services as follows:

  • Master node failure. Compute engine clients (for example, Hive, Impala, and Spark) are resilient to the failure, due to fallback high availability with smart client failover.
  • IDBroker node failure. Both compute engine clients that use standard data connectors (Hive, Impala, Spark) and compute engine clients that use custom data connectors (for example, Hue) are resilient to the failure.
  • Gateway node failure. UI and API access for services such as Cloudera Manager and Atlas are not available in the event of a CM Server gateway node failure.
  • Core node failure. Compute engine clients (for example, Hive, Impala, and Spark) are resilient to the failure, due to fallback high availability with smart client failover.
  • Auxiliary node failure. Ranger user and tag sync are unavailable.