ElasticSearch status in red after full restart vIDM cluster

After completely shutting down a vIDM cluster and starting it up again, I noticed that the ElasticSearch status on all three nodes was in red. You can see this in the System Diagnosis Dashboard.

Initially I thought this was just a matter of time before everything was properly synced. But unfortunately the status remained red.

In the VMware documentation you can find information on how to troubleshoot ElasticSeearch. You can find it here.

To check the health state via command line, open an SSH session to one of the vIDM nodes and enter this command:

curl ‘http://localhost:9200/_cluster/health?pretty’

which gives the following output:

curl ‘http://localhost:9200/_cluster/health?pretty’
{
“cluster_name” : “horizon”,
“status” : “red”,
“timed_out” : false,
“number_of_nodes” : 3,
“number_of_data_nodes” : 3,
“active_primary_shards” : 28,
“active_shards” : 56,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 14,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 80.0
}

A restart of the ElasticSearch service did not change anything. Also, in the logs no errors could be found.

Looking more closely at the output of health check command however, I could see that there were unassigned shards.

“unassigned_shards” : 14,

To understand the meaning of shards, please see: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_basic_concepts.html

To get more detailed info on the shards, execute this command on one of the vIDM nodes:

url ‘localhost:9200/_cat/shards?v’

From the output of this command I could see that shards after the startup of the cluster were assigned successfully and that the unassigned were only from the moment the cluster was shutdown.

This indicated that the ElasticSearch cluster was running without any errors, but there were some older entries stuck in unassigned mode.

The table below shows the output of the unassigned shards:

curl ‘localhost:9200/_cat/shards?v’ | grep UNASS

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5822 100 5822 0 0 844k 0 –:–:– –:–:– –:–:– 947k
searchentities 0 p UNASSIGNED
searchentities 0 r UNASSIGNED

(There were 14 unassigned shards, but for clarity I removed the others.)

To resolve this, you have to manually reallocate the shards.

Use the following command for this:

curl -XPOST ‘localhost:9200/_cluster/reroute’ -d ‘{
“commands”: [{
“allocate”: {
“index”: “one_of_the_indexes“,
“shard”: shard_number,
“node”: “one_of_nodes“,
“allow_primary”: 1
}
}]
}’

To find the name of the nodes, execute this command:

curl ‘localhost:9200/_cat/nodes?v’

The name of the nodes is shown in the last column:

In my case the command to reallocate the shards looks like this:

curl -XPOST ‘localhost:9200/_cluster/reroute’ -d ‘{“commands”:[{“allocate”:{“index”: “searchentities”,”shard”: 0,”node”: “Ringleader”,”allow_primary”: 1}}]}’

Listing the unassigned shards again, you can now see that “searchentities” dissappeared from the unassigned list.

curl ‘localhost:9200/_cat/shards?v’ | grep UNASS

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5822 100 5822 0 0 844k 0 –:–:– –:–:– –:–:– 947k


and the searchentities shards have been added to the assigned:

curl ‘localhost:9200/_cat/shards?v’ | grep “searchentities”
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2240 100 2240 0 0 286k 0 –:–:– –:–:– –:–:– 364k
searchentities 0 p STARTED
searchentities 0 r STARTED

If you have multiple unassigned shards, you should run the command to reallocate the shards for each of them.

When all shards have been reallocated, the ElasticSearch cluster becomes green again 🙂

SSL handshake errors when load balancing IDM 19.03 with NetScaler

Loadbalancing VMware Identity Manager with NetScaler is quite easy to setup. Carl Stalhood has written an excellent blog with step-by-step instructions.

However when implementing load balancing of vIDM version 19.03 appliances with NetScaler at a customer, I was not able to establish an secure connection from the NetScaler to the vIDM appliances. The virtual servers stayed in down state and the vIDM url was not accessible.

First I checked the obvious stuff, such as firewall to rule out any blocked ports. This was all ok.

Next up: certificates. Certificates on the NetScaler and the vIDM appliances were valid. Intermediate and root certificates were also uploaded and chained.

TLS 1.0 is disabled on Identity Manager 2.6 and newer. The version of NetScaler was 12.0-56.20_nc_32, which supports TLS 1.2 (See release notes)

Next step was creating a trace log on the NetScaler for deeper investigation.

From the trace we can see that communication is established with TLS 1.2.

The error we get is “Encrypted Alert 21”, which means that decryption fails. (https://tools.ietf.org/html/rfc5246#section-7.2)

To get around this error, we decided to setup a test NetScaler VPX appliance with the same version as the production NetScaler. Obviously SSL connection failed also. After upgrading the NetScaler VPX appliance to the latest release (version 12.1-52.15_nc_64), all SSL errors disappeared, servers were “UP”, and we were able to establish an SLL handshake with the vIDM appliances.

Kudos to my colleague @Vincent_VTH for helping me investigating and fixing this issue.