Common problems
Operator crashes on startup with OOMKilled
¶
On very large Kubernetes clusters with many hundreds of resources (pods, secrets, config maps, and so on), the operator may fail to start with its pod getting terminated with a OOMKilled
reason:
kubectl -n elastic-system \
get pods -o=jsonpath='{.items[].status.containerStatuses}' | jq
[
{
"containerID": "containerd://...",
"image": "docker.elastic.co/eck/eck-operator:2.16.1",
"imageID": "docker.elastic.co/eck/eck-operator@sha256:...",
"lastState": {
"terminated": {
"containerID": "containerd://...",
"exitCode": 137,
"finishedAt": "2022-07-04T09:47:02Z",
"reason": "OOMKilled",
"startedAt": "2022-07-04T09:46:43Z"
}
},
"name": "manager",
"ready": false,
"restartCount": 2,
"started": false,
"state": {
"waiting": {
"message": "back-off 20s restarting failed container=manager pod=elastic-operator-0_elastic-system(57de3efd-57e0-4c1e-8151-72b0ac4d6b14)",
"reason": "CrashLoopBackOff"
}
}
}
]
This is an issue with the controller-runtime
framework on top of which the operator is built. Even though the operator is only interested in the resources created by itself, the framework code needs to gather information about all relevant resources in the Kubernetes cluster in order to provide the filtered view of cluster state required by the operator. On very large clusters, this information gathering can use up a lot of memory and exceed the default resource limit defined for the operator pod.
The default memory limit for the operator pod is set to 1 Gi. You can increase (or decrease) this limit to a value suited to your cluster as follows:
kubectl patch sts elastic-operator -n elastic-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager", "resources":{"limits":{"memory":"2Gi"}}}]}}}}'
Note
Set limits (spec.containers[].resources.limits
) that match requests (spec.containers[].resources.requests
) to prevent operator’s Pod from being terminated during node-pressure eviction.
Timeout when submitting a resource manifest ¶
When submitting a ECK resource manifest, you may encounter an error message similar to the following:
Error from server (Timeout): error when creating "elasticsearch.yaml": Timeout: request did not complete within requested timeout 30s
This error is usually an indication of a problem communicating with the validating webhook. If you are running ECK on a private Google Kubernetes Engine (GKE) cluster, you may need to add a firewall rule allowing port 9443 from the API server. Another possible cause for failure is if a strict network policy is in effect. Refer to the webhook troubleshooting documentation for more details and workarounds.
Copying secrets with Owner References ¶
Copying the Elasticsearch Secrets generated by ECK (for instance, the certificate authority or the elastic user) into another namespace wholesale can trigger a Kubernetes bug which can delete all of the Elasticsearch-related resources, for example, the data volumes. Since ECK 1.3.1, OwnerReference
was removed both from Elasticsearch Secrets containing public certificates and the Secret holding the elastic user credentials. These secrets are likely to be copied. If Secrets were copied in other namespaces before ECK 1.3.1, make sure you manually remove the OwnerReference
, as these Secrets might still be affected, even if ECK has been upgraded.
For example, a source secret might be:
kubectl get secret quickstart-es-elastic-user -o yaml
apiVersion: v1
data:
elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0
kind: Secret
metadata:
creationTimestamp: "2020-06-09T19:11:41Z"
labels:
common.k8s.elastic.co/type: elasticsearch
eck.k8s.elastic.co/credentials: "true"
elasticsearch.k8s.elastic.co/cluster-name: quickstart
name: quickstart-es-elastic-user
namespace: default
ownerReferences:
- apiVersion: elasticsearch.k8s.elastic.co/v1
blockOwnerDeletion: true
controller: true
kind: Elasticsearch
name: quickstart
uid: c7a9b436-aa07-4341-a2cc-b33b3dfcbe29
resourceVersion: "13048277"
selfLink: /api/v1/namespaces/default/secrets/quickstart-es-elastic-user
uid: 04cdf334-77d3-4de6-a2e8-7a2b23366a27
type: Opaque
To copy it to a different namespace, strip the metadata.ownerReferences
field as well as the object-specific data:
apiVersion: v1
data:
elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0
kind: Secret
metadata:
labels:
common.k8s.elastic.co/type: elasticsearch
eck.k8s.elastic.co/credentials: "true"
elasticsearch.k8s.elastic.co/cluster-name: quickstart
name: quickstart-es-elastic-user
namespace: default
type: Opaque
Failure to do so can cause data loss.
Scale down of Elasticsearch master-eligible Pods seems stuck ¶
If a master-eligible Elasticsarch Pod was never successfully scheduled and the Elasticsearch cluster is running version 7.8 or earlier, ECK may fail to scale down the Pod. To find out whether you are affected, check if the Pod in question is pending:
> kubectl get pods
pod/<cluster-name>-es-<nodeset>-1 0/1 Pending 0 26m <none> <none>
Check the operator logs for an error similar to:
"unable to add to voting_config_exclusions: 400 Bad Request: add voting config exclusions request for [<cluster-name>-es-<nodeset>-1] matched no master-eligible nodes",
To work around this issue, scale down the underlying StatefulSet manually. First, identify the affected StatefulSet and the number of Pods that are ready (symbolized by m
in this example):
> kubectl get sts -l elasticsearch.k8s.elastic.co/cluster-name=<cluster-name>
NAME READY AGE
<cluster-name>-es-<nodeset> m/n 44h
Then, scale down the StatefulSet to the right size m
, removing the pending Pod:
> kubectl scale --replicas=m sts/<cluster-name>-es-<nodeset>
Warning
Do not use this method to scale down Pods that have already joined the Elasticsearch cluster, as additional data loss protection that ECK applies is sidestepped.
Pods are not replaced after a configuration update ¶
The update of an existing Elasticsearch cluster configuration can fail because the operator is unable to apply the changes required while replacing the pods of a given Elasticsearch cluster.
A key indicator is when the Phase of the Elasticsearch resource is in ApplyingChanges
state for too long:
kubectl get es
NAME HEALTH NODES VERSION PHASE AGE
elasticsearch-sample yellow 2 7.9.2 ApplyingChanges 36m
Possible causes include:
- The Elasticsearch cluster is not healthy
kubectl get elasticsearch NAME HEALTH NODES VERSION PHASE AGE elasticsearch.elasticsearch.k8s.elastic.co/elasticsearch-sample yellow 1 7.9.2 Ready 3m50s
- Scheduling issues The scheduling fails with the following message:
kubectl get events --sort-by='{.lastTimestamp}' | tail LAST SEEN TYPE REASON OBJECT MESSAGE 10s Warning FailedScheduling pod/quickstart-es-default-2 0/3 nodes are available: 3 Insufficient memory.
kubectl get pod elasticsearch-sample-es-default-2 -o go-template="{{.status}}" map[conditions:[map[lastProbeTime:<nil> lastTransitionTime:2020-12-07T09:31:06Z message:0/3 nodes are available: 3 Insufficient cpu. reason:Unschedulable status:False type:PodScheduled]] phase:Pending qosClass:Guaranteed]
- The operator is not able to restart some nodes
kubectl -n elastic-system logs statefulset.apps/elastic-operator | tail {"log.level":"info","@timestamp":"2020-11-19T17:34:48.769Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"default","es_name":"quickstart","failed_predicates":{"do_not_restart_healthy_node_if_MaxUnavailable_reached":["quickstart-es-default-1","quickstart-es-default-0"]}}
Pending
status:kubectl get pods NAME READY STATUS RESTARTS AGE quickstart-es-default-0 1/1 Running 0 146m quickstart-es-default-1 1/1 Running 0 146m quickstart-es-default-2 0/1 Pending 0 134m
For more information, check Troubleshooting methods.
ECK operator upgrade stays pending when using OLM ¶
When using Operator Lifecycle Manager (OLM) to install and upgrade the ECK operator an upgrade of ECK will not complete on older versions of OLM. This is due to an issue in OLM itself that is fixed in version 0.16.0 or later. OLM is also used behind the scenes when you install ECK as a Red Hat Certified Operator on OpenShift or as a community operator through operatorhub.io.
> oc get csv
NAME DISPLAY VERSION REPLACES PHASE
elastic-cloud-eck.v1.3.1 Elasticsearch (ECK) Operator 1.3.1 elastic-cloud-eck.v1.3.0 Replacing
elastic-cloud-eck.v1.4.0 Elasticsearch (ECK) Operator 1.4.0 elastic-cloud-eck.v1.3.1 Pending
If you are using one of the affected versions of OLM and upgrading OLM to a newer version is not possible then ECK can still be upgraded by uninstalling and reinstalling it. This can be done by removing the Subscription
and both ClusterServiceVersion
resources and adding them again. On OpenShift the same workaround can be performed in the UI by clicking on "Uninstall Operator" and then reinstalling it through OperatorHub.
If you upgraded Elasticsearch to the wrong version ¶
If you accidentally upgrade one of your Elasticsearch clusters to a version that does not exist or a version to which a direct upgrade is not possible from your currently deployed version, a validation will prevent you from going back to the previous version. The reason for this validation is that ECK will not allow downgrades as this is not supported by Elasticsearch and once the data directory of Elasticsearch has been upgraded there is no way back to the old version without a snapshot restore.
These two upgrading scenarios, however, are exceptions because Elasticsearch never started up successfully. If you annotate the Elasticsearch resource with eck.k8s.elastic.co/disable-downgrade-validation=true
ECK allows you to go back to the old version at your own risk. If you also attempted an upgrade of other related Elastic Stack applications at the same time you can use the same annotation to go back. Remove the annotation afterwards to prevent accidental downgrades and reduced availability.
Reconfigure stack config policy based role mappings after an upgrade to 8.15.3 from 8.14.x or 8.15.x ¶
You have role mappings defined in a StackConfigPolicy, and you upgraded from 8.14.x or 8.15.x, to 8.15.3.
Examples: - 8.14.2 → 8.15.2 → 8.15.3 - 8.14.2 → 8.15.3 - 8.15.2 → 8.15.3
The best option is to upgrade to 8.16.0, to fix the problem automatically. If this is not possible and you are stuck on 8.15.3, you have to perform two manual steps in order to correctly reconfigure role mappings because due to a bug the role mappings were duplicated.
- Force reload the StackConfigPolicy configuration
Force reload the StackConfigPolicy configuration containing the role mappings definition, by adding metadata to any of the mappings:
apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
spec:
elasticsearch:
securityRoleMappings:
<roleName>:
metadata:
force_reload: anything 1
- add a dummy metadata to force reload the config
Check that the role mapping is now in the cluster state:
GET /_cluster/state/metadata?filter_path=metadata.role_mappings.role_mappings
{"metadata":{"role_mappings":{"role_mappings":[{"enabled":true,"roles":["superuser"],"rules":{"all":[{"field":{"realm.name":"oidc1"}},{"field":{"username":"*"}}]},"metadata":{"force_reload":"dummy"}}]}}}
- Remove duplicated role mappings exposed via the API
Start by listing all the role mappings defined in your StackConfigPolicy:
kubectl get scp <scpName> -o json | jq '.spec.elasticsearch.securityRoleMappings | to_entries[].key' -r
<roleName>
Delete each role:
DELETE /_security/role_mapping/<roleName>
{"found": true}
Check that the role mapping was deleted:
GET /_security/role_mapping/<roleName>
{}