Replace / remove a node

  1. Does k8ssandra provide any functions to replace / remove a specific cassandra node?

  2. K8ssandraTask runs a command specified by K8ssandraTask.spec.template.jobs.command. Is that an arbitrary string, e.g. “/usr/bin/ls”?

Thanks.

Hi @vs123,

yes you can replace a broken node using a ReplaceNode K8ssandraTask looking like this:

apiVersion: control.k8ssandra.io/v1alpha1
kind: K8ssandraTask
metadata:
  name: test-replacenode1
  namespace: <namespace>
spec:
  cluster:
    name: <cluster name>
    namespace: <namespace>
  datacenters:
    - dc1
  template:
    jobs:
      - args:
          pod_name: test-dc1-r1-sts-2
        command: replacenode
        name: ''

It will reboostrap the node with a new PV, keeping token ownership and streaming data from the remaining replicas.

Jobs can’t run arbitrary commands, just a fixed set.

OK, I tried that. I don’t see any changes in the cluster.

k8ssandra-operator-cass-operator (1.16.0) logs show:

2024-03-12T21:44:47.828Z ERROR Reconciler error {“controller”: “cassandratask”, “controllerGroup”: “control.k8ssandra.io”, “controllerKind”: “CassandraTask”, “CassandraTask”: {“name”:“replace-node1-dc1”,“namespace”:“k8ssandra-operator”}, “namespace”: “k8ssandra-operator”, “name”: “replace-node1-dc1”, “reconcileID”: “55d72e4a-593f-4856-88a1-ad54312a8658”, “error”: “CassandraTask.control.k8ssandra.io "replace-node1-dc1" is invalid: [conditions[0].message: Required value, conditions[0].reason: Required value]”}

“kubectl describe K8ssandraTask -A” outputs for this task:

Status:
  Conditions:
    Last Transition Time:  2024-03-12T21:22:56Z
    Message:               
    Reason:                Running
    Status:                False
    Type:                  Running
    Last Transition Time:  2024-03-12T21:22:56Z
    Message:               
    Reason:                Failed
    Status:                False
    Type:                  Failed
  Datacenters:
    dc1:
Events:
  Type    Reason               Age   From                      Message
  ----    ------               ----  ----                      -------
  Normal  CreateCassandraTask  34m   k8ssandratask-controller  Created CassandraTask k8ssandra-operator.replace-node1-dc1

I tried running similar command later, now cass-operator logs show:

2024-03-12T23:32:58.885Z DEBUG this job isn’t allowed to run due to ConcurrencyPolicy restrictions {“controller”: “cassandratask”, “controllerGroup”: “control.k8ssandra.io”, “controllerKind”: “CassandraTask”, “CassandraTask”: {“name”:“replace-node4-dc1”,“namespace”:“k8ssandra-operator”}, “namespace”: “k8ssandra-operator”, “name”: “replace-node4-dc1”, “reconcileID”: “0f421de2-bdc7-4275-ae2f-888de97840ce”, “datacenterName”: “dc1”, “clusterName”: “test-cassandra”, “activeTasks”: 1}

You should check the cass-operator pod logs to see what’s the underlying error.
Please share with us the manifest you used for the K8ssandraTask and also the manifest of the corresponding generated CassandraTask.
Then you’ll have to delete your K8ssandraTasks before trying again it seems.

I collected more information. Thanks for looking into this.
Using k8ssandra-operator-1.12.0.

replace.yml:

apiVersion: control.k8ssandra.io/v1alpha1
kind: K8ssandraTask
metadata:
  name: replace-node7
  namespace: k8ssandra-operator
spec:
  cluster:
    name: test-cassandra
    namespace: k8ssandra-operator
  #datacenters:
  template:
    jobs:
      - args:
          pod_name: test-cassandra-dc1-us-west-2a-sts-0
          rack: us-west-2a
        command: replacenode
        name: ''

$ kubectl describe K8ssandraTask -A

Name:         replace-node7
Namespace:    k8ssandra-operator
Labels:       <none>
Annotations:  <none>
API Version:  control.k8ssandra.io/v1alpha1
Kind:         K8ssandraTask
Metadata:
  Creation Timestamp:  2024-04-08T23:41:41Z
  Finalizers:
    control.k8ssandra.io/finalizer
  Generation:  1
  Owner References:
    API Version:           k8ssandra.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  K8ssandraCluster
    Name:                  test-cassandra
    UID:                   fac9a86b-d312-4766-bfe6-147950266252
  Resource Version:        77439
  UID:                     cafee656-8f9d-477c-9665-93b351a0acb9
Spec:
  Cluster:
    Name:       test-cassandra
    Namespace:  k8ssandra-operator
  Template:
    Jobs:
      Args:
        pod_name:  test-cassandra-dc1-us-west-2a-sts-0
        Rack:      us-west-2a
      Command:     replacenode
      Name:        
Status:
  Conditions:
    Last Transition Time:  2024-04-08T23:41:42Z
    Message:               
    Reason:                Running
    Status:                False
    Type:                  Running
    Last Transition Time:  2024-04-08T23:41:42Z
    Message:               
    Reason:                Failed
    Status:                False
    Type:                  Failed
  Datacenters:
    dc1:
Events:
  Type    Reason               Age   From                      Message
  ----    ------               ----  ----                      -------
  Normal  CreateCassandraTask  103s  k8ssandratask-controller  Created CassandraTask k8ssandra-operator.replace-node7-dc1

$ kubectl describe CassandraTask -A

Name:         replace-node7-dc1
Namespace:    k8ssandra-operator
Labels:       app.kubernetes.io/component=cassandra
              app.kubernetes.io/created-by=cass-operator
              app.kubernetes.io/instance=cassandra-test-cassandra
              app.kubernetes.io/managed-by=cass-operator
              app.kubernetes.io/name=cassandra
              app.kubernetes.io/part-of=k8ssandra
              app.kubernetes.io/version=4.1.3
              cassandra.datastax.com/cluster=test-cassandra
              cassandra.datastax.com/datacenter=dc1
              control.k8ssandra.io/status=active
              k8ssandra.io/task-name=replace-node7
              k8ssandra.io/task-namespace=k8ssandra-operator
Annotations:  <none>
API Version:  control.k8ssandra.io/v1alpha1
Kind:         CassandraTask
Metadata:
  Creation Timestamp:  2024-04-08T23:41:41Z
  Generation:          2
  Owner References:
    API Version:           cassandra.datastax.com/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  CassandraDatacenter
    Name:                  dc1
    UID:                   bcec25f8-9652-4a33-86ec-d47c02730e7b
  Resource Version:        77437
  UID:                     06ed1aa7-a7ca-4495-849c-f41bb6a3f0bf
Spec:
  Concurrency Policy:  Forbid
  Datacenter:
    Name:       dc1
    Namespace:  k8ssandra-operator
  Jobs:
    Args:
      pod_name:                test-cassandra-dc1-us-west-2a-sts-0
      Rack:                    us-west-2a
    Command:                   replacenode
    Name:                      
  Restart Policy:              Never
  Ttl Seconds After Finished:  0
Events:                        <none>

k8ssandra-operator-cass-operator logs:

2024-04-08T23:41:42.030Z	INFO	KubeAPIWarningLogger	unknown field "status.conditions[0].lastProbeTime"
2024-04-08T23:41:42.030Z	ERROR	Reconciler error	{"controller": "cassandratask", "controllerGroup": "control.k8ssandra.io", "controllerKind": "CassandraTask", "CassandraTask": {"name":"replace-node7-dc1","namespace":"k8ssandra-operator"}, "namespace": "k8ssandra-operator", "name": "replace-node7-dc1", "reconcileID": "878cadc5-a80b-4f58-b306-25bbd6f9fd0e", "error": "CassandraTask.control.k8ssandra.io \"replace-node7-dc1\" is invalid: [conditions[0].message: Required value, conditions[0].reason: Required value]"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Any comments? Thanks.

Hey, this is usually an indication of failed CRD upgrade (you have an old CassandraTask CRD installed in the system) or incompatible Kubernetes version.

Did you install k8ssandra-operator using Kustomize or Helm?

k8ssandra-operator is installed using helm, APP VERSION 1.12.0

kubectl version:
Server Version: v1.28.8-eks-adc7111

How do I check the version of CassandraTask CRD? I ran
kubectl get crd cassandratasks.control.k8ssandra.io -oyaml
It is recently created:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.12.0
  creationTimestamp: "2024-04-15T18:41:21Z"
  generation: 1
  name: cassandratasks.control.k8ssandra.io
  resourceVersion: "4267"
  uid: 84b320dc-a728-461c-a802-86b63eb49770

You were right, we had some image tag mismatch. Fixed those, and replacenode works now.
Thanks.