Statefulset pods stuck in state 1/2

Hi, I’ve deployed K8ssandra but I’ve encountered kind of situation like this.

[root@node1 ~]# kubectl get pods
NAME                                                  READY   STATUS     RESTARTS   AGE
k8ssandra-cass-operator-766b945f65-ntb9s              1/1     Running    0          24m
k8ssandra-dc1-default-sts-0                           1/2     Running    0          24m
k8ssandra-dc1-default-sts-1                           1/2     Running    0          24m
k8ssandra-dc1-default-sts-2                           1/2     Running    0          24m
k8ssandra-dc1-default-sts-3                           1/2     Running    1          24m
k8ssandra-dc1-stargate-7d79856946-qjjl7               0/1     Init:0/1   0          24m
k8ssandra-grafana-dfdb5cc5c-4zq4n                     2/2     Running    0          24m
k8ssandra-kube-prometheus-operator-7dcccdcc86-tv7qc   1/1     Running    0          24m
k8ssandra-reaper-operator-566cdc787-nz5mf             1/1     Running    0          24m
prometheus-k8ssandra-kube-prometheus-prometheus-0     2/2     Running    1          24m

also i’ve reinstalled k8ssandra for many times, sometimes just one of statefulset pod completely runs as expected for example:

[root@node1 ~]# kubectl get pods
NAME                                                  READY   STATUS     RESTARTS   AGE
k8ssandra-cass-operator-766b945f65-ntb9s              1/1     Running    0          24m
k8ssandra-dc1-default-sts-0                           1/2     Running    0          24m
k8ssandra-dc1-default-sts-1                           1/2     Running    0          24m
k8ssandra-dc1-default-sts-2                           2/2     Running    0          24m
k8ssandra-dc1-default-sts-3                           1/2     Running    1          24m
k8ssandra-dc1-stargate-7d79856946-qjjl7               0/1     Init:0/1   0          24m
k8ssandra-grafana-dfdb5cc5c-4zq4n                     2/2     Running    0          24m
k8ssandra-kube-prometheus-operator-7dcccdcc86-tv7qc   1/1     Running    0          24m
k8ssandra-reaper-operator-566cdc787-nz5mf             1/1     Running    0          24m
prometheus-k8ssandra-kube-prometheus-prometheus-0     2/2     Running    1          24m

helm values:

[root@node1 ~]# helm get values k8ssandra
USER-SUPPLIED VALUES:
cassandra:
  allowMultipleNodesPerWorker: false
  cassandraLibDirVolume:
    size: 5Gi
    storageClass: rook-ceph-block
  datacenters:
  - name: dc1
    racks:
    - name: default
    size: 4
  enabled: true
  heap:
    newGenSize: 24G
    size: 24G
  resources:
    limits:
      cpu: 3000m
      memory: 24Gi
    requests:
      cpu: 3000m
      memory: 24Gi
  version: 3.11.10
kube-prometheus-stack:
  grafana:
    adminPassword: admin123
    adminUser: admin
stargate:
  cpuLimMillicores: 1000
  cpuReqMillicores: 200
  enabled: true
  heapMB: 1024
  replicas: 1
[root@node1 ~]# kubectl logs k8ssandra-dc1-default-sts-0 -c cassandra

INFO  [nioEventLoopGroup-2-2] 2021-05-27 09:21:52,340 Cli.java:617 - address=/10.233.96.0:34922 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [nioEventLoopGroup-2-1] 2021-05-27 09:22:01,047 Cli.java:617 - address=/10.233.96.0:34954 url=/api/v0/probes/liveness status=200 OK
INFO  [epollEventLoopGroup-170-1] 2021-05-27 09:22:02,337 Clock.java:47 - Using native clock for microsecond precision
WARN  [epollEventLoopGroup-170-2] 2021-05-27 09:22:02,338 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0xa37e9fbc]'
WARN  [epollEventLoopGroup-170-2] 2021-05-27 09:22:02,339 Loggers.java:39 - [s165] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=762c7772), trying next node (FileNotFoundException: null)
[root@node1 ~]# kubectl get pvc
NAME                                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
server-data-k8ssandra-dc1-default-sts-0   Bound    pvc-3e796c50-1dc0-4b10-a02c-94e83def42dd   5Gi        RWO            rook-ceph-block   32m
server-data-k8ssandra-dc1-default-sts-1   Bound    pvc-27ecb64f-97f4-401b-944d-161650784be0   5Gi        RWO            rook-ceph-block   32m
server-data-k8ssandra-dc1-default-sts-2   Bound    pvc-174c6237-c386-401e-8551-a1d39e266838   5Gi        RWO            rook-ceph-block   32m
server-data-k8ssandra-dc1-default-sts-3   Bound    pvc-5d0fa6fd-e7c9-459c-91c9-8226d363536e   5Gi        RWO            rook-ceph-block   32m
[root@node1 ~]# kubectl describe pod k8ssandra-dc1-default-sts-0
Name:         k8ssandra-dc1-default-sts-0
Namespace:    k8ssandra
Priority:     0
Node:         node7/172.16.11.183
Start Time:   Thu, 27 May 2021 11:51:35 +0300
Labels:       app.kubernetes.io/managed-by=cass-operator
              cassandra.datastax.com/cluster=k8ssandra
              cassandra.datastax.com/datacenter=dc1
              cassandra.datastax.com/node-state=Ready-to-Start
              cassandra.datastax.com/rack=default
              controller-revision-hash=k8ssandra-dc1-default-sts-865d88bd4
              statefulset.kubernetes.io/pod-name=k8ssandra-dc1-default-sts-0
Annotations:  <none>
Status:       Running
IP:           10.233.96.6
IPs:
  IP:           10.233.96.6
Controlled By:  StatefulSet/k8ssandra-dc1-default-sts
Init Containers:
  base-config-init:
    Container ID:  docker://752e5e85c3cdde14d850998552809d3e98a85c2dfa647cb608034b6a180b1e83
    Image:         k8ssandra/cass-management-api:3.11.10-v0.1.25
    Image ID:      docker-pullable://k8ssandra/cass-management-api@sha256:ef5e007d37b57d905c706c1221c96228c4387abb8a96f994af8aae3423dc9f2a
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      cp -r /etc/cassandra/* /cassandra-base-config/
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 27 May 2021 11:53:52 +0300
      Finished:     Thu, 27 May 2021 11:53:52 +0300
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /cassandra-base-config/ from cassandra-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mtdjk (ro)
  server-config-init:
    Container ID:   docker://683835e66c8a9b4fd42900e0cc7f7b6930254bb042aab3e326b8b047f3665b63
    Image:          docker.io/datastax/cass-config-builder:1.0.4
    Image ID:       docker-pullable://datastax/cass-config-builder@sha256:0cfa1f1270f1c211ae4ac8eb690dd9e909cf690126e5ed5ddb08bba78902d1a1
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 27 May 2021 11:53:59 +0300
      Finished:     Thu, 27 May 2021 11:54:01 +0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  256M
    Requests:
      cpu:     1
      memory:  256M
    Environment:
      POD_IP:                      (v1:status.podIP)
      HOST_IP:                     (v1:status.hostIP)
      USE_HOST_IP_FOR_BROADCAST:  false
      RACK_NAME:                  default
      PRODUCT_VERSION:            3.11.10
      PRODUCT_NAME:               cassandra
      DSE_VERSION:                3.11.10
      CONFIG_FILE_DATA:           {"cassandra-yaml":{"authenticator":"PasswordAuthenticator","authorizer":"CassandraAuthorizer","credentials_update_interval_in_ms":3600000,"credentials_validity_in_ms":3600000,"num_tokens":256,"permissions_update_interval_in_ms":3600000,"permissions_validity_in_ms":3600000,"role_manager":"CassandraRoleManager","roles_update_interval_in_ms":3600000,"roles_validity_in_ms":3600000},"cluster-info":{"name":"k8ssandra","seeds":"k8ssandra-seed-service"},"datacenter-info":{"graph-enabled":0,"name":"dc1","solr-enabled":0,"spark-enabled":0},"jvm-options":{"additional-jvm-opts":["-Dcassandra.system_distributed_replication_dc_names=dc1","-Dcassandra.system_distributed_replication_per_dc=4"],"heap_size_young_generation":"24G","initial_heap_size":"24G","max_heap_size":"24G"}}
    Mounts:
      /config from server-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mtdjk (ro)
  jmx-credentials:
    Container ID:  docker://d152b98f82f2d628069b567622e7a32168169195ef9ab22b59591af37138d5cb
    Image:         busybox
    Image ID:      docker-pullable://busybox@sha256:b5fc1d7b2e4ea86a06b0cf88de915a2c43a99a00b6b3c0af731e5f4c07ae8eff
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      echo "$REAPER_JMX_USERNAME $REAPER_JMX_PASSWORD" > /config/jmxremote.password && echo "$SUPERUSER_JMX_USERNAME $SUPERUSER_JMX_PASSWORD" >> /config/jmxremote.password
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 27 May 2021 11:54:02 +0300
      Finished:     Thu, 27 May 2021 11:54:02 +0300
    Ready:          True
    Restart Count:  0
    Environment:
      REAPER_JMX_USERNAME:     <set to the key 'username' in secret 'k8ssandra-reaper-jmx'>  Optional: false
      REAPER_JMX_PASSWORD:     <set to the key 'password' in secret 'k8ssandra-reaper-jmx'>  Optional: false
      SUPERUSER_JMX_USERNAME:  <set to the key 'username' in secret 'k8ssandra-superuser'>   Optional: false
      SUPERUSER_JMX_PASSWORD:  <set to the key 'password' in secret 'k8ssandra-superuser'>   Optional: false
    Mounts:
      /config from server-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mtdjk (ro)
Containers:
  cassandra:
    Container ID:   docker://217d76c7eb3153e77000da0043bfa31b4b45f1500002f9c8aac8a8e8ab94731d
    Image:          k8ssandra/cass-management-api:3.11.10-v0.1.25
    Image ID:       docker-pullable://k8ssandra/cass-management-api@sha256:ef5e007d37b57d905c706c1221c96228c4387abb8a96f994af8aae3423dc9f2a
    Ports:          9042/TCP, 9142/TCP, 7000/TCP, 7001/TCP, 7199/TCP, 8080/TCP, 9103/TCP, 9160/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 27 May 2021 11:54:03 +0300
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     3
      memory:  24Gi
    Requests:
      cpu:      3
      memory:   24Gi
    Liveness:   http-get http://:8080/api/v0/probes/liveness delay=15s timeout=1s period=15s #success=1 #failure=3
    Readiness:  http-get http://:8080/api/v0/probes/readiness delay=20s timeout=1s period=10s #success=1 #failure=3
    Environment:
      LOCAL_JMX:                no
      DS_LICENSE:               accept
      DSE_AUTO_CONF_OFF:        all
      USE_MGMT_API:             true
      MGMT_API_EXPLICIT_START:  true
      DSE_MGMT_EXPLICIT_START:  true
    Mounts:
      /config from server-config (rw)
      /etc/encryption/ from encryption-cred-storage (rw)
      /var/lib/cassandra from server-data (rw)
      /var/log/cassandra from server-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mtdjk (ro)
  server-system-logger:
    Container ID:   docker://23b2501e7c93b5923c43d5f596ea9d9ce268f835ba0181b25364a01cc37c8c0a
    Image:          k8ssandra/system-logger:9c4c3692
    Image ID:       docker-pullable://k8ssandra/system-logger@sha256:6208a1e3d710d022c9e922c8466fe7d76ca206f97bf92902ff5327114696f8b1
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 27 May 2021 11:54:07 +0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  64M
    Requests:
      cpu:        100m
      memory:     64M
    Environment:  <none>
    Mounts:
      /var/log/cassandra from server-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mtdjk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  server-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  server-data-k8ssandra-dc1-default-sts-0
    ReadOnly:   false
  cassandra-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  server-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  server-logs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  encryption-cred-storage:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  dc1-keystore
    Optional:    false
  default-token-mtdjk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-mtdjk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               32m                    default-scheduler        Successfully assigned k8ssandra/k8ssandra-dc1-default-sts-0 to node7
  Normal   SuccessfulAttachVolume  32m                    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-3e796c50-1dc0-4b10-a02c-94e83def42dd"
  Normal   Pulling                 32m                    kubelet                  Pulling image "k8ssandra/cass-management-api:3.11.10-v0.1.25"
  Normal   Pulled                  30m                    kubelet                  Successfully pulled image "k8ssandra/cass-management-api:3.11.10-v0.1.25" in 1m59.009421126s
  Normal   Started                 30m                    kubelet                  Started container base-config-init
  Normal   Created                 30m                    kubelet                  Created container base-config-init
  Normal   Pulling                 30m                    kubelet                  Pulling image "docker.io/datastax/cass-config-builder:1.0.4"
  Normal   Pulled                  30m                    kubelet                  Successfully pulled image "docker.io/datastax/cass-config-builder:1.0.4" in 6.631303128s
  Normal   Created                 30m                    kubelet                  Created container server-config-init
  Normal   Started                 30m                    kubelet                  Started container server-config-init
  Normal   Started                 30m                    kubelet                  Started container jmx-credentials
  Normal   Created                 30m                    kubelet                  Created container jmx-credentials
  Normal   Pulled                  30m                    kubelet                  Container image "busybox" already present on machine
  Normal   Pulled                  30m                    kubelet                  Container image "k8ssandra/cass-management-api:3.11.10-v0.1.25" already present on machine
  Normal   Created                 30m                    kubelet                  Created container cassandra
  Normal   Started                 30m                    kubelet                  Started container cassandra
  Normal   Pulling                 30m                    kubelet                  Pulling image "k8ssandra/system-logger:9c4c3692"
  Normal   Pulled                  30m                    kubelet                  Successfully pulled image "k8ssandra/system-logger:9c4c3692" in 3.859718237s
  Normal   Created                 30m                    kubelet                  Created container server-system-logger
  Normal   Started                 30m                    kubelet                  Started container server-system-logger
  Warning  Unhealthy               2m31s (x166 over 30m)  kubelet                  Readiness probe failed: HTTP probe failed with statuscode: 500

My environment specs are:
Kubernetes: 1.20
CNI: Weave
Storage provider for pvcs: Rook-ceph

In my experience, the most common cause of this problem is insufficient RAM so Cassandra is unable to start. I’ve noted that you’ve allocated 24GB to the heap.

Could you tell us about your environment, particularly the hardware specs like number of CPUs and memory?

In the meantime, I’ll get the K8ssandra devs to chime in as well. Cheers!

1 Like

Hi, thanks for reply
I’ve 5 worker nodes the four of them has 64 GiB and the other one is 128 GiB
also each one of cpu has 10 core Intel(R) Core™ i9-10850K
Kubernetes installed on baremetal servers with kubespray.

In addition if there’s a OOM issue kubernetes indicates its with kubectl describe pod

I noticed that you’ve set both the heap and NewGen to 24GB.

For CMS, our general recommendation with NewGen is to allocate 100MB for each CPU core. So if the Cassandra node has 4 cores, set NewGen to 400MB.

Could you try that and let us know how you go? Cheers!

The problem is still persists

[root@node1 k8ssandra]# helm get values k8ssandra
USER-SUPPLIED VALUES:
cassandra:
  allowMultipleNodesPerWorker: false
  cassandraLibDirVolume:
    size: 5Gi
    storageClass: rook-ceph-block
  datacenters:
  - name: dc1
    racks:
    - name: default
    size: 4
  enabled: true
  heap:
    newGenSize: 400M
    size: 8G
  resources:
    limits:
      cpu: 3000m
      memory: 16Gi
    requests:
      cpu: 3000m
      memory: 8Gi
  version: 3.11.10
kube-prometheus-stack:
  grafana:
    adminPassword: admin123
    adminUser: admin
stargate:
  cpuLimMillicores: 1000
  cpuReqMillicores: 200
  enabled: true
  heapMB: 1024
  replicas: 1
NAME                                                  READY   STATUS     RESTARTS   AGE
k8ssandra-cass-operator-766b945f65-27kq6              1/1     Running    0          5m9s
k8ssandra-dc1-default-sts-0                           1/2     Running    0          5m
k8ssandra-dc1-default-sts-1                           1/2     Running    0          5m
k8ssandra-dc1-default-sts-2                           1/2     Running    0          5m
k8ssandra-dc1-default-sts-3                           1/2     Running    0          5m
k8ssandra-dc1-stargate-7d79856946-qtflg               0/1     Init:0/1   0          5m9s
k8ssandra-grafana-dfdb5cc5c-n49k5                     2/2     Running    0          5m9s
k8ssandra-kube-prometheus-operator-7dcccdcc86-tx84j   1/1     Running    0          5m9s
k8ssandra-reaper-operator-566cdc787-kr98x             1/1     Running    0          5m9s
prometheus-k8ssandra-kube-prometheus-prometheus-0     2/2     Running    1          5m7s

Do you see anything in the server-system-logger container for the Cassandra pods that aren’t running? That container should have the actual system logs from Cassandra.

Definitely check the logs of server-system-logger. If Cassandra is failing at startup, there is a good chance the error message will not be logged in server-system-logger. You should also check the contents of /var/log/cassandra/stdout.log. cass-operator will only start one Cassandra node at a time. Unless you are checking the logs of the node that it is trying to start, the file may be empty or not exist. Make sure you check the correct pod (or check all of them to be safe).

Looks like its a kubernetes networking issue, once i resolve the issue i’ll update you asap.
thank you for your interest i’m so appreciated.
Output of server-system-logger looks like it’s a seed problem.
Notice that in cassandra.yaml seeder looks like

seed_provider:
- class_name: org.apache.cassandra.locator.K8SeedProvider
  parameters:
  - seeds: k8ssandra-seed-service

So i’ve tried to create dnsutils pod in k8ssandra namespace

[root@node1 ~]# kubectl exec -it dnsutils -- sh
/ # nslookup k8ssandra-seed-service
Server:		169.254.25.10
Address:	169.254.25.10#53

** server can't find k8ssandra-seed-service.k8ssandra.svc.cluster.local: SERVFAIL
INFO  [main] 2021-05-28 04:39:29,359 DatabaseDescriptor.java:381 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO  [main] 2021-05-28 04:39:29,359 DatabaseDescriptor.java:439 - Global memtable on-heap threshold is enabled at 2048MB
INFO  [main] 2021-05-28 04:39:29,360 DatabaseDescriptor.java:443 - Global memtable off-heap threshold is enabled at 2048MB
WARN  [main] 2021-05-28 04:39:29,365 DatabaseDescriptor.java:503 - Small commitlog volume detected at /opt/cassandra/data/commitlog; setting commitlog_total_space_in_mb to 1243.  You can override this in cassandra.yaml
WARN  [main] 2021-05-28 04:39:29,365 DatabaseDescriptor.java:530 - Small cdc volume detected at /opt/cassandra/data/cdc_raw; setting cdc_total_space_in_mb to 621.  You can override this in cassandra.yaml
WARN  [main] 2021-05-28 04:39:29,553 DatabaseDescriptor.java:579 - Only 4.839GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots
INFO  [main] 2021-05-28 04:39:29,574 RateBasedBackPressure.java:123 - Initialized back-pressure with high ratio: 0.9, factor: 5, flow: FAST, window size: 2000.
INFO  [main] 2021-05-28 04:39:29,575 DatabaseDescriptor.java:775 - Back-pressure is disabled with strategy null.
INFO  [main] 2021-05-28 04:39:29,683 GossipingPropertyFileSnitch.java:64 - Loaded cassandra-topology.properties for compatibility
INFO  [ScheduledTasks:1] 2021-05-28 04:39:34,717 TokenMetadata.java:517 - Updating topology for all endpoints that have changed
WARN  [main] 2021-05-28 04:39:34,776 K8SeedProvider3x.java:54 - Seed provider couldn't lookup host k8ssandra-seed-service
ERROR [main] 2021-05-28 04:39:34,777 CassandraDaemon.java:803 - Exception encountered during startup: The seed provider lists no seeds.
INFO  [main] 2021-05-28 04:39:58,938 SystemDistributedReplicationInterceptor.java:73 - Using override for distributed system keyspaces: {dc1=4, class=NetworkTopologyStrategy}
INFO  [main] 2021-05-28 04:39:59,183 YamlConfigurationLoader.java:92 - Configuration location: file:/etc/cassandra/cassandra.yaml
INFO  [main] 2021-05-28 04:39:59,346 Config.java:537 - Node configuration:[allocate_tokens_for_keyspace=null; authenticator=PasswordAuthenticator; authorizer=CassandraAuthorizer; auto_bootstrap=true; auto_snapshot=true; back_pressure_enabled=false; back_pressure_strategy=null; batch_size_fail_threshold_in_kb=640; batch_size_warn_threshold_in_kb=64; batchlog_replay_throttle_in_kb=1024; broadcast_address=null; broadcast_rpc_address=10.233.96.6; buffer_pool_use_heap_if_exhausted=true; cas_contention_timeout_in_ms=1000; cdc_enabled=false; cdc_free_space_check_interval_ms=250; cdc_raw_directory=null; cdc_total_space_in_mb=0; check_for_duplicate_rows_during_compaction=true; check_for_duplicate_rows_during_reads=true; client_encryption_options=<REDACTED>; cluster_name=k8ssandra; column_index_cache_size_in_kb=2; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_compression=null; commitlog_directory=null; commitlog_max_compression_buffers_in_pool=3; commitlog_periodic_queue_size=-1; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_batch_window_in_ms=NaN; commitlog_sync_period_in_ms=10000; commitlog_total_space_in_mb=null; compaction_large_partition_warning_threshold_mb=100; compaction_throughput_mb_per_sec=16; concurrent_compactors=null; concurrent_counter_writes=32; concurrent_materialized_view_writes=32; concurrent_reads=32; concurrent_replicates=null; concurrent_writes=32; counter_cache_keys_to_save=2147483647; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; credentials_cache_max_entries=1000; credentials_update_interval_in_ms=3600000; credentials_validity_in_ms=3600000; cross_node_timeout=false; data_file_directories=[Ljava.lang.String;@f001896; disk_access_mode=auto; disk_failure_policy=stop; disk_optimization_estimate_percentile=0.95; disk_optimization_page_cross_chance=0.1; disk_optimization_strategy=ssd; dynamic_snitch=true; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; enable_materialized_views=true; enable_sasi_indexes=true; enable_scripted_user_defined_functions=false; enable_user_defined_functions=false; enable_user_defined_functions_threads=true; encryption_options=<REDACTED>; endpoint_snitch=GossipingPropertyFileSnitch; file_cache_round_up=null; file_cache_size_in_mb=null; gc_log_threshold_in_ms=200; gc_warn_threshold_in_ms=1000; hinted_handoff_disabled_datacenters=[]; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; hints_compression=null; hints_directory=null; hints_flush_period_in_ms=10000; incremental_backups=false; index_interval=null; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; initial_token=null; inter_dc_stream_throughput_outbound_megabits_per_sec=200; inter_dc_tcp_nodelay=false; internode_authenticator=null; internode_compression=dc; internode_recv_buff_size_in_bytes=0; internode_send_buff_size_in_bytes=0; key_cache_keys_to_save=2147483647; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=10.233.96.6; listen_interface=null; listen_interface_prefer_ipv6=false; listen_on_broadcast_address=false; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; max_hints_file_size_in_mb=128; max_mutation_size_in_kb=null; max_streaming_retries=3; max_value_size_in_mb=256; memtable_allocation_type=heap_buffers; memtable_cleanup_threshold=null; memtable_flush_writers=0; memtable_heap_space_in_mb=null; memtable_offheap_space_in_mb=null; min_free_space_per_drive_in_mb=50; native_transport_flush_in_batches_legacy=true; native_transport_max_concurrent_connections=-1; native_transport_max_concurrent_connections_per_ip=-1; native_transport_max_concurrent_requests_in_bytes=-1; native_transport_max_concurrent_requests_in_bytes_per_ip=-1; native_transport_max_frame_size_in_mb=256; native_transport_max_negotiable_protocol_version=-2147483648; native_transport_max_threads=128; native_transport_port=9042; native_transport_port_ssl=null; num_tokens=256; otc_backlog_expiration_interval_ms=200; otc_coalescing_enough_coalesced_messages=8; otc_coalescing_strategy=DISABLED; otc_coalescing_window_us=200; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_cache_max_entries=1000; permissions_update_interval_in_ms=3600000; permissions_validity_in_ms=3600000; phi_convict_threshold=8.0; prepared_statements_cache_size_mb=null; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; repair_session_max_tree_depth=18; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_scheduler_id=null; request_scheduler_options=null; request_timeout_in_ms=10000; role_manager=CassandraRoleManager; roles_cache_max_entries=1000; roles_update_interval_in_ms=3600000; roles_validity_in_ms=3600000; row_cache_class_name=org.apache.cassandra.cache.OHCProvider; row_cache_keys_to_save=2147483647; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=0.0.0.0; rpc_interface=null; rpc_interface_prefer_ipv6=false; rpc_keepalive=true; rpc_listen_backlog=50; rpc_max_threads=2147483647; rpc_min_threads=16; rpc_port=9160; rpc_recv_buff_size_in_bytes=null; rpc_send_buff_size_in_bytes=null; rpc_server_type=sync; saved_caches_directory=null; seed_provider=org.apache.cassandra.locator.K8SeedProvider{seeds=k8ssandra-seed-service}; server_encryption_options=<REDACTED>; slow_query_log_timeout_in_ms=500; snapshot_before_compaction=false; snapshot_on_duplicate_row_detection=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=true; storage_port=7000; stream_throughput_outbound_megabits_per_sec=200; streaming_keep_alive_period_in_secs=300; streaming_socket_timeout_in_ms=86400000; thrift_framed_transport_size_in_mb=15; thrift_max_message_length_in_mb=16; thrift_prepared_statements_cache_size_mb=null; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; tracetype_query_ttl=86400; tracetype_repair_ttl=604800; transparent_data_encryption_options=org.apache.cassandra.config.TransparentDataEncryptionOptions@13f17eb4; trickle_fsync=true; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; unlogged_batch_across_partitions_warn_threshold=10; user_defined_function_fail_timeout=1500; user_defined_function_warn_timeout=500; user_function_timeout_policy=die; windows_timer_interval=1; write_request_timeout_in_ms=2000]

output of my service list:

[root@node1 ~]# kubectl get svc
NAME                                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                 AGE
cass-operator-metrics                  ClusterIP   10.233.55.87    <none>        8383/TCP,8686/TCP                                       13h
k8ssandra-dc1-all-pods-service         ClusterIP   None            <none>        9042/TCP,8080/TCP,9103/TCP                              13h
k8ssandra-dc1-service                  ClusterIP   None            <none>        9042/TCP,9142/TCP,8080/TCP,9103/TCP,9160/TCP            13h
k8ssandra-dc1-stargate-service         ClusterIP   10.233.42.47    <none>        8080/TCP,8081/TCP,8082/TCP,8084/TCP,8085/TCP,9042/TCP   13h
k8ssandra-grafana                      ClusterIP   10.233.32.226   <none>        80/TCP                                                  13h
k8ssandra-kube-prometheus-operator     ClusterIP   10.233.43.117   <none>        443/TCP                                                 13h
k8ssandra-kube-prometheus-prometheus   ClusterIP   10.233.10.212   <none>        9090/TCP                                                13h
k8ssandra-reaper-reaper-service        ClusterIP   10.233.62.19    <none>        8080/TCP                                                13h
k8ssandra-seed-service                 ClusterIP   None            <none>        <none>                                                  13h
prometheus-operated                    ClusterIP   None            <none>        9090/TCP                                                13h

also containers in k8ssandra-dc1-default-sts-* pods are running just cassandra containers state is not ready

Containers:
  cassandra:
    Container ID:   docker://614dc73dd4d41a56856b73ade0f75c145652cc15f3c0d6d2f908c5151db156b3
    Image:          k8ssandra/cass-management-api:3.11.10-v0.1.25
    Image ID:       docker-pullable://k8ssandra/cass-management-api@sha256:ef5e007d37b57d905c706c1221c96228c4387abb8a96f994af8aae3423dc9f2a
    Ports:          9042/TCP, 9142/TCP, 7000/TCP, 7001/TCP, 7199/TCP, 8080/TCP, 9103/TCP, 9160/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 27 May 2021 18:45:48 +0300
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     3
      memory:  16Gi
    Requests:
      cpu:      3
      memory:   8Gi
    Liveness:   http-get http://:8080/api/v0/probes/liveness delay=15s timeout=1s period=15s #success=1 #failure=3
    Readiness:  http-get http://:8080/api/v0/probes/readiness delay=20s timeout=1s period=10s #success=1 #failure=3
    Environment:
      LOCAL_JMX:                no
      DS_LICENSE:               accept
      DSE_AUTO_CONF_OFF:        all
      USE_MGMT_API:             true
      MGMT_API_EXPLICIT_START:  true
      DSE_MGMT_EXPLICIT_START:  true
    Mounts:
      /config from server-config (rw)
      /etc/encryption/ from encryption-cred-storage (rw)
      /var/lib/cassandra from server-data (rw)
      /var/log/cassandra from server-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j4mp7 (ro)
  server-system-logger:
    Container ID:   docker://e0f2ad80ac2b5de879a028c9f4b863994a8ebcd9d7f3cdd8f34c9c18858af839
    Image:          k8ssandra/system-logger:9c4c3692
    Image ID:       docker-pullable://k8ssandra/system-logger@sha256:6208a1e3d710d022c9e922c8466fe7d76ca206f97bf92902ff5327114696f8b1
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 27 May 2021 18:45:48 +0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  64M
    Requests:
      cpu:        100m
      memory:     64M
    Environment:  <none>
    Mounts:
      /var/log/cassandra from server-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j4mp7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  server-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  server-data-k8ssandra-dc1-default-sts-1
    ReadOnly:   false
  cassandra-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  server-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  server-logs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  encryption-cred-storage:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  dc1-keystore
    Optional:    false
  default-token-j4mp7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-j4mp7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  4m31s (x4709 over 13h)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500

output of /config/cassandra.yaml from cassandra container in pods


cassandra@k8ssandra-dc1-default-sts-1:/config$ cat  cassandra.yaml 
ssl_storage_port: 7001
storage_port: 7000
batchlog_replay_throttle_in_kb: 1024
commit_failure_policy: stop
unlogged_batch_across_partitions_warn_threshold: 10
commitlog_segment_size_in_mb: 32
start_rpc: true
credentials_validity_in_ms: 3600000
client_encryption_options:
  enabled: false
concurrent_materialized_view_writes: 32
inter_dc_tcp_nodelay: false
column_index_cache_size_in_kb: 2
rpc_server_type: sync
authorizer: CassandraAuthorizer
num_tokens: 256
row_cache_save_period: 0
disk_failure_policy: stop
native_transport_port: 9042
server_encryption_options:
  internode_encryption: none
dynamic_snitch_reset_interval_in_ms: 600000
compaction_throughput_mb_per_sec: 16
role_manager: CassandraRoleManager
column_index_size_in_kb: 64
batch_size_warn_threshold_in_kb: 64
windows_timer_interval: 1
compaction_large_partition_warning_threshold_mb: 100
rpc_keepalive: true
batch_size_fail_threshold_in_kb: 640
snapshot_before_compaction: false
credentials_update_interval_in_ms: 3600000
tracetype_query_ttl: 86400
concurrent_reads: 32
key_cache_save_period: 14400
row_cache_size_in_mb: 0
tracetype_repair_ttl: 604800
enable_materialized_views: true
tombstone_warn_threshold: 1000
rpc_address: 0.0.0.0
concurrent_writes: 32
commitlog_sync: periodic
counter_cache_save_period: 7200
roles_update_interval_in_ms: 3600000
back_pressure_enabled: false
enable_sasi_indexes: true
slow_query_log_timeout_in_ms: 500
trickle_fsync: true
write_request_timeout_in_ms: 2000
incremental_backups: false
truncate_request_timeout_in_ms: 60000
enable_scripted_user_defined_functions: false
read_request_timeout_in_ms: 5000
request_timeout_in_ms: 10000
start_native_transport: true
memtable_allocation_type: heap_buffers
transparent_data_encryption_options:
  enabled: false
  chunk_length_kb: 64
  cipher: AES/CBC/PKCS5Padding
  key_alias: testing:1
internode_compression: dc
authenticator: PasswordAuthenticator
max_hints_delivery_threads: 2
cross_node_timeout: false
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
tombstone_failure_threshold: 100000
hinted_handoff_enabled: true
hints_flush_period_in_ms: 10000
enable_user_defined_functions: false
hinted_handoff_throttle_in_kb: 1024
max_hint_window_in_ms: 10800000
broadcast_rpc_address: 10.233.96.6
auto_snapshot: true
index_summary_resize_interval_in_minutes: 60
range_request_timeout_in_ms: 10000
sstable_preemptive_open_interval_in_mb: 50
seed_provider:
- class_name: org.apache.cassandra.locator.K8SeedProvider
  parameters:
  - seeds: k8ssandra-seed-service
dynamic_snitch_update_interval_in_ms: 100
trickle_fsync_interval_in_kb: 10240
listen_address: 10.233.96.6
commitlog_sync_period_in_ms: 10000
cdc_enabled: false
max_hints_file_size_in_mb: 128
counter_write_request_timeout_in_ms: 5000
cluster_name: k8ssandra
concurrent_counter_writes: 32
endpoint_snitch: GossipingPropertyFileSnitch
dynamic_snitch_badness_threshold: 0.1
permissions_validity_in_ms: 3600000
permissions_update_interval_in_ms: 3600000
roles_validity_in_ms: 3600000
rpc_port: 9160
cas_contention_timeout_in_ms: 1000
thrift_framed_transport_size_in_mb: 15
gc_warn_threshold_in_ms: 1000
request_scheduler: org.apache.cassandra.scheduler.NoScheduler

Hi after digging kubernetes internals, I’ve found out Centos 8 does not supports Iptables as legacy mode.
So my CNI was weave and it was using Iptables as legacy mode that’s why my pods cannot resolve k8ssandra-seed-service.

After fixing the resolution issue whole k8ssandra services worked as expected.
I am so appreciated for your attention.

Best regards,
Eren Cankurtaran

6 Likes

@ieuD really glad to hear you figured it out. More importantly, thanks for circling back and letting us know the root cause. Cheers!

1 Like

how did you do this, plz help me