Reaper crashing - failure to connect to cluster

The main symptom of this issue is that Reaper appears to be in an essentially eternally restarting state.

As you can see from the pods here:

oda-k8ssandra-cass-operator-85c9f8c4c7-zqfdp         1/1     Running     0                22d
oda-k8ssandra-medusa-operator-f4f75c7bb-6bb47        1/1     Running     0                22d
oda-k8ssandra-oslo-a1-sts-0                          3/3     Running     0                22d
oda-k8ssandra-oslo-a2-sts-0                          3/3     Running     0                22d
oda-k8ssandra-oslo-b1-sts-0                          3/3     Running     0                22d
oda-k8ssandra-oslo-b2-sts-0                          3/3     Running     0                22d
oda-k8ssandra-oslo-m1-sts-0                          3/3     Running     0                22d
oda-k8ssandra-reaper-7f744bcfb7-2c8w6                0/1     Running     2535 (21s ago)   22d
oda-k8ssandra-reaper-operator-654f9f9bdc-22lb7       1/1     Running     0                22d

When we first updated the cluster to the newer version of k8ssandra ( 1.4.1) I had to run the migration patches manually or else it encountered issues, but then for a time the reaper appeared healthy and did not restart so very often. It slowly entered a state where it restarts so often not much is getting done. We originally thought maybe it wasn’t finding the IPs of the cluster (as they are not static) as they changed over time and this would result in it eventually losing contact with the cluster. However, on closer inspection I am unsure if this is the case, since it does reference the oda-k8ssandra-oslo-service.

Nodetool status doesn’t seem to indicate any issues with the cluster:

Datacenter: oslo
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  172.16.6.17    3 GiB      256          ?       f269d385-50f6-4130-b6d2-bcfe581ac7a7  m1
UN  172.16.4.146   16.01 GiB  256          ?       5c4879e0-b757-4d96-b068-572c26ef4a4a  b2
UN  172.16.14.245  12.09 GiB  256          ?       e2513d2f-e261-46d3-95c6-f487b403975b  b1
UN  172.16.8.235   14.59 GiB  256          ?       0d93d741-2922-4b58-9182-22a8981c91a0  a2
UN  172.16.10.174  10.79 GiB  256          ?       26599a8f-a58e-453d-aed6-cb000ac475eb  a1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

On startup I see:

INFO   [2022-01-01 18:20:28,670] [main] c.d.d.c.p.DCAwareRoundRobinPolicy - Using data-center name 'oslo' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) 
INFO   [2022-01-01 18:20:28,671] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.14.245:9042 added 
INFO   [2022-01-01 18:20:28,671] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.4.146:9042 added 
INFO   [2022-01-01 18:20:28,672] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.6.17:9042 added 
INFO   [2022-01-01 18:20:28,672] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.10.174:9042 added 
INFO   [2022-01-01 18:20:28,672] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.8.235:9042 added 
ERROR  [2022-01-01 18:21:25,580] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly...
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)

Eventually if it manages to get a bit further in startup, then we get other information about issues that maybe need to be dealt with:

INFO   [2022-01-01 18:21:42,098] [main] i.c.s.RepairManager - re-trigger a running run after restart, with id df8295a0-5eed-11ec-b11b-e5ac57216973 
INFO   [2022-01-01 18:21:42,194] [main] i.c.ReaperApplication - Initialization complete! 
WARN   [2022-01-01 18:21:42,194] [main] i.c.ReaperApplication - Reaper is ready to get things done! 
INFO   [2022-01-01 18:21:42,197] [main] i.d.s.ServerFactory - Starting cassandra-reaper
_________                                          .___               __________
\_   ___ \_____    ______ ___________    ____    __| _/___________    \______   \ ____ _____  ______   ___________
/    \  \/\__  \  /  ___//  ___/\__  \  /    \  / __ |\_  __ \__  \    |       _// __ \\__  \ \____ \_/ __ \_  __ \
\     \____/ __ \_\___ \ \___ \  / __ \|   |  \/ /_/ | |  | \// __ \_  |    |   \  ___/ / __ \|  |_> >  ___/|  | \/
 \______  (____  /____  >____  >(____  /___|  /\____ | |__|  (____  /  |____|_  /\___  >____  /   __/ \___  >__|
        \/     \/     \/     \/      \/     \/      \/            \/          \/     \/     \/|__|        \/
 
INFO   [2022-01-01 18:21:42,243] [main] o.e.j.s.SetUIDListener - Opened application@320be73{HTTP/1.1,[http/1.1]}{0.0.0.0:8080} 
INFO   [2022-01-01 18:21:42,243] [main] o.e.j.s.SetUIDListener - Opened admin@435e416c{HTTP/1.1,[http/1.1]}{0.0.0.0:8081} 
INFO   [2022-01-01 18:21:42,245] [main] o.e.j.s.Server - jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_312-b07 
INFO   [2022-01-01 18:21:42,290] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2022-01-01 18:21:42,567] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments. 
INFO   [2022-01-01 18:21:42,596] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments. 
INFO   [2022-01-01 18:21:42,602] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - Repair amount done 310.0

Later in the logs we see it trying to add new IPs (I think?), but getting connection issues:

INFO   [2022-01-01 18:09:19,395] [main] c.d.d.c.p.DCAwareRoundRobinPolicy - Using data-center name 'oslo' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) 
INFO   [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host /172.16.14.86:9042 added 
INFO   [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host /172.16.4.209:9042 added 
INFO   [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.10.23:9042 added 
INFO   [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host /172.16.8.50:9042 added 
INFO   [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.6.15:9042 added 
WARN   [2022-01-01 18:09:24,401] [clustername-nio-worker-0] c.d.d.c.HostConnectionPool - Error creating connection to /172.16.14.86:9042

and:

ERROR  [2022-01-01 18:09:24,423] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly... 
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: oda-k8ssandra-oslo-service/172.16.6.15:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)), oda-k8ssandra-oslo-service/172.16.10.23:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)))

We are a bit stuck, and unsure how best to debug this issue. I may need to do a closer inspection of the particular IPs that Reaper is trying to contact at a specific point in time, to see if they are correct. Also, maybe we need to do some more manual stuff to bring the cluster back into a healthier state, but then I would be afraid it would slide back into the current state unless we change/improve things in our setup. Any advice is appreciated!

Hey @louise,

Can you take a look at K8SSAND-1288 ⁃ Hung jobs, reaper is trying to connect to old IPs that don't exists · Issue #1314 · k8ssandra/k8ssandra · GitHub and see if it is related? This comment might also be relevant.