The main symptom of this issue is that Reaper appears to be in an essentially eternally restarting state.
As you can see from the pods here:
oda-k8ssandra-cass-operator-85c9f8c4c7-zqfdp 1/1 Running 0 22d
oda-k8ssandra-medusa-operator-f4f75c7bb-6bb47 1/1 Running 0 22d
oda-k8ssandra-oslo-a1-sts-0 3/3 Running 0 22d
oda-k8ssandra-oslo-a2-sts-0 3/3 Running 0 22d
oda-k8ssandra-oslo-b1-sts-0 3/3 Running 0 22d
oda-k8ssandra-oslo-b2-sts-0 3/3 Running 0 22d
oda-k8ssandra-oslo-m1-sts-0 3/3 Running 0 22d
oda-k8ssandra-reaper-7f744bcfb7-2c8w6 0/1 Running 2535 (21s ago) 22d
oda-k8ssandra-reaper-operator-654f9f9bdc-22lb7 1/1 Running 0 22d
When we first updated the cluster to the newer version of k8ssandra ( 1.4.1) I had to run the migration patches manually or else it encountered issues, but then for a time the reaper appeared healthy and did not restart so very often. It slowly entered a state where it restarts so often not much is getting done. We originally thought maybe it wasn’t finding the IPs of the cluster (as they are not static) as they changed over time and this would result in it eventually losing contact with the cluster. However, on closer inspection I am unsure if this is the case, since it does reference the oda-k8ssandra-oslo-service.
Nodetool status doesn’t seem to indicate any issues with the cluster:
Datacenter: oslo
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.16.6.17 3 GiB 256 ? f269d385-50f6-4130-b6d2-bcfe581ac7a7 m1
UN 172.16.4.146 16.01 GiB 256 ? 5c4879e0-b757-4d96-b068-572c26ef4a4a b2
UN 172.16.14.245 12.09 GiB 256 ? e2513d2f-e261-46d3-95c6-f487b403975b b1
UN 172.16.8.235 14.59 GiB 256 ? 0d93d741-2922-4b58-9182-22a8981c91a0 a2
UN 172.16.10.174 10.79 GiB 256 ? 26599a8f-a58e-453d-aed6-cb000ac475eb a1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
On startup I see:
INFO [2022-01-01 18:20:28,670] [main] c.d.d.c.p.DCAwareRoundRobinPolicy - Using data-center name 'oslo' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
INFO [2022-01-01 18:20:28,671] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.14.245:9042 added
INFO [2022-01-01 18:20:28,671] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.4.146:9042 added
INFO [2022-01-01 18:20:28,672] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.6.17:9042 added
INFO [2022-01-01 18:20:28,672] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.10.174:9042 added
INFO [2022-01-01 18:20:28,672] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.8.235:9042 added
ERROR [2022-01-01 18:21:25,580] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly...
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
Eventually if it manages to get a bit further in startup, then we get other information about issues that maybe need to be dealt with:
INFO [2022-01-01 18:21:42,098] [main] i.c.s.RepairManager - re-trigger a running run after restart, with id df8295a0-5eed-11ec-b11b-e5ac57216973
INFO [2022-01-01 18:21:42,194] [main] i.c.ReaperApplication - Initialization complete!
WARN [2022-01-01 18:21:42,194] [main] i.c.ReaperApplication - Reaper is ready to get things done!
INFO [2022-01-01 18:21:42,197] [main] i.d.s.ServerFactory - Starting cassandra-reaper
_________ .___ __________
\_ ___ \_____ ______ ___________ ____ __| _/___________ \______ \ ____ _____ ______ ___________
/ \ \/\__ \ / ___// ___/\__ \ / \ / __ |\_ __ \__ \ | _// __ \\__ \ \____ \_/ __ \_ __ \
\ \____/ __ \_\___ \ \___ \ / __ \| | \/ /_/ | | | \// __ \_ | | \ ___/ / __ \| |_> > ___/| | \/
\______ (____ /____ >____ >(____ /___| /\____ | |__| (____ / |____|_ /\___ >____ / __/ \___ >__|
\/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/|__| \/
INFO [2022-01-01 18:21:42,243] [main] o.e.j.s.SetUIDListener - Opened application@320be73{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
INFO [2022-01-01 18:21:42,243] [main] o.e.j.s.SetUIDListener - Opened admin@435e416c{HTTP/1.1,[http/1.1]}{0.0.0.0:8081}
INFO [2022-01-01 18:21:42,245] [main] o.e.j.s.Server - jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_312-b07
INFO [2022-01-01 18:21:42,290] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - Attempting to run new segment...
INFO [2022-01-01 18:21:42,567] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments.
INFO [2022-01-01 18:21:42,596] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments.
INFO [2022-01-01 18:21:42,602] [oda-k8ssandra:df8295a0-5eed-11ec-b11b-e5ac57216973] i.c.s.RepairRunner - Repair amount done 310.0
Later in the logs we see it trying to add new IPs (I think?), but getting connection issues:
INFO [2022-01-01 18:09:19,395] [main] c.d.d.c.p.DCAwareRoundRobinPolicy - Using data-center name 'oslo' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
INFO [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host /172.16.14.86:9042 added
INFO [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host /172.16.4.209:9042 added
INFO [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.10.23:9042 added
INFO [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host /172.16.8.50:9042 added
INFO [2022-01-01 18:09:19,395] [main] c.d.d.c.Cluster - New Cassandra host oda-k8ssandra-oslo-service/172.16.6.15:9042 added
WARN [2022-01-01 18:09:24,401] [clustername-nio-worker-0] c.d.d.c.HostConnectionPool - Error creating connection to /172.16.14.86:9042
and:
ERROR [2022-01-01 18:09:24,423] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly...
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: oda-k8ssandra-oslo-service/172.16.6.15:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)), oda-k8ssandra-oslo-service/172.16.10.23:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)))
We are a bit stuck, and unsure how best to debug this issue. I may need to do a closer inspection of the particular IPs that Reaper is trying to contact at a specific point in time, to see if they are correct. Also, maybe we need to do some more manual stuff to bring the cluster back into a healthier state, but then I would be afraid it would slide back into the current state unless we change/improve things in our setup. Any advice is appreciated!