MigrationException, 017_add_custom_jmx_port.cql, Invalid column name properties

Hi,

Using k8ssandra 1.10 on Kubernetes 1.19.7 (Azure)

We have the same install on several environment. Somehow, today, on one of the environment, the pod running reaper failed with MigrationException. Can you please give some hints to guide the investigation further and possibly the fix. Here is the excerpt of the log kubectl logs pod/myk8ssandra--reaper-6ad8f94d59

Thanks for any help

INFO   [2021-06-07 15:41:44,871] [main] i.d.s.DefaultServerFactory - Registering jersey handler with root path prefix: / 
INFO   [2021-06-07 15:41:44,873] [main] i.d.s.DefaultServerFactory - Registering admin handler with root path prefix: / 
INFO   [2021-06-07 15:41:44,873] [main] i.d.a.AssetsBundle - Registering AssetBundle with name: assets for path /webui/* 
INFO   [2021-06-07 15:41:44,912] [main] i.c.ReaperApplication - initializing runner thread pool with 15 threads and 2 repair runners 
INFO   [2021-06-07 15:41:44,912] [main] i.c.ReaperApplication - initializing storage of type: cassandra 
INFO   [2021-06-07 15:41:44,942] [main] c.d.d.core - DataStax Java driver 3.10.1 for Apache Cassandra 
INFO   [2021-06-07 15:41:44,944] [main] c.d.d.c.GuavaCompatibility - Detected Guava >= 19 in the classpath, using modern compatibility layer 
INFO   [2021-06-07 15:41:45,130] [main] c.d.d.c.ClockFactory - Using native clock to generate timestamps. 
INFO   [2021-06-07 15:41:45,199] [main] c.d.d.c.NettyUtil - Did not find Netty's native epoll transport in the classpath, defaulting to NIO. 
INFO   [2021-06-07 15:41:45,635] [main] c.d.d.c.p.DCAwareRoundRobinPolicy - Using data-center name 'dc1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) 
INFO   [2021-06-07 15:41:45,636] [main] c.d.d.c.Cluster - New Cassandra host myk8ssandra-dc1-service/172.20.146.196:9042 added 
INFO   [2021-06-07 15:41:45,637] [main] c.d.d.c.Cluster - New Cassandra host myk8ssandra-dc1-service/172.20.145.90:9042 added 
INFO   [2021-06-07 15:41:46,175] [main] o.c.c.m.MigrationRepository - Found 12 migration scripts 
WARN   [2021-06-07 15:41:46,175] [main] i.c.s.CassandraStorage - Starting db migration from 16 to 27… 
INFO   [2021-06-07 15:41:46,377] [main] o.c.c.m.MigrationRepository - Found 12 migration scripts 
WARN   [2021-06-07 15:41:46,506] [clustername-worker-0] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='insert into schema_migration(applied_successful, version, script_name, script, executed_at) values(?, ?, ?, ?, ?)' 
WARN   [2021-06-07 15:41:46,510] [clustername-worker-0] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='INSERT INTO schema_migration_leader (keyspace_name, leader, took_lead_at, leader_hostname) VALUES (?, ?, dateOf(now()), ?) IF NOT EXISTS USING TTL 300' 
WARN   [2021-06-07 15:41:46,514] [clustername-worker-0] c.d.d.c.Cluster - Re-preparing already prepared query is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once. Query='DELETE FROM schema_migration_leader where keyspace_name = ? IF leader = ?' 
ERROR  [2021-06-07 15:41:46,557] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly... 
org.cognitor.cassandra.migration.MigrationException: Error during migration of script 017_add_custom_jmx_port.cql while executing 'ALTER TABLE cluster ADD properties text;'
	at org.cognitor.cassandra.migration.Database.execute(Database.java:269)
	at java.util.Collections$SingletonList.forEach(Collections.java:4824)
	at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:68)
	at io.cassandrareaper.storage.CassandraStorage.migrate(CassandraStorage.java:362)
	at io.cassandrareaper.storage.CassandraStorage.initializeCassandraSchema(CassandraStorage.java:293)
	at io.cassandrareaper.storage.CassandraStorage.initializeAndUpgradeSchema(CassandraStorage.java:250)
	at io.cassandrareaper.storage.CassandraStorage.<init>(CassandraStorage.java:238)
	at io.cassandrareaper.ReaperApplication.initializeStorage(ReaperApplication.java:480)
	at io.cassandrareaper.ReaperApplication.tryInitializeStorage(ReaperApplication.java:303)
	at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:181)
	at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:98)
	at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:43)
	at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:87)
	at io.dropwizard.cli.Cli.run(Cli.java:78)
	at io.dropwizard.Application.run(Application.java:93)
	at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:117)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name properties because it conflicts with an existing column
	at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
	at org.cognitor.cassandra.migration.Database.executeStatement(Database.java:277)
	at org.cognitor.cassandra.migration.Database.execute(Database.java:261)
	... 15 common frames omitted
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Invalid column name properties because it conflicts with an existing column
	at com.datastax.driver.core.Responses$Error.asException(Responses.java:181)
	at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:215)
	at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:235)
	at com.datastax.driver.core.RequestHandler.access$2600(RequestHandler.java:61)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:1011)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:814)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1287)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1205)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
	at com.datastax.driver.core.InboundTrafficMeter.channelRead(InboundTrafficMeter.java:38)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1304)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:921)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:135)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

Could this be caused because we have added one extra node (physical Kubernetes node, not Cassandra node) in the Kubernetes cluster? This is indeed the difference of the k8ssandra install affected by the above exception. And the timestamps in the reaper pod logs above coincide roughly with the moment the extra node was added.

H @Tri

Could this be caused because we have added one extra node (physical Kubernetes node, not Cassandra node) in the Kubernetes cluster?

I wouldn’t think so unless it caused a Cassandra pod to get rescheduled while Reaper was applying schema changes leading to inconsistent schema across the C* cluster.

Did this happen on a helm install or on a helm upgrade?

My guess is that the schema is in an inconsistent state. Can you check the Cassandra logs as well?

I would be inclined to try and reinstall Reaper but first make sure your Cassandra pods are in a ready state. If you want to try reinstalling, I might go about it this way:

  • Set the reaper.enabled chart property to false and run a helm upgrade. This will remove Reaper bits.
  • Then drop the reaper_db keyspace.
  • Set reaper.enabled back to true and run another helm upgrade

Cheers

John

@jsanda Works perfect, thank you very much.

Did this happen on a helm install or on a helm upgrade ?

None, the Cassandra instance was running happily. The incident causing the Reaper pod CrashloopBackoff occured shortly after we add more hardware to the Azure AKS cluster (adding 1 node). We don’t know the root cause. I speculate that this could be an unfortunate timing where reaper job kicked in about the same time as Azure AKS was redistributing the workload on the nodes. If the Cassandra pods was moving at that time. maybe reaper lost the context and record some corrupted state.

Here is the detailed instruction I did, in case a future reader runs into the same problem.

Set common params in vars

K8X_RELEASE_NAME='myk8ssandra'
K8X_NS='k8cass'

Check Cassandra is Ready

# Verify all Cassandra pods are running and READY
kubectl get pods -l cassandra.datastax.com/datacenter=dc1 -n $K8X_NS

# Status of the Cassandra datacenter `dc1`
kubectl get CassandraDatacenter/dc1 -o 'jsonpath={.status.cassandraOperatorProgress}{"\n"}' -n $K8X_NS

Un-install reaper:

helm upgrade $K8X_RELEASE_NAME k8ssandra/k8ssandra --namespace $K8X_NS \
  --reuse-values --set reaper.enabled=false

Drop the reaper_db keyspace
Wait (5 minutes?) and check Cassandra is ready

K8SADMIN_USER=$(kubectl -n $K8X_NS get secret ${K8X_RELEASE_NAME}-superuser -o jsonpath={.data\.username} | base64 -d)
K8SADMIN_PASS=$(kubectl -n $K8X_NS get secret ${K8X_RELEASE_NAME}-superuser -o jsonpath={.data\.password} | base64 -d)

kubectl exec -it ${K8X_RELEASE_NAME}-dc1-zone1-sts-0 -c cassandra -n $K8X_NS -- sh -c "cqlsh -u $K8SADMIN_USER -p $K8SADMIN_PASS"

abc-superuser@cqlsh>  DROP KEYSPACE reaper_db;

-- Make sure the keyspace no longer exists
abc-superuser@cqlsh>  SELECT * FROM system_schema.keyspaces WHERE keyspace_name='reaper_db';

exit

Re-install Reaper

helm upgrade $K8X_RELEASE_NAME k8ssandra/k8ssandra --namespace $K8X_NS \
  --reuse-values --set reaper.enabled=true

Wait about 5 minutes and everything is back to order again (reaper pod status is Running and Ready). The little scary part was a warning when running DROP KEYSPACE reaper_db; not sure what to make sense of it:

Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.

Hey @Tri

Thanks for the detailed update. Glad to hear you got things working!

When checking datacenter readiness it is not sufficient to only check .status.cassandraOperatorProgress. You should also check the Ready condition, e.g.,

status:
  cassandraOperatorProgress: Ready
  conditions:
  - lastTransitionTime: "2021-06-04T22:19:10Z"
    message: ""
    reason: ""
    status: "False"
    type: ScalingUp
  - lastTransitionTime: "2021-06-05T20:50:21Z"
    message: ""
    reason: ""
    status: "False"
    type: Updating
  - lastTransitionTime: "2021-06-04T22:17:21Z"
    message: ""
    reason: ""
    status: "False"
    type: Stopped
  - lastTransitionTime: "2021-06-04T21:08:15Z"
    message: ""
    reason: ""
    status: "False"
    type: ReplacingNodes
  - lastTransitionTime: "2021-06-04T21:08:15Z"
    message: ""
    reason: ""
    status: "False"
    type: RollingRestart
  - lastTransitionTime: "2021-06-04T22:19:10Z"
    message: ""
    reason: ""
    status: "False"
    type: Resuming
  - lastTransitionTime: "2021-06-04T21:08:15Z"
    message: ""
    reason: ""
    status: "False"
    type: ScalingDown
  - lastTransitionTime: "2021-06-04T21:08:15Z"
    message: ""
    reason: ""
    status: "True"
    type: Valid
  - lastTransitionTime: "2021-06-04T21:08:35Z"
    message: ""
    reason: ""
    status: "True"
    type: Initialized
  - lastTransitionTime: "2021-06-04T22:19:10Z"
    message: ""
    reason: ""
    status: "True"
    type: Ready
  lastServerNodeStarted: "2021-06-09T20:50:14Z"
  nodeStatuses:
    multi-dc-test-dc-west-rack1-sts-0:
      hostID: c2106ff5-8dda-4e68-a56a-8d5339d421c7
    multi-dc-test-dc-west-rack2-sts-0:
      hostID: a89895d4-056c-4ee3-83ba-c22e2d956d8a
    multi-dc-test-dc-west-rack3-sts-0:
      hostID: 89e193ab-ff5b-48e8-9298-d210be2d49a7
  observedGeneration: 4
  quietPeriod: "2021-06-09T20:50:56Z"
  superUserUpserted: "2021-06-09T20:50:51Z"
  usersUpserted: "2021-06-09T20:50:51Z"

As for the schema mismatch warning I suggest doing a rolling restart of the datacenter to see if that will clear it up.

Cheers

John

Can you please show how to do a rolling restart the “kubernetes” way?

Set rollingRestartRequested: true.

For reference, see the example-cassdc-full.yaml:

  # Setting rollingRestartRequested to true will have Cass Operator do a rolling
  # restart on this CassDC at the next opportunity. The operator will set this
  # back to false once the restart is in progress. 
  rollingRestartRequested: false

Currently using cass-operator-0.29.1 from k8ssandra helm chart version 1.1.0. It seems there is no rollingRestartRequested field in missing in the CassandraDatacenter definition:

kubectl get cassdc/dc1 -n mycass -o yaml > ~/Downloads/dc1.yaml

Should I just add this field under spec: as in the example you showed. And then kubectl apply ?
`

Correct. In my case I did:

$ kubectl get cassdc dc1 -o yaml > cassdc-dc1.yaml

Then I updated cassdc-dc1.yaml and inserted rollingRestartRequested just before podTemplateSpec here:

...
spec:
  allowMultipleNodesPerWorker: true
  clusterName: k8ssandra
  config:
    ...
  configBuilderResources: {}
  dockerImageRunsAsCassandra: true
  managementApiAuth:
    insecure: {}
  rollingRestartRequested: true          <----- INSERTED HERE
  podTemplateSpec:
     ...

Then applied the changes with:

$ kubectl apply -f cassdc.yaml

Note that if you try to do another kubectl get, you still won’t see rollingRestartRequested in the output. Cheers!