Article - CS438655

Issue with one or more Zookeeper nodes prevented ThingWorx Platform Active-Passive High Availability (HA) from successful failover

Modified: 04-Mar-2025   


Applies To

  • ThingWorx Platform 8.4 to 8.5
  • Zookeeper

Description

  • Attempting to complete maintenance tasks on ThingWorx Platform Active-Passive High Availability (HA) configuration resulted in downtime despite having the proper number of nodes available at all times
  • Took one of three available Zookeeper nodes offline for maintenance and ThingWorx Platform was inaccessible
  • Unexpected downtime when performing maintenance on ThingWorx Platform Active-Passive HA environment
  • Only two of three Zookeeper nodes were part of the quorum which resulted in downtime for ThingWorx Platform when one of the Zookeeper nodes went offline
  • Ensured the following node counts were online and available in ThingWorx Active-Passive HA configuration but still got unplanned downtime:
    • 1 ThingWorx node
    • 2 Zookeeper nodes
  • Zookeeper logs indicate that only two of three nodes are part of the quorum:
    • [myid:<ZK ID>] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1296] - Have quorum of supporters, sids: [ [<ZK ID 1> <ZK ID 2>],[<ZK ID 1>, <ZK ID 2>] ]; starting up and setting last processed zxid: 0x2900000000
  • Took a restart of a Zookeeper node and it immediately formed a quorum where it was the leader per the Zookeeper logs:
    • [myid:<ZK ID>] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@464] - LEADING - LEADER ELECTION TOOK - <Time> MS
    • When restarting a single Zookeeper node it should join an existing quorum as a FOLLOWER
This is a printer-friendly version of Article 438655 and may be out of date. For the latest version click CS438655