REBALANCELEADERS is not reliable

Discussion:

Bernd Fehling

2018-11-27 14:12:33 UTC

Hi list,

unfortunately REBALANCELEADERS is not reliable and the leader
election has unpredictable results with SolrCloud 6.6.5 and
Zookeeper 3.4.10.
Seen with 5 shards / 3 replicas.

- CLUSTERSTATUS reports all replicas (core_nodes) as state=active.
- setting with ADDREPLICAPROP the property preferredLeader to other replicas
- calling REBALANCELEADERS
- some leaders have changed, some not.

I then tried:
- removing all preferredLeader properties from replicas which succeeded.
- trying again REBALANCELEADERS for the rest. No success.
- Shutting down nodes to force the leader to a specific replica left running.
No success.
- calling REBALANCELEADERS responds that the replica is inactive!!!
- calling CLUSTERSTATUS reports that the replica is active!!!

Also, the replica which don't want to become leader is not in the list
of collections->[collection_name]->leader_elect->shard1..x->election

Where is CLUSTERSTATUS getting it's state info from?

Has anyone else problems with REBALANCELEADERS?

I noticed that the Reference Guide writes "preferredLeader" (with capital "L")
but the JAVA code has "preferredleader".

Regards, Bernd

Vadim Ivanov

2018-11-27 14:47:22 UTC

Permalink

Hi, Bernd
I have tried REBALANCELEADERS with Solr 6.3 and 7.5
I had very similar results and notion that it's not reliable :(
--
Br, Vadim

-----Original Message-----
Sent: Tuesday, November 27, 2018 5:13 PM
Subject: REBALANCELEADERS is not reliable
Hi list,
unfortunately REBALANCELEADERS is not reliable and the leader
election has unpredictable results with SolrCloud 6.6.5 and
Zookeeper 3.4.10.
Seen with 5 shards / 3 replicas.
- CLUSTERSTATUS reports all replicas (core_nodes) as state=active.
- setting with ADDREPLICAPROP the property preferredLeader to other replicas
- calling REBALANCELEADERS
- some leaders have changed, some not.
- removing all preferredLeader properties from replicas which succeeded.
- trying again REBALANCELEADERS for the rest. No success.
- Shutting down nodes to force the leader to a specific replica left running.
No success.
- calling REBALANCELEADERS responds that the replica is inactive!!!
- calling CLUSTERSTATUS reports that the replica is active!!!
Also, the replica which don't want to become leader is not in the list
of collections->[collection_name]->leader_elect->shard1..x->election
Where is CLUSTERSTATUS getting it's state info from?
Has anyone else problems with REBALANCELEADERS?
I noticed that the Reference Guide writes "preferredLeader" (with capital "L")
but the JAVA code has "preferredleader".
Regards, Bernd

Bernd Fehling

2018-11-28 07:17:00 UTC

Permalink

Hi Vadim,

thanks for confirming.
So it seems to be a general problem with Solr 6.x, 7.x and might
be still there in the most recent versions.

But where to start to debug this problem, is it something not
correctly stored in zookeeper or is overseer the problem?

I was also reading something about a "leader queue" where possible
leaders have to be requeued or something similar.

May be I should try to get a situation where a "locked" core
is on the overseer and then connect the debugger to it and step
through it.
Peeking and poking around, like old Commodore 64 days :-)

Regards, Bernd

Post by Vadim Ivanov
Hi, Bernd
I have tried REBALANCELEADERS with Solr 6.3 and 7.5
I had very similar results and notion that it's not reliable :(
--
Br, Vadim

Aman Tandon

2018-11-29 19:40:36 UTC

Permalink

For me today, I deleted the leader replica of one of the two shard
collection. Then other replica of that shard was getting elected for leader.

After waiting for long tried the setting addreplicaprop preferred leader on
one of the replica then tried FORCELEADER but no luck. Then also tried
rebalance but no help. Finally have to recreate the whole collection.

Not sure what was the issue but both FORCELEADER AND REBALANCING didn't
work if there was no leader however preferred leader property was setted.

Post by Bernd Fehling
Hi Vadim,
thanks for confirming.
So it seems to be a general problem with Solr 6.x, 7.x and might
be still there in the most recent versions.
But where to start to debug this problem, is it something not
correctly stored in zookeeper or is overseer the problem?
I was also reading something about a "leader queue" where possible
leaders have to be requeued or something similar.
May be I should try to get a situation where a "locked" core
is on the overseer and then connect the debugger to it and step
through it.
Peeking and poking around, like old Commodore 64 days :-)
Regards, Bernd

Post by Vadim Ivanov
Hi, Bernd
I have tried REBALANCELEADERS with Solr 6.3 and 7.5
I had very similar results and notion that it's not reliable :(
--
Br, Vadim

replicas