solr reads whole index on startup

Discussion:

lstusr 5u93n4

2018-12-05 17:53:36 UTC

Hi All,

We have a collection:
- solr 7.5
- 3 shards, replication factor 2 for a total of 6 NRT replicas
- 3 servers, 16GB ram each
- 2 billion documents
- autoAddReplicas: false
- 2.1 TB on-disk index size
- index stored on hdfs on separate servers.

If we (gracefully) shut down solr on all 3 servers, when we re-launch solr
we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive.

During this recovery time, we notice high network traffic outbound from our
HDFS servers to our solr servers. The sum total of which is roughly
equivalent to the index size on disk.

So it seems to us that on startup, solr has to re-read the entire index
before coming back alive.

1. is this assumption correct?
2. is there any way to mitigate this, so that solr can launch faster?

Thanks!

Kyle

Kevin Risden

2018-12-05 17:58:38 UTC

Permalink

Do you have logs right before the following?

"we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive."

Is there a peersync failure or something else in the logs indicating why
there is a full recovery?

Kevin Risden

Post by lstusr 5u93n4
Hi All,
- solr 7.5
- 3 shards, replication factor 2 for a total of 6 NRT replicas
- 3 servers, 16GB ram each
- 2 billion documents
- autoAddReplicas: false
- 2.1 TB on-disk index size
- index stored on hdfs on separate servers.
If we (gracefully) shut down solr on all 3 servers, when we re-launch solr
we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive.
During this recovery time, we notice high network traffic outbound from our
HDFS servers to our solr servers. The sum total of which is roughly
equivalent to the index size on disk.
So it seems to us that on startup, solr has to re-read the entire index
before coming back alive.
1. is this assumption correct?
2. is there any way to mitigate this, so that solr can launch faster?
Thanks!
Kyle

lstusr 5u93n4

2018-12-05 18:12:20 UTC

Permalink

Hi Kevin,

We do have logs. Grepping for peersync, I can see

solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 03:31:44.007 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/]
nUpdates=100
solr | 2018-12-05 03:31:44.008 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 05:48:44.202 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/]
nUpdates=100
solr | 2018-12-05 05:48:44.646 WARN
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/,
counting as success
solr | 2018-12-05 05:48:44.682 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded
solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]
nUpdates=100
solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
Replica did not return a fingerprint - possibly an older Solr version or
exception
solr | 2018-12-05 06:01:31.126 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
recoveringAfterStartup=[true]
solr | 2018-12-05 06:01:31.129 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr exception talking to
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/,
failed
solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | at
org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321)
solr | at
org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341)
solr | at
org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135)
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
recoveringAfterStartup=[true]
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
Fingerprint comparison result: 1
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint: {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584,
versionsHash=-2647181038600325382, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}, Our fingerprint:
{maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640,
versionsHash=-2861577316323569427, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008,
versionsHash=-5367247708654169186, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Received 13 versions from
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
solr | 2018-12-05 12:30:38.184 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync:
core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too old.
ourHighThreshold=1618976334275411968 otherLowThreshold=1619012608476577792
ourHighest=1619014810634027008 otherHighest=1619014936356192256
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync
Recovery was not successful - trying replication.

Expanding one of them, we have:

solr | 2018-12-05 03:31:39.421 INFO
(AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction No
operations to execute for event: {
solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t",
solr | "source":".auto_add_replicas",
solr | "eventTime":303476722167752,
solr | "eventType":"NODELOST",
solr | "properties":{
solr | "eventTimes":[303476722167752],
solr | "preferredOperation":"movereplica",
solr | "_enqueue_time_":303959110445696,
solr |
"nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}}
solr | 2018-12-05 03:31:41.290 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu
e.
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an
d sync
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01.
eca.local:8983/solr/iglshistory_shard3_replica_n10/
solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.

Not quite sure why it would be executing the auto-add-replicas trigger
since it's definitely set to false on the collection. (I checked). Also,
this log comes from very early in the startup procedure, probably before
solr has been launched on the other servers.

Kyle

Post by Kevin Risden
Do you have logs right before the following?
"we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive."
Is there a peersync failure or something else in the logs indicating why
there is a full recovery?
Kevin Risden

solr

Post by lstusr 5u93n4
we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive.
During this recovery time, we notice high network traffic outbound from

our

Post by lstusr 5u93n4
HDFS servers to our solr servers. The sum total of which is roughly
equivalent to the index size on disk.
So it seems to us that on startup, solr has to re-read the entire index
before coming back alive.
1. is this assumption correct?
2. is there any way to mitigate this, so that solr can launch faster?
Thanks!
Kyle

lstusr 5u93n4

2018-12-05 20:15:03 UTC

Permalink

I just repeated the procedure, same effect. I'm an hour in and it's still
recovering. Looked at the autoscaling API, but it's configured not to do
anything, which makes sense given the previous output.

One thing I did see, just now:

solr | 2018-12-05 20:02:37.922 INFO (qtp2131952342-5208) [
x:igls_shard3_replica_n8] o.a.s.h.a.CoreAdminOperation It has been
requested that we recover: core=igls_shard3_replica_n8
solr | 2018-12-05 20:02:37.923 ERROR (qtp2131952342-5208) [
x:igls_shard3_replica_n8] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Unable to locate core
igls_shard3_replica_n8
solr | at
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:167)
solr | at
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
solr | at
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:395)
solr | at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)
solr | at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
solr | at
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)
solr | at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)
solr | at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)
solr | at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
solr | at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
solr | at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
solr | at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
solr | at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
solr | at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
solr | at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
solr | at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
solr | at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
solr | at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
solr | at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
solr | at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
solr | at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
solr | at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
solr | at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
solr | at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
solr | at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
solr | at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
solr | at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
solr | at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
solr | at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
solr | at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
solr | at
org.eclipse.jetty.server.Server.handle(Server.java:531)
solr | at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
solr | at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
solr | at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
solr | at
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
solr | at
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
solr | at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
solr | at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
solr | at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
solr | at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
solr | at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
solr | at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
solr | at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
solr | at java.lang.Thread.run(Thread.java:748)

But I'm not sure why it can't locate the core, because it's there in
${SOLR_HOME} and also exists on hdfs...

Thanks!

Kyle

Post by lstusr 5u93n4
Hi Kevin,
We do have logs. Grepping for peersync, I can see
solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 03:31:44.007 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/]
nUpdates=100
solr | 2018-12-05 03:31:44.008 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 05:48:44.202 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/]
nUpdates=100
solr | 2018-12-05 05:48:44.646 WARN
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/,
counting as success
solr | 2018-12-05 05:48:44.682 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded
solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]
nUpdates=100
solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
Replica did not return a fingerprint - possibly an older Solr version or
exception
solr | 2018-12-05 06:01:31.126 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]
- recoveringAfterStartup=[true]
solr | 2018-12-05 06:01:31.129 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr exception talking to
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/,
failed
solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | at
org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321)
solr | at
org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341)
solr | at
org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135)
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]
- recoveringAfterStartup=[true]
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
Fingerprint comparison result: 1
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint: {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584,
versionsHash=-2647181038600325382, numVersions=701868706,
{maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640,
versionsHash=-2861577316323569427, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008,
versionsHash=-5367247708654169186, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Received 13 versions from
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
solr | 2018-12-05 12:30:38.184 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too
old. ourHighThreshold=1618976334275411968
otherLowThreshold=1619012608476577792 ourHighest=1619014810634027008
otherHighest=1619014936356192256
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync
Recovery was not successful - trying replication.
solr | 2018-12-05 03:31:39.421 INFO
(AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction No
operations to execute for event: {
solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t",
solr | "source":".auto_add_replicas",
solr | "eventTime":303476722167752,
solr | "eventType":"NODELOST",
solr | "properties":{
solr | "eventTimes":[303476722167752],
solr | "preferredOperation":"movereplica",
solr | "_enqueue_time_":303959110445696,
solr |
"nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}}
solr | 2018-12-05 03:31:41.290 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu
e.
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an
d sync
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01.
eca.local:8983/solr/iglshistory_shard3_replica_n10/
solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
Not quite sure why it would be executing the auto-add-replicas trigger
since it's definitely set to false on the collection. (I checked). Also,
this log comes from very early in the startup procedure, probably before
solr has been launched on the other servers.
Kyle

solr

Post by lstusr 5u93n4
we notice that the nodes go into "Recovering" state for about 10-12

hours

Post by lstusr 5u93n4
before finally coming alive.
During this recovery time, we notice high network traffic outbound from

our

Erick Erickson

2018-12-05 20:38:29 UTC

Permalink

Solr should _not_ have to replicate the index or go into peer sync on startup.

are you stopping indexing before you shut your servers down?
Be very sure you have passed your autocommit interval after you've stopped indexing and before you stop Solr.
How are you shutting down? bin/solr stop is the proper way, and you should NOT be seeing any kind of message about force killing the Solr process. There are timeouts in solr.in.sh (unix) that you can lengthen if you are seeing any message about force killing.
Bring all the nodes back up before you start indexing again.

With the above, you should not see peer sync on startup. Of course you
have to have all your Solr nodes in the "active" state before shutting
down.

your indexes are very large. Very very large given your hardware. You have roughly 700M docs per replica, I'm not quite sure whether your 2.1T is aggregate across all 6 replicas or not. Even if so (and that's the best case), each replica is roughly 350G in size.You have, at best, 700G per server trying to run with 16G of RAM. This is extremely undersized in my experience.
I suspect you're hitting a lot of GC. What kind of GC observations have you made?
I would turn off autoscaling, your index sizes and hardware are too small to even consider automatically moving that much data around.

In summary, this seems on the surface to be an extremely undersized
system, and unless and until you properly size it you'll have problems

Best,
Erick

Hi Kevin,
We do have logs. Grepping for peersync, I can see
solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 03:31:44.007 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/]
nUpdates=100
solr | 2018-12-05 03:31:44.008 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 05:48:44.202 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/]
nUpdates=100
solr | 2018-12-05 05:48:44.646 WARN
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/,
counting as success
solr | 2018-12-05 05:48:44.682 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded
solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]
nUpdates=100
solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
Replica did not return a fingerprint - possibly an older Solr version or
exception
solr | 2018-12-05 06:01:31.126 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
recoveringAfterStartup=[true]
solr | 2018-12-05 06:01:31.129 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr exception talking to
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/,
failed
solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | at
org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321)
solr | at
org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341)
solr | at
org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135)
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
recoveringAfterStartup=[true]
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
Fingerprint comparison result: 1
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint: {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584,
versionsHash=-2647181038600325382, numVersions=701868706,
{maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640,
versionsHash=-2861577316323569427, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008,
versionsHash=-5367247708654169186, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Received 13 versions from
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
solr | 2018-12-05 12:30:38.184 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too old.
ourHighThreshold=1618976334275411968 otherLowThreshold=1619012608476577792
ourHighest=1619014810634027008 otherHighest=1619014936356192256
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync
Recovery was not successful - trying replication.
solr | 2018-12-05 03:31:39.421 INFO
(AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction No
operations to execute for event: {
solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t",
solr | "source":".auto_add_replicas",
solr | "eventTime":303476722167752,
solr | "eventType":"NODELOST",
solr | "properties":{
solr | "eventTimes":[303476722167752],
solr | "preferredOperation":"movereplica",
solr | "_enqueue_time_":303959110445696,
solr |
"nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}}
solr | 2018-12-05 03:31:41.290 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu
e.
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an
d sync
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01.
eca.local:8983/solr/iglshistory_shard3_replica_n10/
solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
Not quite sure why it would be executing the auto-add-replicas trigger
since it's definitely set to false on the collection. (I checked). Also,
this log comes from very early in the startup procedure, probably before
solr has been launched on the other servers.
Kyle

solr

Post by lstusr 5u93n4
we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive.
During this recovery time, we notice high network traffic outbound from

our

lstusr 5u93n4

2018-12-05 22:01:47 UTC

Permalink

Hey Erick,

Some thoughts:

Solr should _not_ have to replicate the index or go into peer sync on

Post by Erick Erickson
startup.

Okay, that's good to know! Tells us that this is a problem that can be
solved.

Post by Erick Erickson

are you stopping indexing before you shut your servers down?

By indexing, you mean adding new entries to solr, right? We havent'
explicity stopped it, but it's very quiet (rate of 1 doc / minute).

Post by Erick Erickson

Be very sure you have passed your autocommit interval after you've

stopped indexing and before you stop Solr.

This we have not done... Autocommit is set fairly long on our system (5
mins) I'll do another test where I completely quiesce first.

Post by Erick Erickson

How are you shutting down? bin/solr stop is the proper way, and you

should NOT be seeing any kind of message about force killing the Solr
process. There are timeouts in solr.in.sh (unix) that you can lengthen if
you are seeing any message about force killing.

Nothing about force killing shows up in the logs, but I'll extend the
timeouts anyways and monitor the next shutdown (yes with bin/solr stop).
One thought: could it be that we're waiting TOO LONG between stopping the
solr processes on the different servers? Like by killing it on server 1,
we're causing servers 2 and 3 to go into a non-stable state? So they're not
perfectly "green" at the time of shutdown?

Post by Erick Erickson

Bring all the nodes back up before you start indexing again.

if an entry were to trickle in before the last node were online, this could
cause the behaviour? I'll block it anyways and see.

Post by Erick Erickson
With the above, you should not see peer sync on startup. Of course you
have to have all your Solr nodes in the "active" state before shutting
down.

your indexes are very large. Very very large given your hardware. You

have roughly 700M docs per replica, I'm not quite sure whether your 2.1T
is aggregate across all 6 replicas or not. Even if so (and that's the best
case), each replica is roughly 350G in size.You have, at best, 700G per
server trying to run with 16G of RAM. This is extremely undersized in my
experience.

Your math is good. :D The replicas range between 337G and 355G in size.
That said, I'm curious as to why you say this is undersized. We're
optimized for index speed, not query speed. Based on the amount of data we
want to hold, we have to accept the fact that the full index will not fit
in RAM. (We're not going to provision 2-3T of RAM on the servers...) We
also have unpredictable query patterns, so we don't benefit much by
caching. Some data will have to be read from disk on nearly every query.
The queries on this system DO take a little bit long (like 10 seconds or
so, up to 100 seconds at the worst) but that's acceptable for our use
case. Resource-wise, each server has 6GB allocated for solr's jvm, and
we're not pushing that beyond about 50% utilization (jvm memory usage
hovers around 3G, typically.) So, do we really gain anything by adding more
hardware?

Post by Erick Erickson

I suspect you're hitting a lot of GC. What kind of GC observations have

you made?

None, yet. Given that the jvm isn't pushing its limits, we haven't profiled
the GC. We will one day, certainly, to squeeze more performance out of the
system, but haven't gone down that road so far.

Post by Erick Erickson

I would turn off autoscaling, your index sizes and hardware are too

small to even consider automatically moving that much data around.

Yeah, absolutely. We explicity set autoAddReplicas to false on the
collection, and we have no other triggers setup.

Anyways, I'll report back with some findings, after the system comes back
online and I can kill it again. :D

Thanks for all your help!

Post by Erick Erickson
In summary, this seems on the surface to be an extremely undersized
system, and unless and until you properly size it you'll have problems
Best,
Erick

Hi Kevin,
We do have logs. Grepping for peersync, I can see
solr | 2018-12-05 03:31:41.301 INFO

(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)

http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/
]

nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO

(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)

(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[

http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/
]

nUpdates=100
solr | 2018-12-05 03:31:44.008 INFO

(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/

]

nUpdates=100
solr | 2018-12-05 05:48:44.646 WARN

(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/

counting as success
solr | 2018-12-05 05:48:44.682 INFO

(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded
solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/

]

nUpdates=100
solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
Replica did not return a fingerprint - possibly an older Solr version or
exception
solr | 2018-12-05 06:01:31.126 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy

Attempting

to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]

recoveringAfterStartup=[true]
solr | 2018-12-05 06:01:31.129 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr exception talking to
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/

failed
solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | at

org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321)

solr | at

org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341)

solr | at

org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135)

solr | 2018-12-05 09:21:31.831 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy

Attempting

to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]

recoveringAfterStartup=[true]
solr | 2018-12-05 09:21:31.831 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
Fingerprint comparison result: 1
solr | 2018-12-05 12:28:26.698 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint: {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584,
versionsHash=-2647181038600325382, numVersions=701868706,
{maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640,
versionsHash=-2861577316323569427, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008,
versionsHash=-5367247708654169186, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

from

http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
solr | 2018-12-05 12:30:38.184 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too

old.

ourHighThreshold=1618976334275411968

otherLowThreshold=1619012608476577792

ourHighest=1619014810634027008 otherHighest=1619014936356192256
solr | 2018-12-05 12:30:38.185 INFO

(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr

x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync
Recovery was not successful - trying replication.
solr | 2018-12-05 03:31:39.421 INFO
(AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction

operations to execute for event: {
solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t",
solr | "source":".auto_add_replicas",
solr | "eventTime":303476722167752,
solr | "eventType":"NODELOST",
solr | "properties":{
solr | "eventTimes":[303476722167752],
solr | "preferredOperation":"movereplica",
solr | "_enqueue_time_":303959110445696,
solr |
"nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}}
solr | 2018-12-05 03:31:41.290 INFO

(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)

[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu
e.
solr | 2018-12-05 03:31:41.291 INFO

(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)

[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an
d sync
solr | 2018-12-05 03:31:41.291 INFO

(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)

[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01.
eca.local:8983/solr/iglshistory_shard3_replica_n10/
solr | 2018-12-05 03:31:41.301 INFO

(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)

http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/
]

nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO

(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)

[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
Not quite sure why it would be executing the auto-add-replicas trigger
since it's definitely set to false on the collection. (I checked). Also,
this log comes from very early in the startup procedure, probably before
solr has been launched on the other servers.
Kyle

Post by Kevin Risden
Do you have logs right before the following?
"we notice that the nodes go into "Recovering" state for about 10-12

hours

Post by Kevin Risden
before finally coming alive."
Is there a peersync failure or something else in the logs indicating

why

Post by Kevin Risden
there is a full recovery?
Kevin Risden

solr

Post by lstusr 5u93n4
we notice that the nodes go into "Recovering" state for about 10-12

hours

Post by Kevin Risden

Post by lstusr 5u93n4
before finally coming alive.
During this recovery time, we notice high network traffic outbound

from

Post by Kevin Risden
our

Post by lstusr 5u93n4
HDFS servers to our solr servers. The sum total of which is roughly
equivalent to the index size on disk.
So it seems to us that on startup, solr has to re-read the entire

index

Post by Kevin Risden

Post by lstusr 5u93n4
before coming back alive.
1. is this assumption correct?
2. is there any way to mitigate this, so that solr can launch faster?
Thanks!
Kyle

Erick Erickson

2018-12-06 17:26:36 UTC

Permalink

First, your indexing rate _probably_ isn't the culprit if it's as slow
as you indicate, although testing will tell.

bq. could it be that we're waiting TOO LONG between stopping the solr
processes on the different servers?
At your query rate this is probably not an issue. One thing you might
do is change the updatelog, see
numRecordsToKeep here:
https://lucene.apache.org/solr/guide/7_5/updatehandlers-in-solrconfig.html
This defines how many documents are kept in each replica's transaction
log. When told to recover,
Solr tries first to replay documents in the tlog and if it's not
possible/reasonable then does the full
recovery (i.e. copies the index from the leader). The default number
is 100, making it 1,000 might help,
especially at your update rate.

I wouldn't bother changing your autocommit interval with an update
rate so low, it's pretty irrelevant.

bq. I'm curious as to why you say this is undersized

Solr is pretty sensitive to GC pauses when it comes to recoveries. One
scenario: ZooKeeper sends a keep-alive to each Solr instance, and when
the response exceeds the timeout, ZK says, in effect, "that node is
dead, when I see it again I'll tell it to recover". Actually, the
leader of that shard will fire a "Leader Initiated Recovery". A
stop-the-world GC pause will trigger this situation if it takes longer
than the timeout. With your index soooo much larger than your RAM, my
first guess would be exactly that. Collect the GC logs and use
something like GCViewer to see....

Another scenario is when the leader sends the doc to the follower, if
the request times out (again possibly due to a stop-the-world GC
pause) the leader, you guessed it, sends the follower into recovery.

You also say that you care most about indexing, now querying and that
querying is unpredictable. That'll drive a lot of swapping and/or GC
and/or....

And once a follower _does_ go into recovery, you'll have some
"interesting" times moving that much data around, contributing to GC
pressure, I/O waits, network saturation leading, possibly, to _more_
recoveries. The fact that it takes 10-12 hours to resync on startup
leads me to question whether your network is slow as well. Admittedly
it's a lot of data, but it's still just a copy operation.

All the above is just things that I've seen go wrong, by no means
exhaustive. I've seen a lot of Solr installations, and never seen one
be stable with that much disparity between index size and available
RAM so part of this is "the voice of experience". Whether that
experience is accurate or not is certainly debatable.

Best,
Erick

Post by lstusr 5u93n4
Hey Erick,
Solr should _not_ have to replicate the index or go into peer sync on

Post by Erick Erickson
startup.

Okay, that's good to know! Tells us that this is a problem that can be
solved.

Post by Erick Erickson

are you stopping indexing before you shut your servers down?

By indexing, you mean adding new entries to solr, right? We havent'
explicity stopped it, but it's very quiet (rate of 1 doc / minute).

Post by Erick Erickson

Be very sure you have passed your autocommit interval after you've

stopped indexing and before you stop Solr.

This we have not done... Autocommit is set fairly long on our system (5
mins) I'll do another test where I completely quiesce first.

Post by Erick Erickson

How are you shutting down? bin/solr stop is the proper way, and you

should NOT be seeing any kind of message about force killing the Solr
process. There are timeouts in solr.in.sh (unix) that you can lengthen if
you are seeing any message about force killing.

Post by Erick Erickson

Bring all the nodes back up before you start indexing again.

if an entry were to trickle in before the last node were online, this could
cause the behaviour? I'll block it anyways and see.

Post by Erick Erickson
With the above, you should not see peer sync on startup. Of course you
have to have all your Solr nodes in the "active" state before shutting
down.

your indexes are very large. Very very large given your hardware. You

Post by Erick Erickson

I suspect you're hitting a lot of GC. What kind of GC observations have

you made?

None, yet. Given that the jvm isn't pushing its limits, we haven't profiled
the GC. We will one day, certainly, to squeeze more performance out of the
system, but haven't gone down that road so far.

Post by Erick Erickson

I would turn off autoscaling, your index sizes and hardware are too

small to even consider automatically moving that much data around.

Yeah, absolutely. We explicity set autoAddReplicas to false on the
collection, and we have no other triggers setup.
Anyways, I'll report back with some findings, after the system comes back
online and I can kill it again. :D
Thanks for all your help!