Hi Kevin,
We do have logs. Grepping for peersync, I can see
solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 03:31:44.007 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/]
nUpdates=100
solr | 2018-12-05 03:31:44.008 INFO
(coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
[c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
solr | 2018-12-05 05:48:44.202 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/]
nUpdates=100
solr | 2018-12-05 05:48:44.646 WARN
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr got a 503 from
http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/,
counting as success
solr | 2018-12-05 05:48:44.682 INFO
(coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
core=igls_shard1_replica_n1 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded
solr | 2018-12-05 05:51:18.369 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]
nUpdates=100
solr | 2018-12-05 06:01:18.496 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
Replica did not return a fingerprint - possibly an older Solr version or
exception
solr | 2018-12-05 06:01:31.126 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
recoveringAfterStartup=[true]
solr | 2018-12-05 06:01:31.129 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 06:11:18.535 WARN (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr exception talking to
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/,
failed
solr | 2018-12-05 06:11:18.583 INFO (qtp2131952342-10232)
[c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | at
org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321)
solr | at
org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341)
solr | at
org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135)
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
to PeerSync from [
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
recoveringAfterStartup=[true]
solr | 2018-12-05 09:21:31.831 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr START leader=
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
nUpdates=100
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
Fingerprint comparison result: 1
solr | 2018-12-05 12:28:26.698 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint: {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584,
versionsHash=-2647181038600325382, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}, Our fingerprint:
{maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640,
versionsHash=-2861577316323569427, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
fingerprint {maxVersionSpecified=9223372036854775807,
maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008,
versionsHash=-5367247708654169186, numVersions=701868706,
numDocs=701868706, maxDoc=13478974}
solr | 2018-12-05 12:30:38.183 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Received 13 versions from
http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
solr | 2018-12-05 12:30:38.184 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync:
core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr Our versions are too old.
ourHighThreshold=1618976334275411968 otherLowThreshold=1619012608476577792
ourHighest=1619014810634027008 otherHighest=1619014936356192256
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
PeerSync: core=igls_shard2_replica_n6 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
solr | 2018-12-05 12:30:38.185 INFO
(recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync
Recovery was not successful - trying replication.
Expanding one of them, we have:
solr | 2018-12-05 03:31:39.421 INFO
(AutoscalingActionExecutor-8-thread-1) [ ] o.a.s.c.a.ExecutePlanAction No
operations to execute for event: {
solr | "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t",
solr | "source":".auto_add_replicas",
solr | "eventTime":303476722167752,
solr | "eventType":"NODELOST",
solr | "properties":{
solr | "eventTimes":[303476722167752],
solr | "preferredOperation":"movereplica",
solr | "_enqueue_time_":303959110445696,
solr |
"nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}}
solr | 2018-12-05 03:31:41.290 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu
e.
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an
d sync
solr | 2018-12-05 03:31:41.291 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01.
eca.local:8983/solr/iglshistory_shard3_replica_n10/
solr | 2018-12-05 03:31:41.301 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
nUpdates=100
solr | 2018-12-05 03:31:41.307 INFO
(coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
[c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
http://solr.node2.metaonly01.eca.local:8983/solr DONE. We have no
versions. sync failed.
Not quite sure why it would be executing the auto-add-replicas trigger
since it's definitely set to false on the collection. (I checked). Also,
this log comes from very early in the startup procedure, probably before
solr has been launched on the other servers.
Kyle
Post by Kevin RisdenDo you have logs right before the following?
"we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive."
Is there a peersync failure or something else in the logs indicating why
there is a full recovery?
Kevin Risden
Post by lstusr 5u93n4Hi All,
- solr 7.5
- 3 shards, replication factor 2 for a total of 6 NRT replicas
- 3 servers, 16GB ram each
- 2 billion documents
- autoAddReplicas: false
- 2.1 TB on-disk index size
- index stored on hdfs on separate servers.
If we (gracefully) shut down solr on all 3 servers, when we re-launch
solr
Post by lstusr 5u93n4we notice that the nodes go into "Recovering" state for about 10-12 hours
before finally coming alive.
During this recovery time, we notice high network traffic outbound from
our
Post by lstusr 5u93n4HDFS servers to our solr servers. The sum total of which is roughly
equivalent to the index size on disk.
So it seems to us that on startup, solr has to re-read the entire index
before coming back alive.
1. is this assumption correct?
2. is there any way to mitigate this, so that solr can launch faster?
Thanks!
Kyle