Discussion:
CDCR Replication sensitive to network problems
Webster Homer
2018-12-07 18:35:24 UTC
Permalink
We are using Solr 7.2. We have two solrclouds that are hosted on Google clouds. These are targets for an on Prem solr cloud where we run our ETL loads and have CDCR replicate it to the Google clouds. This mostly works pretty well. However, networks can fail. When the network has a brief outage we frequently then see corrupted tlog files. Frequently we see 0 length tlog files or files that appear to be truncated. When this happens we see lots of cdcr errors. If there is a corrupt tlog, we delete it and things go back to normal.
The frequency of the errors is troubling. CDCR needs to be more robust with networking issues. I don't know how tlogs get corrupted in this scenario, but they obviously do.

Today we started seeing lots of CdcrReplicator errors but could not find a corrupt tlog. This is a trace from the logs
java.io.EOFException
at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:168)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:863)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:857)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:266)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:603)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:315)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:272)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:690)
at org.apache.solr.update.CdcrTransactionLog$CdcrLogReader.next(CdcrTransactionLog.java:304)
at org.apache.solr.update.CdcrUpdateLog$CdcrLogReader.next(CdcrUpdateLog.java:633)
at org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:77)
at org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(CdcrReplicatorScheduler.java:81)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Our admins restarted the source solr servers and that seems to have helped.
Loading...