Discussion:
SolrCloud OOM Problem
dancoleman
2014-08-11 23:27:01 UTC
Permalink
My SolrCloud of 3 shard / 3 replicas is having a lot of OOM errors. Here are
some specs on my setup:

hosts: all are EC2 m1.large with 250G data volumes
documents: 120M total
zookeeper: 5 external t1.micros

startup command with memory and GC values
===============================================================
root 12499 1 61 19:36 pts/0 01:49:18 /usr/lib/jvm/jre/bin/java
-XX:NewSize=1536m -XX:MaxNewSize=1536m -Xms5120m -Xmx5120m -XX:+UseParNewGC
-XX:+CMSParallelRemarkEnabled -XX:+UseConcMarkSweepGC
-Djavax.sql.DataSource.Factory=org.apache.commons.dbcp.BasicDataSourceFactory
-DnumShards=3 -Dbootstrap_confdir=/data/solr/lighting_products/conf
-Dcollection.configName=lighting_products_cloud_conf
-DzkHost=ec2-00-17-55-217.compute-1.amazonaws.com:2181,ec2-00-82-150-252.compute-1.amazonaws.com:2181,ec2-00-234-237-109.compute-1.amazonaws.com:2181,ec2-00-224-205-204.compute-1.amazonaws.com:2181,ec2-00-20-72-124.compute-1.amazonaws.com:2181
-classpath
:/usr/share/tomcat6/bin/bootstrap.jar:/usr/share/tomcat6/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar
-Dcatalina.base=/usr/share/tomcat6 -Dcatalina.home=/usr/share/tomcat6
-Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat6/temp
-Djava.util.logging.config.file=/usr/share/tomcat6/conf/logging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
org.apache.catalina.startup.Bootstrap start


Linux "top" command output with no indexing
=======================================================
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8654 root 20 0 95.3g 6.4g 1.1g S 27.6 87.4 83:46.19 java


Linux "top" command output with indexing
=======================================================
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12499 root 20 0 95.8g 5.8g 556m S 164.3 80.2 110:40.99 java


So it appears our indexing is clobbering the CPU but not the memory.

The user queries are pretty bad and I will provide a few examples here.
Note that the sort date "updated_dt" is not the order in which documents are
indexed, the user application should be sorting on a
different date.
=======================================================
INFO: [lighting_products] webapp=/solr path=/select
params={facet=true&sort=updated_dt+desc&f.content_videotype_s.facet.missing=true&spellcheck.q=candid+camera&nocache=1407790600582&distrib=false&version=2&oe=UTF-8&fl=id,score&df=text&shard.url=10.211.82.113:80/solr/lighting_products/|10.249.34.65:80/solr/lighting_products/&NOW=1407790600883&ie=UTF-8&facet.field=content_videotype_s&fq=my_database_s:training&fq=my_server_s:mydomain\-arc\-v2.lightingservices.com&fq=allnamespaces_s_mv:(happyland\-site+OR+tags+OR+predicates+OR+authorities+OR+happyland+OR+movies+OR+global+OR+devicegroup+OR+people+OR+entertainment)&fq={!tag%3Dct0}_contenttype_s:Standard\:ShowVideo&fsv=true&site=solr_arc&_restype=SOLR&wt=javabin&defType=dismax&rows=50&start=0&f.content_videotype_s.facet.limit=160&q=candid+camera&q.op=AND&isShard=true}
hits=22 status=0 QTime=118

INFO: [lighting_products] webapp=/solr path=/select
params={facet=false&f.content_videotype_s.facet.missing=true&spellcheck.q=candid+camera&nocache=1407790600582&distrib=false&version=2&oe=UTF-8&fl=uuid_s,_shortid_s,_contenttype_s,_contentnamespace_s,_secondarynamespaces_s_mv,content_title_s,content_name_s,content_typename_s,content_predicatename_s,content_tagtext_s,content_ldapid_s,content_groupname_s,content_rolename_s,content_sitekey_s,score&fl=id&df=text&shard.url=10.211.82.113:80/solr/lighting_products/|10.249.34.65:80/solr/lighting_products/&NOW=1407790600883&ie=UTF-8&facet.field=content_videotype_s&fq=my_database_s:training&fq=my_server_s:mydomain\-arc\-v2.lightingservices.com&fq=allnamespaces_s_mv:(happyland\-site+OR+tags+OR+predicates+OR+authorities+OR+happyland+OR+movies+OR+global+OR+devicegroup+OR+people+OR+entertainment)&fq={!tag%3Dct0}_contenttype_s:Standard\:ShowVideo&site=solr_arc&facet.mincount=1&ids=mydomain.v2.lightingservices.com-training-ea5526a9-06a8-4d29-a3f7-f132f1a9aeeb,mydomain.v2.lightingservices.com-training-0bce8225-38f4-4793-b1a2-3764aeb5d4c8,mydomain.v2.lightingservices.com-training-d98bc070-aa30-4e40-8e99-f8c855f499d3,mydomain.v2.lightingservices.com-training-ce286605-d781-4804-88ba-5d9f6ca3cf6a,mydomain.v2.lightingservices.com-training-12cfae76-10a4-4e3b-b9c4-c7d5621ffc85,mydomain.v2.lightingservices.com-training-d8e2b40b-3553-4094-8bb0-e56eb50db753,mydomain.v2.lightingservices.com-training-687527fa-f3fe-4580-b3b5-d19e6f8ea144,mydomain.v2.lightingservices.com-training-2e1d09f1-5b89-4419-a455-68e0ef354be3,mydomain.v2.lightingservices.com-training-6c1cfd74-8504-4166-bad2-c63e1ac11c8d,mydomain.v2.lightingservices.com-training-be9e93c1-65f9-4de8-a909-40cca4e0f145,mydomain.v2.lightingservices.com-training-104b6e52-c997-40bd-bca5-9fd51e79584d,mydomain.v2.lightingservices.com-training-0f231cb2-4947-4602-aeb8-30fa0f104fa8,mydomain.v2.lightingservices.com-training-cdad1d41-50ca-4edc-97dd-043b3db8bf17,mydomain.v2.lightingservices.com-training-ecf06a7b-dc89-4912-af57-ea721ce67b08,mydomain.v2.lightingservices.com-training-3a01d385-5a77-4d34-b130-f1bb227451a7,mydomain.v2.lightingservices.com-training-7debc9f4-cdef-47d6-97ed-d960eba8d34f,mydomain.v2.lightingservices.com-training-b5dfd62e-abe1-410e-8ef3-5d1479f9ee98&_restype=SOLR&wt=javabin&defType=dismax&rows=50&start=0&q=candid+camera&q.op=AND&isShard=true}
status=0 QTime=391

INFO: [lighting_products] webapp=/solr path=/select
params={site=solr_arc&facet=false&facet.mincount=1&ids=mydomain.v2.lightingservices.com-training-42f475a9-20a3-4c29-a6a5-a77c96b1533c,mydomain.v2.lightingservices.com-training-0dc61b6a-217c-11e4-b48f-0026b9414f30,mydomain.v2.lightingservices.com-training-0d8f3f14-217c-11e4-b48f-0026b9414f30,mydomain.v2.lightingservices.com-training-496f3606-b306-4ab6-bef6-23b42d9b5555,mydomain.v2.lightingservices.com-training-e2038ec2-5995-47e9-8645-6ad05a35a5d2,mydomain.v2.lightingservices.com-training-0adc8c78-6823-4876-b84e-e7156af6a5e8,mydomain.v2.lightingservices.com-training-d9093dd3-4f7d-4b13-bd52-0d14c41ac578,mydomain.v2.lightingservices.com-training-357f61b2-88a6-4984-9bb2-dab69e5e6282,mydomain.v2.lightingservices.com-training-7ae34b55-0b92-419d-8099-299e3e557a99,mydomain.v2.lightingservices.com-training-154b4b08-0dbc-477a-a068-b831fa05965e,mydomain.v2.lightingservices.com-training-0b06e710-7aba-4a48-9a8e-316b4468ca0c,mydomain.v2.lightingservices.com-training-0daa21da-217c-11e4-b48f-0026b9414f30,mydomain.v2.lightingservices.com-training-2c56f8d2-217d-11e4-b48f-0026b9414f30,mydomain.v2.lightingservices.com-training-abcc37bc-a2fc-4192-bac4-188ee0703f07,mydomain.v2.lightingservices.com-training-f4032364-2623-4e66-80e0-9f7af693934e,mydomain.v2.lightingservices.com-training-c802d95b-4549-4cc9-b977-d97ced8c566e,mydomain.v2.lightingservices.com-training-1c97e884-af6c-494e-8820-6dddec7b7086,mydomain.v2.lightingservices.com-training-89854375-7bc2-4adc-9864-30db2af824d7&_restype=SOLR&nocache=1407790543361&distrib=false&q.alt=*:*&wt=javabin&version=2&rows=50&defType=dismax&oe=UTF-8&f.content_imagetype_s.facet.missing=true&NOW=1407790543725&shard.url=10.211.82.113:80/solr/lighting_products/|10.249.34.65:80/solr/lighting_products/&df=text&fl=uuid_s,shortid_s,contenttype_s,contentnamespace_s,secondarynamespaces_s_mv,content_title_s,content_name_s,content_typename_s,content_predicatename_s,content_tagtext_s,content_ldapid_s,content_groupname_s,content_rolename_s,content_sitekey_s,score&fl=id&start=0&ie=UTF-8&q.op=AND&facet.field=content_imagetype_s&isShard=true&fq=my_database_s:training&fq=my_server_s:mydomain\-arc\-v2.lightingservices.com&fq=allnamespaces_s_mv:(santa\-intl+OR+tags+OR+predicates+OR+santajr+OR+authorities+OR+santa+OR+movies+OR+global+OR+devicegroup+OR+kids+OR+people+OR+teensanta+OR+kids\-and\-family)&fq={!tag%3Dct0}contenttype_s:Standard\:Image}
status=0 QTime=1809

INFO: [lighting_products] webapp=/solr path=/select
params={site=solr_arc&ids=mydomain.v2.lightingservices.com-training-81a2a4d0-914b-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-da3add36-9ecc-4891-90ce-5202d508fce1,mydomain.v2.lightingservices.com-training-d7e3e102-914a-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-a2a8e1ae-914a-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-5d89a724-914b-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-94342a6e-b097-4e93-8fdc-8597e1f5179d,mydomain.v2.lightingservices.com-training-361292b4-914b-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-a4556418-914b-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-89b75f97-adec-4407-b648-5c563c07ee6a,mydomain.v2.lightingservices.com-training-e67fa642-5667-4e5f-bc83-d02981890f2e,mydomain.v2.lightingservices.com-training-45592376-082d-41cf-b7a7-f7064f65c1c4,mydomain.v2.lightingservices.com-training-cade1838-914a-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-e804a7f1-2003-4ef7-92ca-8fdaa269270b,mydomain.v2.lightingservices.com-training-17920180-914b-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-13fdfb8c-914b-11e2-8a73-0026b9414f30,mydomain.v2.lightingservices.com-training-d77dd216-8fb2-11e2-8a73-0026b9414f30&spellcheck.q=spike&qf=title_autocomplete&_restype=SOLR&nocache=1407776350213&distrib=false&wt=javabin&version=2&rows=50&defType=dismax&oe=UTF-8&NOW=1407776350272&shard.url=10.211.82.113:80/solr/lighting_products/|10.249.34.65:80/solr/lighting_products/&df=text&fl=uuid_s,shortid_s,contenttype_s,contentnamespace_s,secondarynamespaces_s_mv,content_name_s,urlkey_s,content_description_t,score&fl=id&start=0&q=spike&ie=UTF-8&bf=recip(ms(NOW/HOUR,updated_dt),3.16e-11,1,1)&q.op=AND&isShard=true&fq=my_database_s:training&fq=my_server_s:mydomain\-arc\-v2.lightingservices.com&fq=allnamespaces_s_mv:(spike\-site+OR+tags+OR+spike+OR+predicates+OR+authorities+OR+movies+OR+global+OR+devicegroup+OR+people+OR+entertainment)&fq=contenttype_s:(Standard\:DistPolicy)}
status=0 QTime=808

Any ideas on why memory usage is so high are appreciated.




--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-OOM-Problem-tp4152389.html
Sent from the Solr - User mailing list archive at Nabble.com.
Shawn Heisey
2014-08-12 00:15:01 UTC
Permalink
Post by dancoleman
My SolrCloud of 3 shard / 3 replicas is having a lot of OOM errors. Here are
hosts: all are EC2 m1.large with 250G data volumes
documents: 120M total
zookeeper: 5 external t1.micros
<snip>
Post by dancoleman
Linux "top" command output with no indexing
=======================================================
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8654 root 20 0 95.3g 6.4g 1.1g S 27.6 87.4 83:46.19 java
Linux "top" command output with indexing
=======================================================
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12499 root 20 0 95.8g 5.8g 556m S 164.3 80.2 110:40.99 java
I think you're likely going to need a much larger heap than 5GB, or
you're going to need a lot more machines and shards, so that each
machine has a much smaller piece of the index. The java heap is only
one part of the story here, though.

Solr performance is terrible when the OS cannot effectively cache the
index, because Solr must actually read the disk to get the data required
for a query. Disks are incredibly SLOW. Even SSD storage is a *lot*
slower than RAM.

Your setup does not have anywhere near enough memory for the size of
your shards. Amazon's website says that the m1.large instance has 7.5GB
of RAM. You're allocating 5GB of that to Solr (the java heap) according
to your startup options. If you subtract a little more for the
operating system and basic system services, that leaves about 2GB of RAM
for the disk cache. Based on the numbers from top, that Solr instance
is handling nearly 90GB of index. 2GB of RAM for caching is nowhere
near enough -- you will want between 32GB and 96GB of total RAM for that
much index.

http://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn
dancoleman
2014-08-12 00:50:03 UTC
Permalink
90G is correct, each host is currently holding that much data.

Are you saying that 32GB to 96GB would be needed for each host? Assuming
we did not add more shards that is.




--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-OOM-Problem-tp4152389p4152401.html
Sent from the Solr - User mailing list archive at Nabble.com.
Shawn Heisey
2014-08-12 01:12:21 UTC
Permalink
Post by dancoleman
90G is correct, each host is currently holding that much data.
Are you saying that 32GB to 96GB would be needed for each host? Assuming
we did not add more shards that is.
If you want good performance and enough memory to give Solr the heap it
will need, yes. Lucene (the search API that Solr uses) relies on good
operating system caching for the index. Having enough memory to catch the
ENTIRE index is not usually required, but it is recommended.

Alternatively, you can add a lot more hosts and create a new collection
with a lot more shards. The total memory requirement across the whole
cloud won't go down, but each host won't require as much.

Thanks,
Shawn
tuxedomoon
2014-08-12 21:12:18 UTC
Permalink
I have modified my instances to m2.4xlarge 64-bit with 68.4G memory. Hate to
ask this but can you recommend Java memory and GC settings for 90G data and
the above memory? Currently I have

CATALINA_OPTS="${CATALINA_OPTS} -XX:NewSize=1536m -XX:MaxNewSize=1536m
-Xms5120m -Xmx5120m -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
-XX:+UseConcMarkSweepGC"

Doesn't this mean I am starting with 5G and never going over 5G?

I've seen a few of those univerted multi-valued field OOMs already on the
upgraded host.

Thanks

Tux







--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-OOM-Problem-tp4152389p4152585.html
Sent from the Solr - User mailing list archive at Nabble.com.
Shawn Heisey
2014-08-12 21:55:27 UTC
Permalink
Post by tuxedomoon
I have modified my instances to m2.4xlarge 64-bit with 68.4G memory. Hate to
ask this but can you recommend Java memory and GC settings for 90G data and
the above memory? Currently I have
CATALINA_OPTS="${CATALINA_OPTS} -XX:NewSize=1536m -XX:MaxNewSize=1536m
-Xms5120m -Xmx5120m -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
-XX:+UseConcMarkSweepGC"
Doesn't this mean I am starting with 5G and never going over 5G?
Yes, that's exactly what it means -- you have a heap size limit of 5GB.
The OutOfMemory error indicates that Solr needs more heap space than it
is getting. You'll need to raise the -Xmx value. it is usually
advisable to configure -Xms to match.

The wiki page I linked before includes a link to the following page,
listing the GC options that I use beyond the -Xmx setting:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thanks,
Shawn
tuxedomoon
2014-08-13 11:34:15 UTC
Permalink
Great info. Can I ask how much data you are handling with that 6G or 7G
heap?



--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-OOM-Problem-tp4152389p4152712.html
Sent from the Solr - User mailing list archive at Nabble.com.
Shawn Heisey
2014-08-13 13:10:45 UTC
Permalink
Post by tuxedomoon
Great info. Can I ask how much data you are handling with that 6G or 7G
heap?
My dev server is the one with the 7GB heap. My production servers only
handle half the index shards, so they have the smaller heap. Here is
the index size info from my dev server:

[***@bigindy5 ~]# du -sh /index/solr4/data/
131G /index/solr4/data/

This represents about 116 million total documents.

Thanks,
Shawn
tuxedomoon
2014-08-13 13:52:31 UTC
Permalink
I applied the OPTS you pointed me to, here's the full string:

CATALINA_OPTS="${CATALINA_OPTS} -XX:NewSize=1536m -XX:MaxNewSize=1536m
-Xms12288m -Xmx12288m -XX:NewRatio=3 -XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
-XX:+UseConcMarkSweepGC -XX:+CMSScavengeBeforeRemark
-XX:PretenureSizeThreshold=64m -XX:CMSFullGCsBeforeCompaction=1
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70
-XX:CMSTriggerPermRatio=80 -XX:CMSMaxAbortablePrecleanTime=6000
-XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages
-XX:+AggressiveOpts"

jConsole is now showing lower heap usage. It had been climbing to 12G
consistently, now it is only spiking to 10G every 10 minutes or so.

Here's my "top" output
=======================================================
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4250 root 20 0 129g 14g 1.9g S 2.0 21.3 17:40.61 java









--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-OOM-Problem-tp4152389p4152753.html
Sent from the Solr - User mailing list archive at Nabble.com.

tuxedomoon
2014-08-13 11:42:43 UTC
Permalink
Have you used a queue to intercept queries and if so what was your
implementation? We are indexing huge amounts of data from 7 SolrJ instances
which run independently, so there's a lot of concurrent indexing.




--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-OOM-Problem-tp4152389p4152717.html
Sent from the Solr - User mailing list archive at Nabble.com.
Shawn Heisey
2014-08-13 13:10:20 UTC
Permalink
Post by tuxedomoon
Have you used a queue to intercept queries and if so what was your
implementation? We are indexing huge amounts of data from 7 SolrJ instances
which run independently, so there's a lot of concurrent indexing.
On my setup, the queries come from a java webapp that uses SolrJ, which
is running on multiple servers in a cluster. The updates come from a
custom SolrJ application that I wrote. There is no queue, Solr is more
than capable of handling the load that we give it.

Full rebuilds are done with the dataimport handler. The source of all
our Solr data is a MySQL database.

Thanks,
Shawn
Toke Eskildsen
2014-08-12 07:56:30 UTC
Permalink
Post by dancoleman
My SolrCloud of 3 shard / 3 replicas is having a lot of OOM errors. Here are
hosts: all are EC2 m1.large with 250G data volumes
Is that 3 (each running a primary and a replica shard) or 6 instances?
Post by dancoleman
documents: 120M total
zookeeper: 5 external t1.micros
If your facet fields has many unique values and if you have many
concurrent requests, then memory usage will be high. But by the looks of
it, I guess that the facets fields has relatively few values?

Still, if you have many concurrent queries, you might consider using a
queue in front of your SolrCloud instead of just starting new requests,
in order to set an effective limit on heap usage.

- Toke Eskildsen, State and University Library, Denmark
Loading...