Discussion:
How to Fast Bulk Inserting documents
Troy Edwards
2015-08-19 17:09:50 UTC
Permalink
I have a requirement where I have to bulk insert a lot of documents in
SolrCloud.

My average document size is 400 bytes
Number of documents that need to be inserted 250000/second (for a total of
about 3.6 Billion documents)

Any ideas/suggestions on how that can be done? (use a client or uploadcsv
or stream or data import handler)

How can SolrCloud be configured to allow this fast bulk insert?

Any thoughts on what the SolrCloud configuration would probably look like?

Thanks
Vineeth Dasaraju
2015-08-19 17:22:23 UTC
Permalink
I have been using the solrj client and get speeds of 1000 objects per
second. The size of my object is around 4 kb.
Post by Troy Edwards
I have a requirement where I have to bulk insert a lot of documents in
SolrCloud.
My average document size is 400 bytes
Number of documents that need to be inserted 250000/second (for a total of
about 3.6 Billion documents)
Any ideas/suggestions on how that can be done? (use a client or uploadcsv
or stream or data import handler)
How can SolrCloud be configured to allow this fast bulk insert?
Any thoughts on what the SolrCloud configuration would probably look like?
Thanks
Shawn Heisey
2015-08-19 18:04:33 UTC
Permalink
Post by Troy Edwards
I have a requirement where I have to bulk insert a lot of documents in
SolrCloud.
My average document size is 400 bytes
Number of documents that need to be inserted 250000/second (for a total of
about 3.6 Billion documents)
Any ideas/suggestions on how that can be done? (use a client or uploadcsv
or stream or data import handler)
How can SolrCloud be configured to allow this fast bulk insert?
Any thoughts on what the SolrCloud configuration would probably look like?
I think this is an unrealistic goal, unless you're planning on a couple
hundred shards with a very small number of shards (1 or 2) per server.
This would also require a very large number of very fast servers with a
fair amount of RAM. The more shards you have on each server, the more
likely it is that you'll need SSD storage. This will get very expensive.

It is likely going to take a lot longer than 4 hours to rebuild your
entire 3.6 billion document index. Your small document size will help
keep the rebuild time lower than I would otherwise expect, but 3.6
billion is a VERY large number. I can achieve about 6000 docs per
second on my largest index, which means that each of my cold shards
indexes at about 1000 docs per second. I'm not sure how large my
documents are, but a few kilobytes is probably about right. The entire
rebuild takes over 9 hours for a little more than 200 million documents.

The best performance is likely to come from a heavily multi-threaded
SolrJ 5.2.1 or later application using CloudSolrClient, with at least
version 5.2.1 on your servers. Even if you build the hardware
infrastructure I described above, it won't perform to your expectations
unless you've got someone with considerable Java programming skills.

Thanks,
Shawn
Toke Eskildsen
2015-08-19 18:13:44 UTC
Permalink
Post by Troy Edwards
My average document size is 400 bytes
Number of documents that need to be inserted 250000/second
(for a total of about 3.6 Billion documents)
Any ideas/suggestions on how that can be done? (use a client
or uploadcsv or stream or data import handler)
Use more than one cloud. Make them fully independent. As I suggested when you asked 4 days ago. That would also make it easy to scale: Just measure how much a single setup can take and do the math.

- Toke Eskildsen
Upayavira
2015-08-19 18:36:06 UTC
Permalink
Post by Toke Eskildsen
Post by Troy Edwards
My average document size is 400 bytes
Number of documents that need to be inserted 250000/second
(for a total of about 3.6 Billion documents)
Any ideas/suggestions on how that can be done? (use a client
or uploadcsv or stream or data import handler)
Use more than one cloud. Make them fully independent. As I suggested when
you asked 4 days ago. That would also make it easy to scale: Just measure
how much a single setup can take and do the math.
Yes - work out how much each node can handle, then you can work out how
many nodes you need.

You could consider using implicit routing rather than compositeId, which
means that you take on responsibility for hashing your ID to push
content to the right node. (Or, if you use compositeId, you could use
the same algorithm, and be sure that you send docs directly to the
correct shard.

At the moment, if you push five documents to a five shard collection,
the node you send them to could end up doing four HTTP requests to the
other nodes in the collection. This means you don't need to worry about
where to post your content - it is just handled for you. However, there
is a performance hit there. Push content direct to the correct node
(either using implicit routing, or by replicating the compositeId hash
calculation in your client) and you'd increase your indexing throughput
significantly, I would theorise.

Upayavira
Erick Erickson
2015-08-19 18:41:21 UTC
Permalink
Ir you're sitting on HDFS anyway, you could use MapReduceIndexerTool. I'm not
sure that'll hit your rate, it spends some time copying things around.
If you're not on
HDFS, though, it's not an option.

Best,
Erick
Post by Upayavira
Post by Toke Eskildsen
Post by Troy Edwards
My average document size is 400 bytes
Number of documents that need to be inserted 250000/second
(for a total of about 3.6 Billion documents)
Any ideas/suggestions on how that can be done? (use a client
or uploadcsv or stream or data import handler)
Use more than one cloud. Make them fully independent. As I suggested when
you asked 4 days ago. That would also make it easy to scale: Just measure
how much a single setup can take and do the math.
Yes - work out how much each node can handle, then you can work out how
many nodes you need.
You could consider using implicit routing rather than compositeId, which
means that you take on responsibility for hashing your ID to push
content to the right node. (Or, if you use compositeId, you could use
the same algorithm, and be sure that you send docs directly to the
correct shard.
At the moment, if you push five documents to a five shard collection,
the node you send them to could end up doing four HTTP requests to the
other nodes in the collection. This means you don't need to worry about
where to post your content - it is just handled for you. However, there
is a performance hit there. Push content direct to the correct node
(either using implicit routing, or by replicating the compositeId hash
calculation in your client) and you'd increase your indexing throughput
significantly, I would theorise.
Upayavira
Susheel Kumar
2015-08-19 20:43:27 UTC
Permalink
For Indexing 3.5 billion documents, you will not only run into bottleneck
with Solr but also at different places (data acquisition, solr document
object creation, submitting in bulk/batches to Solr).

This will require parallelizing the above operations at each of the above
steps which can get you maximum throughput. Multi-threaded java solrj
based Indexer & CloudSolrClient is required as described by Shawn. I have
used ConcurrentSolrUpdate in the past but with CloudSolrClient,
setParallelUpdates should be tried out.

Thanks,
Susheel
Post by Erick Erickson
Ir you're sitting on HDFS anyway, you could use MapReduceIndexerTool. I'm not
sure that'll hit your rate, it spends some time copying things around.
If you're not on
HDFS, though, it's not an option.
Best,
Erick
Post by Upayavira
Post by Toke Eskildsen
Post by Troy Edwards
My average document size is 400 bytes
Number of documents that need to be inserted 250000/second
(for a total of about 3.6 Billion documents)
Any ideas/suggestions on how that can be done? (use a client
or uploadcsv or stream or data import handler)
Use more than one cloud. Make them fully independent. As I suggested
when
Post by Upayavira
Post by Toke Eskildsen
you asked 4 days ago. That would also make it easy to scale: Just
measure
Post by Upayavira
Post by Toke Eskildsen
how much a single setup can take and do the math.
Yes - work out how much each node can handle, then you can work out how
many nodes you need.
You could consider using implicit routing rather than compositeId, which
means that you take on responsibility for hashing your ID to push
content to the right node. (Or, if you use compositeId, you could use
the same algorithm, and be sure that you send docs directly to the
correct shard.
At the moment, if you push five documents to a five shard collection,
the node you send them to could end up doing four HTTP requests to the
other nodes in the collection. This means you don't need to worry about
where to post your content - it is just handled for you. However, there
is a performance hit there. Push content direct to the correct node
(either using implicit routing, or by replicating the compositeId hash
calculation in your client) and you'd increase your indexing throughput
significantly, I would theorise.
Upayavira
Toke Eskildsen
2015-08-19 20:58:23 UTC
Permalink
Post by Toke Eskildsen
Use more than one cloud. Make them fully independent.
As I suggested when you asked 4 days ago. That would
also make it easy to scale: Just measure how much a
single setup can take and do the math.
The goal is 250K documents/second.

I tried modifying the books.csv-example that comes with Solr to use lines with 400 characters and inflated it to 4 * 1 million entries. I then started a Solr with the techproduct-example and ingested the 4*1M entries using curl from 4 prompts a the same time. The longest running took 138 seconds. 4M/138 seconds = 29K documents/second.

My machine is a 4 core (8 with HyperThreading) i7 laptop, using SSD. On a modern server and with custom schema & config, the speed should of course be better. On the other hand, the rate might slow down as the shards grows.

Give or take, something like 10 machines could conceivably be enough to handle the Solr load if the analysis chain is near the books-example in complexity. Of course real data tests are needed and the CSv-data must be constructed somehow.

- Toke Eskildsen
Troy Edwards
2015-08-20 01:47:58 UTC
Permalink
Thank you for taking the time to do the test.

I have been doing similar tests using the post Tool (SimplePostTool) with
the real data and was able to get to about 10K documents/second.

I am considering using multiple files (one per client) ftp'd into a solr
node and then use a scheduled job to use the post tool and post them to
solr.

The only issue I have run into so far is that if there is an error in data
(e.g. required field missing) the post tool stops processing the rest of
the file.
Post by Toke Eskildsen
Post by Toke Eskildsen
Use more than one cloud. Make them fully independent.
As I suggested when you asked 4 days ago. That would
also make it easy to scale: Just measure how much a
single setup can take and do the math.
The goal is 250K documents/second.
I tried modifying the books.csv-example that comes with Solr to use lines
with 400 characters and inflated it to 4 * 1 million entries. I then
started a Solr with the techproduct-example and ingested the 4*1M entries
using curl from 4 prompts a the same time. The longest running took 138
seconds. 4M/138 seconds = 29K documents/second.
My machine is a 4 core (8 with HyperThreading) i7 laptop, using SSD. On a
modern server and with custom schema & config, the speed should of course
be better. On the other hand, the rate might slow down as the shards grows.
Give or take, something like 10 machines could conceivably be enough to
handle the Solr load if the analysis chain is near the books-example in
complexity. Of course real data tests are needed and the CSv-data must be
constructed somehow.
- Toke Eskildsen
Troy Edwards
2015-08-20 01:42:36 UTC
Permalink
Are you suggesting that requests come into a service layer that identifies
which client is on which solrcloud and passes the request to that cloud?

Thank you
Post by Toke Eskildsen
Post by Troy Edwards
My average document size is 400 bytes
Number of documents that need to be inserted 250000/second
(for a total of about 3.6 Billion documents)
Any ideas/suggestions on how that can be done? (use a client
or uploadcsv or stream or data import handler)
Use more than one cloud. Make them fully independent. As I suggested when
you asked 4 days ago. That would also make it easy to scale: Just measure
how much a single setup can take and do the math.
- Toke Eskildsen
Loading...