Can Apache Solr Handle TeraByte Large Data

Discussion:

Can Apache Solr Handle TeraByte Large Data

mustafozbek

2012-01-13 12:08:00 UTC

I am an apache solr user about a year. I used solr for simple search tools
but now I want to use solr with 5TB of data. I assume that 5TB data will be
7TB when solr index it according to filter that I use. And then I will add
nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without
shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data

There is no available 5TB data, I just want to estimate what will be the
result.
Note: You can assume that hardware resourses are not a problem.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html
Sent from the Solr - User mailing list archive at Nabble.com.

Daniel Brügge

2012-01-13 14:49:06 UTC

Hi,

it's definitely a problem to store 5TB in Solr without using sharding. I try to split data over solr instances,
so that the index will fit in my memory on the server.

I ran into trouble with a Solr using 50G index.

Daniel

Post by mustafozbek
I am an apache solr user about a year. I used solr for simple search tools
but now I want to use solr with 5TB of data. I assume that 5TB data will be
7TB when solr index it according to filter that I use. And then I will add
nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without
shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be the
result.
Note: You can assume that hardware resourses are not a problem.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html
Sent from the Solr - User mailing list archive at Nabble.com.

d***@ontrenet.com

2012-01-13 15:00:42 UTC

Maybe also have a look at these links.

http://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes
http://www.hathitrust.org/blogs/large-scale-search

Post by Daniel BrÃ¼gge
Hi,
it's definitely a problem to store 5TB in Solr without using sharding. I
try to split data over solr instances,
so that the index will fit in my memory on the server.
I ran into trouble with a Solr using 50G index.
Daniel

Post by mustafozbek
I am an apache solr user about a year. I used solr for simple search tools
but now I want to use solr with 5TB of data. I assume that 5TB data

will

Post by Daniel BrÃ¼gge

Post by mustafozbek
be
7TB when solr index it according to filter that I use. And then I will add
nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without
shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be the
result.
Note: You can assume that hardware resourses are not a problem.
--

http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html

Post by Daniel BrÃ¼gge

Post by mustafozbek
Sent from the Solr - User mailing list archive at Nabble.com.

Robert Stewart

2012-01-13 16:06:17 UTC

Any idea how many documents your 5TB data contains? Certain features such as faceting depends more on # of total documents than on actual size of data.

I have tested approx. 1 TB (100 million documents) running on a single machine (40 cores, 128 GB RAM), using distributed search across 10 shards (10 million docs each). So running 10 SOLR processes. Search performance is good (under 1 second avg. including faceting).

So based on that for 5TB (assuming 500 millon docs) you could probably shard across a few such machines and get decent performance with distributed search.

The indexes were sharded by time. New documents go into a single index (the "current" index), and once that index reaches 10 million docs, a new index is created to become the "current" index. Then the oldest index is dropped from search (so total remains 10 shards). It is news data, so older data is less important.

Post by d***@ontrenet.com
Maybe also have a look at these links.
http://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes
http://www.hathitrust.org/blogs/large-scale-search

Post by Daniel BrÃ¼gge
Hi,
it's definitely a problem to store 5TB in Solr without using sharding. I
try to split data over solr instances,
so that the index will fit in my memory on the server.
I ran into trouble with a Solr using 50G index.
Daniel

Post by mustafozbek
I am an apache solr user about a year. I used solr for simple search tools
but now I want to use solr with 5TB of data. I assume that 5TB data

will

Post by Daniel BrÃ¼gge

Post by mustafozbek
be
7TB when solr index it according to filter that I use. And then I will add
nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data.

(without

Post by Daniel BrÃ¼gge

Post by mustafozbek
shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be

the

Post by Daniel BrÃ¼gge

Post by mustafozbek
result.
Note: You can assume that hardware resourses are not a problem.
--

http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html

Post by Daniel BrÃ¼gge

Post by mustafozbek
Sent from the Solr - User mailing list archive at Nabble.com.

mustafozbek

2012-01-16 08:50:12 UTC

All documents that we use are rich text documents and we parse them with
tika. we need to search real time.

Post by Robert Stewart
Any idea how many documents your 5TB data contains?

There are about 3millions document. You see the problem is that we have
documents large in size and small in numbers. Is that fine?

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3662567.html
Sent from the Solr - User mailing list archive at Nabble.com.

Otis Gospodnetic

2012-01-16 17:11:05 UTC

Hello,

________________________________
All documents that we use are rich text documents and we parse them with
tika. we need to search real time.

Because of real-time requirement, you'll need to use unreleased/dev version of Solr.

Post by Robert Stewart
Any idea how many documents your 5TB data contains?

There are about 3millions document. You see the problem is that we have
documents large in size and small in numbers. Is that fine?

That's fine. But you may want to think about breaking up large docs into smaller Solr docs, since finding a match in a very large doc may make it hard for users to jump to the match/matches in a large doc unless you highlight matches in the document and allow the user to jump from match to match.

Otis
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html

Burton-West, Tom

2012-01-16 22:00:35 UTC

Hello ,

Searching real-time sounds difficult with that amount of data. With large documents, 3 million documents, and 5TB of data the index will be very large. With indexes that large your performance will probably be I/O bound.

Do you plan on allowing phrase or proximity searches? If so, your performance will be even more I/O bound as documents that large will have huge positions indexes that will need to be read into memory for processing phrase queries. To reduce I/O you need as much of the index in memory (Lucene/Solr caches, and operating system disk cache). Every commit invalidates the Solr/Lucene caches (unless the newer nrt code has solved this for Solr).

If you index and serve on the same server, you are also going to get terrible response time whenever your commits trigger a large merge.

If you need to service 10-100 qps or more, you may need to look at putting your index on SSDs or spreading it over enough machines so it can stay in memory.

What kind of response times are you looking for and what query rate?

We have somewhat smaller documents. We have 10 million documents and about 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 machines (i.e. 3 shards per machine). We get an average of around 200-300ms response time but our 95th percentile times are about 800ms and 99th percentile are around 2 seconds. This is with an average load of less than 1 query/second.

As Otis suggested, you may want to implement a strategy that allows users to search within the large documents by breaking the documents up into smaller units. What we do is have two Solr indexes. The first indexes complete documents. When the user clicks on a result, we index the entire document on a page level in a small Solr index on-the-fly. That way they can search within the document and get page level results.

More details about our setup: http://www.hathitrust.org/blogs/large-scale-search

Tom Burton-West
University of Michigan Library
www.hathitrust.org
-----Original Message-----

Memory Makers

2012-01-17 05:15:28 UTC

I've been toying with the idea of setting up an experiment to index a large
document set 1+ TB -- any thoughts on an open data set that one could use
for this purpose?

Thanks.

Post by Burton-West, Tom
Hello ,
Searching real-time sounds difficult with that amount of data. With large
documents, 3 million documents, and 5TB of data the index will be very
large. With indexes that large your performance will probably be I/O bound.
Do you plan on allowing phrase or proximity searches? If so, your
performance will be even more I/O bound as documents that large will have
huge positions indexes that will need to be read into memory for processing
phrase queries. To reduce I/O you need as much of the index in memory
(Lucene/Solr caches, and operating system disk cache). Every commit
invalidates the Solr/Lucene caches (unless the newer nrt code has solved
this for Solr).
If you index and serve on the same server, you are also going to get
terrible response time whenever your commits trigger a large merge.
If you need to service 10-100 qps or more, you may need to look at putting
your index on SSDs or spreading it over enough machines so it can stay in
memory.
What kind of response times are you looking for and what query rate?
We have somewhat smaller documents. We have 10 million documents and about
6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
machines (i.e. 3 shards per machine). We get an average of around
200-300ms response time but our 95th percentile times are about 800ms and
99th percentile are around 2 seconds. This is with an average load of less
than 1 query/second.
As Otis suggested, you may want to implement a strategy that allows users
to search within the large documents by breaking the documents up into
smaller units. What we do is have two Solr indexes. The first indexes
complete documents. When the user clicks on a result, we index the entire
document on a page level in a small Solr index on-the-fly. That way they
can search within the document and get page level results.
http://www.hathitrust.org/blogs/large-scale-search
Tom Burton-West
University of Michigan Library
www.hathitrust.org
-----Original Message-----

Otis Gospodnetic

2012-01-18 06:30:16 UTC

Could indexing English Wikipedia dump over and over get you there?

Otis
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html

________________________________
Sent: Tuesday, January 17, 2012 12:15 AM
Subject: Re: Can Apache Solr Handle TeraByte Large Data
I've been toying with the idea of setting up an experiment to index a large
document set 1+ TB -- any thoughts on an open data set that one could use
for this purpose?
Thanks.

Post by Burton-West, Tom
Hello ,
Searching real-time sounds difficult with that amount of data. With large
documents, 3 million documents, and 5TB of data the index will be very
large. With indexes that large your performance will probably be I/O bound.
Do you plan on allowing phrase or proximity searches? If so, your
performance will be even more I/O bound as documents that large will have
huge positions indexes that will need to be read into memory for processing
phrase queries. To reduce I/O you need as much of the index in memory
(Lucene/Solr caches, and operating system disk cache). Every commit
invalidates the Solr/Lucene caches (unless the newer nrt code has solved
this for Solr).
If you index and serve on the same server, you are also going to get
terrible response time whenever your commits trigger a large merge.
If you need to service 10-100 qps or more, you may need to look at putting
your index on SSDs or spreading it over enough machines so it can stay in
memory.
What kind of response times are you looking for and what query rate?
We have somewhat smaller documents. We have 10 million documents and about
6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
machines (i.e. 3 shards per machine). We get an average of around
200-300ms response time but our 95th percentile times are about 800ms and
99th percentile are around 2 seconds. This is with an average load of less
than 1 query/second.
As Otis suggested, you may want to implement a strategy that allows users
to search within the large documents by breaking the documents up into
smaller units. What we do is have two Solr indexes. The first indexes
complete documents. When the user clicks on a result, we index the entire
document on a page level in a small Solr index on-the-fly. That way they
can search within the document and get page level results.
http://www.hathitrust.org/blogs/large-scale-search
Tom Burton-West
University of Michigan Library
www.hathitrust.org
-----Original Message-----

Otis Gospodnetic

2012-01-14 09:53:12 UTC

Hello,

Inline

----- Original Message -----

Post by mustafozbek
I am an apache solr user about a year. I used solr for simple search tools
but now I want to use solr with 5TB of data. I assume that 5TB data will be
7TB when solr index it according to filter that I use. And then I will add
nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without
shards)
a- Can solr server answers the queries in an acceptable time

Not likely, unless the diversity of queries is very small and OS can keep the relevant parts of the index cached and Solr caches get hit a lot.

Post by mustafozbek
b- what is the expected time for commiting of 50MB data on 7TB index.

Depends on settings like ramBufferSizeMB and how you add the data (e.g. via DIH, via SolrJ, via csvn import...)

Post by mustafozbek
c- Is there an upper limit for index size.

Yes, there are Lucene doc IDs that limit its size, but you will hit you will hit hardware limits before you hit that limit.

Post by mustafozbek
2- what are the suggestions that you offer
a- How many shards should I use

Depends primarily on the number of servers available and their capacity.

Post by mustafozbek
b- Should I use solr cores

Sounds like you should really start by using SolrCloud.

Post by mustafozbek
c- What is the committing frequency you offered. (is 1 hour OK)

Depends on how often you want to see new data show up in search results. Some people need that to be immediately, or 1 second or 1 hour, while some are OK with 24h.

Post by mustafozbek
3- are there any test results for this kind of large data

Nothing official, but it's been done. For example, we've done large-scale stuff like this with Solr for our clients at Sematext, but we can't publish technical details.

Post by mustafozbek
There is no available 5TB data, I just want to estimate what will be the
result.
Note: You can assume that hardware resourses are not a problem.

Otis
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html

Mugeesh Husain

2015-08-03 15:42:53 UTC

Hi,
I am new in solr development and have a same requirement and I have already
got some knowledge such as how many shard have to created such amount of
data at all. with help of googling.

I want to take Some suggestion there are so many method to do indexing such
as DIH,solr,Solrj.

Please suggest me in which way i have to do it.
1.) Should i use Solrj
1.) Should i use DIH
1.) Should i use post method(in terminal)

or Is there any other way for indexing such amount of data.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html
Sent from the Solr - User mailing list archive at Nabble.com.

Alexandre Rafalovitch

2015-08-03 16:06:48 UTC

That's still a VERY open question. The answer is Yes, but the details
depend on the shape and source of your data. And the search you are
anticipating.

Is this a lot of entries with small number of fields. Or a -
relatively - small number of entries with huge field counts. Do you
need to store/return all those fields or just search them?

Is the content coming as one huge file (in which format?) or from an
external source such as database?

And so on.

Regards,
Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

Post by Mugeesh Husain
Hi,
I am new in solr development and have a same requirement and I have already
got some knowledge such as how many shard have to created such amount of
data at all. with help of googling.
I want to take Some suggestion there are so many method to do indexing such
as DIH,solr,Solrj.
Please suggest me in which way i have to do it.
1.) Should i use Solrj
1.) Should i use DIH
1.) Should i use post method(in terminal)
or Is there any other way for indexing such amount of data.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mugeesh Husain

2015-08-03 17:56:40 UTC

Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.

You Told me "The answer is Yes" i didn't get in which way you said Yes.

Thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
Sent from the Solr - User mailing list archive at Nabble.com.

Erick Erickson

2015-08-03 18:22:45 UTC

I'd go with SolrJ personally. For a terabyte of data that (I'm inferring)
are PDF files and the like (aka "semi-structured documents) you'll
need to have Tika parse out the data you need to index. And doing
that through posting or DIH puts all the analysis on the Solr servers,
which will work, but not optimally.

Here's something to get you started:

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

Post by Mugeesh Husain
Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.
You Told me "The answer is Yes" i didn't get in which way you said Yes.
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
Sent from the Solr - User mailing list archive at Nabble.com.

Erik Hatcher

2015-08-03 18:22:47 UTC

Most definitely yes given your criteria below. If you donât care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as youâd like to Solr would suffice it sounds like.
â
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>

Post by Mugeesh Husain
Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.
You Told me "The answer is Yes" i didn't get in which way you said Yes.
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
Sent from the Solr - User mailing list archive at Nabble.com.

Erick Erickson

2015-08-03 18:29:40 UTC

Ahhh, listen to Hatcher if you're not indexing the _contents_ of the
files, just the filenames....

Erick

Most definitely yes given your criteria below. If you don’t care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as you’d like to Solr would suffice it sounds like.
—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>

Post by Mugeesh Husain
Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.
You Told me "The answer is Yes" i didn't get in which way you said Yes.
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mugeesh Husain

2015-08-03 19:21:04 UTC

@Erik Hatcher You mean i have to use Solrj for indexing to it.(right ?)

Can Solrj handle large amount of data which i have mentioned previous post ?
If i will use DIH then how will i split value from filename etc.

I want to start my development in a right direction that why i am little
confuse on which way i will start my
requirement.

Please told me you guys are told me yes(Is yes for Solrj ? or DIH ?)

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220550.html
Sent from the Solr - User mailing list archive at Nabble.com.

Alexandre Rafalovitch

2015-08-03 18:59:37 UTC

Just to reconfirm, are you indexing file content? Because if you are,
you need to be aware most of the PDF do not extract well, as they do
not have text flow preserved.

If you are indexing PDF files, I would run a sample through Tika
directly (that's what Solr uses under the covers anyway) and see what
the output looks like.

Apart from that, either SolrJ or DIH would work. If this is for a
production system, I'd use SolrJ with client-side Tika parsing. But
you could use DIH for a quick test run.

Regards,
Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

Post by Mugeesh Husain
Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.
You Told me "The answer is Yes" i didn't get in which way you said Yes.
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mugeesh Husain

2015-08-03 19:34:59 UTC

@Alexandre No i dont need a content of a file. i am repeating my requirement

I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf

I just split all Value from a filename only,these values i have to index.

I am interested to index value to solr not file contains.

I have tested the DIH from a file system its work fine but i dont know how
can i implement my code in DIH
if my code get some value than how i can i index it using DIH.

If i will use DIH then How i will make split operation and get value from
it.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
Sent from the Solr - User mailing list archive at Nabble.com.

Alexandre Rafalovitch

2015-08-03 21:01:05 UTC

Well,

If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.

You could do it with DIH, but it would be with nested entities and the
inner entity would probably try to parse the file. So, a lot of wasted
effort if you just care about the file names.

Or, I would just do a directory listing in the operating system and
use regular expressions to split it into CSV file, which I would then
import into Solr directly.

In all of these cases, the question would be which field is the ID of
the record to ensure no duplicates.

Regards,
Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

Post by Mugeesh Husain
@Alexandre No i dont need a content of a file. i am repeating my requirement
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
I just split all Value from a filename only,these values i have to index.
I am interested to index value to solr not file contains.
I have tested the DIH from a file system its work fine but i dont know how
can i implement my code in DIH
if my code get some value than how i can i index it using DIH.
If i will use DIH then How i will make split operation and get value from
it.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
Sent from the Solr - User mailing list archive at Nabble.com.

Upayavira

2015-08-03 21:59:56 UTC

SolrJ is just a "SolrClient". In pseudocode, you say:

SolrClient client = new
SolrClient("http://localhost:8983/solr/whatever");

List<SolrInputDocument> docs = new ArrayList<>();
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "abc123");
doc.addField("some-text-field", "I like it when the sun shines");
docs.add(doc);
client.add(docs);
client.commit();

(warning, the above is typed from memory)

So, the question is simply how many documents do you add to docs before
you do client.add(docs);

And how often (if at all) do you call client.commit().

So when you are told "Use SolrJ", really, you are being told to write
some Java code that happens to use the SolrJ client library for Solr.

Upayavira

Post by Alexandre Rafalovitch
Well,
If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.
You could do it with DIH, but it would be with nested entities and the
inner entity would probably try to parse the file. So, a lot of wasted
effort if you just care about the file names.
Or, I would just do a directory listing in the operating system and
use regular expressions to split it into CSV file, which I would then
import into Solr directly.
In all of these cases, the question would be which field is the ID of
the record to ensure no duplicates.
Regards,
Alex.
----
http://www.solr-start.com/

Post by Mugeesh Husain
@Alexandre No i dont need a content of a file. i am repeating my requirement
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
I just split all Value from a filename only,these values i have to index.
I am interested to index value to solr not file contains.
I have tested the DIH from a file system its work fine but i dont know how
can i implement my code in DIH
if my code get some value than how i can i index it using DIH.
If i will use DIH then How i will make split operation and get value from
it.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
Sent from the Solr - User mailing list archive at Nabble.com.

Konstantin Gribov

2015-08-03 22:15:21 UTC

Upayavira, manual commit isn't a good advice, especially with small bulks
or single document, is it? I see recommendations on using
autoCommit+autoSoftCommit instead of manual commit mostly.

Post by Upayavira
SolrClient client = new
SolrClient("http://localhost:8983/solr/whatever");
List<SolrInputDocument> docs = new ArrayList<>();
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "abc123");
doc.addField("some-text-field", "I like it when the sun shines");
docs.add(doc);
client.add(docs);
client.commit();
(warning, the above is typed from memory)
So, the question is simply how many documents do you add to docs before
you do client.add(docs);
And how often (if at all) do you call client.commit().
So when you are told "Use SolrJ", really, you are being told to write
some Java code that happens to use the SolrJ client library for Solr.
Upayavira

Post by Alexandre Rafalovitch
Well,
If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.
You could do it with DIH, but it would be with nested entities and the
inner entity would probably try to parse the file. So, a lot of wasted
effort if you just care about the file names.
Or, I would just do a directory listing in the operating system and
use regular expressions to split it into CSV file, which I would then
import into Solr directly.
In all of these cases, the question would be which field is the ID of
the record to ensure no duplicates.
Regards,
Alex.
----
http://www.solr-start.com/

Post by Mugeesh Husain
@Alexandre No i dont need a content of a file. i am repeating my

requirement

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
I just split all Value from a filename only,these values i have to

index.

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
I am interested to index value to solr not file contains.
I have tested the DIH from a file system its work fine but i dont know

how

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
can i implement my code in DIH
if my code get some value than how i can i index it using DIH.
If i will use DIH then How i will make split operation and get value

from

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
it.
--

http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
Sent from the Solr - User mailing list archive at Nabble.com.

--
Best regards,
Konstantin Gribov

Upayavira

2015-08-04 09:24:37 UTC

Yes, you are right - generally autocommit is a better way. If you are
doing a one-off indexing, then a manual commit may well be the best
option, but generally, autocommit is a better way.

Upayavira

Post by Konstantin Gribov
Upayavira, manual commit isn't a good advice, especially with small bulks
or single document, is it? I see recommendations on using
autoCommit+autoSoftCommit instead of manual commit mostly.

Post by Upayavira
SolrClient client = new
SolrClient("http://localhost:8983/solr/whatever");
List<SolrInputDocument> docs = new ArrayList<>();
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "abc123");
doc.addField("some-text-field", "I like it when the sun shines");
docs.add(doc);
client.add(docs);
client.commit();
(warning, the above is typed from memory)
So, the question is simply how many documents do you add to docs before
you do client.add(docs);
And how often (if at all) do you call client.commit().
So when you are told "Use SolrJ", really, you are being told to write
some Java code that happens to use the SolrJ client library for Solr.
Upayavira

Post by Alexandre Rafalovitch
Well,
If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.
You could do it with DIH, but it would be with nested entities and the
inner entity would probably try to parse the file. So, a lot of wasted
effort if you just care about the file names.
Or, I would just do a directory listing in the operating system and
use regular expressions to split it into CSV file, which I would then
import into Solr directly.
In all of these cases, the question would be which field is the ID of
the record to ensure no duplicates.
Regards,
Alex.
----
http://www.solr-start.com/

Post by Mugeesh Husain
@Alexandre No i dont need a content of a file. i am repeating my

requirement

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
I just split all Value from a filename only,these values i have to

index.

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
I am interested to index value to solr not file contains.
I have tested the DIH from a file system its work fine but i dont know

how

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
can i implement my code in DIH
if my code get some value than how i can i index it using DIH.
If i will use DIH then How i will make split operation and get value

from

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
it.
--

http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html

Post by Alexandre Rafalovitch

Post by Mugeesh Husain
Sent from the Solr - User mailing list archive at Nabble.com.

--
Best regards,
Konstantin Gribov

Mugeesh Husain

2015-08-04 17:13:17 UTC

@Upayavira if i uses Solrj for indexing. autocommit or Softautocommit will
work in case of SolJ

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220796.html
Sent from the Solr - User mailing list archive at Nabble.com.

Upayavira

2015-08-04 17:50:21 UTC

Post by Mugeesh Husain
@Upayavira if i uses Solrj for indexing. autocommit or Softautocommit will
work in case of SolJ

There are two ways to get content into Solr:

* push it in via an HTTP post.
- this is what SolrJ uses, what bin/post uses, and everything else
other than:
* DIH: this runs inside Solr and pulls content into the index via
configurations

Personally, I'm not a fan of the DIH. It works for simple scenarios, but
as soon as your needs get a little complex, it seems to struggle, and it
seems your needs have become sufficiently complex already.

Solr itself does the autocommit, so it will work with anything that you
use to push content into Solr, SolrJ, bin/post, DIH or anything else.

Upayavira

Mugeesh Husain

2015-08-05 15:50:14 UTC

@Upayavira

Thanks these thing are most useful for my understanding
I have thing about i will create XML or CVS file from my requirement using
java
Then Index it via HTTP post or bin/post

I am not using DIH because i did't get any of link or idea how to split
data and add to solr one by one.(As i mention onmy requirement)

tell me Indexing XML file or CVS files which one is a better way ?

with csv i noticed that it didn't parse the data into the correct fields. So
how do we ensure that the data is correctly stored in Solr ?

Or XML is a correct way to parse it

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221051.html
Sent from the Solr - User mailing list archive at Nabble.com.

Upayavira

2015-08-05 15:59:22 UTC

If you are using Java, you will likely find SolrJ the best way - it uses
serialised Java objects to communicate with Solr - you don't need to
worry about that. Just use code similar to that earlier in the thread.
No XML, no CSV, just simple java code.

Upayavira

Post by Mugeesh Husain
@Upayavira
Thanks these thing are most useful for my understanding
I have thing about i will create XML or CVS file from my requirement using
java
Then Index it via HTTP post or bin/post
I am not using DIH because i did't get any of link or idea how to split
data and add to solr one by one.(As i mention onmy requirement)
tell me Indexing XML file or CVS files which one is a better way ?
with csv i noticed that it didn't parse the data into the correct fields. So
how do we ensure that the data is correctly stored in Solr ?
Or XML is a correct way to parse it
--
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221051.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mugeesh Husain

2015-08-05 16:07:49 UTC

filesystem are about 40 millions of document it will iterate 40 times how may
solrJ could not handle 40m times loops(before indexing i have to split
values from filename and make some operation then index to Solr)

Is it will continuous indexing using 40m times or i have to sleep in between
some interaval.

Does it will take same time in compare of HTTP or bin/post ?

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221060.html
Sent from the Solr - User mailing list archive at Nabble.com.

Upayavira

2015-08-05 16:21:06 UTC

Post your docs in sets of 1000. Create a:

List<SolrInputDocument> docs

Then add 1000 docs to it, then client.add(docs);

Repeat until your 40m are indexed.

Upayavira

Post by Mugeesh Husain
filesystem are about 40 millions of document it will iterate 40 times how may
solrJ could not handle 40m times loops(before indexing i have to split
values from filename and make some operation then index to Solr)
Is it will continuous indexing using 40m times or i have to sleep in between
some interaval.
Does it will take same time in compare of HTTP or bin/post ?
--
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221060.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mugeesh Husain

2015-08-05 16:49:46 UTC

thanks you Upayavira,

I think i have done all these thing using SolrJ which was usefull before
starting development of the project.
I hope i will not got any of issue using SolrJ and got lots of stuff using
it.

Thanks
Mugeesh Husain

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221066.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mugeesh Husain

2015-08-04 09:51:57 UTC

Thank @Alexandre and Erickson ,Hatcher.

I will generate ID of MD5 with help of filename using java.
I can do it with help of SolrJ nicely because i am java developer apart from
this
The question raised that data is too large i think it will break into
multiple shards(core)
Using multi core indexing how i can analysed duplicate ID while reindexing
the whole.(Using Solrj) and
How i will analysed one core contains such amount of data and other etc.

I have decide i will do it with SolrJ because i don't have good
understanding with DIH for such type operation which i needed on my
requirement. i'd google but unable to find such type of DIH Example which i
can implement on my problem.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220673.html
Sent from the Solr - User mailing list archive at Nabble.com.

Erik Hatcher

2015-08-04 12:03:37 UTC

If you have data that only consists of id (full filename) and filename (indexed, tokenized) 40M of those will fit comfortably into a single shard provided enough RAM to operate.

I know SolrJ is tossed out there a lot as a/the way to index - but if youâve got a directory tree of files and want to index _just_ the file names then a shell script that generated a CSV could be easy and clean. Itâs trivial to `bin/post -c <your collection> data.csv`

â
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>

Post by Mugeesh Husain
I will generate ID of MD5 with help of filename using java.
I can do it with help of SolrJ nicely because i am java developer apart from
this
The question raised that data is too large i think it will break into
multiple shards(core)
Using multi core indexing how i can analysed duplicate ID while reindexing
the whole.(Using Solrj) and
How i will analysed one core contains such amount of data and other etc.
I have decide i will do it with SolrJ because i don't have good
understanding with DIH for such type operation which i needed on my
requirement. i'd google but unable to find such type of DIH Example which i
can implement on my problem.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220673.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mugeesh Husain

2015-08-04 17:10:32 UTC

Thanks you Erik, I will preferred XML files instead of csv.
On my requirement if i want to use DIH for indexing than how could i split
these operation or include java clode to DIH..
I have googled but not get such type of requirement.
provide my any of link for it or some suggestion to do it.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220793.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mikhail Khludnev

2015-08-04 19:00:03 UTC

Post by Mugeesh Husain
Thanks you Erik, I will preferred XML files instead of csv.
On my requirement if i want to use DIH for indexing than how could i split
these operation or include java clode to DIH..

Here is my favorite way to tweak data in DIH
https://wiki.apache.org/solr/DataImportHandler#ScriptTransformer
You can even do it in java https://wiki.apache.org/solr/DIHCustomTransformer
but personally I prefer JavaScript.

Note, as a big fan of DIH I had to say that it's not an option in case of
SolrCloud, I explained it
http://blog.griddynamics.com/2015/07/how-to-import-structured-data-into-solr.html

I have googled but not get such type of requirement.

provide my any of link for it or some suggestion to do it.

Post by Mugeesh Husain
--
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220793.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<***@griddynamics.com>

Mugeesh Husain

2015-08-05 16:24:19 UTC

@Mikhail Use of data import handler ,if i define my baseDir is
D:/work/folder. Will it work for sub-folder and sub-folder of sub-folder ...
etc also.?

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221063.html
Sent from the Solr - User mailing list archive at Nabble.com.

34 Replies
412 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

mustafozbek 2012-01-13 12:08:00 UTC

Daniel Brügge 2012-01-13 14:49:06 UTC

d***@ontrenet.com 2012-01-13 15:00:42 UTC

Robert Stewart 2012-01-13 16:06:17 UTC

mustafozbek 2012-01-16 08:50:12 UTC

Otis Gospodnetic 2012-01-16 17:11:05 UTC

Burton-West, Tom 2012-01-16 22:00:35 UTC

Memory Makers 2012-01-17 05:15:28 UTC

Otis Gospodnetic 2012-01-18 06:30:16 UTC

Otis Gospodnetic 2012-01-14 09:53:12 UTC

Mugeesh Husain 2015-08-03 15:42:53 UTC

Alexandre Rafalovitch 2015-08-03 16:06:48 UTC

Mugeesh Husain 2015-08-03 17:56:40 UTC

Erick Erickson 2015-08-03 18:22:45 UTC

Erik Hatcher 2015-08-03 18:22:47 UTC

Erick Erickson 2015-08-03 18:29:40 UTC

Mugeesh Husain 2015-08-03 19:21:04 UTC

Alexandre Rafalovitch 2015-08-03 18:59:37 UTC

Mugeesh Husain 2015-08-03 19:34:59 UTC

Alexandre Rafalovitch 2015-08-03 21:01:05 UTC

Upayavira 2015-08-03 21:59:56 UTC

Konstantin Gribov 2015-08-03 22:15:21 UTC

Upayavira 2015-08-04 09:24:37 UTC

Mugeesh Husain 2015-08-04 17:13:17 UTC

Upayavira 2015-08-04 17:50:21 UTC

Mugeesh Husain 2015-08-05 15:50:14 UTC

Upayavira 2015-08-05 15:59:22 UTC

Mugeesh Husain 2015-08-05 16:07:49 UTC

Upayavira 2015-08-05 16:21:06 UTC

Mugeesh Husain 2015-08-05 16:49:46 UTC

Mugeesh Husain 2015-08-04 09:51:57 UTC

Erik Hatcher 2015-08-04 12:03:37 UTC

Mugeesh Husain 2015-08-04 17:10:32 UTC

Mikhail Khludnev 2015-08-04 19:00:03 UTC

Mugeesh Husain 2015-08-05 16:24:19 UTC

about - legalese

Loading...