Discussion:
Solr feasibility with terabyte-scale data
(too old to reply)
Phillip Farber
2008-01-18 22:26:21 UTC
Permalink
Hello everyone,

We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.

The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.

We expect to have low QPS (<10) rate and probably will not need
millisecond query response.

Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.

While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?

Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?

Given these parameters is it realistic to think that Solr could handle
the task?

Any advice/wisdom greatly appreciated,

Phil
Srikant Jakilinki
2008-01-19 04:14:30 UTC
Permalink
Nice description of a use-case. My 2 pennies embedded...
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large text field, indexed not stored, containing the OCR typically
~1.4Mb. Some limited faceting or additional metadata fields may be
added later.
It has been my experience that large fields are a bit of a problem. If
possible, try to segment them. Just suggesting this as indexing seems to
the bottleneck, not the queries.
Post by Phillip Farber
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on
the desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb
of data via HTTP suggest we can index at a rate sufficient to keep up
with the inputs (after getting over the 1.1 Tb hump). We envision
nightly commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection Distribution to add more Solr instances clearly means
duplicating an immense index. Is it possible to use one instance to
update the index on NAS while other instances only read the index and
commit to keep their caches warm instead?
So, let me get this straight. You want to put the index in 'true' shared
storage. Just one copy of it on NAS with Solr used in a diskless
fashion? If this is the case, I do not see why you cannot do what you
want to do. Just do not have a master/slave configuration at all since
this configuration makes all Solr boxes behave as similarly and
up-to-date as possible with multiple copies of indexes created. Have one
or more Solr box index the index. Have one or more Solr box search the
index. Load balance the boxes yourself by a simple round robin.
Post by Phillip Farber
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Solr can be used and proved at this kind of scale. But if it is simple
index/search you are after you might also consider writing the
index-search-update programs yourself.
Post by Phillip Farber
Given these parameters is it realistic to think that Solr could handle
the task?
I am sure it would. Please keep us posted on your approach and which
worked for you as yours is a very generic problem and should be
documented from the use-case (your mail) to design (this thread) to
implementation (your decision) and performance (your benchmarks).
Post by Phillip Farber
Any advice/wisdom greatly appreciated,
Phil
----------------------------------------------------------------------
Free pop3 email with a spam filter.
http://www.bluebottle.com/tag/5
Ryan McKinley
2008-01-19 19:09:07 UTC
Permalink
Post by Phillip Farber
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
I have not done anything on this scale... but with:
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
split a large index into many smaller indices and return the union of
all results. This may or may not be necessary depending on what the
data actually looks like (if you text just uses 100 words, your index
may not be that big)

How many documents are you talking about?
Post by Phillip Farber
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html

http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html


ryan
Phillip Farber
2008-01-22 19:05:14 UTC
Permalink
Post by Ryan McKinley
Post by Phillip Farber
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large text field, indexed not stored, containing the OCR typically
~1.4Mb. Some limited faceting or additional metadata fields may be
added later.
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
split a large index into many smaller indices and return the union of
all results. This may or may not be necessary depending on what the
data actually looks like (if you text just uses 100 words, your index
may not be that big)
How many documents are you talking about?
Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we
are talking perhaps 50K words total to index so as you point out the
index might not be too big. It's the *data* that is big not the
*index*, right?. So I don't think SOLR-303 (distributed search) is
required here.

Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M? Ballpark?

Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?
Post by Ryan McKinley
Post by Phillip Farber
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html
ryan
Ryan McKinley
2008-01-22 19:37:43 UTC
Permalink
Post by Phillip Farber
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M? Ballpark?
Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?
The only real advice I can add is to give it a try. If you have test
data, try testing it and see what happens. 1/2 sec queries is likely
possible with the right hardware and settings -- but run a few tests
before signing any contracts ;) If the index is really large, SOLR-303
should help make it more managable.

Let us know how things go and post add data to:
http://wiki.apache.org/solr/SolrPerformanceData

ryan
Mike Klaas
2008-01-22 20:37:35 UTC
Permalink
we are talking perhaps 50K words total to index so as you point out
the index might not be too big. It's the *data* that is big not
the *index*, right?. So I don't think SOLR-303 (distributed
search) is required here.
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size
will result for 7M documents over 50K words where we're talking
just 2 fields per doc: 1 id field and one OCR field of ~1.4M?
Ballpark?
That's 280K tokens per document, assuming ~5 chars/word. That's 2
trillion tokens. Lucene's posting list compression is decent, but
you're still talking about a minimum of 2-4TB for the index (that's
assuming 1 or 2 bytes per token).
Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?
Well, the average compressed posting list will be at least 80MB that
needs to be read from the NAS and decoded and ranked. Since the size
is exponentially distributed, common terms will be much bigger and
rarer terms much smaller.

You want to return all 7M ids for every query? That in itself would
be 100's of MB of xml to generate, transfer, and parse.

0.5s seems a little optimistic.

Since your queries are so simple, I think it might be better to use
lucene directly. You can read the matching doc ids for a term
directly from the posting list in that case.

-Mike
Erick Erickson
2008-01-23 00:57:34 UTC
Permalink
Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the "words" are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in every orientation. Not a
recognizable word in the bunch <G>....

Best
Erick
Post by Phillip Farber
Post by Ryan McKinley
Post by Phillip Farber
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large text field, indexed not stored, containing the OCR typically
~1.4Mb. Some limited faceting or additional metadata fields may be
added later.
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
split a large index into many smaller indices and return the union of
all results. This may or may not be necessary depending on what the
data actually looks like (if you text just uses 100 words, your index
may not be that big)
How many documents are you talking about?
are talking perhaps 50K words total to index so as you point out the
index might not be too big. It's the *data* that is big not the
*index*, right?. So I don't think SOLR-303 (distributed search) is
required here.
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M? Ballpark?
Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?
Post by Ryan McKinley
Post by Phillip Farber
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html
Post by Ryan McKinley
ryan
Phillip Farber
2008-01-23 16:15:03 UTC
Permalink
For sure this is a problem. We have considered some strategies. One
might be to use a dictionary to clean up the OCR but that gets hard for
proper names and technical jargon. Another is to use stop words (which
has the unfortunate side effect of making phrase searches like "to be or
not to be" impossible). I've heard you can't make a silk purse out of a
sows ear ...

Phil
Post by Erick Erickson
Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the "words" are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in every orientation. Not a
recognizable word in the bunch <G>....
Best
Erick
Post by Phillip Farber
Post by Ryan McKinley
Post by Phillip Farber
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large text field, indexed not stored, containing the OCR typically
~1.4Mb. Some limited faceting or additional metadata fields may be
added later.
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
split a large index into many smaller indices and return the union of
all results. This may or may not be necessary depending on what the
data actually looks like (if you text just uses 100 words, your index
may not be that big)
How many documents are you talking about?
are talking perhaps 50K words total to index so as you point out the
index might not be too big. It's the *data* that is big not the
*index*, right?. So I don't think SOLR-303 (distributed search) is
required here.
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M? Ballpark?
Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?
Post by Ryan McKinley
Post by Phillip Farber
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html
Post by Ryan McKinley
ryan
Lance Norskog
2008-01-23 20:19:28 UTC
Permalink
We use two indexed copies of the same text, one with stemming and stopwords
and the other with neither. We do phrase search on the second.

You might use two different OCR implementations and cross-correlate the
output.

Lance

-----Original Message-----
From: Phillip Farber [mailto:***@umich.edu]
Sent: Wednesday, January 23, 2008 8:15 AM
To: solr-***@lucene.apache.org
Subject: Re: Solr feasibility with terabyte-scale data

For sure this is a problem. We have considered some strategies. One might
be to use a dictionary to clean up the OCR but that gets hard for proper
names and technical jargon. Another is to use stop words (which has the
unfortunate side effect of making phrase searches like "to be or not to be"
impossible). I've heard you can't make a silk purse out of a sows ear ...

Phil
Post by Erick Erickson
Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the "words" are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in every orientation. Not a
recognizable word in the bunch <G>....
Best
Erick
Post by Phillip Farber
Post by Ryan McKinley
Post by Phillip Farber
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large text field, indexed not stored, containing the OCR typically
~1.4Mb. Some limited faceting or additional metadata fields may be
added later.
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
split a large index into many smaller indices and return the union of
all results. This may or may not be necessary depending on what the
data actually looks like (if you text just uses 100 words, your index
may not be that big)
How many documents are you talking about?
are talking perhaps 50K words total to index so as you point out the
index might not be too big. It's the *data* that is big not the
*index*, right?. So I don't think SOLR-303 (distributed search) is
required here.
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M? Ballpark?
Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?
Post by Ryan McKinley
Post by Phillip Farber
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/p
ackage-summary.html
Post by Erick Erickson
Post by Phillip Farber
Post by Ryan McKinley
ryan
Otis Gospodnetic
2008-01-21 07:04:26 UTC
Permalink
Hi,
Some quick notes, since it's late here.

- You'll need to wait for SOLR-303 - there is no way even a big machine will be able to search such a large index in a reasonable amount of time, plus you may simply not have enough RAM for such a large index.

- I'd suggest you wait for Solr 1.3 (or some -dev version that uses the about-to-be-released Lucene 2.3)...performance reasons.

- As for avoiding index duplication - how about having a SAN with a single copy of the index that all searchers (and the master) point to?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Phillip Farber <***@umich.edu>
To: solr-***@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data

Hello everyone,

We are considering Solr 1.2 to index and search a terabyte-scale
dataset
of OCR. Initially our requirements are simple: basic tokenizing, score

sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.

The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on
the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with

the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.

We expect to have low QPS (<10) rate and probably will not need
millisecond query response.

Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
starting
with 2 blades and will add as demands require.

While we have a lot of storage, the idea of master/slave Solr
Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?

Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?

Given these parameters is it realistic to think that Solr could handle
the task?

Any advice/wisdom greatly appreciated,

Phil
Phillip Farber
2008-01-23 00:20:22 UTC
Permalink
Post by Otis Gospodnetic
Hi,
Some quick notes, since it's late here.
- You'll need to wait for SOLR-303 - there is no way even a big machine will be able to search such a large index in a reasonable amount of time, plus you may simply not have enough RAM for such a large index.
Are you basing this on data similar to what Mike Klaas outlines?

Quoting Mike Klaas:

"That's 280K tokens per document, assuming ~5 chars/word. That's 2
trillion tokens. Lucene's posting list compression is decent, but
you're still talking about a minimum of 2-4TB for the index (that's
assuming 1 or 2 bytes per token). "

and

"Well, the average compressed posting list will be at least 80MB that
needs to be read from the NAS and decoded and ranked. Since the size is
exponentially distributed, common terms will be much bigger and rarer
terms much smaller."

End of quoting Mike Klaas:

We would need all 7M ids scored so we could push them through a filter
query to reduce them to a much smaller number on the order of 100-10,000
representing just those that correspond to items in a collection.

So to ask again, do you think it's possible to do this in, say, under 15
seconds? (I think I'm giving up on 0.5 sec. ...)
Post by Otis Gospodnetic
- I'd suggest you wait for Solr 1.3 (or some -dev version that uses the about-to-be-released Lucene 2.3)...performance reasons.
- As for avoiding index duplication - how about having a SAN with a single copy of the index that all searchers (and the master) point to?
Yes we're thinking a single copy of the index using hardware-based
snapshot technology for the readers a dedicated indexing solr instance
updates the index. Reasonable?
Post by Otis Gospodnetic
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on
the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
Mike Klaas
2008-01-23 01:47:14 UTC
Permalink
Post by Phillip Farber
We would need all 7M ids scored so we could push them through a
filter query to reduce them to a much smaller number on the order
of 100-10,000 representing just those that correspond to items in a
collection.
You could pass the filter to Solr to improve the speed dramatically.
Post by Phillip Farber
So to ask again, do you think it's possible to do this in, say,
under 15 seconds? (I think I'm giving up on 0.5 sec. ...)
At this point, no-one is going to be able to answer you question
unless they have done something similar. The largest individual
index I've worked with is on the order of 10GB, and one thing I've
learned is to not extrapolate several orders of magnitude beyond my
experience.

-Mike
marcusherou
2008-05-09 06:37:19 UTC
Permalink
Hi.

I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents. Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.

Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is filled
(have the same size as the others) only then will all shards receive docs in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.

Let's say the indexing part will be all fancy and working i TB scale now we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.

Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
{
Set<Integer> ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
List<DocumentIndexer> indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Page<Document> res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}

if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}

docs.setTotalItems(totalItems);

return docs;
}

This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer (SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.

I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio configurable.

The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.

Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".

Kindly

//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
View this message in context: http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
James Brady
2008-05-09 07:36:44 UTC
Permalink
Hi, we have an index of ~300GB, which is at least approaching the
ballpark you're in.

Lucky for us, to coin a phrase we have an 'embarassingly
partitionable' index so we can just scale out horizontally across
commodity hardware with no problems at all. We're also using the
multicore features available in development Solr version to reduce
granularity of core size by an order of magnitude: this makes for lots
of small commits, rather than few long ones.

There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!

James
Post by marcusherou
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G
documents. Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is filled
(have the same size as the others) only then will all shards receive docs in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared
storage.
Let's say the indexing part will be all fancy and working i TB scale now we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a
controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.
Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
{
Set<Integer> ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(),
sdc.getPageSize());
for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
List<DocumentIndexer> indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Page<Document> res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of
DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and
comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
View this message in context: http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
Marcus Herou
2008-05-09 08:17:04 UTC
Permalink
Cool.

Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus
Hi, we have an index of ~300GB, which is at least approaching the ballpark
you're in.
Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity hardware with
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by an order of
magnitude: this makes for lots of small commits, rather than few long ones.
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!
James
Post by marcusherou
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents. Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is filled
(have the same size as the others) only then will all shards receive docs in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale now we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.
Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
{
Set<Integer> ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
List<DocumentIndexer> indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Page<Document> res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
James Brady
2008-05-09 17:56:58 UTC
Permalink
So our problem is made easier by having complete index
partitionability by a user_id field. That means at one end of the
spectrum, we could have one monolithic index for everyone, while at
the other end of the spectrum we could individual cores for each
user_id.

At the moment, we've gone for a halfway house somewhere in the middle:
I've got several large EC2 instances (currently 3), each running a
single master/slave pair of Solr servers. The servers have several
cores (currently 10 - a guesstimated good number). As new users
register, I automatically distribute them across cores. I would like
to do something with clustering users based on geo-location so that
cores will get 'time off' for maintenance and optimization for that
user cluster's nighttime. I'd also like to move in the 1 core per user
direction as dynamic core creation becomes available.

It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved a
lot of the tricky problems. There a number of ridiculously sized
projects using it to solve their scale problems, not least Yahoo...

James
Post by Marcus Herou
Cool.
Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.
Kindly
//Marcus
Hi, we have an index of ~300GB, which is at least approaching the ballpark
you're in.
Lucky for us, to coin a phrase we have an 'embarassingly
partitionable'
index so we can just scale out horizontally across commodity
hardware with
no problems at all. We're also using the multicore features
available in
development Solr version to reduce granularity of core size by an order of
magnitude: this makes for lots of small commits, rather than few long ones.
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!
James
Post by marcusherou
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the
index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G
documents.
Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at
least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance
wise if
we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response.
I did
hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new
shard
the
master indexer will know what shard to prioritize. This is
probably not
enough either since all new docs will go to the new shard until it
is
filled
(have the same size as the others) only then will all shards
receive docs
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals
docs from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when
it comes
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB
scale now
we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a
controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.
Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
{
Set<Integer> ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(),
sdc.getPageSize());
for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
List<DocumentIndexer> indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Page<Document> res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of
DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with
something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which
indexers
to
prefer over the others.
But of course if you have 50 searchers, having each client thread
spawn
yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Ken Krugler
2008-05-09 20:26:19 UTC
Permalink
Hi Marcus,
Post by James Brady
It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved
a lot of the tricky problems. There a number of ridiculously sized
projects using it to solve their scale problems, not least Yahoo...
You should also look at a new project called Katta:

http://katta.wiki.sourceforge.net/

First code check-in should be happening this weekend, so I'd wait
until Monday to take a look :)

-- Ken
Post by James Brady
Post by Marcus Herou
Cool.
Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.
Kindly
//Marcus
Hi, we have an index of ~300GB, which is at least approaching the ballpark
you're in.
Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity hardware with
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by an order of
magnitude: this makes for lots of small commits, rather than few long ones.
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!
James
Post by marcusherou
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents. Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is filled
(have the same size as the others) only then will all shards receive docs in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale now we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.
Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
{
Set<Integer> ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
List<DocumentIndexer> indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Page<Document> res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Marcus Herou
2008-05-10 15:39:16 UTC
Permalink
Thanks Ken.

I will take a look be sure of that :)

Kindly

//Marcus
Post by Ken Krugler
Hi Marcus,
It seems a lot of what you're describing is really similar to MapReduce,
so I think Otis' suggestion to look at Hadoop is a good one: it might
prevent a lot of headaches and they've already solved a lot of the tricky
problems. There a number of ridiculously sized projects using it to solve
their scale problems, not least Yahoo...
http://katta.wiki.sourceforge.net/
First code check-in should be happening this weekend, so I'd wait until
Monday to take a look :)
-- Ken
Cool.
Post by Marcus Herou
Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.
Kindly
//Marcus
Hi, we have an index of ~300GB, which is at least approaching the
Post by James Brady
ballpark
you're in.
Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity hardware with
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by an order of
magnitude: this makes for lots of small commits, rather than few long ones.
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!
James
Hi.
Post by marcusherou
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents. Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise
if
we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I
did
hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is filled
(have the same size as the others) only then will all shards receive
docs
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs
from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it
comes
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale
now
we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.
Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
{
Set<Integer> ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
List<DocumentIndexer> indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Page<Document> res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which
indexers
to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Lance Norskog
2008-05-09 20:54:47 UTC
Permalink
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the
MD5 cryptographic checksumming algorithm. This takes X bytes of data and
creates a 128-bit long "random" number, or 128 "random" bits. At this point
there are no reports of two different datasets that give the same checksum.

This gives some handy things:
a) a fixed-size unique ID field, giving fixed space requirements,
The standard representation of this is 32 hex bytes, i.e.
'deadbeefdeadbeefdeadbeefdeadbeef'. You could make a special 128-bit Lucene
data type for this.

b) the ability to change your mind about the uniqueness formula for your
data,

c) a handy primary key for cross-correlating in other databases,
Think external DBs which supply data for some records. The primary
key is the MD5 signature.

d) the ability to randomly pick subsets of your data.
The record 'id:deadbeefdeadbeefdeadbeefdeadbeef', will match the
wildcard string 'deadbeef*'. And 'd*'.
'd*' selects a perfectly random subset of your data, 1/16 of the
total size. 'd**' gives 1/256 of your data.
This is perfectly random because MD5 gives such a "perfectly" random
hashcode.

This should go on a wiki page 'SchemaDesignTips'.

Cheers,

Lance Norskog
Otis Gospodnetic
2008-05-09 16:54:55 UTC
Permalink
Marcus,

You are headed in the right direction.

We've built a system like this at Technorati (Lucene, not Solr) and had components like the "namenode" or "controller" that you mention. If you look at Hadoop project, you will see something similar in concept (NameNode), though it deals with raw data blocks, their placement in the cluster, etc. As a matter of fact, I am currently running its "re-balancer" in order to move some of the blocks around in the cluster. That matches what you are describing for moving documents from one shard to the other. Of course, you can simplify things and just have this central piece be aware of any new servers and simply get it to place any new docs on the new servers and create a new shard there. Or you can get fancy and take into consideration the hardware resources - the CPU, the disk space, the me
mory, and use that to figure out how much each machine in your cluster can handle and maximize its use based on this knowledge. :)

I think Solr and Nutch are in a desperate need of this central component (must not be SPOF!) for shard management.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
Sent: Friday, May 9, 2008 2:37:19 AM
Subject: Re: Solr feasibility with terabyte-scale data
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents. Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is filled
(have the same size as the others) only then will all shards receive docs in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale now we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.
Currently my idea is something like this.
public Pagesearch(SearchDocumentCommand sdc)
{
Setids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iteratoriterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
Listindexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Pageres = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer (SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
Marcus Herou
2008-05-10 16:03:49 UTC
Permalink
Hi Otis.

Thanks for the insights. Nice to get feedback from a technorati guy. Nice to
see that the snippet of yours is almost a copy of mine, gives me the right
stomach feeling about this :)

I'm quite familiar with Hadoop as you can see if you check out the code of
my OS project AbstractCache->
http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a project
which aims to create storage solutions based on the Map and SortedMap
interface. I use it everywhere in Tailsweep.com and used it as well at my
former employer Eniro.se (largest yellow pages site in Sweden). It has been
in constant development for five years.

Since I'm a cluster freak of nature I love a project named GlusterFS where
thay have managed to create a system without master/slave[s] and NameNode.
The advantage of this is that it is a lot more scalable, the drawback is
that you can get into Split-Brain situations which guys in the mailing-list
are complaining about. Anyway I tend to try to solve this with JGroups
membership where the coordinator can be any machine in the cluster but in
the group joining process the first machine to join get's the privilege of
becoming coordinator. But even with JGroups you can run into trouble with
race-conditions of all kinds (distributed locks for example).

I've created an alternative to the Hadoop file system (mostly for fun) where
you just add an object to the cluster and based on what algorithm you choose
it is Raided or striped across the cluster.

Anyway this was off topic but I think my experience in building membership
aware clusters will help me in this particular case.

Kindly

//Mrcaus
Post by Otis Gospodnetic
Marcus,
You are headed in the right direction.
We've built a system like this at Technorati (Lucene, not Solr) and had
components like the "namenode" or "controller" that you mention. If you
look at Hadoop project, you will see something similar in concept
(NameNode), though it deals with raw data blocks, their placement in the
cluster, etc. As a matter of fact, I am currently running its "re-balancer"
in order to move some of the blocks around in the cluster. That matches
what you are describing for moving documents from one shard to the other.
Of course, you can simplify things and just have this central piece be
aware of any new servers and simply get it to place any new docs on the new
servers and create a new shard there. Or you can get fancy and take into
consideration the hardware resources - the CPU, the disk space, the memory,
and use that to figure out how much each machine in your cluster can handle
and maximize its use based on this knowledge. :)
I think Solr and Nutch are in a desperate need of this central component
(must not be SPOF!) for shard management.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
Sent: Friday, May 9, 2008 2:37:19 AM
Subject: Re: Solr feasibility with terabyte-scale data
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents.
Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if
we
have one big index. I totally agree with the others claiming that you
most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard
the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is
filled
(have the same size as the others) only then will all shards receive docs
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale now
we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller
like
searcher on the client side which itself searches in all of the shards
and
merges the response. Perhaps Distributed Solr solves this and will love
to
test it whenever my new installation of servers and enclosures is
finished.
Currently my idea is something like this.
public Pagesearch(SearchDocumentCommand sdc)
{
Setids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iteratoriterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
Listindexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Pageres = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers
to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn
yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset
Post by Phillip Farber
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on
the
Post by Phillip Farber
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
starting
Post by Phillip Farber
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection
Post by Phillip Farber
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Marcus Herou
2008-05-11 14:42:57 UTC
Permalink
Quick reply to Otis and Ken.

Otis: After a nights sleep I think you are absolutely right about that some
HPC grid like Hadoop or perhaps GlusterHPC should be used regardless of my
last comment.

Ken: Looked at the arch of Katta and it looks really nice. I really believe
that Katta could be something which lives as a subproject of Lucene since it
surely fills a gap that is not filled by Nutch. Nutch surely do similar
stuff but this could actually something that Nutch uses as a component to
distributed the crawled index. I will try to join the Stefan & friends team.

Kindly

//Marcus
Post by Marcus Herou
Hi Otis.
Thanks for the insights. Nice to get feedback from a technorati guy. Nice
to see that the snippet of yours is almost a copy of mine, gives me the
right stomach feeling about this :)
I'm quite familiar with Hadoop as you can see if you check out the code of
my OS project AbstractCache->
http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a
project which aims to create storage solutions based on the Map and
SortedMap interface. I use it everywhere in Tailsweep.com and used it as
well at my former employer Eniro.se (largest yellow pages site in Sweden).
It has been in constant development for five years.
Since I'm a cluster freak of nature I love a project named GlusterFS where
thay have managed to create a system without master/slave[s] and NameNode.
The advantage of this is that it is a lot more scalable, the drawback is
that you can get into Split-Brain situations which guys in the mailing-list
are complaining about. Anyway I tend to try to solve this with JGroups
membership where the coordinator can be any machine in the cluster but in
the group joining process the first machine to join get's the privilege of
becoming coordinator. But even with JGroups you can run into trouble with
race-conditions of all kinds (distributed locks for example).
I've created an alternative to the Hadoop file system (mostly for fun)
where you just add an object to the cluster and based on what algorithm you
choose it is Raided or striped across the cluster.
Anyway this was off topic but I think my experience in building membership
aware clusters will help me in this particular case.
Kindly
//Mrcaus
On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic <
Post by Otis Gospodnetic
Marcus,
You are headed in the right direction.
We've built a system like this at Technorati (Lucene, not Solr) and had
components like the "namenode" or "controller" that you mention. If you
look at Hadoop project, you will see something similar in concept
(NameNode), though it deals with raw data blocks, their placement in the
cluster, etc. As a matter of fact, I am currently running its "re-balancer"
in order to move some of the blocks around in the cluster. That matches
what you are describing for moving documents from one shard to the other.
Of course, you can simplify things and just have this central piece be
aware of any new servers and simply get it to place any new docs on the new
servers and create a new shard there. Or you can get fancy and take into
consideration the hardware resources - the CPU, the disk space, the memory,
and use that to figure out how much each machine in your cluster can handle
and maximize its use based on this knowledge. :)
I think Solr and Nutch are in a desperate need of this central component
(must not be SPOF!) for shard management.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
Sent: Friday, May 9, 2008 2:37:19 AM
Subject: Re: Solr feasibility with terabyte-scale data
Hi.
I will as well head into a path like yours within some months from
now.
Currently I have an index of ~10M docs and only store id's in the
index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents.
Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at least,
one
mount point). Hope this will be the right choice, only future can
tell.
Since we are developing a search engine I frankly don't think even
having
100's of SOLR instances serving the index will cut it performance wise
if we
have one big index. I totally agree with the others claiming that you
most
definitely will go OOE or hit some other constraints of SOLR if you
must
have the whole result in memory sort it and create a xml response. I
did hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing a
TB
index will take a long long time and you really want to have an
optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over
the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler
just a
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new
shard the
master indexer will know what shard to prioritize. This is probably
not
enough either since all new docs will go to the new shard until it is
filled
(have the same size as the others) only then will all shards receive
docs in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs
from the
others until it reaches some sort of threshold (10 servers = each
shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I
think
doing a distributed reindexing will as well be a good thing when it
comes to
cutting both indexing and optimizing speed. Probably each indexer
should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then
just
copy it to the main index to minimize the burden of the shared
storage.
Let's say the indexing part will be all fancy and working i TB scale
now we
come to searching. I personally believe after talking to other guys
which
have built big search engines that you need to introduce a controller
like
searcher on the client side which itself searches in all of the shards
and
merges the response. Perhaps Distributed Solr solves this and will
love to
test it whenever my new installation of servers and enclosures is
finished.
Currently my idea is something like this.
public Pagesearch(SearchDocumentCommand sdc)
{
Setids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iteratoriterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
Listindexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Pageres = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of
DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and
comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I
think.
However the searching speed will at best be constant proportional to
the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something
like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a
thread
pool. The Future result times out after let's say 750 msec and the
client
ignores all searchers which are slower. Probably some performance
metrics
should be gathered about each searcher so the client knows which
indexers to
prefer over the others.
But of course if you have 50 searchers, having each client thread
spawn yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think Peter
Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and
hopefully
someone smacks me on my head and says "Hey dude do it like this
instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset
Post by Phillip Farber
of OCR. Initially our requirements are simple: basic tokenizing,
score
Post by Phillip Farber
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some
limited
Post by Phillip Farber
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about
1M
Post by Phillip Farber
docs) which we expect to increase to 10Tb over time. Pilot tests on
the
Post by Phillip Farber
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb
of
Post by Phillip Farber
data via HTTP suggest we can index at a rate sufficient to keep up
with
Post by Phillip Farber
the inputs (after getting over the 1.1 Tb hump). We envision
nightly
Post by Phillip Farber
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955
dual
Post by Phillip Farber
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE
switch
Post by Phillip Farber
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
starting
Post by Phillip Farber
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection
Post by Phillip Farber
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the
index
Post by Phillip Farber
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we
scale
Post by Phillip Farber
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could
handle
Post by Phillip Farber
the task?
Any advice/wisdom greatly appreciated,
Phil
--
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
http://www.tailsweep.com/
http://blogg.tailsweep.com/
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Otis Gospodnetic
2008-05-09 21:13:06 UTC
Permalink
You can't believe how much it pains me to see such nice piece of work live so separately. But I also think I know why it happened :(. Do you know if Stefan & Co. have the intention to bring it under some contrib/ around here? Would that not make sense?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Marcus,
Post by James Brady
It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved
a lot of the tricky problems. There a number of ridiculously sized
projects using it to solve their scale problems, not least Yahoo...
http://katta.wiki.sourceforge.net/
First code check-in should be happening this weekend, so I'd wait
until Monday to take a look :)
-- Ken
Post by James Brady
Post by Marcus Herou
Cool.
Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.
Kindly
//Marcus
On Fri, May 9, 2008 at 9:36 AM, James Brady
Hi, we have an index of ~300GB, which is at least approaching the ballpark
you're in.
Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity hardware with
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by an order of
magnitude: this makes for lots of small commits, rather than few long ones.
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!
James
Post by marcusherou
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents.
Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if
we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard
the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is
filled
(have the same size as the others) only then will all shards receive docs
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale now
we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is
finished.
Currently my idea is something like this.
public Pagesearch(SearchDocumentCommand sdc)
{
Setids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iteratoriterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
Listindexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Pageres = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers
to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn
yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and hopefully
someone smacks me on my head and says "Hey dude do it like this instead".
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr Collection
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Ken Krugler
2008-05-09 21:37:19 UTC
Permalink
Hi Otis,
Post by Otis Gospodnetic
You can't believe how much it pains me to see such nice piece of
work live so separately. But I also think I know why it happened
:(. Do you know if Stefan & Co. have the intention to bring it
under some contrib/ around here? Would that not make sense?
I'm not working on the project, so I can't speak for Stefan &
friends...but my guess is that it's going to live separately as
something independent of Solr/Nutch. If you view it as search
plumbing that's usable in multiple environments, then that makes
sense. If you view it as replicating core Solr (or Nutch)
functionality, then it sucks. Not sure what the outcome will be.

-- Ken
Post by Otis Gospodnetic
----- Original Message ----
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Marcus,
Post by James Brady
It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved
a lot of the tricky problems. There a number of ridiculously sized
projects using it to solve their scale problems, not least Yahoo...
http://katta.wiki.sourceforge.net/
First code check-in should be happening this weekend, so I'd wait
until Monday to take a look :)
-- Ken
Post by James Brady
Post by Marcus Herou
Cool.
Since you must certainly already have a good partitioning
scheme, could you
Post by James Brady
Post by Marcus Herou
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and
twice before
Post by James Brady
Post by Marcus Herou
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.
Kindly
//Marcus
On Fri, May 9, 2008 at 9:36 AM, James Brady
Post by James Brady
Hi, we have an index of ~300GB, which is at least approaching
the ballpark
Post by James Brady
Post by Marcus Herou
Post by James Brady
you're in.
Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity
hardware with
Post by James Brady
Post by Marcus Herou
Post by James Brady
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by
an order of
Post by James Brady
Post by Marcus Herou
Post by James Brady
magnitude: this makes for lots of small commits, rather than
few long ones.
Post by James Brady
Post by Marcus Herou
Post by James Brady
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!
James
Post by marcusherou
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents.
Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think
even having
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
100's of SOLR instances serving the index will cut it
performance wise if
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
we
have one big index. I totally agree with the others claiming
that you most
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have
an optimized
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or
simpler just a
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard
the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is
filled
(have the same size as the others) only then will all shards
receive docs
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing
when it comes
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale now
we
come to searching. I personally believe after talking to other
guys which
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
have built big search engines that you need to introduce a
controller like
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
searcher on the client side which itself searches in all of
the shards and
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
merges the response. Perhaps Distributed Solr solves this and
will love to
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
test it whenever my new installation of servers and enclosures is
finished.
Currently my idea is something like this.
public Pagesearch(SearchDocumentCommand sdc)
{
Setids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iteratoriterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
Listindexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Pageres = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of
DocumentIndexers. I
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is
broken I think.
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with
something like
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction
with a thread
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
pool. The Future result times out after let's say 750 msec and
the client
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
ignores all searchers which are slower. Probably some
performance metrics
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
should be gathered about each searcher so the client knows
which indexers
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn
yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size
and hopefully
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
someone smacks me on my head and says "Hey dude do it like
this instead".
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a
terabyte-scale dataset
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot
tests on the
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We
are starting
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Otis Gospodnetic
2008-05-09 23:20:55 UTC
Permalink
From what I can tell from the overview on http://katta.wiki.sourceforge.net/, it's a partial replication of Solr/Nutch functionality, plus some goodies. It might have been better to work those goodies into some friendly contrib/ be it Solr, Nutch, Hadoop, or Lucene. Anyhow, let's see what happens there! :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
Sent: Friday, May 9, 2008 5:37:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Otis,
Post by Otis Gospodnetic
You can't believe how much it pains me to see such nice piece of
work live so separately. But I also think I know why it happened
:(. Do you know if Stefan & Co. have the intention to bring it
under some contrib/ around here? Would that not make sense?
I'm not working on the project, so I can't speak for Stefan &
friends...but my guess is that it's going to live separately as
something independent of Solr/Nutch. If you view it as search
plumbing that's usable in multiple environments, then that makes
sense. If you view it as replicating core Solr (or Nutch)
functionality, then it sucks. Not sure what the outcome will be.
-- Ken
Post by Otis Gospodnetic
----- Original Message ----
From: Ken Krugler
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Marcus,
Post by James Brady
It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved
a lot of the tricky problems. There a number of ridiculously sized
projects using it to solve their scale problems, not least Yahoo...
http://katta.wiki.sourceforge.net/
First code check-in should be happening this weekend, so I'd wait
until Monday to take a look :)
-- Ken
Post by James Brady
Post by Marcus Herou
Cool.
Since you must certainly already have a good partitioning
scheme, could you
Post by James Brady
Post by Marcus Herou
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and
twice before
Post by James Brady
Post by Marcus Herou
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.
Kindly
//Marcus
On Fri, May 9, 2008 at 9:36 AM, James Brady
Post by James Brady
Hi, we have an index of ~300GB, which is at least approaching
the ballpark
Post by James Brady
Post by Marcus Herou
Post by James Brady
you're in.
Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity
hardware with
Post by James Brady
Post by Marcus Herou
Post by James Brady
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by
an order of
Post by James Brady
Post by Marcus Herou
Post by James Brady
magnitude: this makes for lots of small commits, rather than
few long ones.
Post by James Brady
Post by Marcus Herou
Post by James Brady
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!
James
Post by marcusherou
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents.
Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.
Since we are developing a search engine I frankly don't think
even having
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
100's of SOLR instances serving the index will cut it
performance wise if
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
we
have one big index. I totally agree with the others claiming
that you most
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have
an optimized
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or
simpler just a
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard
the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is
filled
(have the same size as the others) only then will all shards
receive docs
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing
when it comes
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.
Let's say the indexing part will be all fancy and working i TB scale now
we
come to searching. I personally believe after talking to other
guys which
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
have built big search engines that you need to introduce a
controller like
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
searcher on the client side which itself searches in all of
the shards and
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
merges the response. Perhaps Distributed Solr solves this and
will love to
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
test it whenever my new installation of servers and enclosures is
finished.
Currently my idea is something like this.
public Pagesearch(SearchDocumentCommand sdc)
{
Setids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
for (Iteratoriterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
Listindexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Pageres = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of
DocumentIndexers. I
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is
broken I think.
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with
something like
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction
with a thread
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
pool. The Future result times out after let's say 750 msec and
the client
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
ignores all searchers which are slower. Probably some
performance metrics
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
should be gathered about each searcher so the client knows
which indexers
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn
yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think Peter Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size
and hopefully
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
someone smacks me on my head and says "Hey dude do it like
this instead".
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Kindly
//Marcus
Post by Phillip Farber
Hello everyone,
We are considering Solr 1.2 to index and search a
terabyte-scale dataset
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not stored, containing the OCR typically ~1.4Mb. Some limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot
tests on the
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
data via HTTP suggest we can index at a rate sufficient to keep up with
the inputs (after getting over the 1.1 Tb hump). We envision nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We
are starting
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection
Post by James Brady
Post by Marcus Herou
Post by James Brady
Post by marcusherou
Post by Phillip Farber
Distribution to add more Solr instances clearly means duplicating an
immense index. Is it possible to use one instance to update the index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Continue reading on narkive:
Loading...