Discussion:
Solr on HDFS vs local storage - Benchmarking
Greenhorn Techie
2017-11-22 13:59:21 UTC
Permalink
Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks
Hendrik Haddorp
2017-11-22 14:16:13 UTC
Permalink
We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bit slower. We are about to do performance testing on that now.

The reason we switched to HDFS was largely connected to us using Docker
and Marathon/Mesos. With HDFS the data is in a shared file system and
thus it is possible to move the replica to a different instance on a a
different host.

regards,
Hendrik
Post by Greenhorn Techie
Hi,
Good Afternoon!!
While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.
Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?
Thanks
Greenhorn Techie
2017-11-22 16:06:26 UTC
Permalink
Hendrik,

Thanks for your response.

Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?

Thanks
Post by Hendrik Haddorp
We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bit slower. We are about to do performance testing on that now.
The reason we switched to HDFS was largely connected to us using Docker
and Marathon/Mesos. With HDFS the data is in a shared file system and
thus it is possible to move the replica to a different instance on a a
different host.
regards,
Hendrik
Post by Greenhorn Techie
Hi,
Good Afternoon!!
While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.
Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?
Thanks
Erick Erickson
2017-11-22 17:41:47 UTC
Permalink
In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.

In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's observation: "We now had a patter with lots of small
updates and commits and that seems to be quite a bit slower". He's
merging segments and (presumably) autowarming frequently, implying
lots of I/O and HDFS adds an extra layer.

Personally I'd use whichever is most convenient and see if the
performance was "good enough". I wouldn't recommend _installing_ HDFS
just to use it with Solr, why add another complication? If you need
the redundancy add replicas. If you already have the HDFS
infrastructure in place and using HDFS is easier than local storage,
feel free....

Best,
Erick


On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
Post by Greenhorn Techie
Hendrik,
Thanks for your response.
Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?
Thanks
Post by Hendrik Haddorp
We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bit slower. We are about to do performance testing on that now.
The reason we switched to HDFS was largely connected to us using Docker
and Marathon/Mesos. With HDFS the data is in a shared file system and
thus it is possible to move the replica to a different instance on a a
different host.
regards,
Hendrik
Post by Greenhorn Techie
Hi,
Good Afternoon!!
While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.
Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?
Thanks
Hendrik Haddorp
2017-11-22 19:31:01 UTC
Permalink
We actually use no auto warming. Our collections are pretty small and
the query performance is not really a problem so far. We are using lots
of collections and most Solr caches seem to be per core and not global
so we also have a problem with caching. I have to test the HDFS cache
some more as that should work cross collections.

We also had an HDFS setup already so it looked like a good option to not
loos data. Earlier we had a few cases where we lost the machines so HDFS
looked safer for that.

I would expect that the HDFS performance is also quite good if you have
lots of document adds and not so frequent commits. Frequent adds with
commits, which is likely not good in general anyway, does look quite a
bit slower then local storage so far. As we didn't see that in our
earlier tests, which were more, query focused, I said it large depends
on what you are doing.

Hendrik
Post by Erick Erickson
In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.
In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's observation: "We now had a patter with lots of small
updates and commits and that seems to be quite a bit slower". He's
merging segments and (presumably) autowarming frequently, implying
lots of I/O and HDFS adds an extra layer.
Personally I'd use whichever is most convenient and see if the
performance was "good enough". I wouldn't recommend _installing_ HDFS
just to use it with Solr, why add another complication? If you need
the redundancy add replicas. If you already have the HDFS
infrastructure in place and using HDFS is easier than local storage,
feel free....
Best,
Erick
On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
Post by Greenhorn Techie
Hendrik,
Thanks for your response.
Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?
Thanks
Post by Hendrik Haddorp
We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bit slower. We are about to do performance testing on that now.
The reason we switched to HDFS was largely connected to us using Docker
and Marathon/Mesos. With HDFS the data is in a shared file system and
thus it is possible to move the replica to a different instance on a a
different host.
regards,
Hendrik
Post by Greenhorn Techie
Hi,
Good Afternoon!!
While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.
Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?
Thanks
Erick Erickson
2017-11-23 01:13:53 UTC
Permalink
bq: We also had an HDFS setup already so it looked like a good option
to not loos data. Earlier we had a few cases where we lost the
machines so HDFS looked safer for that.

right, that's one of the places where using HDFS to back Solr makes a
lot of sense. The other approach is to just have replicas for each
shard distributed across different physical machines. But whatever
works is fine.

And there are a bunch of parameters you can tune both on HDFS and for
local file systems so "it's more an art than a science".

bq: Frequent adds with commits, which is likely not good in general
anyway, does look quite a bit slower then local storage so far.

I think you can go a long way towards fixing this by doing some
autowarming. I wouldn't want to open a new searcher every second and
do much autowarming over HDFS, but if you can stand less frequent
commits (say every minute?) you might be able to smooth out the
performance....

Best,
Erick

On Wed, Nov 22, 2017 at 11:31 AM, Hendrik Haddorp
We actually use no auto warming. Our collections are pretty small and the
query performance is not really a problem so far. We are using lots of
collections and most Solr caches seem to be per core and not global so we
also have a problem with caching. I have to test the HDFS cache some more as
that should work cross collections.
We also had an HDFS setup already so it looked like a good option to not
loos data. Earlier we had a few cases where we lost the machines so HDFS
looked safer for that.
I would expect that the HDFS performance is also quite good if you have lots
of document adds and not so frequent commits. Frequent adds with commits,
which is likely not good in general anyway, does look quite a bit slower
then local storage so far. As we didn't see that in our earlier tests, which
were more, query focused, I said it large depends on what you are doing.
Hendrik
Post by Erick Erickson
In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.
In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's observation: "We now had a patter with lots of small
updates and commits and that seems to be quite a bit slower". He's
merging segments and (presumably) autowarming frequently, implying
lots of I/O and HDFS adds an extra layer.
Personally I'd use whichever is most convenient and see if the
performance was "good enough". I wouldn't recommend _installing_ HDFS
just to use it with Solr, why add another complication? If you need
the redundancy add replicas. If you already have the HDFS
infrastructure in place and using HDFS is easier than local storage,
feel free....
Best,
Erick
On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
Post by Greenhorn Techie
Hendrik,
Thanks for your response.
Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?
Thanks
Post by Hendrik Haddorp
We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bit slower. We are about to do performance testing on that now.
The reason we switched to HDFS was largely connected to us using Docker
and Marathon/Mesos. With HDFS the data is in a shared file system and
thus it is possible to move the replica to a different instance on a a
different host.
regards,
Hendrik
Post by Greenhorn Techie
Hi,
Good Afternoon!!
While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.
Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?
Thanks
Loading...