Discussion:
What is the correct URL for POSTing new data?
Christopher Schultz
2018-04-13 13:49:18 UTC
Permalink
All,

I've recently been encountering some frustrations with Solr 7.3 after
configuring TLS; since the command-line tools (which are a breeze to use
when you have a "toy" Solr installation) stop working when TLS is
enabled, I'm finding myself having to perform the following tasks in
order to get bin/post to work:

1. patch bin/post:

234,235c234,235
< echo "$JAVA" -classpath "${TOOL_JAR[0]}" "${PROPS[@]}"
org.apache.solr.util.SimplePostTool "${PARAMS[@]}"
< "$JAVA" -classpath "${TOOL_JAR[0]}" "${PROPS[@]}"
org.apache.solr.util.SimplePostTool "${PARAMS[@]}"
---
org.apache.solr.util.SimplePostTool "${PARAMS[@]}"


2. Run the command with lots of manual options:

$ SOLR_POST_OPTS="-Djavax.net.ssl.trustStore=/etc/solr/solr-client.p12
-Djavax.net.ssl.trustStorePassword=whatevs
-Djavax.net.ssl.trustStoreType=PKCS12" /usr/local/solr/bin/post -c
new_core https://localhost:8983/solr/new_core

[time passes while bin/post uploads a very large file]

SimplePostTool version 5.0.0
Posting files to [base] url https://localhost:8983/solr/new_core...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file new_core.json (application/json) to [base]/json/docs
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
url: https://localhost:8983/solr/new_core/json/docs
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/new_core/json/docs. Reason:
<pre> Not Found</pre></p>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:
https://localhost:8983/solr/new_core/json/docs
1 files indexed.
COMMITting Solr index changes to https://localhost:8983/solr/new_core...
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
url: https://localhost:8983/solr/new_core?commit=true
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/new_core. Reason:
<pre> Not Found</pre></p>
</body>
</html>
Time spent: 0:00:04.710

I'm guessing that I just don't know what the URL is supposed to be for
that core. When browsing the web UI, I can examine the core here:

https://localhost:8983/solr/#/~cores/new_core

Solr reports:

startTime: a day ago
instanceDir: /var/solr/data/new_core
dataDir: /var/solr/data/new_core/data/

Index
lastModified: -
version: 2
numDocs: 0
maxDoc: 0
deletedDocs: 0
current: [check-mark]


So the core is there. I suspect I'm simply not addressing it correctly.
How should I modify the URL I pass on the command-line so that bin/post
can inject a new batch of data?

Thanks,
-chris
Shawn Heisey
2018-04-13 22:02:58 UTC
Permalink
Post by Christopher Schultz
$ SOLR_POST_OPTS="-Djavax.net.ssl.trustStore=/etc/solr/solr-client.p12
-Djavax.net.ssl.trustStorePassword=whatevs
-Djavax.net.ssl.trustStoreType=PKCS12" /usr/local/solr/bin/post -c
new_core https://localhost:8983/solr/new_core
[time passes while bin/post uploads a very large file]
SimplePostTool version 5.0.0
Posting files to [base] url https://localhost:8983/solr/new_core...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file new_core.json (application/json) to [base]/json/docs
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
url: https://localhost:8983/solr/new_core/json/docs
The URL path (beyond the core name) it's ending up with is /json/docs,
when it should be /update/json/docs.

If you hadn't given the command a specific URL, it probably would have
figured out the correct URL on its own.  The base URL for the post tool
normally includes the /update path, which is different than the base URL
for something like HttpSolrClient (in the SolrJ library).  Changing the
handler path is done differently in SolrJ than it is with the post tool.

I know, we've violated that principle again. :)

The bin/post tool is a *simple* tool.  The java class that it calls is
even named "SimplePostTool".  It is expected that most users will
outgrow its functionality quickly and write their own indexing software
that does whatever custom processing they require.  The tool doesn't get
a lot of improvements because we don't intend it to be used as a
production indexing mechanism.  If it does what you need, there's
nothing wrong with production usage, but you need to be aware that it
doesn't have robust error handling, which is usually pretty important
for production.

Thanks,
Shawn
Christopher Schultz
2018-04-15 20:24:46 UTC
Permalink
Shawn,
Post by Shawn Heisey
Post by Christopher Schultz
$
SOLR_POST_OPTS="-Djavax.net.ssl.trustStore=/etc/solr/solr-client.p12
- -Djavax.net.ssl.trustStorePassword=whatevs
Post by Shawn Heisey
Post by Christopher Schultz
-Djavax.net.ssl.trustStoreType=PKCS12" /usr/local/solr/bin/post
-c new_core https://localhost:8983/solr/new_core
[time passes while bin/post uploads a very large file]
SimplePostTool version 5.0.0 Posting files to [base] url
https://localhost:8983/solr/new_core... Entering auto mode. File
endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp
,ots,rtf,htm,html,txt,log
POSTing file new_core.json (application/json) to [base]/json/docs
Post by Shawn Heisey
Post by Christopher Schultz
SimplePostTool: WARNING: Solr returned an error #404 (Not Found)
for url: https://localhost:8983/solr/new_core/json/docs
The URL path (beyond the core name) it's ending up with is
/json/docs, when it should be /update/json/docs.
Looks like that worked. I could find that nowhere in the documentation.
Post by Shawn Heisey
If you hadn't given the command a specific URL, it probably would
have figured out the correct URL on its own.
No, it wouldn't have. It doesn't read any configuration files and
guesses its way through everything. Simply adding HTTPS support
required me to modify the script and manually-specify the URL. That's
why I went through the trouble of explaining so in my initial post.
Post by Shawn Heisey
The base URL for the post tool normally includes the /update path,
which is different than the base URL for something like
HttpSolrClient (in the SolrJ library). Changing the handler path
is done differently in SolrJ than it is with the post tool.
I know, we've violated that principle again. :)
;)

I don't mind all surprises. It's the ones that have zero documentation
that are the most surprising.
Post by Shawn Heisey
The bin/post tool is a *simple* tool. The java class that it calls
is even named "SimplePostTool". It is expected that most users
will outgrow its functionality quickly and write their own indexing
software that does whatever custom processing they require. The
tool doesn't get a lot of improvements because we don't intend it
to be used as a production indexing mechanism.
I'm using it as a bulk-loading operation. I have no need in production
to completely bootstrap a document collection unless the existing one
has been trashed for some reason. Why bother writing my own client
that does the equivalent of "SELECT * FROM table" and then loop over
the ResultSet calling SolrJ's add-document method.

The SimplePostTool should be able to handle that for me, and if it
did, I'd have less code to babysit in perpetuity.
Post by Shawn Heisey
If it does what you need, there's nothing wrong with production
usage, but you need to be aware that it doesn't have robust error
handling, which is usually pretty important for production.
I'm okay with terse error messages.

- -chris
Shawn Heisey
2018-04-15 20:33:53 UTC
Permalink
Post by Christopher Schultz
No, it wouldn't have. It doesn't read any configuration files and
guesses its way through everything. Simply adding HTTPS support
required me to modify the script and manually-specify the URL. That's
why I went through the trouble of explaining so in my initial post.
Gotcha.  I haven't used SSL with Solr myself.  Nobody can get directly
to the Solr servers, so we don't need it.  If somebody is able to
penetrate our systems to the point where they can sniff Solr traffic,
they will already have full access to things far more sensitive than our
search index.

I'll see what I can do about the documentation to make it clear that the
URL given to the post tool needs the request handler path.

Thanks,
Shawn
Christopher Schultz
2018-04-16 13:05:32 UTC
Permalink
Shawn,
Post by Christopher Schultz
No, it wouldn't have. It doesn't read any configuration files
and guesses its way through everything. Simply adding HTTPS
support required me to modify the script and manually-specify the
URL. That's why I went through the trouble of explaining so in my
initial post.
Gotcha. I haven't used SSL with Solr myself. Nobody can get
directly to the Solr servers, so we don't need it. If somebody is
able to penetrate our systems to the point where they can sniff
Solr traffic, they will already have full access to things far more
sensitive than our search index.
Not necessarily, but that depends entirely upon your environment. We
have a policy of "no privileged network positions" so we don't even
trust our "private networks". Someone at the data center could
inadvertently configure a switch port to suddenly join our VLAN or a
network plug might be incorrectly assigned, etc. So we don't want our
data flying around in a way that can be intercepted.
I'll see what I can do about the documentation to make it clear
that the URL given to the post tool needs the request handler
path.
That would be great. Even poking-around in the Solr web UI doesn't
reveal that path because of all the javascript magic in the interface.

It's unreasonable to expect everyone to read source code in order to
learn how to use tools that don't require direct programming.

Let me take a step back and say that Solr in fact has great
documentation. There are evidently some things it lacks for the
uninitiated.

Thanks,
- -chris

Continue reading on narkive:
Loading...