Apache Solr – indexing text files

The problem?

Reading and following the Apache Solr documentation I didn’t understand why none of my files were being read/indexed by Solr. My files are text files, and after issuing the using cURL to try indexing one of them, I kept getting ‘Invalid character’ errors. I was perplexed.

What I tried: curl http://localhost:8983/solr/update --data-binary testfile.txt -H 'Content-type:text/xml; charset=utf-8'

What I received: Unexpected_character_1_code_49_in_prolog_expected___at_rowcol_unknownsource_11

The solution?

Solr needs to be configured to index non-XML based files, and the indexing command (curl update) needs to be slightly modified:

  1. Edit solr/conf/schema.xml
    1. On line ~450 add the following entry:
      1. <field name="body" type="text" indexed="true" stored="true" multiValued="true"/>
    2. On line ~540 add the following entry:
      1. <copyField source="body" dest="text"/>
  2. Use different cURL upload command
    1. curl "http://localhost:8983/solr/update/extract?literal.id=testfile&uprefix=attr_&fmap.content=body&commit=true" -F "myfile=@./testfile"
    2. literal.id refers to a unique document ID for your file

For more information, see the following guide: http://wiki.apache.org/solr/ExtractingRequestHandler

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s