Debugging solr rss DataImportHandler -


i have existing collection, want add rss importer. i've copied gleam example-dih/solr/rss code.

the details below, bottom line seems run, says "fetched: 0" (and no documents). there no exceptions in tomcat log.

questions:

  • is there way turn debugging on rss importers?
  • can see solr's actual request , response?
  • what cause request succeed, no rows fetched?
  • is there tutorial adding rss dih existing collection?

thanks!

my solrconfig.xml file contains requesthandler:

<requesthandler name="/dataimport"     class="org.apache.solr.handler.dataimport.dataimporthandler">     <lst name="defaults">             <str name="config">rss-data-config.xml</str>     </lst> </requesthandler> 

and rss-data-config.xml:

<dataconfig>     <datasource type="urldatasource" />     <document>         <entity name="slashdot"                 pk="link"                 url="http://rss.slashdot.org/slashdot/slashdot"                 processor="xpathentityprocessor"                 foreach="/rss/channel | /rss/item"                 transformer="dateformattransformer">              <field column="source_name" xpath="/rss/channel/title" commonfield="true" />              <field column="title" xpath="/rss/item/title" />             <field column="link" xpath="/rss/item/link" />             <field column="body" xpath="/rss/item/description" />             <field column="date" xpath="/rss/item/date" datetimeformat="yyyy-mm-dd't'hh:mm:ss" />         </entity>     </document> </dataconfig> 

and schema.xml:

<fields>    <field name="title" type="text_general" required="true" indexed="true" stored="true"/>    <field name="link" type="string" required="true" indexed="true" stored="true"/>    <field name="source_name" type="text_general" required="true" indexed="true" stored="true"/>    <field name="body" type="text_general" required="false" indexed="false" stored="true"/>    <field name="date" type="date" required="true" indexed="true" stored="true" />    <field name="text" type="text_general" indexed="true" stored="false" multivalued="true"/>    <field name="_version_" type="long" indexed="true" stored="true"/> <fields> 

when run dataimport admin web page, seems go well. shows "requests: 1" , there no exceptions in tomcat log:

mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.dataimporter maybereloadconfiguration info: loading dih configuration: rss-data-config.xml mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.dataimporter loaddataconfig info: data configuration loaded mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.dataimporter dofullimport info: starting full import mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.simplepropertieswriter readindexerproperties info: read dataimport.properties mar 12, 2013 9:02:59 pm org.apache.solr.handler.dataimport.docbuilder execute info: time taken = 0:0:0.693 mar 12, 2013 9:02:59 pm org.apache.solr.update.processor.logupdateprocessor finish info: [articles] webapp=/solr path=/dataimport params={optimize=false&clean=false&indent=true&commit=false&verbose=true&entity=slashdot&command=full-import&debug=true&wt=json} {} 0 706 

your problem here due rss-data-config.xml , defined xpaths.

if open url http://rss.slashdot.org/slashdot/slashdot in internet explorer , hit f12 developer tools show structure of html.

you can see node <item> child of <channel> , not <rss>. config should follows:

<?xml version="1.0" encoding="utf-8" ?> <dataconfig> <datasource type="urldatasource" /> <document> <entity name="slashdot"             pk="link"             url="http://rss.slashdot.org/slashdot/slashdot"             processor="xpathentityprocessor"             foreach="/rss/channel | /rss/channel/item"             transformer="dateformattransformer">          <field column="source_name" xpath="/rss/channel/title" commonfield="true" />          <field column="title" xpath="/rss/channel/item/title" />         <field column="link" xpath="/rss/channel/item/link" />         <field column="body" xpath="/rss/channel/item/description" />         <field column="date" xpath="/rss/channel/item/date" datetimeformat="yyyy-mm-dd't'hh:mm:ss" />     </entity> </document> </dataconfig> 

Comments

Popular posts from this blog

c - How to retrieve a variable from the Apache configuration inside the module? -

c# - Constructor arguments cannot be passed for interface mocks -

python - malformed header from script index.py Bad header -