Debugging solr rss DataImportHandler -
i have existing collection, want add rss importer. i've copied gleam example-dih/solr/rss code.
the details below, bottom line seems run, says "fetched: 0" (and no documents). there no exceptions in tomcat log.
questions:
- is there way turn debugging on rss importers?
- can see solr's actual request , response?
- what cause request succeed, no rows fetched?
- is there tutorial adding rss dih existing collection?
thanks!
my solrconfig.xml file contains requesthandler:
<requesthandler name="/dataimport" class="org.apache.solr.handler.dataimport.dataimporthandler"> <lst name="defaults"> <str name="config">rss-data-config.xml</str> </lst> </requesthandler>
and rss-data-config.xml:
<dataconfig> <datasource type="urldatasource" /> <document> <entity name="slashdot" pk="link" url="http://rss.slashdot.org/slashdot/slashdot" processor="xpathentityprocessor" foreach="/rss/channel | /rss/item" transformer="dateformattransformer"> <field column="source_name" xpath="/rss/channel/title" commonfield="true" /> <field column="title" xpath="/rss/item/title" /> <field column="link" xpath="/rss/item/link" /> <field column="body" xpath="/rss/item/description" /> <field column="date" xpath="/rss/item/date" datetimeformat="yyyy-mm-dd't'hh:mm:ss" /> </entity> </document> </dataconfig>
and schema.xml:
<fields> <field name="title" type="text_general" required="true" indexed="true" stored="true"/> <field name="link" type="string" required="true" indexed="true" stored="true"/> <field name="source_name" type="text_general" required="true" indexed="true" stored="true"/> <field name="body" type="text_general" required="false" indexed="false" stored="true"/> <field name="date" type="date" required="true" indexed="true" stored="true" /> <field name="text" type="text_general" indexed="true" stored="false" multivalued="true"/> <field name="_version_" type="long" indexed="true" stored="true"/> <fields>
when run dataimport admin web page, seems go well. shows "requests: 1" , there no exceptions in tomcat log:
mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.dataimporter maybereloadconfiguration info: loading dih configuration: rss-data-config.xml mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.dataimporter loaddataconfig info: data configuration loaded mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.dataimporter dofullimport info: starting full import mar 12, 2013 9:02:58 pm org.apache.solr.handler.dataimport.simplepropertieswriter readindexerproperties info: read dataimport.properties mar 12, 2013 9:02:59 pm org.apache.solr.handler.dataimport.docbuilder execute info: time taken = 0:0:0.693 mar 12, 2013 9:02:59 pm org.apache.solr.update.processor.logupdateprocessor finish info: [articles] webapp=/solr path=/dataimport params={optimize=false&clean=false&indent=true&commit=false&verbose=true&entity=slashdot&command=full-import&debug=true&wt=json} {} 0 706
your problem here due rss-data-config.xml , defined xpaths.
if open url http://rss.slashdot.org/slashdot/slashdot in internet explorer , hit f12 developer tools show structure of html.
you can see node <item>
child of <channel>
, not <rss>
. config should follows:
<?xml version="1.0" encoding="utf-8" ?> <dataconfig> <datasource type="urldatasource" /> <document> <entity name="slashdot" pk="link" url="http://rss.slashdot.org/slashdot/slashdot" processor="xpathentityprocessor" foreach="/rss/channel | /rss/channel/item" transformer="dateformattransformer"> <field column="source_name" xpath="/rss/channel/title" commonfield="true" /> <field column="title" xpath="/rss/channel/item/title" /> <field column="link" xpath="/rss/channel/item/link" /> <field column="body" xpath="/rss/channel/item/description" /> <field column="date" xpath="/rss/channel/item/date" datetimeformat="yyyy-mm-dd't'hh:mm:ss" /> </entity> </document> </dataconfig>
Comments
Post a Comment