Solr: FileListEntityProcessor is executing sub entities multiple times


I have configured a dih-import.xml as shown below. The FileListEntityProcessor walks through some folders and then executes a XPathEntity and a DB-Entity for each file.

When I executed a full import for ~30.000 files, the import took almost 3 hours. Back to the DIH-debug console it showed me, that for the first file that was found 2 db-calls were made, for the 2nd 4, then 6, 8, ..

google didn't show me anything on this subject, so I am hoping for you :)

Thanks in advance

<?xml version="1.0" encoding="UTF-8"?> <dataConfig> <dataSource name="cr-db" jndiName="xyz" type="JdbcDataSource" /> <dataSource name="cr-xml" type="FileDataSource" encoding="utf-8" /> <document name="doc"> <entity dataSource="cr-xml" name="f" processor="FileListEntityProcessor" baseDir="/path/to/xml" filename="*.xml" recursive="true" rootEntity="true" onError="skip"> <entity name="xml-data" dataSource="cr-xml" processor="XPathEntityProcessor" forEach="/root" url="${f.fileAbsolutePath}" transformer="DateFormatTransformer" onError="skip"> <field column="id" xpath="/root/id" /> <field column="A" xpath="/root/a" /> </entity> <entity name="db-data" dataSource="cr-db" query=" SELECT id, b FROM a_table WHERE id = '${f.file}'"> <field column="B" name="b" /> </entity> </entity> </document> </dataConfig> <hr />

<strong>EDIT</strong> found the problem at google, but no answer there either: <a href="http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html" rel="nofollow">http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html</a>

<hr />

<strong>and another edit</strong>

updated solr from 3.6 to 4.1 and executed the importer. The problem still exists, only that there are not 2n (2, 4, 6, 8, ..) calls for the sub-entities anymore but only n.


If the main issue is the number of hits on the Database when you use JdbcDataSource, you may try switching to <a href="http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor" rel="nofollow">CachedSqlEntityProcessor</a>.

You may also want to track <a href="https://issues.apache.org/jira/browse/SOLR-2943" rel="nofollow">SOLR-2943</a>, as they want to address exactly your problem, hopefully for upcoming Solr 4.2


