9167

Solr: FileListEntityProcessor is executing sub entities multiple times

Question:

I have configured a dih-import.xml as shown below. The FileListEntityProcessor walks through some folders and then executes a XPathEntity and a DB-Entity for each file.

When I executed a full import for ~30.000 files, the import took almost 3 hours. Back to the DIH-debug console it showed me, that for the first file that was found 2 db-calls were made, for the 2nd 4, then 6, 8, ..

google didn't show me anything on this subject, so I am hoping for you :)

Thanks in advance

<?xml version="1.0" encoding="UTF-8"?> <dataConfig> <dataSource name="cr-db" jndiName="xyz" type="JdbcDataSource" /> <dataSource name="cr-xml" type="FileDataSource" encoding="utf-8" /> <document name="doc"> <entity dataSource="cr-xml" name="f" processor="FileListEntityProcessor" baseDir="/path/to/xml" filename="*.xml" recursive="true" rootEntity="true" onError="skip"> <entity name="xml-data" dataSource="cr-xml" processor="XPathEntityProcessor" forEach="/root" url="${f.fileAbsolutePath}" transformer="DateFormatTransformer" onError="skip"> <field column="id" xpath="/root/id" /> <field column="A" xpath="/root/a" /> </entity> <entity name="db-data" dataSource="cr-db" query=" SELECT id, b FROM a_table WHERE id = '${f.file}'"> <field column="B" name="b" /> </entity> </entity> </document> </dataConfig> <hr />

<strong>EDIT</strong> found the problem at google, but no answer there either: <a href="http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html" rel="nofollow">http://osdir.com/ml/solr-user.lucene.apache.org/2010-04/msg00138.html</a>

<hr />

<strong>and another edit</strong>

updated solr from 3.6 to 4.1 and executed the importer. The problem still exists, only that there are not 2n (2, 4, 6, 8, ..) calls for the sub-entities anymore but only n.

Answer1:

If the main issue is the number of hits on the Database when you use JdbcDataSource, you may try switching to <a href="http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor" rel="nofollow">CachedSqlEntityProcessor</a>.

You may also want to track <a href="https://issues.apache.org/jira/browse/SOLR-2943" rel="nofollow">SOLR-2943</a>, as they want to address exactly your problem, hopefully for upcoming Solr 4.2

Recommend

  • how to store file path in Solr when using TikaEntityProcessor
  • Installing JDBC for Jboss EAP 6.3
  • Plug Bean to JBoss with JNDI
  • Binding container managed authentication alias with DataSource using jython script
  • Configuring Liberty Profile to use H2 database
  • Google Contacts API asp.net settings and authorization token
  • Retrieve google contact based on contact Id
  • How do I chomp a string if I have Perl 4?
  • SQL Server Integrated Security from an Azure Web Site
  • Vim folding : how to hide all the single lines not containing a search pattern (or fold zero line)?
  • R Impute NA's by Linear Increase Depending on Time Interval
  • Polymer paper-input and form submission
  • Furthest-point Voronoi diagram in Java
  • Sending email using standard gmail app without chooser
  • Angular page doesn't refresh after data is added or removed
  • How to map Request parameter in Spring?
  • Is is safe to use HSQLDB for production? (JBoss AS5.1)
  • Is there a way to disable a specific event in kendo ui scheduler?
  • Merge Module leaving files during uninstall
  • Python getting common name from URL using ssl.getpeercert()
  • Redirect STDERR in OPEN pipe comand. Perl Linux
  • JBoss External Properties Files in Classpath
  • Android - Material Design - NavigationView - How to put vertical scroll?
  • Why Encoding.ASCII != ASCIIEncoding.Default in C#?
  • Word Open XML Mail Merge
  • How do I alternate colors in Flat List (React Native)
  • Jquery UI tool tip close icon
  • PHPUnit_Framework_TestCase class is not available. Fix… - Makegood , Eclipse
  • Encrypt data by using a public key in c# and decrypt data by using a private key in php
  • SSO with signing and signature validation doesn't work
  • Deserializing XML into class C#
  • Which linear programming package should I use for high numbers of constraints and “warm starts” [clo
  • R: gsub and capture
  • Benchmarking RAM performance - UWP and C#
  • costura.fody for a dll that references another dll
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • UserPrincipal.Current returns apppool on IIS
  • Conditional In-Line CSS for IE and Others?
  • Net Present Value in Excel for Grouped Recurring CF