67418

Massive number of XML edits

I need to load a mid-sized XML file into memory, make many random access modifications to the file (perhaps hundreds of thousands), then write the result out to STDIO. Most of these modifications will be node insertion/deletions, as well as character insertion/deletions within the text nodes. These XML files will be small enough to fit into memory, but large enough that I won't want to keep multiple copies around.

I am trying to settle on the architecture/libraries and am looking for suggestions.

Here is what I have come up with so far-

I am looking for the ideal XML library for this, and so far, I haven't found anything that seems to fit the bill. The libraries generally store nodes in Haskell lists, and text in Haskell Data.Text objects. This only allows linear Node and Text inserts, and I believe that the Text inserts will have to do full rewrite on every insert/delete.

I think storing both nodes and text in sequences seems to be the way to go.... It supports log(N) inserts and deletes, and only needs to rewrite a small fraction of the tree on each alteration. None of the XML libs are based on this though, so I will have to either write my own lib, or just use one of the other libs to parse then convert it to my own form (given how easy it is to parse XML, I would almost just as soon do the former, rather than have a shadow parse of everything).

I had briefly considered the possibility that this might be a rare case where Haskell might not be the best tool.... But then I realized that mutability doesn't offer much of an advantage here, because my modifications aren't char replacements, but rather add/deletes. If I wrote this in C, I would still need to store the strings/nodes in some sort of tree structure to avoid large byte moves for each insert/delete. (Actually, Haskell probably has some of the best tools to deal with this, but I would be open to suggestions of a better choice of language for this task if you feel there is one).

To summarize-

<ol> <li>

Is Haskell the right choice for this?

</li> <li>

Does any Haskell lib support fast node/text insert/deletes (log(N))?

</li> <li>

Is sequence the best data structure to store a list of items (in my case, Nodes and Chars) for fast insert and deletes?

</li> </ol>

Answer1:

I will answer my own question-

I chose to wrap an Text.XML tree with a custom object that stores nodes and text in Data.Sequence objects. Because haskell is lazy, I believe it only temporarily holds the Text.XML data in memory, node by node as the data streams in, then it is garbage collected before I actually start any real work modifying the Sequence trees.

(It would be nice if someone here could verify that this is how Haskell would work internally, but I've implemented things, and the performance seems to be reasonable, not great- about 30k insert/deletes per second, but this should do).

Recommend

  • Default route for all extreme situations
  • sql for calculating points for games
  • Insert multiple values into hidden field
  • Google cloud datastore emulator init data
  • Sql indexes vs full table scan
  • Class implementation in a header file == bad style? [duplicate]
  • Get all existing pointers to an object
  • Refactoring advice: maps to POJOs
  • Regex for URL rewrite with optional query string parameters
  • Selectively hide background elements when overlayed with transparent div
  • Cut the background to expose the layer below
  • How to create a 2D image by rotating 1D vector of numbers around its center element?
  • Access user's phone number on iOS 7
  • Do query loads all the data in memory
  • Simple linked list-C
  • CakePHP ACL tutorial initDB function warnings
  • Apache RewriteRule redirection with url encoded
  • Date Conversion from yyyy-mm-dd to dd-mm-yyyy
  • Debug.DrawLine not showing in the GameView
  • print() is showing quotation marks in results
  • Make VS2015 use angular-cli ng at build time in a .NET project
  • Use of this Javascript
  • Android fill_parent issue
  • CSS Linear-gradient formatting issue accross different browsers
  • How do I change content of ComboFieldEditor?
  • Adding a button at the bottom of a table view
  • Getting last autonumber in access
  • PHP - How to update data to MySQL when click a radio button
  • How to make a tree having multiple type of nodes and each node can have multiple child nodes in java
  • Get object from AWS S3 as a stream
  • Join two tables and save into third-sql
  • How to handle AllServersUnavailable Exception
  • Weird JavaScript statement, what does it mean?
  • Do I've to free mysql result after storing it?
  • Delete MySQLi record without showing the id in the URL
  • GridView Sorting works once only
  • using conditional logic : check if record exists; if it does, update it, if not, create it
  • SQL merge duplicate rows and join values that are different
  • How can i traverse a binary tree from right to left in java?
  • Python/Django TangoWithDjango Models and Databases