31295

Regex to extract nested patterns [duplicate]

<strong>Possible Duplicate:</strong> Matching Nested Structures With Regular Expressions in Python

I can't wrap my head around this problem. I have a string like the following one:

Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet

My task would be to extract the commands (they are always starting with <strong>[@</strong> and ending with <strong>]</strong>) and their subcommands. A result like

[ [@a xxx yyy [@b xxx yyy [@c xxx yyy]]], # the most outer [@b xxx yyy [@c xxx yyy]], # the middle one [@c xxx yyy] # the inner most ]

would be highly appreciated. The problem is that these kind of commands can occur in very long text messages, so a "performant" solution would be nice.

I was toying around with some regex patterns mostly of the time something like

(\[@.*?\]\s) # for the outer one

but i have seen no light in matching the middle and inner one. To make it more complicated, the amount of nested commands is variable... Might some special regex be the solution? I have read about lookaheads and lookbehinds but no idea how to use them in this special case.

Thank a bunch!

<strong>UPDATE</strong>

@Cyborgx37 pointed me to another post that uses the pyparsing package. It would be nice to have a solution without an external package or library. But pyparsing definately solves that problem!

Answer1:

C# has recursive/nested RegEx, I don't believe Python does. You could re-run the RegEx search on previous results, but this is probably less efficient (the overhead of RegEx for such a simple search) than just making a custom parser. The text your searching for "[@" and "]" isn't very complex.

Here's a custom parser (in JavaScript) that would do the job.

var txt = "Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet"; function parse(s) { var stack = []; var result = []; for(var x=0; x<s.length; x++) { var c = s.charAt(x); if(c == '[' && x+1 < s.length-1 && s.charAt(x+1) == '@') { for(var y=0; y<stack.length; y++) stack[y] += "[@"; stack.push("[@"); x++; } else if(c == ']' && stack.length > 0) { for(var y=0; y<stack.length; y++) stack[y] += "]"; result.push(stack.pop()); } else { for(var y=0; y<stack.length; y++) stack[y] += c; } } return result; } parse(txt);

It quickly loops through all the characters of the text (only once) and uses a stack and an if...if else...else condition to push, pop and modify the values in that stack respectively.

Answer2:

So coming from a c# background, I'm not sure this is going to help but, I imagine that since you have to parse the inside commands anyway, why not just store the contents of the command, and then run your regex function again on the inner data? I know I'm probably missing something, but that's why I would try at least.

Answer3:

No wonder you cannot wrap your head around the problem. There is a formal language theory regarding formal languages. Noam Chomsky described four categories of the languages -- known as Chomsky hierarchy. Regular expressions are capable do describe the easies category of the languages -- the regular languages. However, languages with nested paired structures are outside of regular languages, and they cannot be described/accepted by regular expressions.

One kind of the parsers that are the most easily implemented are the ones based on recursive call of functions that parse elements of the language.

Recommend

  • select data where date is max(date) less than x
  • 'from X import a' versus 'import X; X.a'
  • MySQL: Retrieving ID where exactly 2 rows share the same ID but have different userIDs
  • dd: How to calculate optimal blocksize? [closed]
  • Where to order a result set randomly - in database or in memory?
  • Optimizing conditionals/if blocks, what is preferable from a performance point of view?
  • How to insert a batch of records into Redis
  • How to use Neo4j for finding Minimum Spanning Tree?
  • Decompose a String into Array of Long or List of Long without Loop in JAVA
  • What is the most time efficient way to serialize/deserialize a DataTable to/from Redis?
  • Regex to extract nested patterns [duplicate]
  • SQL convert compatibility matrix into mapping table (columns into rows)
  • How to performantly get every (unordered) pair of different collection elements if random access is
  • ActiveRecord nested SELECT
  • How to create and use very large palette textures for use in opengl?
  • displaying # views on a page without hitting database all the time
  • Jquery - fadeIn() and fadeOut() in PNG image. Border Solid (black) in IE8…?
  • Facebook Like Button doesn't increment
  • C++ DLL Wrapper in vb.net Passing byte array of a byte array?
  • Can I call custom javascript from an R jupyter notebook
  • How to select table rows/complete table?
  • Regex for Specific Tag
  • Get used tables from sql query [duplicate]
  • Time out Error in send mail
  • ApplePay PKPaymentAuthorizationViewController always shows processing
  • Creating PDF from TIFF image using iText
  • Laravel: Getting Session ID oddly truncates when using foreach
  • How to disable all widgets inside Panel or inside Composite?
  • Not able to aggregate on nested fields in elasticsearch
  • Reduction and collapse clauses in OMP have some confusing points
  • Swift: Switch statement fallthrough behavior
  • angularjs unit test when to use $rootScope.$new()
  • Does CUDA 5 support STL or THRUST inside the device code?
  • How do you troubleshoot character encoding problems?
  • How to format a variable of double type
  • What are the advantages and disadvantages of reading an entire file into a single String as opposed
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • How can I remove ASP.NET Designer.cs files?
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • How to Embed XSL into XML