Regex to extract nested patterns [duplicate]

<strong>Possible Duplicate:</strong> Matching Nested Structures With Regular Expressions in Python

I can't wrap my head around this problem. I have a string like the following one:

Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet

My task would be to extract the commands (they are always starting with <strong>[@</strong> and ending with <strong>]</strong>) and their subcommands. A result like

[ [@a xxx yyy [@b xxx yyy [@c xxx yyy]]], # the most outer [@b xxx yyy [@c xxx yyy]], # the middle one [@c xxx yyy] # the inner most ]

would be highly appreciated. The problem is that these kind of commands can occur in very long text messages, so a "performant" solution would be nice.

I was toying around with some regex patterns mostly of the time something like

(\[@.*?\]\s) # for the outer one

but i have seen no light in matching the middle and inner one. To make it more complicated, the amount of nested commands is variable... Might some special regex be the solution? I have read about lookaheads and lookbehinds but no idea how to use them in this special case.

Thank a bunch!

<strong>UPDATE</strong>

@Cyborgx37 pointed me to another post that uses the pyparsing package. It would be nice to have a solution without an external package or library. But pyparsing definately solves that problem!

Answer1:

C# has recursive/nested RegEx, I don't believe Python does. You could re-run the RegEx search on previous results, but this is probably less efficient (the overhead of RegEx for such a simple search) than just making a custom parser. The text your searching for "[@" and "]" isn't very complex.

Here's a custom parser (in JavaScript) that would do the job.

var txt = "Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet"; function parse(s) { var stack = []; var result = []; for(var x=0; x<s.length; x++) { var c = s.charAt(x); if(c == '[' && x+1 < s.length-1 && s.charAt(x+1) == '@') { for(var y=0; y<stack.length; y++) stack[y] += "[@"; stack.push("[@"); x++; } else if(c == ']' && stack.length > 0) { for(var y=0; y<stack.length; y++) stack[y] += "]"; result.push(stack.pop()); } else { for(var y=0; y<stack.length; y++) stack[y] += c; } } return result; } parse(txt);

It quickly loops through all the characters of the text (only once) and uses a stack and an if...if else...else condition to push, pop and modify the values in that stack respectively.

Answer2:

So coming from a c# background, I'm not sure this is going to help but, I imagine that since you have to parse the inside commands anyway, why not just store the contents of the command, and then run your regex function again on the inner data? I know I'm probably missing something, but that's why I would try at least.

Answer3:

No wonder you cannot wrap your head around the problem. There is a formal language theory regarding formal languages. Noam Chomsky described four categories of the languages -- known as Chomsky hierarchy. Regular expressions are capable do describe the easies category of the languages -- the regular languages. However, languages with nested paired structures are outside of regular languages, and they cannot be described/accepted by regular expressions.

One kind of the parsers that are the most easily implemented are the ones based on recursive call of functions that parse elements of the language.

人吐槽 人点赞

Recommend

Comment

用户名: 密码:
验证码: 匿名发表

你可以使用这些语言

查看评论:Regex to extract nested patterns [duplicate]