9084

How to regex match pairs within pairs

My question is fairly straightforward, even if the purpose it will serve is pretty complicated. I will use a simple example:

AzzAyyAxxxxByyBzzB

So normally I would want to get everything between A and B. However, because some of the content between the first A and the last B (one pair) contains additional AB pairs I need to push back the end of the match. (Not sure if that last part made sense).

So what I'm looking for is some RegEx that would allow me to have the following output:

Match 1 Group 1: AzzAyyAxxxxByyBzzB Group 2: zzAyyAxxxxByyBzz

Then I would match it again to get:

Match 2 Group 1: AyyAxxxxByyB Group 2: yyAxxxxByy

Then finally again to get:

Match 3 Group 1: AxxxxB Group 2: xxxx

Obviously if I try (A(.*?)B) on the whole input I get:

Match x Group 1: AzzAyyAxxxxB Group 2: zzAyyAxxxx

Which is not what I'm looking for :)

I hope this makes sense. I understand if this can't be done in RegEx, but I thought I would ask some of you regex wizards before I give up on it and try something else. Thanks!

<strong>Additional Info:</strong>

The project I'm working on is written in Java.

One other problem is that I'm parsing a document which could contain something like this:

AzzAyyAxxxxByyBzzB Here is some unrelated stuff AzzAyyAxxxxByyBzzB AzzzBxxArrrBAssssB

And the top AB pairs needs to be separate from the bottom AB pairs

Answer1:

You made your regex explicitly ungreedy by using the ?. Just leave it out and the regex will consume as much as possible before matching the B:

(A(.*)B)

However, in general nested structures are beyond the scope of regular expressions. In a case like this:

AxxxByyyAzzzB

You would now also match from the first A to the last B. If this is possible in your scenario, you might be better of going through the string yourself character-by-character and counting As and Bs to figure out which ones belong together.

<strong>EDIT:</strong>

Now that you have updated the question and we figured this out in the comments, you <strong>do</strong> have the problem of multiple consecutive pairs. In this case, this cannot be done with a regex engine that does not support recursion.

However you can switch to matching from the inside out.

A([^AB]*)B

This will only get innermost pairs, because there can be neither an A nor a B between the delimiters. If you find it, you can then remove the pair and continue with your next match.

Answer2:

Use word boundary if you use multiline mode:

\bA(.*)B\b #for matches that does not start from beginning of line to end

or

^A(.*)B$ #for matches that start from beginning of line till end

Answer3:

You won't be able to do this with Regular Expressions alone. What you're describing is more Context-Free than Regular. In order to parse something like this you need to push a new context onto a stack every time to encounter an 'A' and pop the stack every time you encounter a 'B'. You need something more like a pushdown automaton than a regular expression.

Recommend

  • Cannot deserialize the current JSON array (e.g. [1,2,3]). C#, cant figure the error out
  • Associate a File To a Application
  • LINQ statement generates different query than LinqPad
  • Base Internationalization and “Could not find a storyboard named […]”
  • Multiple versions of iTunesArtwork in one project?
  • How to convert WPF project so it can be used as a class library by a separate exe
  • How to use SBT with multiple sub project web applications?
  • Zend Framework + Doctrine1.2 project structure with more modules
  • Spring boot mapping static html
  • Zend Framework bassed projects
  • MVVM: Image Bind Source from FileOpenPicker
  • init_seg and warning C4073 from library code?
  • iOS Localization Doesn't Work with More Than 63 Files
  • How to use tag-it
  • Why cout is producing no output on Code Blocks?
  • ckeditor and jquery UI dialog not working
  • Better Indy for Dephi 2007
  • How to pass solution folder as parameter in command line arguments (for debug)?
  • Most efficient way to move table rows from one table to another
  • JSON encode and decode on PHP
  • Building Qt project for C++11 standard
  • Django Haystack Rebuild Index
  • Does Apportable support to build library binary (.a/.so)?
  • Jenkins: FATAL: Could not initialize class hudson.util.ProcessTree$UnixReflection
  • How does document.ready work with angular element directives?
  • Insert new calendar with SyncAdapter- Calendar API Android
  • Ensure fsync did its job
  • How to use carriage return with multiple line?
  • Switching to Release Build causes runtime error in Web Reference
  • How to rebase a series of branches?
  • Lost migrations and Azure database is now out of sync
  • Linq Objects Group By & Sum
  • Optimizing database types to compact database (SQLite)
  • QuartzCore.framework for Mono Develop
  • Buffer size for converting unsigned long to string
  • Hits per day in Google Big Query
  • File not found error Google Drive API
  • reshape alternating columns in less time and using less memory
  • How can i traverse a binary tree from right to left in java?
  • Converting MP3 duration time