42998

# R: gsub and capture

I am trying to extract the contents between square brackets from a string:

eq <- "(5) h[m] + nadh[m] + q10[m] --> (4) h[c] + nad[m] + q10h2[m]"

I can filter them out:

gsub("\\[.+?\\]","" ,eq) ##replaces square brackets and everything inside it [1] "(5) h + nadh + q10 --> (4) h + nad + q10h2"

But how can I capture what's inside the brackets? I tried the following:

gsub("\\[(.+)?\\])", "\\1", eq) grep("\\[(.+)?\\]", eq, value=TRUE)

but both return me the whole string:

[1] "(5) h[m] + nadh[m] + q10[m] --> (4) h[c] + nad[m] + q10h2[m]"

Also, in my application I never know how many such terms in square brackets occur, so I wouldn't know how the 'replace' argument in gsub should look like (e.g. \\1 or \\1_\\2). Thanks in advance!

Try this:

eq <- "(5) h[m] + nadh[m] + q10[m] --> (4) h[c] + nad[m] + q10h2[m]" pattern<-"\\[.+?\\]" m <- gregexpr(pattern, eq) regmatches(eq, m) [[1]] [1] "[m]" "[m]" "[m]" "[c]" "[m]" "[m]"

Your first pattern didn't work because of an extra bracket that was never found:

gsub("\\[(.+)?\\])", "\\1", eq) # Yours gsub("\\[(.+?)\\]", "\\1", eq) # Corrected -- kind of [1] "(5) hm + nadhm + q10m --> (4) hc + nadm + q10h2m"

What you essentially are doing is replacing every instance of your match with your first bracketed part, which isn't what you want.

Your second pattern, using grep, simply searched the string for the pattern, found it, and then returned all strings that had the pattern, which was your one string.

Another option :

library(stringr) pattern<-"\\[.+?\\]" str_extract_all(eq,pattern) [[1]] [1] "[m]" "[m]" "[m]" "[c]" "[m]" "[m]"

gsub replaces portions of a string with replacement strings but here we wish to extract the strings rather than replace them.

<strong>strapplyc</strong> strapplyc in the gsubfn package can do that. Use your pattern but insert parentheses around the portion you wish to capture (or omit the parentheses if you wish to capture the entire pattern including the square brackets):

> library(gsubfn) > strapplyc(eq, "\\[(.*?)\\]")[[1]] [1] "m" "m" "m" "c" "m" "m"

The guts of strapplyc is written in tcl so its quite fast although for small strings such as the ones here the speed will not really matter.

<strong>strapply</strong> There also exists strapply which takes a third argument that is a function, list or proto object that is applied to each extracted capture. e.g.

> # function > strapply(eq, "\\[(.*?)\\]", toupper)[[1]] [1] "M" "M" "M" "C" "M" "M" > # list > strapply(eq, "\\[(.*?)\\]", list(c = "crunchy", m = "munchy"))[[1]] [1] "munchy" "munchy" "munchy" "crunchy" "munchy" "munchy"