Thursday, April 4, 2013

Regexp for matching text within parentheses, brackets, quotes, etc.

    Sometimes we want to find some text that is placed within (...), "...", '...', [...], etc., but we do not want to include delimiters. It can be achieved by means of the look-behind regexp expression.

Let's take the following text as our example:
(leak from java heap) sucks ( rocks!)
We are interested in:
leak from java heap rocks!
People usually propose a simple regexp:
\(.+?\)
1. ( and ) must be escaped with \ because they have special meaning in regexp.
2. The reluctant operator +? is used rather than greedy operator * because we are interested in the smallest possible matching.

However, the proposed regexp returns:
(leak from java heap)( rocks!)
and we do not want to include ( ). We want to get rid of that ugly ( ) in one regexp pass. It can be done with look-behind operator. As it states, the operator just examines the predecessor of our target, but do not include it into the matching [info]:

(?<=\()[\w\s!]+

It says: 
Find ( [do not include it] and go through next signs until you encounter something different than \w \s and !. ( can be replaced with desired delimiter.

Now, let's try a more difficult example - I had to deal with it at my work. I had to replace whitespaces within '...' with _.

Finding the whitespaces within '...':

(?<='[\w\s]{0,100})\s

It says:
Find ', continue with \w and \s [do not include them] and when you find \s - return it.
  
{0,100} is used because Java does not support +? or * in look-behind operator. Using them resulted in:
"Look-behind operator does not have an obvious maximum length".

No comments :

Post a Comment