Via Twitter, an interesting regexp_replace puzzle

Update: Laurent Schneider offers more elegant solutions and points out an embarrassing bug. Rather than edit my post to appear more clever than I actually am, I will instead direct you to his comment. :-)

LewisC (aka @oracle_ace) posed a question on Twitter yesterday, in the following two tweets, about using regexp_replace to match unescaped characters:

oracle_aceTweets.png

I'm no guru, but as an unreformed Perl geek, I'm a sucker for a regex problem, and liked the added challenge of conforming to Oracle's regexp_*() syntax. After a little bit of tweaking and testing, I found that I could even fit the solution in less than 140 characters, so I tweeted back the following:

select '"a"bcd\"efg"hi"' tst,
regexp_replace('"a"bcd\"efg"hi"','(^"+)|([^\\])"+','\2|') fixed
from dual;

Since one person's regex is often another person's line noise, I thought I'd explain what's going on:

  • (^"+) Matches one or more double quote characters at the start of the string. This is basically just handling an edge condition that can't be addressed by the second match. The circumflex character, ^ (aka "caret"), matches the beginning of a line.
  • ([^\\])"+ Matches any character except a slash, followed by a quote. So, per Lewis' spec, an escaped character like \" would be ignored. Note that in this context, the circumflex character ^ means "not."
  • \2| in the replacement string replaces the double-quote characters with a pipe character. The \2 prepends what was matched in the second grouped expression ([^\\]). If the first match condition (^"+) is met, \2 is empty, so nothing is prepended to the replacement character, which is exactly what we want.

This solution can be easily modified to use your choice of match and replacement characters. If one of your chosen characters is \, however, you should expect a little extra quoting/escaping work. ;-)

There is a limitation to this solution that I didn't realize until I started composing this post: Thanks to the + character in the match condition, multiple matched characters will be compressed to a single replacement character. For example:

SQL> select '"""a"bcd\"efg""""hi""""' tst,
2 regexp_replace('"""a"bcd\"efg"""hi""""','(^"+)|([^\\])"+','\2|') fixed
3 from dual
4 /

TST                     FIXED
----------------------- ---------------
"""a"bcd\"efg""""hi"""" |a|bcd\"efg|hi|

If the goal is to eliminate the matched characters altogether (i.e. no replacement character), then this is not a problem. If this compression is a concern, however, then a slightly modified match (+ characters removed) with a multi-pass approach might be appropriate:

SQL> select '"""a"bcd\"efg""""hi""""' tst,
2 regexp_replace(regexp_replace('"""a"bcd\"efg"""hi""""',
3                               '(^")|([^\\])"','\2|'),
4                '(^")|([^\\])"','\2|') fixed
5 from dual
6 /
 
TST                     FIXED
----------------------- ----------------------
"""a"bcd\"efg""""hi"""" |||a|bcd\"efg|||hi||||

Not particularly elegant, but it works. Output from using this revised match condition in a single-pass approach is left as an exercise for the reader. ;-)

Finally, if you're familiar with perl regular expressions, you may note that what Lewis asked for could be handled with a zero-width negative look-behind assertion, without having to resort to multiple passes over the string:

[oracle@vir ~]$ test_string='"""a"bcd\"efg""""hi""""'
[oracle@vir ~]$ echo $test_string | perl -pe 's/(?<!\\)"/|/g'
|||a|bcd\"efg||||hi||||

As far as I know, however, Oracle's regexp functions have not implemented Perl's "extended pattern" syntax. Furthermore, Lewis seemed to be asking for an Oracle-centric solution, and if I'd replied to Lewis' question with "something like Perl's zero-width negative look-behind assertion extended pattern," I would've used up way too many characters in my tweet. ;-)

2 Comments

  1. Posted 7 April 2009 at 2:41 | Permalink

    for the first solution, you can make it 5 char shorter ;)

    regexp_replace('"a"bcd\"efg"hi"','(^|[^\])"+','\1|')

    for the second solution, it is not only buggy, (there are four double-quotes between efg and hi), but it is also way to long and way to slow !

    replace(replace(replace('"""a"bcd\"efg""""hi""""','\"','ÿ'),'"','|'),'ÿ','\"')

    (I hope the backslash will print correctly, I have lots of issues with backslash on my blog...)

    Thanks for the challenge!

  2. Posted 7 April 2009 at 7:27 | Permalink

    Hello Laurent,

    Re #1: Fewer characters == better. Your solution is much more tidy, thank you!

    Re #2: Argh, you are absolutely correct about the bug...I was far too quick when reviewing that code. :-( Thank you for your replace() solution, too. It's a good illustration that even if it looks like a regex problem, it may not be.

    Thank you very much for the beneficial corrections, and for the visit!

    Regards,

    John P.

    Note to readers: The backslashes in Laurent's solution printed correctly, but my blog has a bad habit of inserting "smart" quotes instead of normal quotes, even in <pre> and <code> blocks. Please copy and paste with care, until I can correct my WordPress setup. :) Fixed, I think!

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*