Tuesday, April 8, 2014

Regular Expressions in 10g


This article briefly introduces Oracle’s support for regular expressions in 10g, considered by many developers to be long overdue. The new functions (available in both SQL and PL/SQL) are:
  • REGEXP_LIKE
  • REGEXP_INSTR
  • REGEXP_SUBSTR
  • REGEXP_REPLACE
Each of these functions are covered below. This article assumes that the reader has some knowledge of regular expressions and pattern matching. For further details, see the SQL Reference in the online Oracle documentation.

setup

To demonstrate regular expressions we’ll use a small set of arbitrary test data, created as follows.
SQL> CREATE TABLE t (x VARCHAR2(30));

Table created.

SQL> INSERT ALL
  2     INTO t VALUES ('XYZ123')
  3     INTO t VALUES ('XYZ 123')
  4     INTO t VALUES ('xyz 123')
  5     INTO t VALUES ('X1Y2Z3')
  6     INTO t VALUES ('123123')
  7     INTO t VALUES ('?/*.')
  8     INTO t VALUES ('\?.')
  9  SELECT * FROM dual;

7 rows created.

SQL> SELECT * FROM t;

X
------------
XYZ123
XYZ 123
xyz 123
X1Y2Z3
123123
?/*.
\?.

7 rows selected.

regexp_like

First we’ll look at REGEXP_LIKE and use this to introduce the basic means of specifying regular expressions. This function returns TRUE when it matches patterns using standard (or POSIX) notation. The pattern to match is passed as the second parameter to the function and is represented as a string using a combination of literal or meta characters. These can be quite sophisticated and flexible and it is worth reading the documentation for a full list of meta-characters, POSIX classes and references.
First we’ll look for data with a lower-case letter followed by a space and a number using standard pattern notation.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z] [0-9]');

X
------------------------------
xyz 123
Next we look for data with a lower-case letter followed by any single character (represented by the “.”) then a number.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z].[0-9]');

X
------------------------------
xyz 123
The question mark character in the following expression represents either zero or one occurrence of the previous pattern (in this example a lower-case letter). Hence this regular expression matches a wider range of records.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z]?[0-9]');

X
------------------------------
XYZ123
XYZ 123
xyz 123
X1Y2Z3
123123
The “*” wildcard represents zero or more occurrences of the preceding expression, so it could be argued that the following pattern is simply looking for any data containing at least one number (zero or more lower-case characters followed by a digit).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z]*[0-9]');

X
------------------------------
XYZ123
XYZ 123
xyz 123
X1Y2Z3
123123
To search for a specific number of occurrences of an expression, it is followed by the number of occurrences in braces. The following expression is searching for data with three consecutive capital letters.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[A-Z]{3}');

X
------------------------------
XYZ123
XYZ 123
This time we search for three occurrences of a capital letter followed by a single digit. To ensure that we search for the entire pattern three times (and not just the digit), we need to wrap the combined expression in parentheses (this is known as a subexpression).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '([A-Z][0-9]){3}');

X
------------------------------
X1Y2Z3
Note that to search for at least three occurrences of the subexpression, the number of occurrences would need to be followed by a comma, i.e. {3,}.
The following expression is one way of searching for all-numeric data. The caret (^) and dollar ($) denote start and end of line, respectively. The plus (+) after the digit class represents one or more occurrences.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '^[0-9]+$');

X
------------------------------
123123
We might sometimes need to match meta-characters, meaning that these must be “escaped”. For example, we’ve seen above that the question mark (?) is a special character meaning zero or one of the preceding pattern in the expression. To search explicitly for data containing this character, we must escape it using a backslash (\).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '\?');

X
------------------------------
?/*.
\?.
To specify “not” in a regular expression, a caret (^) is used within the brackets to exclude the referenced characters. Note that outside of these brackets the caret denotes start of line. The following example searches for data that is not all digits, which in our dataset is all but one record. Note that this is not the same as searching for data without any digits at all (that would require a NOT REGEXP_LIKE() or a REGEXP_INSTR() = 0 expression).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[^0-9]+');

X
------------
XYZ123
XYZ 123
xyz 123
X1Y2Z3
?/*.
\?.

6 rows selected.
Finally, we can also specify “either-or” using a pipe (|) meta-character. The following searches for all data containing either an ‘X’ or a ‘1’.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, 'X|1');

X
------------
XYZ123
XYZ 123
xyz 123
X1Y2Z3
123123

5 rows selected.

regexp_instr

REGEXP_INSTR returns the position of a string that matches a given pattern. As with the INSTR function, we can specify a starting point within the string (default 1) and the nth occurrence to search for. There are additional parameters to REGEXP_INSTR that we shall briefly cover below. First, we’ll search for the position of a literal question mark in our test data.
SQL> SELECT x
  2  ,      REGEXP_INSTR(x, '\?') AS "POSITION_OF_?"
  3  FROM   t;

X                              POSITION_OF_?
------------------------------ -------------
XYZ123                                     0
XYZ 123                                    0
xyz 123                                    0
X1Y2Z3                                     0
123123                                     0
?/*.                                       1
\?.                                        2

7 rows selected.
As mentioned above, REGEXP_INSTR has additional (optional) parameters over and above those we use for standard INSTR-type operations. These additional parameters allow for offset and matching. Offset enables us to specify whether we want the position of the start of the expression match (0) or the position immediately after the end of the regex match (1). A match parameter enables us to request case sensitivity using flags (e.g. 'i' for ignore or 'c' for case sensitivity. Default is whatever is specified in NLS_SORT). The following demonstrates three variations while searching for three capital letters in our test data.
SQL> SELECT x
  2         --<>--
  3  ,      REGEXP_INSTR(
  4            x,
  5            '[A-Z]{3}', --expression
  6            1,          --start at
  7            1,          --nth occurrence
  8            0           --offset position
  9            ) AS regexp_offset_0
 10         --<>--
 11  ,      REGEXP_INSTR(
 12            x,
 13            '[A-Z]{3}',
 14            1,
 15            1,
 16            1
 17            ) AS regexp_offset_1
 18         --<>--
 19  ,      REGEXP_INSTR(
 20            x,
 21            '[A-Z]{3}',
 22            1,
 23            1,
 24            0,
 25            'i'        --match parameter
 26            ) AS regexp_case_insensitive
 27         --<>--
 28  FROM   t;

X            REGEXP_OFFSET_0 REGEXP_OFFSET_1 REGEXP_CASE_INSENSITIVE
------------ --------------- --------------- -----------------------
XYZ123                     1               4                       1
XYZ 123                    1               4                       1
xyz 123                    0               0                       1
X1Y2Z3                     0               0                       0
123123                     0               0                       0
?/*.                       0               0                       0
\?.                        0               0                       0

7 rows selected.

regexp_substr

REGEXP_SUBSTR is incredibly useful. It returns the part of the string that matches the pattern only. It follows the same convention as REGEXP_INSTR with respect to starting position, nth occurrence and match-parameter arguments. However, it doesn't work like the standard SUBSTR, for which you specify the length of string to return.
For a very simple demonstration of REGEXP_SUBSTR, we’ll look at the default behaviour when searching for a string of one or more digits. We’ll also try to strip a second occurrence of the same pattern by using the relevant parameter. Incidentally, I’ve deliberately introduced an alternative means of expressing “digits-only” by using the POSIX character class (there are quite a few of these classes described in the online documentation).
SQL> set null {null}

SQL> SELECT x
  2         --<>--
  3  ,      REGEXP_SUBSTR(
  4            x,
  5            '[[:digit:]]+' --note POSIX char-class
  6            ) AS first_occurrence
  7         --<>--
  8  ,      REGEXP_SUBSTR(
  9            x,
 10            '[[:digit:]]+',
 11            1,
 12            2
 13            ) AS second_occurrence
 14  FROM   t;

X            FIRST_OCCURRENCE     SECOND_OCCURRENCE
------------ -------------------- --------------------
XYZ123       123                  {null}
XYZ 123      123                  {null}
xyz 123      123                  {null}
X1Y2Z3       1                    2
123123       123123               {null}
?/*.         {null}               {null}
\?.          {null}               {null}

7 rows selected.

regexp_replace

Finally, we’ll take a brief look at REGEXP_REPLACE. As its name suggests, this function searches for a specified number of occurrences (default all) of a pattern and replaces it with a given string or NULL. As with REGEXP_INSTR and REGEXP_SUBSTR, we can supply starting position, nth occurrence and a match-parameter. In the following simple example, we’ll replace all digits with a hyphen (-).
SQL> SELECT x
  2  ,      REGEXP_REPLACE(x, '[[:digit:]]', '-') AS nums_to_hyphens
  3  FROM   t;

X            NUMS_TO_HYPHENS
------------ --------------------
XYZ123       XYZ---
XYZ 123      XYZ ---
xyz 123      xyz ---
X1Y2Z3       X-Y-Z-
123123       ------
?/*.         ?/*.
\?.          \?.

7 rows selected.
What if we need to replace a given pattern with the pattern itself plus some additional data? REGEXP_REPLACE enables us to back-reference the regular expression (up to 500 expressions in fact) and include it in our replacement. The following example shows how we can replace all digits with the digits themselves followed by an asterisk. We enclose the regular expression in parentheses and reference it in the replacement string using \n (in our case \1 as we only have one preceding expression).

SQL> SELECT x
  2  ,      REGEXP_REPLACE(
  3            x,
  4            '([0-9])', --note the parentheses
  5            '\1*'      --\1 means include 1st pattern
  6            ) AS search_pattern_included
  7  FROM   t;

X            SEARCH_PATTERN_INCLUDED
------------ -----------------------------------
XYZ123       XYZ1*2*3*
XYZ 123      XYZ 1*2*3*
xyz 123      xyz 1*2*3*
X1Y2Z3       X1*Y2*Z3*
123123       1*2*3*1*2*3*
?/*.         ?/*.
\?.          \?.

7 rows selected.

putting regular expressions to use

Once we become familiar with regular expressions in Oracle, we’ll find many uses for them, replacing older methods that used nested SUBSTR(INSTR,INSTR) and so on. The web is also awash with new solutions to old problems that are served by regular expressions in 10g. For example, postcode (zip) validation, telephone number validation, “IS_NUMBER”, email address validation and so on.


No comments:

Post a Comment