Regular expressions

The article information is constructed in such a way, that everybody will understand how to use them, even those who knows nothing about regular expressions.

Introduction

Probably every programmer has at least heard of regular expressions. Number of day-to-day tasks involve finding specific data in some text according to some predefined rules, checking user-generated data or modifying the text information somehow.

Even though each of these tasks may be solved by splitting a string into chars and performing some operations over them, it is not the best way to solve such problems.

For illustration purposes, let’s consider the code checking username for validity without diving into regular expressions, but just examining the code:

import java.util.regex.Matcher;  
import java.util.regex.Pattern;  
   
public class UserNameCheck {  
      
    public static void main(String[] args){  
        System.out.println("Cool check:");  
          
        System.out.println(checkWithRegExp("_@BEST"));  
        System.out.println(checkWithRegExp("vovan"));  
        System.out.println(checkWithRegExp("vo"));  
        System.out.println(checkWithRegExp("Z@OZA"));  
          
        System.out.println("\nDumb check:");  
          
        System.out.println(dumbCheck("_@BEST"));  
        System.out.println(dumbCheck("vovan"));  
        System.out.println(dumbCheck("vo"));  
        System.out.println(dumbCheck("Z@OZA"));  
    }  
      
    public static boolean checkWithRegExp(String userNameString){  
        Pattern p = Pattern.compile("^[a-z0-9_-]{3,15}$");  
        Matcher m = p.matcher(userNameString);  
        return m.matches();  
    }  
      
    public static boolean dumbCheck(String userNameString){  
          
        char[] symbols = userNameString.toCharArray();  
        if(symbols.length < 3 || symbols.length > 15 ) return false;  
          
        String validationString = "abcdefghijklmnopqrstuvwxyz0123456789_";  
          
        for(char c : symbols){  
            if(validationString.indexOf(c)==-1) return false;  
        }  
          
        return true;  
    }  
} 
Result:
Cool check:
false
true
false
false

Dumb check:
false
true
false
false

As we can see the program contains two methods to check if the username is valid. The first one - called checkWithRegExp(String userNameString) - uses regular expression for validity check while the second one - dumbCheck(String userNameString) - does it “manually”.

Therefore, if we decide to modify the checking conditions, it would be sufficient to modify the regular expression string in the first case. But in case of “classical” dumbCheck we would have to rewrite the entire code from the scratch.

General concepts

Regular expressions (abbreviated regex or regexp and sometimes called a rational expression) - is a formal language for performing search and manipulating substrings within the text, based on metacharacter (wildcard character, wildcard) usage. Simply speaking, it is a string that consists of characters and metacharacters and specifying a pattern for search performance.

Regular expressions are used in number of programming languages. Java has special package that allows to work with them - java.util.regex.

“Knowing how to wield regular expressions unleashes processing powers you might not even know were available. Numerous times in any given day, regular expressions help me solve problems both large and small (and quite often, ones that are small but would be large if not for regular expressions). With specific examples that provide the key to solving a large problem, the benefit of regular expressions is obvious. Perhaps not so obvious is the way they can be used throughout the day to solve rather "uninteresting" problems. "Uninteresting" in the sense that such problems are not often the subject of barroom war stories, but quite interesting in that until they're solved, you can't get on with your real work. I find the ability to quickly save an hour of frustration to be somehow exciting. As a simple example, I needed to check a slew of files (the 70 or so files comprising the source for this book, actually) to confirm that each file contained 'SetSize' exactly as often (or as rarely) as it contained 'ResetSize'. To complicate matters, I needed to disregard capitalization (such that, for example, 'setSIZE' would be counted just the same as 'SetSize'). The thought of inspecting the 32,000 lines of text by hand makes me shudder. Even using the normal "find this word" search in an editor would have been truly arduous, what with all the files and all the possible capitalization differences. Regular expressions to the rescue! Typing just a single, short command, I was able to check all files and confirm what I needed to know. Total elapsed time: perhaps 15 seconds to type the command, and another 2 seconds for the actual check of all the data. Wow!” Mastering Regular Expressions, Jeffrey E.F. Friedl

Metacharacters

The basic idea behind regular expressions is is that some characters within a given string are considered not as usual characters, but as ones that have special meaning (the so-called metacharacters). This idea allows the whole regular expressions mechanism to work - every metacharacter has it's own role.

Here are the basic meta-characters examples:

  • ^ - (circumflex) beginning of the string
  • $ - (dollar sign) end of the string
  • . - (dot) shorthand notation for an object matching any character
  • | - logical «OR». Substrings combined in this manner are called alternatives
  • ? - (question mark) means that the preceding character is optional
  • + - means “one or more instances of the immediately preceding element”
  • * – any number of element instances (including zero)
  • \\d – numeric character
  • \\D – not numeric character
  • \\s – whitespace
  • \\S – not a whitespace
  • \\w – alphabetic or numeric character or underscore sign
  • \\W – any symbol other than alphabetic or numeric character or underscore sign

Let’s consider several examples with some of the above-described metacharacters.

The following method checks if a string contains a BACON word (no spaces or any other characters)! We will talk about Pattern and Matcher classes later. matches() method checks if a string corresponds to a regular expression.

public static boolean test(String testString){  
        Pattern p = Pattern.compile("^BACON$");  
        Matcher m = p.matcher(testString);  
        return m.matches();  
} 

Here ^BACON$ = beginning of the string + BACON + end of the string


        System.out.println(test("BACON"));      //true  
        System.out.println(test("  BACON"));    //false  
        System.out.println(test("BACON  "));    //false  
        System.out.println(test("^BACON$"));    //false  
        System.out.println(test("bacon"));      //false 

Let’s go further and write a simple way to check that a string ends with .com or .ru or .ua. Some kind of an URL validator, but a very simplified one.


  public static boolean test(String testString){  
        Pattern p = Pattern.compile(".+\\.(com|ua|ru)");  
        Matcher m = p.matcher(testString);  
        return m.matches(); 
}

Its execution result:


        System.out.println(test("trololo.com"));     //true  
        System.out.println(test("trololo.ua "));     //false  
        System.out.println(test("trololo.ua"));      //true  
        System.out.println(test("trololo/ua"));      //false  
        System.out.println(test("i love java because it is cool.ru"));      //true  
        System.out.println(test("BACON.ru"));        //true  
        System.out.println(test("BACON.de"));        //false 

Let’s consider a ".+\\.(com|ua|ru)" string in more details:

  • + - means that any number (one or more) of any characters may precede the string of interest
  • \\. - dot screening. In that way, we specify that it is exactly a dot that is following instead of any other character.
  • (com|ua|ru) - logical OR: either com or ua or ru. (But what would happen if we omitted the brackets? In that case, we would get the following: ".+\\.com" or "ua" or "ru" - not exactly the thing we want :) ).

Character classes

Sometimes we may encounter a need to present THE SAME CHARACTER in several ways. For example, let’s assume we want to find a wrod "Thailand" in a text and replace it with some other word, but the problem is that there are several ways to write this word - for example, with lower letter instead of capital one.

Of course, some may argue that we may utilize an OR metacharacter and use one of the following regular expressions:
"Thailand|thailand|"
"(Т|т)hailand"
And it indeed will work. BUT regular expressions provide us with a more sophisticated way of detecting similar situations.
We are talking about the so-called character class - it determines set of characters that can be (or cannot be) used instead of a given character.
Character class corresponds to a single character within the processed string and this character should be included in a set specified by a class.
For example, character class [aeiou] corresponds to any lower-case vowel (there will be only one character from this set).

Let’s implement our Thailand example utilizing a character class notion:


public class Rexep { 
 
    public static final String TEXT = "I really like thailand. Thailand is a great country!"; 
 
    public static void main(String[] args){ 
 
    System.out.println(TEXT.replaceAll("[Т]hailand", "Some other country")); 
 
     } 
 
} 

An important feature of the character classes: the meta-characters listed above do not work here, or work differently! Do not be confused, everything inside square brackets - a character class that describes a character.

Character classes have their own metacharacters within them:

  • ^ - logical NOT. For example, [^ABC] - not (A or B or C), but all other characters are fine.
  • - - character range; For example, H[1-6] expression is equivalent to H[123456]

Example:


    public static boolean test(String testString){  
        Pattern p = Pattern.compile("^[a-z]+");  
        Matcher m = p.matcher(testString);  
        return m.matches();  
    } 

Results:


        System.out.println(test("pizza"));   //true  
        System.out.println(test("@pizza"));  //false  
        System.out.println(test("pizza3"));  //false 

"^[a-z]+" = beginning of the string + any character from inside the a-z range (i.e., abcdef...z) any number of times (but not less than once).

Trackbacks

In addition to logical separation of expressions, parentheses create the so-called groups. They are useful when your regular expression consists of several identical parts. Using groups it is sufficient to describe the group of characters once and then simply to refer to it.

For example:

public static void main(String[] args){  
         
        Pattern p = Pattern.compile("(somepattern).+(\\1)");  
        Matcher m = p.matcher("it is an article about regular expressions it is an article about regular expressions it is an article about regular expressions somepattern it is an article somepattern about regular expressions" );  
        if(m.find()){  
            System.out.println(m.group());  
        }  
}

Here is the result:

somepattern it is an article somepattern 

First group (somepattern) could contain more complex expression, then trackback \\1 would significantly reduce the size of the regular expression.

Groups are numbered from left to right starting with 1. Every opening parenthesis increases a group’s number:


(  (  )  )(  (   )  )
^  ^      ^  ^
1  2      3  4

Zero group coincides with the whole found subsequence.

Quantifiers

Regular expressions allow to indicate how many times one or several characters may repeat. You have already seen some of them:

  • + - one or more times
  • * - zero or more times
  • ? - zero or one time
  • {n} - exactly n times
  • {m,n} - m to n times inclusive
  • {m,} - not less than m times
  • {,n} - not more than n times

Now we may fully understand the whole meaning of a regular expression provided in the very first example: ^[a-z0-9_-]{3,15}$ .

Let’s analyze it piece by piece:

  • ^ - beginning of the string
  • [a-z0-9_-] - character that may be either a lower-case latin letter or numeric or underscore symbol.
  • {3,15} - previous object may repeat from 3 to 15 times.

Real-life example:

Let’s consider a regular expression that check if an ip address is valid.

import java.util.regex.Matcher;  
import java.util.regex.Pattern;  
   
public class IPAddressValidator{  
   
    private Pattern pattern;  
    private Matcher matcher;  
   
    private static final String IPADDRESS_PATTERN =   
"^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +  
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +  
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +  
"([01]?\\d\\d?|2[0-4]\\d|25[0-5])$";  
   
    public IPAddressValidator(){  
 pattern = Pattern.compile(IPADDRESS_PATTERN);  
    }  
   
   /**  
    * Validate ip address with regular expression  
    * @param ip ip address for validation  
    * @return true valid ip address, false invalid ip address  
    */  
    public boolean validate(final String ip){    
 matcher = pattern.matcher(ip);  
 return matcher.matches();          
    }  
} 
Source: how-to-validate-ip-address-with-regular-expression


^                             #beginning of the string
 (                            #  group #1 beginning
   [01]?\\d\\d?           #    3 numbers may be here. First - either 0 or 1, or nothing at all. Second - 
                               #  any number. Third - any number or nothing at all
    |                          #    OR
   2[0-4]\\d                #    starts with 2, followed by a number from a 0-4 range and any number after that
    |                         #    OR
   25[0-5]                 #    starts with 25 followed by
                              #    a number from 0-5 range
 )                            #  end of the group
  \.                          #  dot
....                          # Everything above repeats 3 times
$                            #end of the string

It’s not as hard as it may seem at the beginning! Practise will make you more experienced in dealing with regular expressions.

Regular expressions in Java

java.util.regex package allows to work with regular expressions. Regular expressions library contains three main classes: Pattern, Matcher и PatternSyntaxException. (although there are also ASCII, MatchResult, UnicodeProp classes)

1. Class Pattern - Regular expression specified in a string should be first compiled into an object of this class. After being compiled an object of this class may be used to create a Matcher object.

Following methods are defined inside a Pattern class:

  • Pattern compile(String regex) – returns a Pattern, that corresponds to a regex.
  • Matcher matcher(CharSequence input) – returns a Matcher, with the help of which correspondences in an input string may be found. Class Pattern

2. Class Matcher
Matcher object analyzes a string starting from 0 and seeks for pattern matching. After this process is finished Matcher contains information about found (or unfound) correspondences in the input string. User may access this information with the help of different methods of a Matcher object:

  • boolean matches() indicates whether the entire input sequence matches the pattern.
  • int start() indicates the index within the string where the string that matches the pattern begins
  • int end() indicates the index within the string where the string that matches the pattern + 1.
  • String group() - returns the found string
  • String group(int group) - if your regular expression contains groups , this method will help you output the part of a string corresponding to a specific group.
Class Matcher

String class

Let’s take a look at some methods of a String class

public boolean matches(String regex) {  
        return Pattern.matches(regex, this);  
} 

Method, that replaces the first found match:

public String replaceFirst(String regex, String replacement) {  
        return Pattern.compile(regex).matcher(this).replaceFirst(replacement);  
} 

Method, that replaces all found matches:

public String replaceAll(String regex, String replacement) {  
        return Pattern.compile(regex).matcher(this).replaceAll(replacement);  
} 

Replaces all found target character sequences (arrays) with replacement:

public String replace(CharSequence target, CharSequence replacement) {  
        return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(  
                this).replaceAll(Matcher.quoteReplacement(replacement.toString()));  
}

As we can see, even String class uses regular expressions.

Search

There exists two main regular expressions mechanisms types: non-deterministic finite automaton (NFA) , deterministic finite automaton (DFA) (even though a hybrid option also exists).

Java uses NFA.

Let’s consider a NFA algorithm (taken from Mastering Regular Expressions, Jeffrey E.F. Friedl), which may be used as a mechanism to find matches of a to(nite|knight|night) expression in a text ‘…tonight…’. It examines a regular expression component by component starting with t, and checks whether a component matches a current text. Next component is checked in case of a match. This procedure is repeated until a match is found for all regular expression components.

In that case a general match is said to be found. In a given example of to(nite|knight|night) t is the first component. Check fails until a ‘t’ character is found in a text. When it happens o is compared to the following character and, in case of success, the following component will be considered. In our case the (nite|knight|night) is “the following component”. It means “either nite, or knight, or night”. Having encountered three options, mechanism simply compares them one by one.

Every available option is check in exactly the same way - one character at a time. If it ends with a fail, mechanism moves forward to the following option and continues doing that until it either finds a match or all the options are exhausted (mechanism reports a fail in that case).

Conclusion

This article explains a notion of regular expressions in a way that even a person unfamiliar with it should understand. A “Mastering Regular Expressions” book written by Jeffrey E.F. Friedl is strongly advised to be read in order to strengthen your understanding of regular expressions.


Original text by Vladimir Vyshko


Следи за CodeGalaxy

Мобильное приложение Beta

Get it on Google Play
Обратная Связь
Продолжайте изучать
статьи по Java
Cosmo
Зарегистрируйся сейчас
или Подпишись на будущие тесты