Tutorial Guide
    Tutorial Man
    3D Resources
    3D Tutorials

Introduction to regular expressions Part 2 - ERE POSIX

Social Bookmarks   Add to Del.icio.us   Add to Digg   Submit to Reddit   Stumble It!   Blink & Share     Share on Share on Technorati   Share on Facebook
Times viewed: 11125   Rating: 3/10
This tutorial will assume you have already read 'Part 1 : General Mechanics', and are merely wanting to continue your research into POSIX. This tutorial will assume you know the basic parts of a regular expression, such as literal text, metacharacters, and whitespace. If you do not, go read that tutorial now, and then return. Even if you know a little bit of regex, but find understanding this tutorial hard, go read that tutorial and then try to tackle this one. Overall I would say that regular expressions are easy, but only if you have a firm grasp on the most basic concepts.

An Introduction to POSI
POSIX stands for 'portable operating system interface', and POSIX regex comes in two flavors: BRE (Basic regular Expressions), and ERE (Extended Regular Expressions). PHP utilizes ERE POSIX, so that is what this tutorial will cover. Assuming you have already read my tutorial on the General Mechanics of regex, I figure you know enough about the basic parts of a regular expression to jump into a few of the metacharacters provided by ERE POSIX. The basic components - ranges (a-z0-9 for example), character classes ([a-z0-9A-Z] for example), whitespace, etc - are the same as the components discussed in the General Mechanics tutorial. The main parts that are new are the new metacharacters and PHP functions available. The best way to learn a subject like regex is to learn from experience, and you don't get experience from listening to me babble... so lets get to work!

'Special Range' Metacharacters
Special Range Metacharacters are special values that match common ranges. These will be the most important building blocks of your regular expressions.


All of these are available to you from within the PHP functions which utilize ERE POSIX. (all of the ereg_ functions.)
On the next page we will begin writing our first POSIX regular expressions, and exploring a few of the most basic ereg_ functions.

--------------------------------------------------------------------------------


Beginning Ereg
Let's take a look at the structure of the ereg function (known as the 'prototype' of the function):
int ereg ( string pattern, string string [, array regs])

int
This tells us that the function will return an int, either 0 or 1. Zero represents boolean false, and 1 represents boolean true. If there was a match, the function will return 1 (TRUE), of not it will return 0 (FALSE).

string pattern
This is the pattern that you will search for. You will put your posix inside this parameter, and it MUST be a string, meaning inside "" or ''.

string string
This will be the parameter which holds the string you will match the regular expression against.

array regs
This parameter is the optional array, $regs. If you have a regular expression which contains () and this third parameter, PHP will populate the $regs array with the matches of (pattern). $regs[0] will contain a copy of the entire string matched, and $regs[1] and up will contain the matches of each parenthetical match. (from left to right in the regex). Lets take a look at a quick example of this below..


//This will take the date in YYYY-MM-DD and print it in DD.MM.YYYY

//This example was taken from the php manual.

$date = "1986-05-28";

if (
ereg ("([0-9]{4})-([0-9]{1,2})-([0-9]{1,2})", $date, $regs)) {

    echo
"$regs[3].$regs[2].$regs[1]";

} else {

    echo
"Invalid date format: $date";

}

?>

You should already be able to understand what the regex does, but I'll go over it again anyway:
Four digits, followed by -, followed by atleast one but no more than two digits, followed by -, followed by atleast one but no more than two digits.
Now, if that is found (which in this case it will be), ereg() will return 1 (TRUE) and 'echo "$regs[3].$regs[2].$regs[1]";' will be performed. $regs, as you can see, has been populated by numbers from the matching parenthetical expressions from left to right. ($regs[1] is ([0-9]{4}), $regs[2] is ([0-9]{1,2}), and $regs[3] is ([0-9]{1,2}))

Now let me rewrite this, using POSIX's new range metacharacters.




<?php

$date
= "1986-05-28";

if (
ereg ("([[:digit:]]{4})-([[:digit:]]{1,2})-([[:digit:]]{1,2})", $date, $regs)) {

    echo
"$regs[3].$regs[2].$regs[1]";

} else {

    echo
"Invalid date format: $date";

}

?>



As you can see (if you actually tested these two scripts, which if you havn't already - you should do so now..), these two scripts do the exact same thing (almost..). There is one difference, which is 'under the hood', and that is 'multi-byte character' matching. This will be explained in more detail once i've shown you a little more about ereg().

Another Ereg Example
Here is another example of ereg() using the special range metacharacters. Say you have a report system for your website, and in this system you can report many things, including bugs, spelling errors, broken linkes, etc. Say you want to parse this file (Log.txt) and grab only the bug reports. This is how:


<?php

$log
= file('Log.txt');

foreach(
$log as $log_data){

if(
ereg('Bug# ([[:digit:]]+): (.+) END',$log_data,$regs)){

echo
'Bug#: '.$regs[1].': '.$regs[2]."\n";

}

}

?>


Now, let's assume that the data in Log.txt is as follows:

Bug# 1: call to undefined function, mysql_conect END
Spelling# 1: grammer should be grammar END
Broken_Link# 1: /games/demo.zip doesn't exist! END
Bug# 2: call to undefined function, ob_stat END
Spelling# 2: popcern should be popcorn END
Broken_Link# 2: /images/popcern.jpg doesn't exist! END

As you can see, it's just a regular old error report log. Assuming that you had those errors in your Log.txt file, the output of the script above would have been:

Bug#: 1: call to undefined function, mysql_conect
Bug#: 2: call to undefined function, ob_stat

Pretty simple. Now lets take that script a tad farther, and output each type of report (bugs, spelling errors, and broken links.).


<?php

$log
= file('Log.txt');

foreach(
$log as $log_data){

if(
ereg('Bug# ([[:digit:]]+): (.+) END',$log_data,$regs)){

echo
'Bug#: '.$regs[1].': '.$regs[2]."\n";

}

}

foreach(
$log as $log_data){

if(
ereg('Spelling# ([[:digit:]]+): (.+) END',$log_data,$regs)){

echo
'Spelling Error#: '.$regs[1].': '.$regs[2]."\n";

}

}

foreach(
$log as $log_data){

if(
ereg('Broken_Link# ([[:digit:]]+): (.+) END',$log_data,$regs)){

echo
'Broken Link#: '.$regs[1].': '.$regs[2]."\n";

}

}

?>



As you can see, the script is still pretty simple. All you have done is copied a foreach, and changed a couple words each time, and you have added a lot of functionality to your log parser. Let's take a look at the output:

Bug#: 1: call to undefined function, mysql_conect
Bug#: 2: call to undefined function, ob_stat
Spelling Error#: 1: grammer should be grammar
Spelling Error#: 2: popcern should be popcorn
Broken Link#: 1: /games/demo.zip doesn't exist!
Broken Link#: 2: /images/popcern.jpg doesn't exist!

With minimum effort I can imagine this tidbit of code being used to create a pretty advanced log reporting system... the possibilities are endless.

Before I move on it's necessary to introduce you to eregi(). I personally hardly ever use this, but it is convenient to have laying around. eregi() performs a case insensitive match, which is best displayed by an example.

<?php

if (eregi("c", $string)) {

    echo
"'$string' contains a 'c' or 'C'.";

}

?>



Pretty basic theory, instead of requiring a-zA-Z, you can have a-z or A-Z and it will match both uppercase and lowercase characters. On the next page I will breifly explain those 'multibyte characters' I mentioned earlier.

--------------------------------------------------------------------------------

Multi-byte Characters
As you use your computer, you see text in many forms. When you "type" using your keyboard, what is happening behind the scenes is rather complex..but i'll give it to you in a nutshell. Say your in microsoft word and type "T". (literal T not " then T then "). The computer detects that you have pressed a key combination (shift+t, to create an uppercase T), and then it encodes that using whatever characterset your working in. On a Unicode characterset T would be encoded as U+0054. Microsoft word stores the encoded number in memory, and passes that number on to the software controlling your monitor which uses it as an index to find an image of a T and slap it on your screen. Well, charactersets change. There are many different types of sets, usually differing with language. In English, letters are used to create words and sentences. In Japanese, hiragana and katakana are used to represent syllables. In chinese ideographs are used to represent full words or concepts. These differences in representations of language are why multibyte characters are needed. According to your characterset, the keystrokes you perform are encoded in different methods. There are 1-byte (8bit), 2-byte (16 bit), and 4-byte (32 bit) charactersets. So although you might be typing the same word, your keystrokes might be encoded in a different way than everyone elses. Hence the need for support of multibyte characters in regular expressions... we can all be happy that our text will be matched with nearly the same accuracy.

--------------------------------------------------------------------------------

An Introduction to ereg_replace
ereg_replace performs a regular expression which replaces a portion of a string with another string based on a regular expression match. If the regular expression finds a match anywhere in the string it is searching through, it will replace the matched portion with a given 'replacement string'.
Let's take a gander at the prototype of ereg_replace:

string ereg_replace ( string pattern, string replacement, string string)

string
This tells us that the function will return a string. The string returned will be the result of performing the regular expression on the string it recieves as the third parameter.

string pattern
The pattern string is a string containing the regular expression which will be used to match against the third parameter.

string replacement
The replacement string is the string which will replace the matched portions of the third parameter.

string string
This is the string which will be matched against and if matched; changed according to the value of the replacement string.

Again, it's best to learn from experience, so lets try out a simple example. The example below will replace "abc", with "cba".


<php

$stringbefore
= "abc 123";

$stringafter = ereg_replace("abc","cba",$stringbefore);

echo
$stringafter;

?>



The output of running this script should have been 'cba 123'. I think its necessary to show you an example (even if its an extremely stupid one) of using the special range metacharacters. Here goes nothing...


<?php

$stringbefore
= "abcdefghijklmnopqrstuvwxyz0123456789";

$stringafter = ereg_replace("[[:alpha:]]","<",$stringbefore);

$stringafter = ereg_replace("[[:digit:]]",">",$stringafter);

echo
$stringafter;

?>



Oooooh, wasn't that fun? no? *shrug*
The output of that should have been along the lines of <<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>.

Just like ereg, ereg_replace has facilities in place to hold the value of parenthetical matches. Its a little different though. For each () parenthetical match you have, you have a \\digit substring which you can use. The first parenthetical match is \\1, the second is \\2, the third is \\3, etc. \\0 contains the entire matched string. You may only have up to nine substrings, and parenthesese may be nested. This idea is best illustrated by examples, so here we go.


<?php

/* this example was taken from the php manual */

$string = "This is a test";

echo
ereg_replace (" is", " was", $string);

echo
ereg_replace ("( )is", "\\1was", $string);

echo
ereg_replace ("(( )is)", "\\2was", $string);

?>



The output of the above example should have been, "This was a test" printed three times. As you can see, the space in "( )is" is \\1, while the space in "(( )is)" is \\2. In actuallity, the \\digit substring is \digit, but since you are using it inside "", the \ has to be escaped with another \, creating \\digit. If you use '' in your ereg_replace, you must use \digit (\1, \2 etc) instead. Another, more 'real-world' example is below (which demonstrates the \digit idea.)


<?php

$text
= "http://www.google.com/";

$text = ereg_replace('[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]','<a href="\0">\0</a>', $text);

echo
$text; //this should print an html link to http://www.google.com/.

?>



As with eregi, there is also an eregi_replace function. And just like eregi has the same qualities of ereg, it has the same basic functionality and structure of ereg_replace, the only difference being that eregi_replace is case insensitive. Below is the mandatory example...


<?php

$string
= "abcdcCccCCCc";

$string = eregi_replace("c","-",$string);

echo
$string; //should print 'ab-d--------'

?>



That should ring a few bells...
Times viewed: 11125   Rating: 3/10
Social Bookmarks   Add to Del.icio.us   Add to Digg   Submit to Reddit   Stumble It!   Blink & Share     Share on Share on Technorati   Share on Facebook
More PHP Programming - Regular Expressions Tutorials:
- Introduction to regular expressions Part 1 - General Mechanics