Debian way is:
dpkg-reconfigure locales
preg_match
(PHP 4, PHP 5)
preg_match — Perform a regular expression match
Description
Searches subject for a match to the regular expression given in pattern .
Parameters
- pattern
-
The pattern to search for, as a string.
- subject
-
The input string.
- matches
-
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
- flags
-
flags can be the following flag:
- PREG_OFFSET_CAPTURE
- If this flag is passed, for every occurring match the appendant string offset will also be returned. Note that this changes the return value in an array where every element is an array consisting of the matched string at index 0 and its string offset into subject at index 1.
- offset
-
Normally, the search starts from the beginning of the subject string. The optional parameter offset can be used to specify the alternate place from which to start the search (in bytes).
Note: Using offset is not equivalent to passing substr($subject, $offset) to preg_match() in place of the subject string, because pattern can contain assertions such as ^, $ or (?<=x). Compare:
<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>The above example will output:
Array ( )
while this example
<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>will produce
Array ( [0] => Array ( [0] => def [1] => 0 ) )
Return Values
preg_match() returns the number of times pattern matches. That will be either 0 times (no match) or 1 time because preg_match() will stop searching after the first match. preg_match_all() on the contrary will continue until it reaches the end of subject . preg_match() returns FALSE if an error occurred.
ChangeLog
| Version | Description |
|---|---|
| 4.3.3 | The offset parameter was added |
| 4.3.0 | The PREG_OFFSET_CAPTURE flag was added |
| 4.3.0 | The flags parameter was added |
Examples
Example #1 Find the string of text "php"
<?php
// The "i" after the pattern delimiter indicates a case-insensitive search
if (preg_match("/php/i", "PHP is the web scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
?>
Example #2 Find the word "web"
<?php
/* The \b in the pattern indicates a word boundary, so only the distinct
* word "web" is matched, and not a word partial like "webbing" or "cobweb" */
if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
if (preg_match("/\bweb\b/i", "PHP is the website scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
?>
Example #3 Getting the domain name out of a URL
<?php
// get host name from URL
preg_match('@^(?:http://)?([^/]+)@i',
"http://www.php.net/index.html", $matches);
$host = $matches[1];
// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: {$matches[0]}\n";
?>
The above example will output:
domain name is: php.net
Example #4 Using named subpattern
<?php
$str = 'foobar: 2008';
preg_match('/(?<name>\w+): (?<digit>\d+)/', $str, $matches);
print_r($matches);
?>
The above example will output:
Array ( [0] => foobar: 2008 [name] => foobar [1] => foobar [digit] => 2008 [2] => 2008 )
Notes
preg_match
04-Apr-2008 02:36
In addition to reiner-keller's comment about Umlaute using setlocale (LC_ALL, 'de_DE');
To enable 'de_DE' on my Debian 4 machine I first had to:
- uncomment 'de_DE' in file /etc/locale.gen and afterwards
- run locale-gen from the shell
18-Mar-2008 03:55
Thought this might be helpful to those people out there writing code for use with Valves steam id's:
<?php
$steam_id = "STEAM_0:1:1234567890";
$pattern = "/^STEAM_[0-2]:[0-2]:[0-9]{1,10}$/";
if (preg_match($pattern, $team_id)) {
echo "Valid Steam ID";
}
else {
echo "Not Valid Steam ID";
}
// output:
// Valid Steam ID
?>
10-Mar-2008 10:38
Beat this modern email address matcher...
/^([a-z0-9]([a-z0-9_-]*\.?[a-z0-9])*)(\+[a-z0-9]+)?
@([a-z0-9]([a-z0-9-]*[a-z0-9])*\.)*
([a-z0-9]([a-z0-9-]*[a-z0-9]+)*)\.[a-z]{2,6}$/
(Remove the line-breaks!)
Its only bug is it thinks single-number top-level domains are ok. Can you find any others?
06-Mar-2008 08:21
If you want to test for FALSE use === instead.
$result = preg_match("/badtest/J",$string);
if($result === FALSE) {
// bad query
error_log("Whoops!");
} else {
echo("Matched " . $result . " times");
}
05-Mar-2008 02:56
To the comment below about the vallidation of phone numbers.
PEAR offers some briljant classes for phonenumber vallidation.
Check out http://pear.php.net/packages.php?catpid=50&catname=Validate
Regards
Thijs
28-Dec-2007 12:01
A quick example of using named recursion and negative lookaheads for finding the outermost div. You can use this same idea for any type of nested tags.
<?php
$sample =
"lead in text to capture <div>
outside div text
<div>
inner div text
<div>
deep nested text
</div>
</div>
bottom of outside div text
</div> end of text to capture";
preg_match(
'#^(?P<a>.*?)(?P<b>.?<div((.(?!<div))|(?P>b))*?.</div>)(?P<c>.*?)$#s',
$sample, $matches);
echo "<pre>";
var_dump($matches);
//$matches['a'] == "lead in text to capture"
//$matches['b'] == the outermost <div> and child contents (with a leading space)
//$matches['c'] == " end of text to capture"
?>
27-Dec-2007 10:19
One note on the regular expressions provided that claim to validate e-mail addresses: They're incomplete. To quote the note submission page, just a couple paragraphs above the box where you type in your note:
"(And if you're posting an example of validating email addresses, please don't bother. Your example is almost certainly wrong for some small subset of cases. See this information from O'Reilly Mastering Regular Expressions book [http://examples.oreilly.com/regex/readme.html] for the gory details.)"
That said, the expressions as provided aren't COMPLETELY irrelevant -- they WILL validate MOST e-mail addresses, and you won't really be blocking any significant portion of the population by using them. Just be aware of the limitations.
18-Dec-2007 03:52
To test if a regular expression is syntactically correct:
<?
function preg_test($regex)
{
if (sprintf("%s",@preg_match($regex,'')) == '')
{
$error = error_get_last();
throw new Exception(substr($error['message'],70));
}
else
return true;
}
?>
usage:
<?
if (preg_test('/.*/i'))
print "correct!";
// Returns "correct!"
?>
<?
if (preg_test('/.**/i'))
print "correct!";
// Throws exception with message 'Compilation failed: nothing to repeat at offset 2'
?>
19-Nov-2007 02:19
If you try to find the offset when searching in UTF-8 string (containing multibyte characters, like cyrillic characters) with preg_match, using the PREG_OFFSET_CAPTURE flag, you may have different result from what you expected.
First of all you must compiled PHP with Multibyte Support (mbstring). Then you must configure to use Multibyte Support functions (mb_*) or turn on some php Runtime Configurations (php.ini, apache vhost conf file, .htaccess or somewhere else):
php_value default_charset UTF-8
php_value mbstring.func_overload 7
php_value mbstring.internal_encoding UTF-8
php_value mbstring.detect_order UTF-8
When using preg_match with PREG_OFFSET_CAPTURE flag and UTF-8 string the function will count bytes and NOT characters, so 2 bytes but NOT 1 character for some multibyte character. That's way the offset will be more than what you expected.
My simple solution is using mb_strpos:
...
preg_match($pattern, $found_text, $matches, PREG_OFFSET_CAPTURE);
// This will convert $matches[0][1] multibyte byte length to multibyte character length (UTF-8)
$matches[0][1] = mb_strpos($found_text, $matches[0][0]);
...
P.S. The $pattern variable must use "/u" switch for Unicode!!!
-------------------------------------------------
PHP Version 5.2.4
Multibyte regex (oniguruma) version 4.4.4
-------------------------------------------------
25-Aug-2007 11:17
regex for validating emails, from Perl's RFC2822 package:
http://en.wikipedia.org/wiki/Talk:E-mail_address
26-Jul-2007 04:47
Maybe it will sound obvious, but I've encountered this a few times...
If you are using preg_match() to validate user input, remember about including ^ and $ to your regex or take input from $matches[0] after successfully matching a pattern ie.
preg_match('/[0-9]+/', '123 UNION SELECT ... --') will return TRUE, but when you it in a SQL statement, injected code will be probably executed(if you don't escape user argument). Note that $matches[0] == '123', so it can be used as a valid input.
10-Jul-2007 01:33
I just started using PHP and this section doesn't clarify whether or not you must use "/" as your regular expression delimiters.
I want to clarify that you can use almost any character as your delimiter. The delimiter is automatically the first character of your regular expression string. This makes it a bit easier if you are looking for things that might contain a forward slash. For example::
preg_match('#</b>#', $string);
Instead of:
preg_match('/<\/b>/', $string);
Or:
preg_match('@/my/dir/name/@', $string);
Instead of:
preg_match('/\/my\/dir\/name\//', $string);
This can greatly boost readability. Not quite as flexible as in Perl (You can't use control characters or \n which can really come in handy when you aren't quite sure what characters might be in your regular expression), but switching to another delimiter can make your code a bit easier to read.
17-Aug-2006 12:27
Concerning the German umlauts (and other language-specific chars as accented letters etc.): If you use unicode (utf-8), you can match them easily with the unicode character property \pL (match any unicode letter) and the "u" modifier, so e.g.
<?php preg_match("/[\w\pL]/u",$var); ?>
would really match all "words" in $var - whether they contain umlauts or not. Took me a while to figure this out, so maybe this comment will safe the day for someone else :-)
29-Jan-2006 10:17
This is the only function in which the assertion \\G can be used in a regular expression. \\G matches only if the current position in 'subject' is the same as specified by the index 'offset'. It is comparable to the ^ assertion, but whereas ^ matches at position 0, \\G matches at position 'offset'.
11-Feb-2005 10:03
Pointing to the post of "internet at sourcelibre dot com": Instead of using PerlRegExp for e.g. german "Umlaute" like
<?php
$bolMatch = preg_match("/^[a-zA-ZäöüÄÖÜ]+$/", $strData);
?>
use the setlocal command and the POSIX format like
<?php
setlocale (LC_ALL, 'de_DE');
$bolMatch = preg_match("/^[[:alpha:]]+$/", $strData);
?>
This works for any country related special character set.
Remember since the "Umlaute"-Domains have been released it's almost mandatory to change your RegExp to give those a chance to feed your forms which use "Umlaute"-Domains (e-mail and internet address).
Live can be so easy reading the manual ;-)
13-Jan-2005 05:11
Note that the PREG_OFFSET_CAPTURE flag, as far as I've tested, returns the offset in bytes not characters, which may not be what you're expecting if you're using the /u pattern modifier to make the regex UTF-8 aware (i.e. multibyte characters will result in a greater offset than you expect)
17-Jan-2004 11:31
As I did not find any working IPv6 Regexp, I just created one. Here is it:
$pattern1 = '([A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}';
$pattern2 = '[A-Fa-f0-9]{1,4}::([A-Fa-f0-9]{1,4}:){0,5}[A-Fa-f0-9]{1,4}';
$pattern3 = '([A-Fa-f0-9]{1,4}:){2}:([A-Fa-f0-9]{1,4}:){0,4}[A-Fa-f0-9]{1,4}';
$pattern4 = '([A-Fa-f0-9]{1,4}:){3}:([A-Fa-f0-9]{1,4}:){0,3}[A-Fa-f0-9]{1,4}';
$pattern5 = '([A-Fa-f0-9]{1,4}:){4}:([A-Fa-f0-9]{1,4}:){0,2}[A-Fa-f0-9]{1,4}';
$pattern6 = '([A-Fa-f0-9]{1,4}:){5}:([A-Fa-f0-9]{1,4}:){0,1}[A-Fa-f0-9]{1,4}';
$pattern7 = '([A-Fa-f0-9]{1,4}:){6}:[A-Fa-f0-9]{1,4}';
patterns 1 to 7 represent different cases. $full is the complete pattern which should work for all correct IPv6 addresses.
$full = "/^($pattern1)$|^($pattern2)$|^($pattern3)$
|^($pattern4)$|^($pattern5)$|^($pattern6)$|^($pattern7)$/";
23-Nov-2003 01:23
A web server log record can be parsed as follows:
$line_in = '209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"';
if (preg_match('!^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^ ]+) ([^ ]+) ([^/]+)/([^"]+)" ([^ ]+) ([^ ]+) ([^ ]+) (.+)!',
$line_in,
$elements))
{
print_r($elements);
}
Array
(
[0] => 209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"
[1] => 209.6.145.47
[2] => -
[3] => -
[4] => 22/Nov/2003:19:02:30 -0500
[5] => GET
[6] => /dir/doc.htm
[7] => HTTP
[8] => 1.0
[9] => 200
[10] => 6776
[11] => "http://search.yahoo.com/search?p=key+words=UTF-8"
[12] => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"
)
Notes:
1) For the referer field ($elements[11]), I intentially capture the double quotes (") and don't use them as delimiters, because sometimes double-quotes do appear in a referer URL. Double quotes can appear as %22 or \". Both have to be handled correctly. So, I strip off the double quotes in a second step.
2) The URLs should be further parsed, using parse_url, which is quicker and more reliable then preg_match.
3) I assume the requested protocol (HTTP/1.1) always has a slash character in the middle, which might not always be the case, but I'll take the risk.
4) The agent field ($elments[12]) is the most unstructured field, so I make no assumptions about it's format. If the record is truncated, the agent field will not be delimited properly with a quote at the end. So, both cases must be handled.
5) A hyphen (- or "-") means a field has no value. It is necessary to convert these to appropriate value (such as empty string, null, or 0).
6) Finally, there should be appropriate code to handle malformed web log enteries, which are common, due to junk data. I never assume I've seen all cases.
31-Mar-2003 05:56
I you want to match all scandinavian characters (æÆøØåÅöÖäÄ) in addition to those matched by \w, you might want to use this regexp:
/^[\w\xe6\xc6\xf8\xd8\xe5\xc5\xf6\xd6\xe4\xc4]+$/
Remember that \w respects the current locale used in PCRE's character tables.
