Opciones de búsqueda

Las opciones de PCRE se listan a continuación. Los nombres entre paréntesis son los nombres internos a PCRE. Los espacios y los caracteres de nueva línea son ignorados en los modificadores, los otros caracteres causan errores.

i (PCRE_CASELESS)

Realiza una búsqueda insensible a mayúsculas o minúsculas.

m (PCRE_MULTILINE)

Por omisión, PCRE trata la cadena sujeto como una sola línea (aunque esta cadena contenga retornos de carro). El metacarácter "inicio de línea" (^) solo será válido una vez, al inicio de la línea, y el metacarácter "fin de línea" ($) solo será válido al final de la cadena, o antes del retorno de carro final (a menos que se active la opción D). Es el mismo funcionamiento que en Perl. Cuando esta opción está activada, "inicio de línea" y "fin de línea" corresponderán entonces a los caracteres siguiente y precedente inmediatamente a un carácter de nueva línea, además del inicio y del final de la cadena. Es el mismo funcionamiento que la opción Perl /m. Si no hay ningún carácter de nueva línea "\n" en la cadena sujeto, o si no hay ninguna ocurrencia de ^ o $ en el patrón, esta opción no sirve de nada.

s (PCRE_DOTALL)

Con esta opción, el metacarácter punto (.) reemplaza cualquier carácter, incluyendo las nuevas líneas. Sin esta opción, el carácter punto no reemplaza las nuevas líneas. Esta opción es equivalente a la opción Perl /s. Una clase de caracteres negativa como [^a] aceptará siempre los caracteres de nuevas líneas, independientemente de esta opción.

x (PCRE_EXTENDED)

Con esta opción, los caracteres de espacio son ignorados, excepto cuando están escapados, o dentro de una clase de caracteres, y todos los caracteres entre # no escapados y fuera de una clase de caracteres, y el próximo carácter de nueva línea son ignorados. Es el equivalente Perl de la opción /x: permite añadir comentarios en los patrones complicados. Tenga en cuenta, sin embargo, que esto solo se aplica a los caracteres de datos. Los caracteres de espacio nunca deben aparecer en las secuencias especiales de un patrón, por ejemplo en la secuencia (?( que introduce un paréntesis condicional.

A (PCRE_ANCHORED)

Con esta opción, el patrón está anclado de forma forzada, es decir que el patrón debe aplicarse justo al inicio de la cadena sujeto para ser considerado encontrado. Es posible realizar el mismo efecto añadiendo los metacaracteres adecuados, lo cual es la única manera de hacerlo en Perl.

D (PCRE_DOLLAR_ENDONLY)

Con esta opción, el metacarácter $ solo será válido al final de la cadena sujeto. Sin esta opción, $ también es válido antes de una nueva línea, si esta última es el último carácter de la cadena. Esta opción es ignorada si la opción m está activada. No hay equivalente en Perl.

S

Cuando un patrón se utiliza varias veces, vale la pena pasar unos instantes más para analizarlo y optimizar el código para acelerar los tratamientos posteriores. Esta opción fuerza este análisis más exhaustivo. Actualmente, este análisis solo es útil para los patrones no anclados, que no comienzan por un carácter fijo. Desde PHP 7.3.0, esta bandera ya no tiene efecto.

U (PCRE_UNGREEDY)

Esta opción invierte la tendencia a la gula de las expresiones regulares. También puede invertirse esta tendencia caso por caso con un ? pero esto hará gula la secuencia. Esta opción no es compatible con Perl. También puede ponerse en el patrón con la opción (?U) en el patrón o por un punto de interrogación antes del cuantificador (.e.g. .*?).
Nota:
No es generalmente posible hacer coincidir más que el límite de pcre.backtrack_limit caracteres en modo no gula.

X (PCRE_EXTRA)

Esta opción añade otras funcionalidades incompatibles con el PCRE de Perl. Todos los backslash seguidos de una letra que no tendría un significado particular causan un error, permitiendo la reserva de estas combinaciones para futuras funcionalidades. Por omisión, como en Perl, los backslash seguidos de una letra sin significado particular son tratados como valores literales. Actualmente, esta opción no activa otras funciones.

J (PCRE_INFO_JCHANGED)

La opción interna de configuración (?J) modifica la opción local PCRE_DUPNAMES. Permite la duplicación de nombres para los subpatrones. A partir de PHP 7.2.0 J también es soportado como modificador.

u (PCRE_UTF8)

Esta opción activa funcionalidades adicionales de PCRE que no son compatibles con Perl. La cadena de entrada y el patrón son tratados como cadenas UTF-8. Una cadena de entrada inválida tendrá como consecuencia una ausencia de coincidencia en las funciones preg_*. Un patrón inválido levantará un error de nivel E_WARNING. Las secuencias UTF-8 de cinco y seis octetos son consideradas inválidas.

n (PCRE_NO_AUTO_CAPTURE)

Este modificador hace que los grupos simples (xyz) no sean capturantes. Solo los grupos nombrados como (?<name>xyz) son capturantes. Esto afecta únicamente a los grupos capturantes, siempre es posible utilizar referencias de subpatrón numeradas, y el array de coincidencias contendrá siempre resultados numerados. Disponible a partir de PHP 8.2.0

r (PCRE2_EXTRA_CASELESS_RESTRICT)

Cuando u (PCRE_UTF8) y i (PCRE_CASELESS) están activos, este modificador impide la coincidencia entre los caracteres ASCII y no-ASCII. Por ejemplo, preg_match('/\x{212A}/iu', "K") coincide con el símbolo Kelvin K (U+212A). Cuando r es utilizado (preg_match('/\x{212A}/iur', "K"), esto no coincide. Disponible a partir de PHP 8.4.0.

Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of; 1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5" 2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8 3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places ) 4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/ The following script should give you an idea of what works and what doesn't; <?php $examples = array( 'Valid ASCII' => "a", 'Valid 2 Octet Sequence' => "\xc3\xb1", 'Invalid 2 Octet Sequence' => "\xc3\x28", 'Invalid Sequence Identifier' => "\xa0\xa1", 'Valid 3 Octet Sequence' => "\xe2\x82\xa1", 'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1", 'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28", 'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc", 'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc", 'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc", 'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28", 'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1", 'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1", ); echo "++Invalid UTF-8 in pattern\n"; foreach ( $examples as $name => $str ) { echo "$name\n"; preg_match("/".$str."/u",'Testing'); } echo "++ preg_match() examples\n"; foreach ( $examples as $name => $str ) { preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar); echo "$name: "; if ( count($ar) == 0 ) { echo "Matched nothing!\n"; } else { echo "Matched {$ar[0]}\n"; } } echo "++ preg_match_all() examples\n"; foreach ( $examples as $name => $str ) { preg_match_all('/./u', $str, $ar); echo "$name: "; $num_utf8_chars = count($ar[0]); if ( $num_utf8_chars == 0 ) { echo "Matched nothing!\n"; } else { echo "Matched $num_utf8_chars character\n"; } } ?>

Spent a few days, trying to understand how to create a pattern for Unicode chars, using the hex codes. Finally made it, after reading several manuals, that weren't giving any practical PHP-valid examples. So here's one of them: For example we would like to search for Japanese-standard circled numbers 1-9 (Unicode codes are 0x2460-0x2468) in order to make it through the hex-codes the following call should be used: preg_match('/[\x{2460}-\x{2468}]/u', $str); Here $str is a haystack string \x{hex} - is an UTF-8 hex char-code and /u is used for identifying the class as a class of Unicode chars. Hope, it'll be useful.

The description of the "u" flag is a bit misleading. It suggests that it is only required if the pattern contains UTF-8 characters, when in fact it is required if either the pattern or the subject contain UTF-8. Without it, I was having problems with preg_match_all returning invalid multibyte characters when given a UTF-8 subject string. It's fairly clear if you read the documentation for libpcre: In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When either of these is the case, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of strings of 1-byte characters. [from http://www.pcre.org/pcre.txt]

If the _subject_ contains utf-8 sequences the 'u' modifier should be set, otherwise a pattern such as /./ could match a utf-8 sequence as two to four individual ASCII characters. It is not a requirement, however, as you may have a need to break apart utf-8 sequences into single bytes. Most of the time, though, if you're working with utf-8 strings you should use the 'u' modifier. If the subject doesn't contain any utf-8 sequences (i.e. characters in the range 0x00-0x7F only) but the pattern does, as far as I can work out, setting the 'u' modifier would have no effect on the result.

the PCRE_INFO_JCHANGED modifier is apparently not accepted as a global option (after the closing delimiter) in PHP versions <= 5.4 (not checked in PHP 5.5) but allowed in PHP 5.6 (also not checked in PHP 7.X) The following pattern doesn't work in PHP 5.4, but it works in PHP 5.6: <?php //test.php preg_match_all('/(?<dup_name>\d{1,4})\-(?<dup_name>\d{1,2})/J', '1234-23', $matches); var_dump($matches); /* output in PHP 5.4: Warning: preg_match_all(): Unknown modifier 'J' in test.php on line 3 NULL -------------- output PHP 5.6: array(4) { [0]=> array(1) { [0]=> string(7) "1234-23" } ["dup_name"]=> array(1) { [0]=> string(2) "23" } [1]=> array(1) { [0]=> string(4) "1234" } [2]=> array(1) { [0]=> string(2) "23" } } */ ?> in order to resolve this issue in PHP 5.4, one can use the (?J) pattern modifier, which indicates the pattern (from that point forward) allows duplicate names for subpatterns. code which works in PHP 5.4: <?php preg_match_all('/(?J)(?<dup_name>\d{1,4})\-(?<dup_name>\d{1,2})/', '1234-23', $matches); var_dump($matches); /* output in PHP 5.4: array(4) { [0]=> array(1) { [0]=> string(7) "1234-23" } ["dup_name"]=> array(1) { [0]=> string(2) "23" } [1]=> array(1) { [0]=> string(4) "1234" } [2]=> array(1) { [0]=> string(2) "23" } } -------------- output in PHP 5.6 (the same as with /J): array(4) { [0]=> array(1) { [0]=> string(7) "1234-23" } ["dup_name"]=> array(1) { [0]=> string(2) "23" } [1]=> array(1) { [0]=> string(4) "1234" } [2]=> array(1) { [0]=> string(2) "23" } } */ ?>

A warning about the /i modifier and POSIX character classes: If you're using POSIX character classes in your regex that indicate case such as [:upper:] or [:lower:] in combination with the /i modifier, then in PHP < 7.3 the /i modifier will take precedence and effectively make both those character classes work as [:alpha:], but in PHP >= 7.3 the character classes overrule the /i modifier.

An important addendum (with new $pat3_2 utilising \R properly, its results and comments): Note that there are (sometimes difficult to grasp at first glance) nuances of meaning and application of escape sequences like \r, \R and \v - none of them is perfect in all situations, but they are quite useful nevertheless. Some official PCRE control options and their changes come in handy too - unfortunately neither (*ANYCRLF), (*ANY) nor (*CRLF) is documented here on php.net at the moment (although they seem to be available for over 10 years and 5 months now), but they are described on Wikipedia ("Newline/linebreak options" at https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) and official PCRE library site ("Newline convention" at http://www.pcre.org/original/doc/html/pcresyntax.html#SEC17) pretty well. The functionality of \R appears somehow disappointing (with default configuration of compile time option) according to php.net as well as official description ("Newline sequences" at https://www.pcre.org/original/doc/html/pcrepattern.html#newlineseq) when used improperly. A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end (or at the beginning) of any line even without the multiple lines mode (/m) or meta-character assertions ($ or ^). <?php // Various OS-es have various end line (a.k.a line break) chars: // - Windows uses CR+LF (\r\n); // - Linux LF (\n); // - OSX CR (\r). // And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?) of default configuration option for meta-character assertions (^ and $) at compile time of PCRE. $str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_"; // C 3 p 0 _ $pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org when used improperly $pat3_2='/\w(?=\R)/i'; // Much better with allowed lookahead assertion (just to detect without capture) without multiline (/m) mode; note that with alternative for end of string ((?=\R|$)) it would grab all 7 elements as expected, but '/(*ANYCRLF)\w$/mi' is more straightforward in use anyway $p=preg_match_all($pat3, $str, $m3); $r=preg_match_all($pat3_2, $str, $m4); echo $str."\n3 !!! $pat3 ($p): ".print_r($m3[0], true) ."\n3_2 !!! $pat3_2 ($r): ".print_r($m4[0], true); // Note the difference between the two very helpful escape sequences in $pat3 and $pat3_2 (\R) - for some applications at least. /* The code above results in the following output: ABC ABC 123 123 def def nop nop 890 890 QRS QRS ~-_ ~-_ 3 !!! /\w\R?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 3_2 !!! /\w(?=\R)/i (6): Array ( [0] => C [1] => 3 [2] => f [3] => p [4] => 0 [5] => S ) */ ?> Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.

In case you're wondering, what is the meaning of "S" modifier, this paragraph might be useful: When "S" modifier is set, PHP calls the pcre_study() function from the PCRE API before executing the regexp. Result from the function is passed directly to pcre_exec(). For more information about pcre_study() and "Studying the pattern" check the PCRE manual on http://www.pcre.org/pcre.txt PS: Note that function names "pcre_study" and "pcre_exec" used here refer to PCRE library functions written in C language and not to any PHP functions.

A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end ($) of any line in multiple lines mode (/m). <?php // Various OS-es have various end line (a.k.a line break) chars: // - Windows uses CR+LF (\r\n); // - Linux LF (\n); // - OSX CR (\r). // And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?). $str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_"; // C 3 p 0 _ $pat1='/\w$/mi'; // This works excellent in JavaScript (Firefox 7.0.1+) $pat2='/\w\r?$/mi'; $pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org $pat4='/\w\v?$/mi'; $pat5='/(*ANYCRLF)\w$/mi'; // Excellent but undocumented on php.net at the moment $n=preg_match_all($pat1, $str, $m1); $o=preg_match_all($pat2, $str, $m2); $p=preg_match_all($pat3, $str, $m3); $r=preg_match_all($pat4, $str, $m4); $s=preg_match_all($pat5, $str, $m5); echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true) ."\n2 !!! $pat2 ($o): ".print_r($m2[0], true) ."\n3 !!! $pat3 ($p): ".print_r($m3[0], true) ."\n4 !!! $pat4 ($r): ".print_r($m4[0], true) ."\n5 !!! $pat5 ($s): ".print_r($m5[0], true); // Note the difference among the three very helpful escape sequences in $pat2 (\r), $pat3 (\R), $pat4 (\v) and altered newline option in $pat5 ((*ANYCRLF)) - for some applications at least. /* The code above results in the following output: ABC ABC 123 123 def def nop nop 890 890 QRS QRS ~-_ ~-_ 1 !!! /\w$/mi (3): Array ( [0] => C [1] => 0 [2] => _ ) 2 !!! /\w\r?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 3 !!! /\w\R?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 4 !!! /\w\v?$/mi (5): Array ( [0] => C [1] => 3 [2] => p [3] => 0 [4] => _ ) 5 !!! /(*ANYCRLF)\w$/mi (7): Array ( [0] => C [1] => 3 [2] => f [3] => p [4] => 0 [5] => S [6] => _ ) */ ?> Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.

When adding comments with the /x modifier, don't use the pattern delimiter in the comments. It may not be ignored in the comments area. Example: <?php $target = 'some text'; if(preg_match('/ e # Comments here /x',$target)) { print "Target 1 hit.\n"; } if(preg_match('/ e # /Comments here with slash /x',$target)) { print "Target 1 hit.\n"; } ?> prints "Target 1 hit." but then generates a PHP warning message for the second preg_match(): Warning: preg_match() [function.preg-match]: Unknown modifier 'C' in /ebarnard/x-modifier.php on line 11

Opciones de búsqueda

Found A Problem?

User Contributed Notes 11 notes