PHP: 再帰的パターン

再帰的パターン

どうすればカッコに括られた文字列とマッチできるか、という問題を考えて見ましょう。このとき、カッコは何回でもネストできるとします。再帰が使えないとすると、パターンを用いて、せいぜい、ある一定の深さのネストまでしかマッチできないでしょう。任意の深さのネストを処理することは不可能です。Perl 5.6 では、正規表現において再帰を行う実験的な機能が導入されています。 PCRE では、再帰という特殊なケースに対して専用のシーケンス (?R) が導入されました。上記のカッコ問題を解決する PCRE のコードは以下のようになります（PCRE_EXTENDED オプションが設定され空白文字が無視されると仮定）。

      \( ( (?>[^()]+) | (?R) )* \)

まず、このパターンは開きカッコにマッチします。続いて、カッコ以外の文字が並んでいるか、または、再帰的にパターン自体にマッチする（すなわち正しくカッコで括られている）かする部分文字列に何回でもマッチします。最後に閉じカッコにマッチします。

このパターン例には、ネストした無制限の繰り返しが含まれているため、マッチが成功し得ない文字列に対してこのパターンが適用される場合も考えて、カッコ以外の文字列にマッチングさせる部分に再試行無しのサブパターンを使うことが重要です。例えば、このパターンを

      (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

という文字列に適用すると、「マッチしない」という判定が速やかに行われます。しかし、再試行無しのサブパターンを使わないと、対象文字列を分割し得る + と * の繰り返し数の種類が非常に多く、そのすべてが確認された後にマッチの失敗が報告されるため、マッチングに非常に時間がかかります。

キャプチャ用サブパターンにセットされる値は、そのサブパターンに値がセットされ得る最も外側で最も後の再帰レベルからのものになります。上記のパターンを

      (ab(cd)ef)

にマッチングさせると、キャプチャされる値は "ef" であり、最上位レベルの最後の値です

      \( ( ( (?>[^()]+) | (?R) )* ) \)

次のようにカッコを追加すると、キャプチャされる文字列は、"ab(cd)ef" となり、最上位レベルのカッコの中身となります。一つのパターン中に 15 以上のキャプチャ用サブパターンを用いると、PCRE は再帰を行っている間のデータ保存用に追加の記憶領域を確保する必要があります。記憶領域が確保できない場合、メモリ不足エラーを再帰の内側から出力する手段がないため、最初の 15 個のキャプチャ用サブパターンについてのみデータが保存されます。

(?1), (?2) といった記法が再帰的サブパターンとして使用可能です。 (?P>name) あるいは (?&name) といった名前付きのサブパターンも使用可能です。

この再帰的サブパターンの記法は（数字によるものも名前によるものの双方とも）参照するカッコの外で使用でき、プログラム言語のサブルーチンの様に処理されます。前に示した例で、パターン (sens|respons)e and \1ibility は "sense and sensibility" および "response and responsibility" にはマッチしますが "sense and responsibility" にはマッチしませんでした。一方、パターン (sens|respons)e and (?1)ibility は、上記 2 つに加え "sense and responsibility" にもマッチします。ただし、参照先のサブパターンの後に記述する必要があります。

対象文字列の最大の長さは、integer 型の最大の値となります。しかし、PCRE は再帰を使用してサブパターンの処理や繰り返し処理を行います。つまり、そのパターンで処理する対象文字列の長さによって、使用可能なスタック空間の大きさが制限されるということです。

Found A Problem?

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 8 notes

down

horvath at webarticum dot hu ¶

13 years ago

With the (?R) item you can link only to the full pattern, because it quasi equals to (?0). You can not use anchors, asserts etc., and you can only check that string CONTAINS a valid hierarchy or not.

This is wrong: ^\(((?>[^()]+)|(?R))*\)$

However, you can bracketing the full expression, and replace (?R) to the relative link (?-2). This make it reusable. So you can check complex expressions, for example:
<?php

$bracket_system = "(\\(((?>[^()]+)|(?-2))*\\))"; // (reuseable)
$bracket_systems = "((?>[^()]+)?$bracket_system)*(?>[^()]+)?"; // (reuseable)
$equation = "$bracket_systems=$bracket_systems"; // Both side of the equation must be contain valid bracket systems
var_dump(preg_match("/^$equation\$/","a*(a-(2a+2))=4(a+3)-2(a-(a-2))")); // Outputs 'int(1)'
var_dump(preg_match("/^$equation\$/","a*(a-(2a+2)=4(a+3)-2(a-(a-2)))")); // Outputs 'int(0)'

?>

You can also catch multibyte quotes with the 'u' modifier (if you use UTF-8), eg:
<?php

$quoted = "(»((?>[^»«]+)|(?-2))*«)"; // (reuseable)
$prompt = "[\\w ]+: $quoted";
var_dump(preg_match("/^$prompt\$/u","Your name: »write here«")); // Outputs 'int(1)'

?>

down

emanueledelgrande at email dot it ¶

16 years ago

The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth.
It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.
I made several tests with complex patterns to get tags with specific attributes or namespaces, studying the recursion of a subpattern only instead of the full pattern.
Here's an example that may power a fast LL parser with recursive descent (http://en.wikipedia.org/wiki/Recursive_descent_parser):

$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/xsm";

The performances of a preg_match or preg_match_all function call over an avarage (x)HTML document are quite fast and may drive you to chose this way instead of classic DOM object methods, which have a lot of limits and are usually poor in performance with their workarounds, too.
I post a sample application in a brief function (easy to be turned into OOP), which returns an array of objects:

<?php
// test function:
function parse($html) {
    // I have split the pattern in two lines not to have long lines alerts by the PHP.net form:
    $pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|".
    "(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/sm";
    preg_match_all($pattern, $html, $matches, PREG_OFFSET_CAPTURE);
    $elements = array();
    
    foreach ($matches[0] as $key => $match) {
        $elements[] = (object)array(
            'node' => $match[0],
            'offset' => $match[1],
            'tagname' => $matches[1][$key][0],
            'attributes' => isset($matches[2][$key][0]) ? $matches[2][$key][0] : '',
            'omittag' => ($matches[4][$key][1] > -1), // boolean
            'inner_html' => isset($matches[6][$key][0]) ? $matches[6][$key][0] : ''
        );
    }
    return $elements;
}

// random html nodes as example:
$html = <<<EOD
<div id="airport">
    <div geo:position="1.234324,3.455546" class="index">
        <!-- comment test:
        <div class="index_top" />
        -->
        <div class="element decorator">
                <ul class="lister">
                    <li onclick="javascript:item.showAttribute('desc');">
                        <h3 class="outline">
                            <a href="http://php.net/manual/en/regexp.reference.recursive.php" onclick="openPopup()">Link</a>
                        </h3>
                        <div class="description">Sample description</div>
                    </li>
                </ul>
        </div>
        <div class="clean-line"></div>
    </div>
</div>
<div id="omittag_test" rel="rootChild" />
EOD;

// application:
$elements = parse($html);

if (count($elements) > 0) {
    echo "Elements found: <b>".count($elements)."</b><br />";
    
    foreach ($elements as $element) {
        echo "<p>Tpl node: <pre>".htmlentities($element->node)."</pre>
        Tagname: <tt>".$element->tagname."</tt><br />
        Attributes: <tt>".$element->attributes."</tt><br />
        Omittag: <tt>".($element->omittag ? 'true' : 'false')."</tt><br />
        Inner HTML: <pre>".htmlentities($element->inner_html)."</pre></p>";
    }
}
?>

down

Onyxagargaryll ¶

15 years ago

Here's an approach to create a multidimensional array according to a string's delimiters, i.e. we want to analyze...

"some text (aaa(b(c1)(c2)d)e)(test) more text"

... as multidimensional layers.

<?php
$string = "some text (aaa(b(c1)(c2)d)e)(test) more text";

/*
 * Analyses the string multidimensionally by its opening and closing delimiters
 */
function recursiveSplit($string, $layer) {
    preg_match_all("/\((([^()]*|(?R))*)\)/",$string,$matches);
    // iterate thru matches and continue recursive split
    if (count($matches) > 1) {
        for ($i = 0; $i < count($matches[1]); $i++) {
            if (is_string($matches[1][$i])) {
                if (strlen($matches[1][$i]) > 0) {
                    echo "<pre>Layer ".$layer.":   ".$matches[1][$i]."</pre><br />";
                    recursiveSplit($matches[1][$i], $layer + 1);
                }
            }
        }
    }
}

recursiveSplit($string, 0);

/*

Output:

Layer 0:   aaa(b(c1)(c2)d)e

Layer 1:   b(c1)(c2)d

Layer 2:   c1

Layer 2:   c2

Layer 0:   test
*/
?>

down

Anonymous ¶

9 years ago

sass parse sample

<?php

$data = 'a { b { 1 } c { d { 2 } } }';

preg_match('/a (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);
preg_match('/b (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);
preg_match('/c (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);
preg_match('/d (?<R>\{(?:[^{}]+|(?&R))*\})/', $data, $m);

/*
Array (
    [0] => a { b { 1 } c { d { 2 } } }
    [R] => { b { 1 } c { d { 2 } } }
    [1] => { b { 1 } c { d { 2 } } }
)
Array (
    [0] => b { 1 }
    [R] => { 1 }
    [1] => { 1 }
)
Array (
    [0] => c { d { 2 } }
    [R] => { d { 2 } }
    [1] => { d { 2 } }
)
Array (
    [0] => d { 2 }
    [R] => { 2 }
    [1] => { 2 }
)
*/

down

mzvarik at gmail dot com ¶

5 years ago

This regexp can be used for parsing IF conditions.

$str = '
(IF_MYVAR)My var is printed
(OR_MYVARTWO)My var two is printed
(OR_ANOTHER)if you use OR you don't have to END everytime
(ELSE)Whatever bro(END)

(IF_BLUE)Something (IF_SUPERB)super(END) blue - this is simple IF condition(END)
';

function isCondition($k) {
    // put your user conditions here
    $conds = [];
    $conds['BLUE'] = true;
    $conds['MYVARTWO'] = true;
    $conds['ELSE'] = true; // always true
     return $conds[$k];
}

function findConditions($str) {
    
    $pattern = '~ \(if_([^\)]+)\) ((?: (?!\((end|if_)). | (?R) )*+) \(end\) ~xis';
    
    $str = preg_replace_callback($pattern, function($m) {
        
        $k = $m[1];
        $v = $m[2];
        $v = findConditions($v) ?: $v;
        
        $ors = preg_split('~(?=\((OR_[^\)]+|ELSE))~is', $v);
        
        $v = array_shift($ors); // hlavní
                
        if (isCondition($k)) return findConditions($v);
        else {
            foreach ($ors as $or) {
                list($k, $v) = explode(")", $or, 2);
                $k = substr($k, 1);
                if (isCondition($k)) return findConditions($v);
            }
        }
        return ''; // no matching condition
    }, $str);
    return $str;
};

// will output:  Whatever bro  \n\n  Something blue
echo findConditions($str);

down

-1

Daniel Klein ¶

14 years ago

The order of non-mutually exclusive alternatives within a recursed sub-pattern is important.
<?php
$pattern = '/^(?<octet>[01]?\d?\d|2[0-4]\d|25[0-5])(?:\.(?P>octet)){3}$/';
?>

You might expect that this pattern will match any IP address in dotted-decimal notation (e.g. '123.45.67.89'). The pattern is intended to match four octets in the following ranges: 0-9, 00-99 & 000-255, each separated by a single dot. However, only the first octet can include values from 200-255; the remainder can only have values less than 200. The reason for this is that if the rest of the pattern fails, recursion is not back-tracked into to find an alternative match. The first part of the sub-pattern will match the first two digits of any octet from 200-255. The rest of the pattern will then fail because the third digit in the octet does not match either '\.' or '$'.

<?php
var_export(preg_match($pattern, '255.123.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.200.45.67')); // 0 (false)
var_export(preg_match($pattern, '255.123.45.200')); // 0 (false)
?>

The correct pattern is:
<?php
$pattern = '/^(?<octet>25[0-5]|2[0-4]\d|[01]?\d?\d)(?:\.(?P>octet)){3}$/';
?>

Note that the first two alternatives are mutually exclusive so their order is unimportant. The third alternative, however, is not mutually exclusive but it will now only match when the first two fail.

<?php
var_export(preg_match($pattern, '255.123.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.200.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.123.45.200')); // 1 (true)
?>

down

-1

horvath at webarticum dot hu ¶

13 years ago

Below are some reusable patterns. I used comments with the 'x' modifier for human-readability.

You can write also a function, that generates patterns for specified bracket/quote pairs.

<?php

/* normal paretheses */
$simple_pattern = "(    (?#root pattern)
  (                     (?#text or expression)
    (?>[^\\(\\)]+)      (?#text)
    |
    \\((?-2)\\)         (?#expression and recursion)
  )*
)";

$simple_okay_text = "5( 2a + (b - c) ) - a * ( 2b - (c * 3(b - (c + a) ) ) )";
$simple_bad_text = "5( 2)a + (b - c) ) - )a * ( ((2b - (c * 3(b - (c + a) ) ) )";

echo "Simple pattern results:\n";
var_dump(preg_match("/^$simple_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$simple_pattern\$/x",$simple_bad_text));
echo "\n----------\n";

/* some brackets */
$full_pattern = "(                  (?#root pattern)
  (                                 (?#text or expression)
    (?>[^\\(\\)\\{\\}\\[\\]<>]+)    (?#text not contains brackets)
    |
    (
      [\\(\\{\\[<]                  (?#start bracket)
        (?(?<=\\()(?-3)\\)|           (?#if normal)
        (?(?<=\\{)(?-3)\\}|           (?#if coursed)
        (?(?<=\\[)(?-3)\\]|           (?#if squared)
        (?1)\\>                       (?#else so if tag)
      )))                           (?#close nested-but-logically-the-some-level subpatterns)
    )
  )*
)";

$full_okay_text = "5( 2a + [b - c] ) - a * ( 2b - {c * 3<b - (c + a) > } )";
$full_bad_text = "5[ 2a + [b - c} ) - a * ( 2b - [c * 3(b - c + a) ) ) }";

echo "Full pattern results:\n";
var_dump(preg_match("/^$full_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$full_pattern\$/x",$full_okay_text));
var_dump(preg_match("/^$full_pattern\$/x",$full_bad_text));
echo "\n----------\n";

/* some brackets and quotes */
$extrafull_pattern = "(             (?#root pattern)
  (                                 (?#text or expression)
    (?>[^\\(\\)\\{\\}\\[\\]<>'\"]+) (?#text not contains brackets and quotes)
    |
    (
      ([\\(\\{\\[<'\"])             (?#start bracket)
        (?(?<=\\()(?-4)\\)|           (?#if normal)
        (?(?<=\\{)(?-4)\\}|           (?#if coursed)
        (?(?<=\\[)(?-4)\\]|           (?#if squared)
        (?(?<=\\<)(?-4)\\>|           (?#if tag)
        ['\"]                         (?#else so if static)
      ))))                          (?#close nested-but-logically-the-some-level subpatterns)
    )
  )*
)";

$extrafull_okay_text = "5( 2a + ['b' - c] ) - a * ( 2b - \"{c * 3<b - (c + a) > }\" )";
$extrafull_bad_text = "5( 2a + ['b' - c] ) - a * ( 2b - \"{c * 3<b - (c + a) > }\" )";

echo "Extra-full pattern results:\n";
var_dump(preg_match("/^$extrafull_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$full_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$extrafull_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$extrafull_bad_text));

/*

Outputs:

Simple pattern results:
int(0)
int(0)

----------
Full pattern results:
int(0)
int(0)
int(0)

----------
Extra-full pattern results:
int(0)
int(0)
int(0)
int(0)

*/

?>

down

-1

jonah at nucleussystems dot com ¶

15 years ago

An unexpected behavior came up that introduced a very hard-to-track bug in some code I was working on.  It has to do with the preg_match_all PREG_OFFSET_CAPTURE flag.  When you capture the offset of a sub-match, it's offset is given _relative_ to it's parent.  For example, if you extract the value between < and > recursively in this string:

<this is a <string>>

You will get an array that looks like this:

Array
(
    [0] => Array
    (
        [0] => Array
        (
            [0] => <this is a <string>>
            [1] => 0
        )
        [1] => Array
        (
            [0] => this is a <string>
            [1] => 1
        )
    )
    [1] => Array
    (
        [0] => Array
        (
            [0] => <string>
            [1] => 0
        )
        [1] => Array
        (
            [0] => string
            [1] => 1
        )
    )
)

Notice that the offset in the last index is one, not the twelve we expected.  The best way to solve this problem is to run over the results with a recursive function, adding the parent's offset.

＋add a note