grapheme_extract

(PHP 5 >= 5.3.0, PHP 7, PHP 8, PECL intl >= 1.0.0)

grapheme_extract — Function to extract a sequence of default grapheme clusters from a text buffer, which must be encoded in UTF-8

Опис

Процедурний стиль

grapheme_extract(
    string $haystack,
    int $size,
    int $type = GRAPHEME_EXTR_COUNT,
    int $offset = 0,
    int &$next = null
): string|false

Function to extract a sequence of default grapheme clusters from a text buffer, which must be encoded in UTF-8.

Параметри

haystack

String to search.

size

Maximum number items - based on the type - to return.

type

Defines the type of units referred to by the size parameter:

GRAPHEME_EXTR_COUNT (default) -size is the number of default grapheme clusters to extract.
GRAPHEME_EXTR_MAXBYTES -size is the maximum number of bytes returned.
GRAPHEME_EXTR_MAXCHARS - size is the maximum number of UTF-8 characters returned.

offset

Starting position in haystack in bytes - if given, it must be zero or a positive value that is less than or equal to the length of haystack in bytes, or a negative value that counts from the end of haystack. If offset does not point to the first byte of a UTF-8 character, the start position is moved to the next character boundary.

next

Reference to a value that will be set to the next starting position. When the call returns, this may point to the first byte position past the end of the string.

Значення, що повертаються

A string starting at offset offset and ending on a default grapheme cluster boundary that conforms to the size and type specified, або false в разі помилки.

Журнал змін

Версія	Опис
7.1.0	Support for negative `offset`s has been added.

Приклади

Приклад #1 grapheme_extract() example

<?php

$char_a_ring_nfd = "a\xCC\x8A";  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_o_diaeresis_nfd = "o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) normalization form "D"

print urlencode(grapheme_extract( $char_a_ring_nfd . $char_o_diaeresis_nfd, 1, GRAPHEME_EXTR_COUNT, 2));

?>

Поданий вище приклад виведе:

o%CC%88

Прогляньте також

grapheme_substr() - Return part of a string
» Unicode Text Segmentation: Grapheme Cluster Boundaries

Found A Problem?

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 3 notes

down

AJH ¶

13 years ago

Here's how to use grapheme_extract() to loop across a UTF-8 string character by character.

<?php

$str = "سabcक’…";
// if the previous line didn't come through, the string contained:
//U+0633,U+0061,U+0062,U+0063,U+0915,U+2019,U+2026

$n = 0;

for (    $start = 0, $next = 0, $maxbytes = strlen($str), $c = '';
        $start < $maxbytes;
        $c = grapheme_extract($str, 1, GRAPHEME_EXTR_MAXCHARS , ($start = $next), $next)
    )
{
    if (empty($c))
        continue;
    echo "This utf8 character is " . strlen($c) . " bytes long and its first byte is " . ord($c[0]) . "\n";
    $n++;
}
echo "$n UTF-8 characters in a string of $maxbytes bytes!\n";
// Should print: 7 UTF8 characters in a string of 14 bytes!
?>

down

Philo ¶

1 year ago

The other comments on this page were helpful for me.
However, consider using something better than empty($value) when checking the value returned by grapheme_extract since it could as well return something like "0" (which of course evaluates to false).

down

yevgen dot grytsay at gmail dot com ¶

4 years ago

Looping through grapheme clusters:

<?php

// Example taken from Rust documentation: https://doc.rust-lang.org/book/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my
$str = "नमस्ते";
// Alternatively:
//$str = pack('C*', ...[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]);
$next = 0;
$maxbytes = strlen($str);

var_dump($str);

while ($next < $maxbytes) {
    $char = grapheme_extract($str, 1, GRAPHEME_EXTR_COUNT, $next, $next);
    if (empty($char)) {
        continue;
    }
    echo "{$char} - This utf8 character is " . strlen($char) . ' bytes long', PHP_EOL;
}

//string(18) "नमस्ते"
//न - This utf8 character is 3 bytes long
//म - This utf8 character is 3 bytes long
//स् - This utf8 character is 6 bytes long
//ते - This utf8 character is 6 bytes long
?>

＋add a note