php[tek] 2018 : Call for Speakers

Unicode字符属性

自从 PHP 4.4.0 和 5.1.0, 三个额外的转义序列在选用 UTF-8模式时用于匹配通用字符类型。他们是:

\p{xx}
一个有属性 xx 的字符
\P{xx}
一个没有属性 xx 的字符
\X
一个扩展的 Unicode 字符

上面 xx 代表的属性名用于限制 Unicode 通常的类别属性。 每个字符都有一个这样的确定的属性,通过两个缩写的字母指定。 为了与 perl 兼容, 可以在左花括号 { 后面增加 ^ 表示取反。比如: \p{^Lu} 就等同于 \P{Lu}

如果通过 \p\P 仅指定了一个字母,它包含所有以这个字母开头的属性。 在这种情况下,花括号的转义序列是可选的。

\p{L}
\pL
支持的Unicode属性
Property Matches Notes
C Other  
Cc Control  
Cf Format  
Cn Unassigned  
Co Private use  
Cs Surrogate  
L Letter Includes the following properties: Ll, Lm, Lo, Lt and Lu.
Ll Lower case letter  
Lm Modifier letter  
Lo Other letter  
Lt Title case letter  
Lu Upper case letter  
M Mark  
Mc Spacing mark  
Me Enclosing mark  
Mn Non-spacing mark  
N Number  
Nd Decimal number  
Nl Letter number  
No Other number  
P Punctuation  
Pc Connector punctuation  
Pd Dash punctuation  
Pe Close punctuation  
Pf Final punctuation  
Pi Initial punctuation  
Po Other punctuation  
Ps Open punctuation  
S Symbol  
Sc Currency symbol  
Sk Modifier symbol  
Sm Mathematical symbol  
So Other symbol  
Z Separator  
Zl Line separator  
Zp Paragraph separator  
Zs Space separator  

InMusicalSymbols 等扩展属性在 PCRE 中不支持

指定大小写不敏感匹配对这些转义序列不会产生影响,比如, \p{Lu} 始终匹配大写字母。

Unicode 字符集在具体文字中定义。使用文字名可以匹配这些字符集中的一个字符。例如:

  • \p{Greek}
  • \P{Han}

不在确定文字中的则被集中到 Common。当前的文字列表中有:

支持的文字
Arabic Armenian Avestan Balinese Bamum
Batak Bengali Bopomofo Brahmi Braille
Buginese Buhid Canadian_Aboriginal Carian Chakma
Cham Cherokee Common Coptic Cuneiform
Cypriot Cyrillic Deseret Devanagari Egyptian_Hieroglyphs
Ethiopic Georgian Glagolitic Gothic Greek
Gujarati Gurmukhi Han Hangul Hanunoo
Hebrew Hiragana Imperial_Aramaic Inherited Inscriptional_Pahlavi
Inscriptional_Parthian Javanese Kaithi Kannada Katakana
Kayah_Li Kharoshthi Khmer Lao Latin
Lepcha Limbu Linear_B Lisu Lycian
Lydian Malayalam Mandaic Meetei_Mayek Meroitic_Cursive
Meroitic_Hieroglyphs Miao Mongolian Myanmar New_Tai_Lue
Nko Ogham Old_Italic Old_Persian Old_South_Arabian
Old_Turkic Ol_Chiki Oriya Osmanya Phags_Pa
Phoenician Rejang Runic Samaritan Saurashtra
Sharada Shavian Sinhala Sora_Sompeng Sundanese
Syloti_Nagri Syriac Tagalog Tagbanwa Tai_Le
Tai_Tham Tai_Viet Takri Tamil Telugu
Thaana Thai Tibetan Tifinagh Ugaritic
Vai Yi        

\X 转义匹配任意数量的 Unicode 字符。 \X 等价于 (?>\PM\pM*)

也就是说,它匹配一个没有 ”mark” 属性的字符,紧接着任意多个由 ”mark” 属性的字符。 并将这个序列认为是一个原子组(详见下文)。 典型的有 ”mark” 属性的字符是影响到前面的字符的重音符。

用 Unicode 属性来匹配字符并不快, 因为 PCRE 需要去搜索一个包含超过 15000 字符的数据结构。 这就是为什么在 PCRE中 要使用传统的转义序列\d\w 而不使用 Unicode 属性的原因。

add a note add a note

User Contributed Notes 8 notes

up
4
huhwatnouDONTspamPLEASE at hotmail dot com
1 year ago
To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).

I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.
up
10
xuantoaiph at gmail dot com
3 years ago
My country, Vietnam, have our own alphabet table:
http://en.wikipedia.org/wiki/Vietnamese_alphabet
I hope PHP will support better than in Vietnamese.
up
6
mercury at caucasus dot net
7 years ago
An excellent article explaining all these properties can be found here: http://www.regular-expressions.info/unicode.html
up
4
o_shes01 at uni-muenster dot de
6 years ago
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter.
For example, there are three codepoints for the "LJ" digraph in Unicode:
  (*) uppercase "LJ": U+01C7
  (*) titlecase "Lj": U+01C8
  (*) lowercase "lj": U+01C9
up
1
Yzmir Ramirez
4 years ago
If you are working with older environments you will need to first check to see if the version of PCRE will work with unicode directives described above:

<?php

// Need to check PCRE version because some environments are
// running older versions of the PCRE library
// (run in *nix environment `pcretest -C`)

$allowInternational = false;
if (
defined('PCRE_VERSION')) {
    if (
intval(PCRE_VERSION) >= 7) { // constant available since PHP 5.2.4
       
$allowInternational = true;
    }
}
?>

Now you can do a fallback regex (e.g. use "/[a-z]/i"), when the PCRE library version is too old or not available.
up
0
php at lnx-bsp dot net
1 month ago
Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:

<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>

Will match any combination of letters and numbers.
up
-1
suit at rebell dot at
7 years ago
these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"

if you want to match any word but want to provide a fallback, you can do something like that:

<?php
if(@preg_match_all('/\p{L}+/u', $str, $arr) {
 
// fallback goes here
  // for example just '/\w+/u' for a less acurate match
}
?>
up
-3
o_shes01 at uni-muenster dot de
6 years ago
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter.
For example, there are three codepoints for the "LJ" digraph in Unicode:
  (*) uppercase "LJ": U+01C7
  (*) titlecase "Lj": U+01C8
  (*) lowercase "lj": U+01C9
To Top