ConFoo: Call for paper is now Open

html_entity_decode

(PHP 4 >= 4.3.0, PHP 5)

html_entity_decode Converte le entità HTML nei corrispondenti caratteri

Descrizione

string html_entity_decode ( string $string [, int $quote_style [, string $charset ]] )

La funzione html_entity_decode() è l'opposto di htmlentities() converte tutte le entità HTML presenti nel parametro string nel corrispondente carattere.

Il secondo parametro, quote_style, opzionale, indica cosa occorre fare per gli apici 'singoli' e "doppi". Sono possibili tre scelte indicate da tre costanti con default ENT_COMPAT:

Costanti disponibili per quote_style
Nome della costante Descrizione
ENT_COMPAT Converte gli apici doppi e lascia inalterati gli apici singoli.
ENT_QUOTES Converte sia gli apici doppi sia gli apici singoli.
ENT_NOQUOTES Lascia entrambi i tipi di apici inalterati.

Per il terzo parametro opzionale, charset, si utilizza come default il set di caratteri ISO-8859-1. Questo parametro indica quale set di caratteri utilizzare per la conversione.

Elenco dei set di caratteri supportati:

Set di caratteri supportati
Set di caratteri Alias Descrizione
ISO-8859-1 ISO8859-1 Western European, Latin-1.
ISO-8859-5 ISO8859-5 Il charset cirillico poco utilizzato (Latin/Cyrillic).
ISO-8859-15 ISO8859-15 Western European, Latin-9. Con in più il simbolo dell'Euro e i caratteri francesi e finnici mancanti in Latin-1 (ISO-8859-1).
UTF-8   Set ASCII compatibile con il set multi-byte Unicode su 8-bit.
cp866 ibm866, 866 Set di caratteri cirillico specifico del Dos.
cp1251 Windows-1251, win-1251, 1251 Set di caratteri cirillico specifico di Windows.
cp1252 Windows-1252, 1252 Set di caratteri specifico di Windows per l'Europa occidentale.
KOI8-R koi8-ru, koi8r Russo.
BIG5 950 Cinese tradizionale, usato principalmente a Taiwan.
GB2312 936 Cinese semplificato, set di caratteri nazionale standard.
BIG5-HKSCS   Big5 con estensioni per Hong Kong, cinese tradizionale.
Shift_JIS SJIS, SJIS-win, cp932, 932 Giapponese.
EUC-JP EUCJP, eucJP-win Giapponese.
MacRoman   Charset che veniva utilizzato dal Mac OS.
''   Una stringa vuota attiva il rilevamento della codifica dallo script (Zend multibyte), default_charset e l'attuale locale (guarda nl_langinfo() e setlocale()), in quest'ordine. Non consigliato.

Nota: Ogni altro set di caratteri non è riconosciuto. Sarà invece utilizzata la codifica predefinita e verrà mostrato un avviso.

Example #1 Decodifica delle entità HTML

<?php
$orig 
"I'll \"walk\" the <b>dog</b> now";

$a htmlentities($orig);

$b html_entity_decode($a);

echo 
$a// I'll &quot;walk&quot; the &lt;b&gt;dog&lt;/b&gt; now

echo $b// I'll "walk" the <b>dog</b> now


// Per utilizzatori di versioni di PHP antecedenti alla 4.3.0:
function unhtmlentities($string
{
    
$trans_tbl get_html_translation_table(HTML_ENTITIES);
    
$trans_tbl array_flip($trans_tbl);
    return 
strtr($string$trans_tbl);
}

$c unhtmlentities($a);

echo 
$c// I'll "walk" the <b>dog</b> now

?>

Nota:

Ci si può chiedere come mai la sequenza trim(html_entity_decode('&nbsp;')); non produca una stringa vuota; questo accade perché l'intità '&nbsp;' non corrisponde al codice ASCII 32 (che verrebbe rimosso da trim()), ma, nella codifica di default ISO-8859-1, corrisponde al carattere ASCII 160 (0xa0).

Vedere anche htmlentities(), htmlspecialchars(), get_html_translation_table(), and urldecode().

add a note add a note

User Contributed Notes 16 notes

up
8
Martin
3 years ago
If you need something that converts &#[0-9]+ entities to UTF-8, this is simple and works:

<?php
/* Entity crap. /
$input = "Fovi&#269;";

$output = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $input);

/* Plain UTF-8. */
echo $output;
?>
up
5
me at richardsnazell dot com
6 years ago
I had a problem getting the 'TM' trademark symbol to display correctly in an email subject line. Using html_entity_decode() with different charsets didn't work, but directly replacing the entity with it's ASCII equivalent did:

$subject = str_replace('&trade;', chr(153), $subject);
up
8
daniel at brightbyte dot de
9 years ago
This function seems to have to have two limitations (at least in PHP 4.3.8):

a) it does not work with multibyte character codings, such as UTF-8
b) it does not decode numeric entity references

a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1.

b) can be solved rather nicely using the following code:

<?php
function decode_entities($text) {
   
$text= html_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work!
   
$text= preg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation
   
$text= preg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text);  #hex notation
   
return $text;
}
?>

HTH
up
6
neurotic dot neu at gmail dot com
4 years ago
This is a safe rawurldecode with utf8 detection:

<?php
function utf8_rawurldecode($raw_url_encoded){
   
$enc = rawurldecode($raw_url_encoded);
    if(
utf8_encode(utf8_decode($enc))==$enc){;
        return
rawurldecode($raw_url_encoded);
    }else{
        return
utf8_encode(rawurldecode($raw_url_encoded));
    }
}
?>
up
4
Benjamin
1 year ago
The following function decodes named and numeric HTML entities and works on UTF-8. Requires iconv.

function decodeHtmlEnt($str) {
    $ret = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
    $p2 = -1;
    for(;;) {
        $p = strpos($ret, '&#', $p2+1);
        if ($p === FALSE)
            break;
        $p2 = strpos($ret, ';', $p);
        if ($p2 === FALSE)
            break;
           
        if (substr($ret, $p+2, 1) == 'x')
            $char = hexdec(substr($ret, $p+3, $p2-$p-3));
        else
            $char = intval(substr($ret, $p+2, $p2-$p-2));
           
        //echo "$char\n";
        $newchar = iconv(
            'UCS-4', 'UTF-8',
            chr(($char>>24)&0xFF).chr(($char>>16)&0xFF).chr(($char>>8)&0xFF).chr($char&0xFF)
        );
        //echo "$newchar<$p<$p2<<\n";
        $ret = substr_replace($ret, $newchar, $p, 1+$p2-$p);
        $p2 = $p + strlen($newchar);
    }
    return $ret;
}
up
5
Free at Key dot no
4 years ago
Handy function to convert remaining HTML-entities into human readable chars (for entities which do not exist in target charset):

<?php
function cleanString($in,$offset=null)
{
   
$out = trim($in);
    if (!empty(
$out))
    {
       
$entity_start = strpos($out,'&',$offset);
        if (
$entity_start === false)
        {
           
// ideal
           
return $out;   
        }
        else
        {
           
$entity_end = strpos($out,';',$entity_start);
            if (
$entity_end === false)
            {
                 return
$out;
            }
           
// zu lang um eine entity zu sein
           
else if ($entity_end > $entity_start+7)
            {
                
// und weiter gehts
                
$out = cleanString($out,$entity_start+1);
            }
           
// gottcha!
           
else
            {
                
$clean = substr($out,0,$entity_start);
                
$subst = substr($out,$entity_start+1,1);
                
// &scaron; => "s" / &#353; => "_"
                
$clean .= ($subst != "#") ? $subst : "_";
                
$clean .= substr($out,$entity_end+1);
                
// und weiter gehts
                
$out = cleanString($clean,$entity_start+1);
            }
        }
    }
    return
$out;
}
?>
up
2
jojo
7 years ago
The decipherment does the character encoded by the escape function of JavaScript.
When the multi byte is used on the page, it is effective.

javascript escape('aaああaa') ..... 'aa%u3042%u3042aa'
php  jsEscape_decode('aa%u3042%u3042aa')..'aaああaa'

<?php
function jsEscape_decode($jsEscaped,$outCharCode='SJIS'){
   
$arrMojis = explode("%u",$jsEscaped);
    for (
$i = 1;$i < count($arrMojis);$i++){
       
$c = substr($arrMojis[$i],0,4);
       
$cc = mb_convert_encoding(pack('H*',$c),$outCharCode,'UTF-16');
       
$arrMojis[$i] = substr_replace($arrMojis[$i],$cc,0,4);
    }
    return
implode('',$arrMojis);
}
?>
up
5
aidan at php dot net
10 years ago
This functionality is now implemented in the PEAR package PHP_Compat.

More information about using this function without upgrading your version of PHP can be found on the below link:

http://pear.php.net/package/PHP_Compat
up
4
florianborn (at) yahoo (dot) de
9 years ago
Note that

<?php

echo urlencode(html_entity_decode("&nbsp;"));

?>

will output "%A0" instead of "+".
up
3
php dot net at c dash ovidiu dot tk
9 years ago
Quick & dirty code that translates numeric entities to UTF-8.

<?php

   
function replace_num_entity($ord)
    {
       
$ord = $ord[1];
        if (
preg_match('/^x([0-9a-f]+)$/i', $ord, $match))
        {
           
$ord = hexdec($match[1]);
        }
        else
        {
           
$ord = intval($ord);
        }
       
       
$no_bytes = 0;
       
$byte = array();

        if (
$ord < 128)
        {
            return
chr($ord);
        }
        elseif (
$ord < 2048)
        {
           
$no_bytes = 2;
        }
        elseif (
$ord < 65536)
        {
           
$no_bytes = 3;
        }
        elseif (
$ord < 1114112)
        {
           
$no_bytes = 4;
        }
        else
        {
            return;
        }

        switch(
$no_bytes)
        {
            case
2:
            {
               
$prefix = array(31, 192);
                break;
            }
            case
3:
            {
               
$prefix = array(15, 224);
                break;
            }
            case
4:
            {
               
$prefix = array(7, 240);
            }
        }

        for (
$i = 0; $i < $no_bytes; $i++)
        {
           
$byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;
        }

       
$byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];

       
$ret = '';
        for (
$i = 0; $i < $no_bytes; $i++)
        {
           
$ret .= chr($byte[$i]);
        }

        return
$ret;
    }

   
$test = 'This is a &#269;&#x5d0; test&#39;';

    echo
$test . "<br />\n";
    echo
preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', $test);

?>
up
1
Matt Robinson
5 years ago
I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'.

If you don't want a UTF-8 string, you'll need to convert it afterward with something like utf8_decode(), iconv(), or mb_convert_encoding().

If you're producing XML, which doesn't recognise most HTML entities:

When producing a UTF-8 document (the default), then htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8') (because you only need to escape < and > and & unless you're printing inside the XML tags themselves).

Otherwise, either convert all the named entities to numeric ones, or declare the named entities in the document's DTD. The full list of 252 entities can be found in the HTML 4.01 Spec, or you can cut and paste the function from my site (http://inanimatt.com/php-convert-entities.php).
up
-3
marion at figmentthinking dot com
5 years ago
I just ran into the:
Bug #27626 html_entity_decode bug - cannot yet handle MBCS in html_entity_decode()!

The simple solution if you're still running PHP 4 is to wrap the html_entity_decode() function with the utf8_decode() function.

<?php
$string
= '&nbsp;';
$utf8_encode = utf8_encode(html_entity_decode($string));
?>

By default html_entity_decode() returns the ISO-8859-1 character set, and by default utf8_decode()...

http://us.php.net/manual/en/function.utf8-decode.php
"Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1"
up
-4
Victor
2 years ago
We were having very peculiar behavior regarding foreign characters such as e-acute.

However, it was only showing up as a problem when extracting those characters out of our mysql database and when being displayed through a proxy server of ours that handles dns issues.

As other users have made a note of, the default character setting wasn't what they were expecting it to be when they left theirs blank.

When we changed our default_charset to "UTF-8", our problems and needs for using functions like these were no longer necessary in handling foreign characters such as e-acute. Good enough for us!
up
-4
kae at verens dot com
6 years ago
the references to 'chr()' in the example unhtmlentities() function should be changed to unichr, using the example unichr() function described in the 'chr' reference (http://php.net/chr).

the reason for this is characters such as &#x20AC; which do not break down into an ASCII number (that's the Euro, by the way).
up
-5
jl dot garcia at gmail dot com
5 years ago
I created this function to filter all the text that goes in or comes out of the database.

<?php
function filter_string($string, $nohtml='', $save='') {
    if(!empty(
$nohtml)) {
       
$string = trim($string);
        if(!empty(
$save)) $string = htmlentities(trim($string), ENT_QUOTES, 'ISO-8859-15');
        else
$string = html_entity_decode($string, ENT_QUOTES, 'ISO-8859-15');
    }
    if(!empty(
$save)) $string = mysql_real_escape_string($string);
    else
$string = stripslashes($string);
    return(
$string);
}
?>
up
-8
grvg (at) free (dot) fr
8 years ago
Here is the ultimate functions to convert HTML entities to UTF-8 :
The main function is htmlentities2utf8
Others are helper functions

<?php
function chr_utf8($code)
    {
        if (
$code < 0) return false;
        elseif (
$code < 128) return chr($code);
        elseif (
$code < 160) // Remove Windows Illegals Cars
       
{
            if (
$code==128) $code=8364;
            elseif (
$code==129) $code=160; // not affected
           
elseif ($code==130) $code=8218;
            elseif (
$code==131) $code=402;
            elseif (
$code==132) $code=8222;
            elseif (
$code==133) $code=8230;
            elseif (
$code==134) $code=8224;
            elseif (
$code==135) $code=8225;
            elseif (
$code==136) $code=710;
            elseif (
$code==137) $code=8240;
            elseif (
$code==138) $code=352;
            elseif (
$code==139) $code=8249;
            elseif (
$code==140) $code=338;
            elseif (
$code==141) $code=160; // not affected
           
elseif ($code==142) $code=381;
            elseif (
$code==143) $code=160; // not affected
           
elseif ($code==144) $code=160; // not affected
           
elseif ($code==145) $code=8216;
            elseif (
$code==146) $code=8217;
            elseif (
$code==147) $code=8220;
            elseif (
$code==148) $code=8221;
            elseif (
$code==149) $code=8226;
            elseif (
$code==150) $code=8211;
            elseif (
$code==151) $code=8212;
            elseif (
$code==152) $code=732;
            elseif (
$code==153) $code=8482;
            elseif (
$code==154) $code=353;
            elseif (
$code==155) $code=8250;
            elseif (
$code==156) $code=339;
            elseif (
$code==157) $code=160; // not affected
           
elseif ($code==158) $code=382;
            elseif (
$code==159) $code=376;
        }
        if (
$code < 2048) return chr(192 | ($code >> 6)) . chr(128 | ($code & 63));
        elseif (
$code < 65536) return chr(224 | ($code >> 12)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
        else return
chr(240 | ($code >> 18)) . chr(128 | (($code >> 12) & 63)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
    }

   
// Callback for preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $str);
   
function html_entity_replace($matches)
    {
        if (
$matches[2])
        {
            return
chr_utf8(hexdec($matches[3]));
        } elseif (
$matches[1])
        {
            return
chr_utf8($matches[3]);
        }
        switch (
$matches[3])
        {
            case
"nbsp": return chr_utf8(160);
            case
"iexcl": return chr_utf8(161);
            case
"cent": return chr_utf8(162);
            case
"pound": return chr_utf8(163);
            case
"curren": return chr_utf8(164);
            case
"yen": return chr_utf8(165);
           
//... etc with all named HTML entities
       
}
        return
false;
    }
   
    function
htmlentities2utf8 ($string) // because of the html_entity_decode() bug with UTF-8
   
{
       
$string = preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $string);
        return
$string;
    }
?>
To Top