Here is a function I wrote to capitalize the previous remarks about charset problems (UTF-8...) when using loadHTML and then DOM functions.
It adds the charset meta tag just after <head> to improve automatic encoding detection, converts any specific character to an html entity, thus PHP DOM functions/attributes will return correct values.
<?php
mb_detect_order("ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15");
function loadNprepare($url,$encod='') {
$content = file_get_contents($url);
if (!empty($content)) {
if (empty($encod))
$encod = mb_detect_encoding($content);
$headpos = mb_strpos($content,'<head>');
if (FALSE=== $headpos)
$headpos= mb_strpos($content,'<HEAD>');
if (FALSE!== $headpos) {
$headpos+=6;
$content = mb_substr($content,0,$headpos) . '<meta http-equiv="Content-Type" content="text/html; charset='.$encod.'">' .mb_substr($content,$headpos);
}
$content=mb_convert_encoding($content, 'HTML-ENTITIES', $encod);
}
$dom = new DomDocument;
$res = $dom->loadHTML($content);
if (!$res) return FALSE;
return $dom;
}
?>
NB: it uses mb_strpos/mb_substr instead of mb_ereg_replace because that seemed more efficient with huge html pages.