php - Finding and replacing keywords with link using DOMDocument -
i've been researching way find keywords if inside of 'p', 'span' or 'blockquote' , replace them link, using domdocument. i've written piece of regex achieves this, rather use domdocument should result in better solution.
the code below has 2 main issues, if place &
in $html .. crashes because &
isn't escaped , can't find way correctly escape &
.
a smaller issue, not important .. if html invalid domdocument tries correct html , seem unable prevent this.
the preg_replace uses array, because dynamically loaded using multiple keywords.
$html = ' <blockquote>random random text</blockquote> <p>we match text</p> <p>this sample text</p>'; libxml_use_internal_errors(true); $dom = new domdocument(); $dom->stricterrorchecking = false; $dom->loadhtml(mb_convert_encoding($html, 'html-entities', "utf-8")); $xpath = new domxpath($dom); foreach($xpath->query('//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]') $node) { $replaced = preg_replace( array('/(^|\s)'.preg_quote('we', '/').'(\s|$)/msi'), array('<a href="#wrapped">we</a>'), $node->wholetext ); $newnode = $dom->createdocumentfragment(); $newnode->appendxml($replaced); $node->parentnode->replacechild($newnode, $node); } $result = mb_substr($dom->savexml($xpath->query('//body')->item(0)), 6, -7, "utf-8"); libxml_clear_errors(); echo $result;
the problem ampersand coming fact inject html appendxml($replaced)
not escape <
, >
nor &
of text parts.
the main issue though use domdocument avoid regex manipulation, still manipulate html way on smaller scale , bump similar problems.
here way avoid that. did not maintain array style of replace not make over-complicated. sure manage extend other types of replacements when needed:
foreach ($xpath->query( '//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]') $node) { // keep reference parent node: $parent = $node->parentnode; // split text (e.g. "random random text") parts // can isolate parts must modified. // e.g. into: ["random ", "we", " random text"] $parts = preg_split('/\b('.preg_quote('we', '/').')\b/msi', $node->textcontent, 0, preg_split_delim_capture); foreach ($parts $index => $part) { if (empty($part)) continue; // parts corresponding captured expression in // split delimiter (e.g. "we") occur @ odd indexes: if ($index % 2) { // create anchor dom-way. value passed // should not interpreted html, escape it: $el = $dom->createelement('a', htmlentities($part)); $el->setattribute('href', '#wrapped'); } else { // create text node dom-way. text escaped // library, knows should not interpreted // html: $el = $dom->createtextnode($part); } // insert part, before node processing $parent->insertbefore($el, $node); } // when parts inserted, delete node split $parent->removechild($node); }
this way you'll not have ampersand problem.
nb: there no way know of can prevent domdocument "fix" invalid html.
Comments
Post a Comment