php - Finding and replacing keywords with link using DOMDocument -


i've been researching way find keywords if inside of 'p', 'span' or 'blockquote' , replace them link, using domdocument. i've written piece of regex achieves this, rather use domdocument should result in better solution.

the code below has 2 main issues, if place & in $html .. crashes because & isn't escaped , can't find way correctly escape &.

a smaller issue, not important .. if html invalid domdocument tries correct html , seem unable prevent this.

the preg_replace uses array, because dynamically loaded using multiple keywords.

$html = ' <blockquote>random random text</blockquote> <p>we match text</p> <p>this sample text</p>';  libxml_use_internal_errors(true); $dom = new domdocument(); $dom->stricterrorchecking = false;  $dom->loadhtml(mb_convert_encoding($html, 'html-entities', "utf-8"));  $xpath = new domxpath($dom);  foreach($xpath->query('//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]') $node) {     $replaced = preg_replace(         array('/(^|\s)'.preg_quote('we', '/').'(\s|$)/msi'),          array('<a href="#wrapped">we</a>'),         $node->wholetext     );     $newnode  = $dom->createdocumentfragment();     $newnode->appendxml($replaced);     $node->parentnode->replacechild($newnode, $node); }  $result = mb_substr($dom->savexml($xpath->query('//body')->item(0)), 6, -7, "utf-8");  libxml_clear_errors();  echo $result; 

the problem ampersand coming fact inject html appendxml($replaced) not escape <, > nor & of text parts.

the main issue though use domdocument avoid regex manipulation, still manipulate html way on smaller scale , bump similar problems.

here way avoid that. did not maintain array style of replace not make over-complicated. sure manage extend other types of replacements when needed:

foreach ($xpath->query(         '//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]')         $node) {     // keep reference parent node:     $parent = $node->parentnode;     // split text (e.g. "random random text") parts      // can isolate parts must modified.     // e.g. into: ["random ", "we", " random text"]      $parts = preg_split('/\b('.preg_quote('we', '/').')\b/msi',                           $node->textcontent, 0, preg_split_delim_capture);     foreach ($parts $index => $part) {         if (empty($part)) continue;         // parts corresponding captured expression in          // split delimiter (e.g. "we") occur @ odd indexes:         if ($index % 2) {             // create anchor dom-way. value passed             // should not interpreted html, escape it:             $el = $dom->createelement('a', htmlentities($part));             $el->setattribute('href', '#wrapped');         } else {             // create text node dom-way. text escaped             // library, knows should not interpreted              // html:             $el = $dom->createtextnode($part);         }         // insert part, before node processing         $parent->insertbefore($el, $node);     }     // when parts inserted, delete node split     $parent->removechild($node); } 

this way you'll not have ampersand problem.

nb: there no way know of can prevent domdocument "fix" invalid html.


Comments

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

java.lang.NoClassDefFoundError When Creating New Android Project -