A regular lightweight cleanup

StephenTaylor · Post by **StephenTaylor** » Thu May 08, 2014 1:35 pm

Here's a happy thing I stumbled across while cleaning up a thousand HTMs. The HTML is a mashup from multiple sources, including fragments of Word documents pasted in through a WYSIWYG editor. The target is clean structural HTML.

Part of the task is to remove empty HTML elements. As the source has all kinds of nesting, eg H2>P>STRONG>H3 it shouldn't be a job for regex. Better to parse with ⎕XML and apply some recursive test. But modifying ⎕R with the power operator is a handy lightweight solution:

      html←('<(EM|H\d|OO|LI|P|SPAN|STRONG|UL)([^>]*)>\s*</\1>'⎕R''⍠OPTIONS)⍣≡html

HTMLTidy is available to indent and wrap line lengths when we're done. Or - lightweight - one can use ⎕R with a function operand:

      NL←⎕UCS 13 10
 OPTIONS←('IC' 1)('Mode' 'D')('DotAll' 1)('Greedy' 0)
 cutat←{⍵/⍨,∘0>⌿↑¯1 1↓¨⊂⍺|⍵}
 join←{(⊃⍴⍺)↓⊃,/⍺∘,¨⍵}                                         ⍝ join strings with ⍺
 subpat←{¯1∊l o←⍵.(Lengths Offsets)⊃⍨¨⍺+1:'' ⋄ l↑o↓⍵.Block}    ⍝ subpattern ⍺ from ⎕R
 wrap←{
     r←-/l i←⍺
     ind←⍳⍴⍵
     NL join(i↑'')∘,¨⍵⊂⍨1,¯1↓ind∊r cutat ind/⍨⍵=' '
 }                                                             ⍝ eg 80 6∘wrap ←→ wrap to 80-char lines with 6-sp indent

 wrapp←{'<p>',NL,(72 8 wrap 2∘subpat ⍵),NL,'    </p>'}
 html←('<p( .+)*>\s*?(.+)\s*?</p>'⎕R wrapp ⍠OPTIONS)html

StephenTaylor · Post by **StephenTaylor** » Fri May 09, 2014 10:21 am

A correction: the ⎕R with the wrapp function captures attributes of the P but fails to preserve them. It should of course be:

      wrapp←{'<p',(1 subpat ⍵),'>',NL,(72 8 wrap 2 subpat ⍵),NL,'    </p>'}

Dyalog Forums

A regular lightweight cleanup

A regular lightweight cleanup

Re: A regular lightweight cleanup