Page 1 of 1

A regular lightweight cleanup

Posted: Thu May 08, 2014 1:35 pm
by StephenTaylor
Here's a happy thing I stumbled across while cleaning up a thousand HTMs. The HTML is a mashup from multiple sources, including fragments of Word documents pasted in through a WYSIWYG editor. The target is clean structural HTML.

Part of the task is to remove empty HTML elements. As the source has all kinds of nesting, eg H2>P>STRONG>H3 it shouldn't be a job for regex. Better to parse with ⎕XML and apply some recursive test. But modifying ⎕R with the power operator is a handy lightweight solution:

      html←('<(EM|H\d|OO|LI|P|SPAN|STRONG|UL)([^>]*)>\s*</\1>'⎕R''⍠OPTIONS)⍣≡html


HTMLTidy is available to indent and wrap line lengths when we're done. Or - lightweight - one can use ⎕R with a function operand:

      NL←⎕UCS 13 10
OPTIONS←('IC' 1)('Mode' 'D')('DotAll' 1)('Greedy' 0)
cutat←{⍵/⍨,∘0>⌿↑¯1 1↓¨⊂⍺|⍵}
join←{(⊃⍴⍺)↓⊃,/⍺∘,¨⍵} ⍝ join strings with ⍺
subpat←{¯1∊l o←⍵.(Lengths Offsets)⊃⍨¨⍺+1:'' ⋄ l↑o↓⍵.Block} ⍝ subpattern ⍺ from ⎕R
wrap←{
r←-/l i←⍺
ind←⍳⍴⍵
NL join(i↑'')∘,¨⍵⊂⍨1,¯1↓ind∊r cutat ind/⍨⍵=' '
} ⍝ eg 80 6∘wrap ←→ wrap to 80-char lines with 6-sp indent

wrapp←{'<p>',NL,(72 8 wrap 2∘subpat ⍵),NL,' </p>'}
html←('<p( .+)*>\s*?(.+)\s*?</p>'⎕R wrapp ⍠OPTIONS)html

Re: A regular lightweight cleanup

Posted: Fri May 09, 2014 10:21 am
by StephenTaylor
A correction: the ⎕R with the wrapp function captures attributes of the P but fails to preserve them. It should of course be:

      wrapp←{'<p',(1 subpat ⍵),'>',NL,(72 8 wrap 2 subpat ⍵),NL,'    </p>'}