A regular lightweight cleanup

APL-related discussions - a stream of APL consciousness.
Not sure where to start a discussion ? Here's the place to be
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !
Post Reply
User avatar
StephenTaylor
Posts: 31
Joined: Thu May 28, 2009 8:20 am

A regular lightweight cleanup

Post by StephenTaylor »

Here's a happy thing I stumbled across while cleaning up a thousand HTMs. The HTML is a mashup from multiple sources, including fragments of Word documents pasted in through a WYSIWYG editor. The target is clean structural HTML.

Part of the task is to remove empty HTML elements. As the source has all kinds of nesting, eg H2>P>STRONG>H3 it shouldn't be a job for regex. Better to parse with ⎕XML and apply some recursive test. But modifying ⎕R with the power operator is a handy lightweight solution:

      html←('<(EM|H\d|OO|LI|P|SPAN|STRONG|UL)([^>]*)>\s*</\1>'⎕R''⍠OPTIONS)⍣≡html


HTMLTidy is available to indent and wrap line lengths when we're done. Or - lightweight - one can use ⎕R with a function operand:

      NL←⎕UCS 13 10
OPTIONS←('IC' 1)('Mode' 'D')('DotAll' 1)('Greedy' 0)
cutat←{⍵/⍨,∘0>⌿↑¯1 1↓¨⊂⍺|⍵}
join←{(⊃⍴⍺)↓⊃,/⍺∘,¨⍵} ⍝ join strings with ⍺
subpat←{¯1∊l o←⍵.(Lengths Offsets)⊃⍨¨⍺+1:'' ⋄ l↑o↓⍵.Block} ⍝ subpattern ⍺ from ⎕R
wrap←{
r←-/l i←⍺
ind←⍳⍴⍵
NL join(i↑'')∘,¨⍵⊂⍨1,¯1↓ind∊r cutat ind/⍨⍵=' '
} ⍝ eg 80 6∘wrap ←→ wrap to 80-char lines with 6-sp indent

wrapp←{'<p>',NL,(72 8 wrap 2∘subpat ⍵),NL,' </p>'}
html←('<p( .+)*>\s*?(.+)\s*?</p>'⎕R wrapp ⍠OPTIONS)html
User avatar
StephenTaylor
Posts: 31
Joined: Thu May 28, 2009 8:20 am

Re: A regular lightweight cleanup

Post by StephenTaylor »

A correction: the ⎕R with the wrapp function captures attributes of the P but fails to preserve them. It should of course be:

      wrapp←{'<p',(1 subpat ⍵),'>',NL,(72 8 wrap 2 subpat ⍵),NL,'    </p>'}
Post Reply