Part of the task is to remove empty HTML elements. As the source has all kinds of nesting, eg H2>P>STRONG>H3 it shouldn't be a job for regex. Better to parse with ⎕XML and apply some recursive test. But modifying ⎕R with the power operator is a handy lightweight solution:
html←('<(EM|H\d|OO|LI|P|SPAN|STRONG|UL)([^>]*)>\s*</\1>'⎕R''⍠OPTIONS)⍣≡html
HTMLTidy is available to indent and wrap line lengths when we're done. Or - lightweight - one can use ⎕R with a function operand:
NL←⎕UCS 13 10
OPTIONS←('IC' 1)('Mode' 'D')('DotAll' 1)('Greedy' 0)
cutat←{⍵/⍨,∘0>⌿↑¯1 1↓¨⊂⍺|⍵}
join←{(⊃⍴⍺)↓⊃,/⍺∘,¨⍵} ⍝ join strings with ⍺
subpat←{¯1∊l o←⍵.(Lengths Offsets)⊃⍨¨⍺+1:'' ⋄ l↑o↓⍵.Block} ⍝ subpattern ⍺ from ⎕R
wrap←{
r←-/l i←⍺
ind←⍳⍴⍵
NL join(i↑'')∘,¨⍵⊂⍨1,¯1↓ind∊r cutat ind/⍨⍵=' '
} ⍝ eg 80 6∘wrap ←→ wrap to 80-char lines with 6-sp indent
wrapp←{'<p>',NL,(72 8 wrap 2∘subpat ⍵),NL,' </p>'}
html←('<p( .+)*>\s*?(.+)\s*?</p>'⎕R wrapp ⍠OPTIONS)html