retrieve zip file from a web site programmatically

APL-related discussions - a stream of APL consciousness.
Not sure where to start a discussion ? Here's the place to be
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !
Post Reply
tclviiidyalog
Posts: 17
Joined: Tue Apr 26, 2011 1:03 pm

retrieve zip file from a web site programmatically

Post by tclviiidyalog »

when i type the following string into a browser,
(N.B. if trying this at home be sure to substitute today's date for the dates in the string below):

'http://oasis.caiso.com/mrtu-oasis/SingleZip?resultformat=6&queryname=PRC_CURR_LMP&node=NP15SLAK_5_N001&startdate=20110425&enddate=200110425'

the browser pops up a window asking me if i want to save a zip file, and after saving it the file can be unzipped to reveal a comma delimited file (i.e. .csv filetype) with the latest 5 minute block price of California electricity

so I try the following steps to retrieve that zip file under APL

)load conga
C:\Program Files\Dyalog\Dyalog APL 13.0 Unicode\ws\conga saved Fri Apr 01 00:38:00 2011

DRC.Init ''
0 Conga loaded from: C:\Program Files\Dyalog\Dyalog APL 13.0 Unicode\conga21Uni

stuff← Samples.HTTPGet'http://oasis.caiso.com/mrtu-oasis/SingleZip?resultformat=6 queryname=PRC_CURR_LMP&node=NP15SLAK_5_N001&startdate=20110425&enddate=200110425'

stuff[1]
100 ⍝ oops, error code 100, looked but couldn't find an explanation in the Dyalog manuals


stuff[2] ⍝ this looks ok
http/1.1 200 ok
server Sun-Java-System-Web-Server/7.0
date Mon, 25 Apr 2011 19:13:05 GMT
content-disposition inline; filename=20110425_20110425_PRC_CURR_LMP_N_20110425_12_13_03.zip;
content-type application/x-zip-compressed
via 1.1 https-oasis
proxy-agent Sun-Java-System-Web-Server/7.0
transfer-encoding chunked


⍴ (⊃stuff[3])
388 ⍝ uh oh, the browser returns a 376 byte zip file, but this component is bigger

tie←¯1+⌊/0,⎕NNUMS ⍝ With next available number

'APL created.zip' ⎕NCREATE tie ⍝ ... create file.
⎕nsize¯1
0
(⊃stuff[3]) ⎕nappend tie
⎕nuntie tie

and there should be a file called "APL created.zip" on my drive, there in fact is such a file . . .
. . . but when i try to unzip it i am told the file is corrupted


any ideas why this might be happening?
is APL adding on some extra bytes?
and what is error code 100 a dyalog error or a California Electricity Transmission System Operator error?
gil
Posts: 72
Joined: Mon Feb 15, 2010 12:42 am

Re: retrieve zip file from a web site programmatically

Post by gil »

Hi

Looking at the header information the first thing that strike me is that no content-length has been defined. The sample code for Conga does not currently support a chunked transfer encoding (see http://en.wikipedia.org/wiki/Chunked_transfer_encoding).

You could modify the Samples.HTTPGet to handle this with a few extra lines of code by checking the http header for a transfer-encoding entry and also parse the data correctly if it is chunked.

Below is a modified version of HTTPGet with the additional lines marked out with comments:

Code: Select all

 r←{certs}HTTPGet url;U;DRC;protocol;wr;key;flags;pars;secure;data;z;header;datalen;host;port;done;cmd;b;page;auth;p;x509;priority;chunked
⍝ Get an HTTP page, format [HTTP[S]://][user:pass@]url[:port][/page]
⍝ Opional Left argument: PublicCert PrivateKey SSLValidation
⍝ Makes secure connection if left arg provided or URL begins with https:

⍝ Result: (return code) (HTTP headers) (HTTP body) [PeerCert if secure]

 (U DRC)←##.(HTTPUtils DRC) ⍝ Uses utils from here
 {}DRC.Init''

 p←(∨/b)×1+(b←'//'⍷url)⍳1
 secure←(2=⎕NC'certs')∨(U.lc(p-2)↑url)≡'https:'
 port←(1+secure)⊃80 443 ⍝ Default HTTP/HTTPS port
 url←p↓url              ⍝ Remove HTTP[s]:// if present
 host page←'/'split url,(~'/'∊url)/'/'    ⍝ Extract host and page from url

 :If 0=⎕NC'certs' ⋄ certs←'' ⋄ :EndIf

 :If secure
     x509 flags priority←3↑certs,(⍴,certs)↓(⎕NEW ##.DRC.X509Cert)32 'NORMAL:!CTYPE-OPENPGP'  ⍝ 32=Do not validate Certs
     pars←('x509'x509)('SSLValidation'flags)('Priority'priority)
 :Else ⋄ pars←''
 :EndIf

 :If '@'∊host ⍝ Handle user:password@host...
     auth←NL,'Authorization: Basic ',(U.Encode(¯1+p←host⍳'@')↑host)
     host←p↓host
 :Else ⋄ auth←''
 :EndIf

 host port←port U.HostPort host ⍝ Check for override of port number

 :If 0=⊃(r cmd)←2↑DRC.Clt''host port'Text' 100000,pars ⍝ 100,000 is max receive buffer size
 :AndIf 0=⊃r←DRC.Send cmd('GET ',page,' HTTP/1.1',NL,'Host: ',host,':',host,NL,'Accept: */*',auth,NL,NL)

     done data header←0 ⍬(0 ⍬)
     chunked←0          ⍝ gil: initialise chunked flag

     :Repeat
         :If ~done←0≠1⊃wr←DRC.Wait cmd 5000            ⍝ Wait up to 5 secs

             :If wr[3]∊'Block' 'BlockLast'                ⍝ If we got some data
                 :If 0<⍴data,←4⊃wr
                 :AndIf 0=1⊃header
                     header←U.DecodeHeader data
                     :If 0<1⊃header
                         data←(1⊃header)↓data
                         datalen←⊃((2⊃header)U.GetValue'Content-Length' 'Numeric'),¯1 ⍝ ¯1 if no content length not specified
                         ⍝ ↓↓↓ gil: set flag if header entry found
                         chunked←'chunked'≡' '~⍨(2⊃header)U.GetValue'transfer-encoding' 'char'
                     :EndIf
                 :EndIf
             :Else
                 ⎕←wr ⍝ Error?
                 ∘
             :EndIf

             done←'BlockLast'≡3⊃wr                        ⍝ Done if socket was closed
             :If datalen>0
                 done←done∨datalen≤⍴data ⍝ ... or if declared amount of data rcvd
                 ⍝ ↓↓↓ gil: parse data if chunked
             :ElseIf chunked
                 data done←ReadChunks data
                 ⍝ ↑↑↑ gil
             :Else
                 done←done∨(∨/'</html>'⍷data)∨(∨/'</HTML>'⍷data)
             :EndIf
         :EndIf
     :Until done

     :Trap 0 ⍝ If any errors occur, abandon conversion
         :If ∨/'charset=utf-8'⍷(2⊃header)U.GetValue'content-type' ''
             data←'UTF-8'⎕UCS ⎕UCS data ⍝ Convert from UTF-8
         :EndIf
     :EndTrap

     r←(1⊃wr)(2⊃header)data
     :If secure ⋄ r←r,⊂DRC.GetProp cmd'PeerCert' ⋄ :EndIf
 :Else
     'Connection failed ',,⍕r
 :EndIf

 z←DRC.Close cmd


ReadChunks is just a simple example of a parser, it doesn't implement all features of the chunked transfer encoding protocol, but it works for your example case.

Code: Select all

 (res done)←ReadChunks data;header;octets;extensions;partition;hexToDec;length
 partition←{⎕ML←3
     head tail←(1+∨\⍺⍷⍵,⍺)⊂⍵,⍺
     tail←(-⍴,⍺)↓(⍴,⍺)↓tail
     head tail}
 hexToDec←{16⊥16|(1⌽⎕D,'abcdef')⍳⍵}∘U.lc
 res done←'' 0

 :While (~done)∧(~0∊⍴data)
     header data←NL partition data
     octets extensions←';'partition header
     length←hexToDec octets
     done∨←length=0
     res,←length↑data
     data←(length+⍴NL)↓data
 :EndWhile
 ⍝


Hope this helps.
tclviiidyalog
Posts: 17
Joined: Tue Apr 26, 2011 1:03 pm

Re: retrieve zip file from a web site programmatically

Post by tclviiidyalog »

Gil,

Many thanks for going way beyond anything I could have expected

I appear to have a different version of the conga workspace you do, as the HTTPUtils namespace doesn't seem to contain the U or DRC fns

thanks again
Tony
gil
Posts: 72
Joined: Mon Feb 15, 2010 12:42 am

Re: retrieve zip file from a web site programmatically

Post by gil »

Hi Tony

Not to worry, I'm glad to help.

I think we are using the same version of the conga ws as I followed your instructions and loaded it from Dyalog v13.0 Unicode.

U and DRC are references to the namespaces HTTPUtils and DRC respectively. They are defined as local names in the Samples.HTTPGet function and I lazily made use of them in the ReadChunks function I wrote. This works fine as long as ReadChunks is placed in the same namespace as HTTPGet (ie Samples.ReadChunks) and only called from HTTPGet as they would be within scope that way.

Let me know how you fare.

Cheers

Gil
tclviiidyalog
Posts: 17
Joined: Tue Apr 26, 2011 1:03 pm

Re: retrieve zip file from a web site programmatically

Post by tclviiidyalog »

I'm still having a problem, I tried the following and got the error
------------------------------------------------------------------
)clear
clear ws
)load conga
C:\Program Files\Dyalog\Dyalog APL 13.0 Unicode\ws\conga saved Fri Apr 01 00:38:00 2011
)cs #.Samples
#.Samples
)copy chunkread ReadChunks callstring
.\chunkread saved Tue May 10 08:28:34 2011
stuff← ReadChunks callstring
VALUE ERROR
ReadChunks[8] hexToDec←{16⊥16|(1⌽⎕D,'abcdef')⍳⍵}∘U.lc

)fns
CertPath ClientServerTest DisplayCert Foo GetSimpleServiceData GetStats GetUserFromCerts
HTTPGet RPCGet ReadCert ReadChunks ResetTest SaveResult Say SecureCallback StatCalc Style
TestAll TestAllSecure TestFTPClient TestRPCServer TestSecureConnection TestSecureConnectionBHC
TestSecureConnectionTimeouts TestSecureServer TestSecureTelnetServer TestSecureWebClient TestSecureWebServer
TestSecureWebServerCB TestSimpleServices TestTelnetServer TestWebClient TestWebFunctionServer TestWebServer
TestX509Certs split
------------------------------------------------------------------

I don't see the U and DRC fns in the namespace . . . How am I being stupid?
gil
Posts: 72
Joined: Mon Feb 15, 2010 12:42 am

Re: retrieve zip file from a web site programmatically

Post by gil »

It was really unnecessary of me to use the HTTPUtils in this case and the way I used a reference not defined within the ReadChunks function clearly has got you confused, my bad.

The way I wrote ReadChunks it depends on the local name U (a reference to the HTTPUtils namespace that is assumed to exist as a sibling namespace to Samples, in this case in the root, ie. #.HTTPUtils) set up by HTTPGet and a global variable NL that exists in the Samples namespace. If you place ReadChunks in the Samples namespace it will find the NL variable within its scope, but the name U is not defined. This is why you get a VALUE ERROR when you try executing ReadChunks on its own.

To be able to run ReadChunks on its own you need to make it self contained by removing any references to external/global names. First of all, you can modify the ReadChunks function to get rid of the reference to the HTTPUtils namespace by replacing the line:

Code: Select all

hexToDec←{16⊥16|(1⌽⎕D,'abcdef')⍳⍵}∘U.lc

with:

Code: Select all

hexToDec←{16⊥16|'123456789abcdef0123456789ABCDEF'⍳⍵}


You should also add a local definition of the carriage return + linefeed variable NL like so:

Code: Select all

NL←⎕UCS 13 10


And what you end up with is something in line with this:

Code: Select all

(res done)←ReadChunks data;header;octets;extensions;partition;hexToDec;length;NL
partition←{⎕ML←3
     head tail←(1+∨\⍺⍷⍵,⍺)⊂⍵,⍺
     tail←(-⍴,⍺)↓(⍴,⍺)↓tail
     head tail}
hexToDec←{16⊥16|'123456789abcdef0123456789ABCDEF'⍳⍵}
NL←⎕UCS 13 10
res done←'' 0

:While (~done)∧(~0∊⍴data)
     header data←NL partition data
     octets extensions←';'partition header
     length←hexToDec octets
     done∨←length=0
     res,←length↑data
     data←(length+⍴NL)↓data
:EndWhile


You can read more about Namespaces and Localisation in the online help files.

Good luck!
Post Reply