Page 1 of 1
retrieve zip file from a web site programmatically
Posted: Mon May 09, 2011 6:07 pm
by tclviiidyalog
when i type the following string into a browser,
(N.B. if trying this at home be sure to substitute today's date for the dates in the string below):
'http://oasis.caiso.com/mrtu-oasis/SingleZip?resultformat=6&queryname=PRC_CURR_LMP&node=NP15SLAK_5_N001&startdate=20110425&enddate=200110425'
the browser pops up a window asking me if i want to save a zip file, and after saving it the file can be unzipped to reveal a comma delimited file (i.e. .csv filetype) with the latest 5 minute block price of California electricity
so I try the following steps to retrieve that zip file under APL
)load conga
C:\Program Files\Dyalog\Dyalog APL 13.0 Unicode\ws\conga saved Fri Apr 01 00:38:00 2011
DRC.Init ''
0 Conga loaded from: C:\Program Files\Dyalog\Dyalog APL 13.0 Unicode\conga21Uni
stuff← Samples.HTTPGet'http://oasis.caiso.com/mrtu-oasis/SingleZip?resultformat=6 queryname=PRC_CURR_LMP&node=NP15SLAK_5_N001&startdate=20110425&enddate=200110425'
stuff[1]
100 ⍝ oops, error code 100, looked but couldn't find an explanation in the Dyalog manuals
stuff[2] ⍝ this looks ok
http/1.1 200 ok
server Sun-Java-System-Web-Server/7.0
date Mon, 25 Apr 2011 19:13:05 GMT
content-disposition inline; filename=20110425_20110425_PRC_CURR_LMP_N_20110425_12_13_03.zip;
content-type application/x-zip-compressed
via 1.1 https-oasis
proxy-agent Sun-Java-System-Web-Server/7.0
transfer-encoding chunked
⍴ (⊃stuff[3])
388 ⍝ uh oh, the browser returns a 376 byte zip file, but this component is bigger
tie←¯1+⌊/0,⎕NNUMS ⍝ With next available number
'APL created.zip' ⎕NCREATE tie ⍝ ... create file.
⎕nsize¯1
0
(⊃stuff[3]) ⎕nappend tie
⎕nuntie tie
and there should be a file called "APL created.zip" on my drive, there in fact is such a file . . .
. . . but when i try to unzip it i am told the file is corrupted
any ideas why this might be happening?
is APL adding on some extra bytes?
and what is error code 100 a dyalog error or a California Electricity Transmission System Operator error?
Re: retrieve zip file from a web site programmatically
Posted: Tue May 10, 2011 7:29 am
by gil
Hi
Looking at the header information the first thing that strike me is that no content-length has been defined. The sample code for Conga does not currently support a chunked transfer encoding (see
http://en.wikipedia.org/wiki/Chunked_transfer_encoding).
You could modify the Samples.HTTPGet to handle this with a few extra lines of code by checking the http header for a transfer-encoding entry and also parse the data correctly if it is chunked.
Below is a modified version of HTTPGet with the additional lines marked out with comments:
Code: Select all
r←{certs}HTTPGet url;U;DRC;protocol;wr;key;flags;pars;secure;data;z;header;datalen;host;port;done;cmd;b;page;auth;p;x509;priority;chunked
⍝ Get an HTTP page, format [HTTP[S]://][user:pass@]url[:port][/page]
⍝ Opional Left argument: PublicCert PrivateKey SSLValidation
⍝ Makes secure connection if left arg provided or URL begins with https:
⍝ Result: (return code) (HTTP headers) (HTTP body) [PeerCert if secure]
(U DRC)←##.(HTTPUtils DRC) ⍝ Uses utils from here
{}DRC.Init''
p←(∨/b)×1+(b←'//'⍷url)⍳1
secure←(2=⎕NC'certs')∨(U.lc(p-2)↑url)≡'https:'
port←(1+secure)⊃80 443 ⍝ Default HTTP/HTTPS port
url←p↓url ⍝ Remove HTTP[s]:// if present
host page←'/'split url,(~'/'∊url)/'/' ⍝ Extract host and page from url
:If 0=⎕NC'certs' ⋄ certs←'' ⋄ :EndIf
:If secure
x509 flags priority←3↑certs,(⍴,certs)↓(⎕NEW ##.DRC.X509Cert)32 'NORMAL:!CTYPE-OPENPGP' ⍝ 32=Do not validate Certs
pars←('x509'x509)('SSLValidation'flags)('Priority'priority)
:Else ⋄ pars←''
:EndIf
:If '@'∊host ⍝ Handle user:password@host...
auth←NL,'Authorization: Basic ',(U.Encode(¯1+p←host⍳'@')↑host)
host←p↓host
:Else ⋄ auth←''
:EndIf
host port←port U.HostPort host ⍝ Check for override of port number
:If 0=⊃(r cmd)←2↑DRC.Clt''host port'Text' 100000,pars ⍝ 100,000 is max receive buffer size
:AndIf 0=⊃r←DRC.Send cmd('GET ',page,' HTTP/1.1',NL,'Host: ',host,':',host,NL,'Accept: */*',auth,NL,NL)
done data header←0 ⍬(0 ⍬)
chunked←0 ⍝ gil: initialise chunked flag
:Repeat
:If ~done←0≠1⊃wr←DRC.Wait cmd 5000 ⍝ Wait up to 5 secs
:If wr[3]∊'Block' 'BlockLast' ⍝ If we got some data
:If 0<⍴data,←4⊃wr
:AndIf 0=1⊃header
header←U.DecodeHeader data
:If 0<1⊃header
data←(1⊃header)↓data
datalen←⊃((2⊃header)U.GetValue'Content-Length' 'Numeric'),¯1 ⍝ ¯1 if no content length not specified
⍝ ↓↓↓ gil: set flag if header entry found
chunked←'chunked'≡' '~⍨(2⊃header)U.GetValue'transfer-encoding' 'char'
:EndIf
:EndIf
:Else
⎕←wr ⍝ Error?
∘
:EndIf
done←'BlockLast'≡3⊃wr ⍝ Done if socket was closed
:If datalen>0
done←done∨datalen≤⍴data ⍝ ... or if declared amount of data rcvd
⍝ ↓↓↓ gil: parse data if chunked
:ElseIf chunked
data done←ReadChunks data
⍝ ↑↑↑ gil
:Else
done←done∨(∨/'</html>'⍷data)∨(∨/'</HTML>'⍷data)
:EndIf
:EndIf
:Until done
:Trap 0 ⍝ If any errors occur, abandon conversion
:If ∨/'charset=utf-8'⍷(2⊃header)U.GetValue'content-type' ''
data←'UTF-8'⎕UCS ⎕UCS data ⍝ Convert from UTF-8
:EndIf
:EndTrap
r←(1⊃wr)(2⊃header)data
:If secure ⋄ r←r,⊂DRC.GetProp cmd'PeerCert' ⋄ :EndIf
:Else
'Connection failed ',,⍕r
:EndIf
z←DRC.Close cmd
ReadChunks is just a simple example of a parser, it doesn't implement all features of the chunked transfer encoding protocol, but it works for your example case.
Code: Select all
(res done)←ReadChunks data;header;octets;extensions;partition;hexToDec;length
partition←{⎕ML←3
head tail←(1+∨\⍺⍷⍵,⍺)⊂⍵,⍺
tail←(-⍴,⍺)↓(⍴,⍺)↓tail
head tail}
hexToDec←{16⊥16|(1⌽⎕D,'abcdef')⍳⍵}∘U.lc
res done←'' 0
:While (~done)∧(~0∊⍴data)
header data←NL partition data
octets extensions←';'partition header
length←hexToDec octets
done∨←length=0
res,←length↑data
data←(length+⍴NL)↓data
:EndWhile
⍝
Hope this helps.
Re: retrieve zip file from a web site programmatically
Posted: Tue May 10, 2011 12:35 pm
by tclviiidyalog
Gil,
Many thanks for going way beyond anything I could have expected
I appear to have a different version of the conga workspace you do, as the HTTPUtils namespace doesn't seem to contain the U or DRC fns
thanks again
Tony
Re: retrieve zip file from a web site programmatically
Posted: Tue May 10, 2011 5:02 pm
by gil
Hi Tony
Not to worry, I'm glad to help.
I think we are using the same version of the conga ws as I followed your instructions and loaded it from Dyalog v13.0 Unicode.
U and DRC are references to the namespaces HTTPUtils and DRC respectively. They are defined as local names in the Samples.HTTPGet function and I lazily made use of them in the ReadChunks function I wrote. This works fine as long as ReadChunks is placed in the same namespace as HTTPGet (ie Samples.ReadChunks) and only called from HTTPGet as they would be within scope that way.
Let me know how you fare.
Cheers
Gil
Re: retrieve zip file from a web site programmatically
Posted: Tue May 10, 2011 11:16 pm
by tclviiidyalog
I'm still having a problem, I tried the following and got the error
------------------------------------------------------------------
)clear
clear ws
)load conga
C:\Program Files\Dyalog\Dyalog APL 13.0 Unicode\ws\conga saved Fri Apr 01 00:38:00 2011
)cs #.Samples
#.Samples
)copy chunkread ReadChunks callstring
.\chunkread saved Tue May 10 08:28:34 2011
stuff← ReadChunks callstring
VALUE ERROR
ReadChunks[8] hexToDec←{16⊥16|(1⌽⎕D,'abcdef')⍳⍵}∘U.lc
∧
)fns
CertPath ClientServerTest DisplayCert Foo GetSimpleServiceData GetStats GetUserFromCerts
HTTPGet RPCGet ReadCert ReadChunks ResetTest SaveResult Say SecureCallback StatCalc Style
TestAll TestAllSecure TestFTPClient TestRPCServer TestSecureConnection TestSecureConnectionBHC
TestSecureConnectionTimeouts TestSecureServer TestSecureTelnetServer TestSecureWebClient TestSecureWebServer
TestSecureWebServerCB TestSimpleServices TestTelnetServer TestWebClient TestWebFunctionServer TestWebServer
TestX509Certs split
------------------------------------------------------------------
I don't see the U and DRC fns in the namespace . . . How am I being stupid?
Re: retrieve zip file from a web site programmatically
Posted: Wed May 11, 2011 7:05 am
by gil
It was really unnecessary of me to use the HTTPUtils in this case and the way I used a reference not defined within the ReadChunks function clearly has got you confused, my bad.
The way I wrote ReadChunks it depends on the local name U (a reference to the HTTPUtils namespace that is assumed to exist as a sibling namespace to Samples, in this case in the root, ie. #.HTTPUtils) set up by HTTPGet and a global variable NL that exists in the Samples namespace. If you place ReadChunks in the Samples namespace it will find the NL variable within its scope, but the name U is not defined. This is why you get a VALUE ERROR when you try executing ReadChunks on its own.
To be able to run ReadChunks on its own you need to make it self contained by removing any references to external/global names. First of all, you can modify the ReadChunks function to get rid of the reference to the HTTPUtils namespace by replacing the line:
Code: Select all
hexToDec←{16⊥16|(1⌽⎕D,'abcdef')⍳⍵}∘U.lc
with:
Code: Select all
hexToDec←{16⊥16|'123456789abcdef0123456789ABCDEF'⍳⍵}
You should also add a local definition of the carriage return + linefeed variable NL like so:
And what you end up with is something in line with this:
Code: Select all
(res done)←ReadChunks data;header;octets;extensions;partition;hexToDec;length;NL
partition←{⎕ML←3
head tail←(1+∨\⍺⍷⍵,⍺)⊂⍵,⍺
tail←(-⍴,⍺)↓(⍴,⍺)↓tail
head tail}
hexToDec←{16⊥16|'123456789abcdef0123456789ABCDEF'⍳⍵}
NL←⎕UCS 13 10
res done←'' 0
:While (~done)∧(~0∊⍴data)
header data←NL partition data
octets extensions←';'partition header
length←hexToDec octets
done∨←length=0
res,←length↑data
data←(length+⍴NL)↓data
:EndWhile
⍝
You can read more about
Namespaces and Localisation in the online help files.
Good luck!