Page 1 of 2
Source Code Example - Scraping a Web Page
Posted: Thu Apr 17, 2014 4:52 am
by neeraj
The following may be of help to other people
TEST;⎕USING;srcUriString;srcUri;client;str
⎕USING←'System,mscorlib.dll'
⎕USING,←⊂'System.IO,mscorlib.dll'
⎕USING,←⊂'System.Net,System.dll'
srcUriString←⎕NEW String(⊂'http://www.cayugafamilydental.com')
srcUri←⎕NEW Uri srcUriString
client←⎕NEW WebClient ⍬
str←client.DownloadString srcUri
⍴str
Re: Source Code Example - Scraping a Web Page
Posted: Thu Apr 17, 2014 7:36 am
by Dick Bowman
Any advantage over the example in the .NET Interface Guide?
Always interesting to see different ways to skin a rat. Descriptions of pros and cons can help take what we learn into unknown territory.
Re: Source Code Example - Scraping a Web Page
Posted: Thu Apr 17, 2014 3:01 pm
by jGoff
Always good to have a simple "scraper" on hand. Using v12.1, it worked the first time after a copy and a paste. (Careful not to let the first ⎕NEW line wrap.) Thanks for sharing.
P.S. Not to mention that if I ever have a toothache in Ithaca, I'll know where to go.
Re: Source Code Example - Scraping a Web Page
Posted: Thu Apr 17, 2014 6:03 pm
by neeraj
This version is shorter. I tried using the conga workspace and HTTPGet in the Samples namespace but it was not returning the desired result, so I used .NET instead which does return the correct result.
Re: Source Code Example - Scraping a Web Page
Posted: Thu Apr 17, 2014 7:24 pm
by PGilbert
Thanks for sharing your short version. Here is what we have done recently inspired by the
.Net Interface Guide of Dyalog:
Code: Select all
TEST2;dataStream;reader;request;response;responseFromServer;url;⎕USING
⎕USING←'System.Net,System.dll' 'System.IO,mscorlib.dll' 'System.Text,mscorlib.dll'
url←'http://www.cayugafamilydental.com'
request←WebRequest.Create(⊂url)
response←request.GetResponse
dataStream←response.GetResponseStream
reader←⎕NEW StreamReader(dataStream,Encoding.GetEncoding(⊂response.CharacterSet))
responseFromServer←reader.ReadToEnd
What do you do to extract the information you are looking for from the HTML ? (We ended-up transforming the HTML in XHTML and search it like if it was Xml)
Re: Source Code Example - Scraping a Web Page
Posted: Thu Apr 17, 2014 9:05 pm
by DanB|Dyalog
neeraj: what did you do to get the page and how did you do it?
I tried
Code: Select all
)load conga
C:\Program Files\Dyalog\V14U\ws\conga saved Mon Apr 07 17:20:16 2014
⍴¨r←Samples.HTTPGet'www.dyalog.com'
11 2 17192
r∊⊂p
0 0 1
'p' is the result from your fn. It matches the 3rd element returned by Samples.HTTPGet.
/Dan
Re: Source Code Example - Scraping a Web Page
Posted: Fri Apr 18, 2014 3:39 am
by neeraj
Samples.HTTPGet 'http://finance.google.com/finance/info?%20client=ig&q=NASDAQ:GOOG,NYSE:IBM'
the above should come back with something like
// [ { "id": "304466804484872" ,"t" : "GOOG" ,"e" : "NASDAQ" ,"l" : "536.10" ,"l_fix" : "536.10" ,"l_cur" : "536.10" ,"s": "2" ,"ltt":"4:00PM EDT" ,"lt" : "Apr 17, 4:00PM EDT" ,"lt_dts" : "2014-04-17T16:00:00Z" ,"c" : "-20.44" ,"c_fix" : "-20.44" ,"cp" : "-3.67" ,"cp_fix" : "-3.67" ,"ccol" : "chr" ,"pcls_fix" : "556.54" ,"el": "538.16" ,"el_fix": "538.16" ,"el_cur": "538.16" ,"elt" : "Apr 17, 7:59PM EDT" ,"ec" : "+2.06" ,"ec_fix" : "2.06" ,"ecp" : "0.38" ,"ecp_fix" : "0.38" ,"eccol" : "chg" ,"div" : "" ,"yld" : "" } ,{ "id": "18241" ,"t" : "IBM" ,"e" : "NYSE" ,"l" : "190.01" ,"l_fix" : "190.01" ,"l_cur" : "190.01" ,"s": "2" ,"ltt":"4:02PM EDT" ,"lt" : "Apr 17, 4:02PM EDT" ,"lt_dts" : "2014-04-17T16:02:08Z" ,"c" : "-6.39" ,"c_fix" : "-6.39" ,"cp" : "-3.25" ,"cp_fix" : "-3.25" ,"ccol" : "chr" ,"pcls_fix" : "196.4" ,"el": "190.37" ,"el_fix": "190.37" ,"el_cur": "190.37" ,"elt" : "Apr 17, 7:57PM EDT" ,"ec" : "+0.36" ,"ec_fix" : "0.36" ,"ecp" : "0.19" ,"ecp_fix" : "0.19" ,"eccol" : "chg" ,"div" : "0.95" ,"yld" : "2.00" } ]
which is a JSON with a 2 element array. I do not get the above result. My original post was a contrived example to avoid JSON issues when the focus was on HTTPGet
Re: Source Code Example - Scraping a Web Page
Posted: Fri Apr 18, 2014 3:50 am
by neeraj
PGilbert:
Here is how I have been dealing with HTML. It is a snippet but will give you a flavor. I am just looking for specific information in the HTML.
:Case 2
⍝ Schwab A Rated Stocks
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA1.WEBARCHIVE')
S←7↓¨2000↑¨(' symbol="'⍷C)⊂C ⍝ All stock names are of the form ' symbol="IBM"'
AGRADE←STK¨S
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA2.WEBARCHIVE')
S←20↑¨(' symbol="'⍷C)⊂C
AGRADE←AGRADE,STK¨S
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA3.WEBARCHIVE')
S←20↑¨(' symbol="'⍷C)⊂C
AGRADE←fIXNAME¨AGRADE,STK¨S
∆MTX[(∆IN ∆MTX[;3]≡¨⊂'A');3]←⊂'--' ⍝ All previous A Grades are reset to --
I←∆MTX[;2]⍳AGRADE
existing←(I<1↑⍴∆MTX)/I
∆MTX[existing;3]←'A'
∆MTX[NR;3]←⊂date 0
Re: Source Code Example - Scraping a Web Page
Posted: Fri Apr 18, 2014 7:36 am
by Morten|Dyalog
If you try this with most of the HTTPGet functions that are out there, it will fail for two reasons. First because the URL has been redirected (the www has been removed from the address):
Code: Select all
Samples.HTTPGet 'http://www.cayugafamilydental.com'
0 http/1.1 301 moved permanently
date Thu, 17 Apr 2014 06:19:02 GMT
server Apache
x-powered-by PHP/5.4.27
expires Thu, 19 Nov 1981 08:52:00 GMT
cache-control no-store, no-cache, must-revalidate, post-check=0, pre-check=0
pragma no-cache
x-pingback http://cayugafamilydental.com/xmlrpc.php
set-cookie PHPSESSID=a1eca9c1da2b358b76b2708acf53b07b; path=/
location http://cayugafamilydental.com/
content-length 0
content-type text/html; charset=UTF-8
Secondly, if you switch to the correct address, it fails because the content is compressed in a "chunked" mode which we did not support. The attached file contains source for the Samples.HTTPGet function that will be distributed with v14.0: It both handles the redirection and the chunking. The advantage of HTTPGet over the .NET solution is that it is cross-platform, it will work under Windows, AIX, Linux (including the Raspberry Pi) - and the future versions of Dyalog APL that we are currently working on (MacOS and Android - release dates still not set).
Re: Source Code Example - Scraping a Web Page
Posted: Fri Apr 18, 2014 1:55 pm
by Brian|Dyalog
Using the HTTPGet that Morten supplied, you can retrieve the JSON result you want...
rc hdrs response←Samples.HTTPGet 'http://finance.google.com/finance/info?%20client=ig&q=NASDAQ:GOOG,NYSE:IBM'
response~⎕ucs 13 10 ⍝ remove carriage returns and linefeeds (wrapping is due to ⎕PW)
// [{"id": "304466804484872","t" : "GOOG","e" : "NASDAQ","l" : "536.10","l_fix" : "536.10","l_cur" : "536.10","s": "0","lt
t":"4:00PM EDT","lt" : "Apr 17, 4:00PM EDT","lt_dts" : "2014-04-17T16:00:00Z","c" : "-20.44","c_fix" : "-20.44","cp"
: "-3.67","cp_fix" : "-3.67","ccol" : "chr","pcls_fix" : "556.54"},{"id": "18241","t" : "IBM","e" : "NYSE","l" : "1
90.01","l_fix" : "190.01","l_cur" : "190.01","s": "0","ltt":"4:02PM EDT","lt" : "Apr 17, 4:02PM EDT","lt_dts" : "201
4-04-17T16:02:08Z","c" : "-6.39","c_fix" : "-6.39","cp" : "-3.25","cp_fix" : "-3.25","ccol" : "chr","pcls_fix" : "19
6.4"}]
The leading // is not valid JSON - you can verify this by pasting the result into an online JSON validator like the one found at
http://jsonlint.com/Then you can use the JSON namespace that's attached below to convert the JSON to a form more usable from APL.
The JSON namespace was developed as a part of the MiServer project, but is a useful standalone utility as well.
stocks←JSON.JSONtoNS 2↓response ⍝ drop off the leading // and convert JSON to namespace format
stocks ⍝ each stock symbol is its own namespace
#.JSON.[Namespace] #.JSON.[Namespace]
]disp (⊃stocks).⎕nl -2 ⍝ each namespace contains variables corresponding to the JSON elements
┌→┬─────┬────┬──┬──────┬─┬──┬─┬─────┬─────┬──┬──────┬───┬────────┬─┬─┐
│c│c_fix│ccol│cp│cp_fix│e│id│l│l_cur│l_fix│lt│lt_dts│ltt│pcls_fix│s│t│
└→┴────→┴───→┴─→┴─────→┴→┴─→┴→┴────→┴────→┴─→┴─────→┴──→┴───────→┴→┴→┘
]disp stocks.(t l lt_dts) ⍝ now you've got something you can manipulate from APL
┌→─────────────────────────────────┬─────────────────────────────────┐
│┌→───┬──────┬────────────────────┐│┌→──┬──────┬────────────────────┐│
││GOOG│536.10│2014-04-17T16:00:00Z│││IBM│190.01│2014-04-17T16:02:08Z││
│└───→┴─────→┴───────────────────→┘│└──→┴─────→┴───────────────────→┘│
└─────────────────────────────────→┴────────────────────────────────→┘
↑stocks.(t l lt_dts)
GOOG 536.10 2014-04-17T16:00:00Z
IBM 190.01 2014-04-17T16:02:08Z