Source Code Example - Scraping a Web Page

Using (or providing) Microsoft.NET Classes
neeraj
Posts: 82
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Source Code Example - Scraping a Web Page

Post by neeraj »

The following may be of help to other people
      TEST;⎕USING;srcUriString;srcUri;client;str
⎕USING←'System,mscorlib.dll'
⎕USING,←⊂'System.IO,mscorlib.dll'
⎕USING,←⊂'System.Net,System.dll'
srcUriString←⎕NEW String(⊂'http://www.cayugafamilydental.com')
srcUri←⎕NEW Uri srcUriString
client←⎕NEW WebClient ⍬
str←client.DownloadString srcUri
⍴str
User avatar
Dick Bowman
Posts: 235
Joined: Thu Jun 18, 2009 4:55 pm
Contact:

Re: Source Code Example - Scraping a Web Page

Post by Dick Bowman »

Any advantage over the example in the .NET Interface Guide?

Always interesting to see different ways to skin a rat. Descriptions of pros and cons can help take what we learn into unknown territory.
Visit http://apl.dickbowman.com to read more from Dick Bowman
jGoff
Posts: 26
Joined: Fri Jun 19, 2009 12:25 pm

Re: Source Code Example - Scraping a Web Page

Post by jGoff »

Always good to have a simple "scraper" on hand. Using v12.1, it worked the first time after a copy and a paste. (Careful not to let the first ⎕NEW line wrap.) Thanks for sharing.

P.S. Not to mention that if I ever have a toothache in Ithaca, I'll know where to go.
neeraj
Posts: 82
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Source Code Example - Scraping a Web Page

Post by neeraj »

This version is shorter. I tried using the conga workspace and HTTPGet in the Samples namespace but it was not returning the desired result, so I used .NET instead which does return the correct result.
User avatar
PGilbert
Posts: 440
Joined: Sun Dec 13, 2009 8:46 pm
Location: Montréal, Québec, Canada

Re: Source Code Example - Scraping a Web Page

Post by PGilbert »

Thanks for sharing your short version. Here is what we have done recently inspired by the .Net Interface Guide of Dyalog:

Code: Select all

 TEST2;dataStream;reader;request;response;responseFromServer;url;⎕USING
 ⎕USING←'System.Net,System.dll' 'System.IO,mscorlib.dll' 'System.Text,mscorlib.dll'
 url←'http://www.cayugafamilydental.com'
 request←WebRequest.Create(⊂url)
 response←request.GetResponse
 dataStream←response.GetResponseStream
 reader←⎕NEW StreamReader(dataStream,Encoding.GetEncoding(⊂response.CharacterSet))
 responseFromServer←reader.ReadToEnd


What do you do to extract the information you are looking for from the HTML ? (We ended-up transforming the HTML in XHTML and search it like if it was Xml)
DanB|Dyalog

Re: Source Code Example - Scraping a Web Page

Post by DanB|Dyalog »

neeraj: what did you do to get the page and how did you do it?
I tried

Code: Select all

      )load conga
C:\Program Files\Dyalog\V14U\ws\conga saved Mon Apr 07 17:20:16 2014
      ⍴¨r←Samples.HTTPGet'www.dyalog.com'
   11 2  17192
      r∊⊂p
0 0 1

'p' is the result from your fn. It matches the 3rd element returned by Samples.HTTPGet.
/Dan
neeraj
Posts: 82
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Source Code Example - Scraping a Web Page

Post by neeraj »

Samples.HTTPGet 'http://finance.google.com/finance/info?%20client=ig&q=NASDAQ:GOOG,NYSE:IBM'

the above should come back with something like

// [ { "id": "304466804484872" ,"t" : "GOOG" ,"e" : "NASDAQ" ,"l" : "536.10" ,"l_fix" : "536.10" ,"l_cur" : "536.10" ,"s": "2" ,"ltt":"4:00PM EDT" ,"lt" : "Apr 17, 4:00PM EDT" ,"lt_dts" : "2014-04-17T16:00:00Z" ,"c" : "-20.44" ,"c_fix" : "-20.44" ,"cp" : "-3.67" ,"cp_fix" : "-3.67" ,"ccol" : "chr" ,"pcls_fix" : "556.54" ,"el": "538.16" ,"el_fix": "538.16" ,"el_cur": "538.16" ,"elt" : "Apr 17, 7:59PM EDT" ,"ec" : "+2.06" ,"ec_fix" : "2.06" ,"ecp" : "0.38" ,"ecp_fix" : "0.38" ,"eccol" : "chg" ,"div" : "" ,"yld" : "" } ,{ "id": "18241" ,"t" : "IBM" ,"e" : "NYSE" ,"l" : "190.01" ,"l_fix" : "190.01" ,"l_cur" : "190.01" ,"s": "2" ,"ltt":"4:02PM EDT" ,"lt" : "Apr 17, 4:02PM EDT" ,"lt_dts" : "2014-04-17T16:02:08Z" ,"c" : "-6.39" ,"c_fix" : "-6.39" ,"cp" : "-3.25" ,"cp_fix" : "-3.25" ,"ccol" : "chr" ,"pcls_fix" : "196.4" ,"el": "190.37" ,"el_fix": "190.37" ,"el_cur": "190.37" ,"elt" : "Apr 17, 7:57PM EDT" ,"ec" : "+0.36" ,"ec_fix" : "0.36" ,"ecp" : "0.19" ,"ecp_fix" : "0.19" ,"eccol" : "chg" ,"div" : "0.95" ,"yld" : "2.00" } ]

which is a JSON with a 2 element array. I do not get the above result. My original post was a contrived example to avoid JSON issues when the focus was on HTTPGet
neeraj
Posts: 82
Joined: Wed Dec 02, 2009 12:10 am
Location: Ithaca, NY, USA

Re: Source Code Example - Scraping a Web Page

Post by neeraj »

PGilbert:

Here is how I have been dealing with HTML. It is a snippet but will give you a flavor. I am just looking for specific information in the HTML.

      :Case 2
⍝ Schwab A Rated Stocks
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA1.WEBARCHIVE')
S←7↓¨2000↑¨(' symbol="'⍷C)⊂C ⍝ All stock names are of the form ' symbol="IBM"'
AGRADE←STK¨S
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA2.WEBARCHIVE')
S←20↑¨(' symbol="'⍷C)⊂C
AGRADE←AGRADE,STK¨S
C←NFILE∆READ(∆FILEPATH,'SCHWAB\SCHWABA3.WEBARCHIVE')
S←20↑¨(' symbol="'⍷C)⊂C
AGRADE←fIXNAME¨AGRADE,STK¨S
∆MTX[(∆IN ∆MTX[;3]≡¨⊂'A');3]←⊂'--' ⍝ All previous A Grades are reset to --
I←∆MTX[;2]⍳AGRADE
existing←(I<1↑⍴∆MTX)/I
∆MTX[existing;3]←'A'
∆MTX[NR;3]←⊂date 0
User avatar
Morten|Dyalog
Posts: 460
Joined: Tue Sep 09, 2008 3:52 pm

Re: Source Code Example - Scraping a Web Page

Post by Morten|Dyalog »

If you try this with most of the HTTPGet functions that are out there, it will fail for two reasons. First because the URL has been redirected (the www has been removed from the address):

Code: Select all

      Samples.HTTPGet 'http://www.cayugafamilydental.com'
0   http/1.1 301 moved permanently                                                                     
    date                             Thu, 17 Apr 2014 06:19:02 GMT                                     
    server                           Apache                                                           
    x-powered-by                     PHP/5.4.27                                                       
    expires                          Thu, 19 Nov 1981 08:52:00 GMT                                     
    cache-control                    no-store, no-cache, must-revalidate, post-check=0, pre-check=0   
    pragma                           no-cache                                                         
    x-pingback                       http://cayugafamilydental.com/xmlrpc.php                         
    set-cookie                       PHPSESSID=a1eca9c1da2b358b76b2708acf53b07b; path=/               
    location                         http://cayugafamilydental.com/                                   
    content-length                   0                                                                 
    content-type                     text/html; charset=UTF-8                                         

Secondly, if you switch to the correct address, it fails because the content is compressed in a "chunked" mode which we did not support. The attached file contains source for the Samples.HTTPGet function that will be distributed with v14.0: It both handles the redirection and the chunking. The advantage of HTTPGet over the .NET solution is that it is cross-platform, it will work under Windows, AIX, Linux (including the Raspberry Pi) - and the future versions of Dyalog APL that we are currently working on (MacOS and Android - release dates still not set).
Attachments

[The extension dyalog has been deactivated and can no longer be displayed.]

User avatar
Brian|Dyalog
Posts: 120
Joined: Thu Nov 26, 2009 4:02 pm
Location: West Henrietta, NY

Re: Source Code Example - Scraping a Web Page

Post by Brian|Dyalog »

Using the HTTPGet that Morten supplied, you can retrieve the JSON result you want...

      rc hdrs response←Samples.HTTPGet 'http://finance.google.com/finance/info?%20client=ig&q=NASDAQ:GOOG,NYSE:IBM' 

response~⎕ucs 13 10 ⍝ remove carriage returns and linefeeds (wrapping is due to ⎕PW)
// [{"id": "304466804484872","t" : "GOOG","e" : "NASDAQ","l" : "536.10","l_fix" : "536.10","l_cur" : "536.10","s": "0","lt
t":"4:00PM EDT","lt" : "Apr 17, 4:00PM EDT","lt_dts" : "2014-04-17T16:00:00Z","c" : "-20.44","c_fix" : "-20.44","cp"
: "-3.67","cp_fix" : "-3.67","ccol" : "chr","pcls_fix" : "556.54"},{"id": "18241","t" : "IBM","e" : "NYSE","l" : "1
90.01","l_fix" : "190.01","l_cur" : "190.01","s": "0","ltt":"4:02PM EDT","lt" : "Apr 17, 4:02PM EDT","lt_dts" : "201
4-04-17T16:02:08Z","c" : "-6.39","c_fix" : "-6.39","cp" : "-3.25","cp_fix" : "-3.25","ccol" : "chr","pcls_fix" : "19
6.4"}]


The leading // is not valid JSON - you can verify this by pasting the result into an online JSON validator like the one found at http://jsonlint.com/

Then you can use the JSON namespace that's attached below to convert the JSON to a form more usable from APL.
The JSON namespace was developed as a part of the MiServer project, but is a useful standalone utility as well.

      stocks←JSON.JSONtoNS 2↓response  ⍝ drop off the leading // and convert JSON to namespace format
stocks ⍝ each stock symbol is its own namespace
#.JSON.[Namespace] #.JSON.[Namespace]

]disp (⊃stocks).⎕nl -2 ⍝ each namespace contains variables corresponding to the JSON elements
┌→┬─────┬────┬──┬──────┬─┬──┬─┬─────┬─────┬──┬──────┬───┬────────┬─┬─┐
│c│c_fix│ccol│cp│cp_fix│e│id│l│l_cur│l_fix│lt│lt_dts│ltt│pcls_fix│s│t│
└→┴────→┴───→┴─→┴─────→┴→┴─→┴→┴────→┴────→┴─→┴─────→┴──→┴───────→┴→┴→┘

]disp stocks.(t l lt_dts) ⍝ now you've got something you can manipulate from APL
┌→─────────────────────────────────┬─────────────────────────────────┐
│┌→───┬──────┬────────────────────┐│┌→──┬──────┬────────────────────┐│
││GOOG│536.10│2014-04-17T16:00:00Z│││IBM│190.01│2014-04-17T16:02:08Z││
│└───→┴─────→┴───────────────────→┘│└──→┴─────→┴───────────────────→┘│
└─────────────────────────────────→┴────────────────────────────────→┘

↑stocks.(t l lt_dts)
GOOG 536.10 2014-04-17T16:00:00Z
IBM 190.01 2014-04-17T16:02:08Z
Attachments

[The extension dyalog has been deactivated and can no longer be displayed.]

Post Reply