Receiving HTML from AJAX appl

Using (or providing) Microsoft.NET Classes
Post Reply
nah
Posts: 2
Joined: Thu May 20, 2010 6:45 am

Receiving HTML from AJAX appl

Post by nah »

I have run an application that picked and unpacked HTML from a web page.
Now, the web page has been redesigned, so now it uses AJAX.
This means that the HTML is no longer updated.
Has anyone here experience on this subject?

Best regards
Niels
User avatar
Morten|Dyalog
Posts: 460
Joined: Tue Sep 09, 2008 3:52 pm

Re: Receiving HTML from AJAX appl

Post by Morten|Dyalog »

I don't think there is an easy solution to this, as the web page is now expecting some client code to run in the web browser. Writing your own JavaScript interpreter (or similar) is probably more work that you would like to do :-).

You might be able to get away with using a tool like "Fiddler" to spy on the HTTP communication and see what the AJAX client-side sends to the server and reverse engineer that - this MIGHT give you the information that you need, depending on what is going on. But this is probably a very long shot.

Can you let us know which page you are trying to "scrape"?
nah
Posts: 2
Joined: Thu May 20, 2010 6:45 am

Re: Receiving HTML from AJAX appl

Post by nah »

A nice example is "http://www.soccerway.com/national/sweden/allsvenskan/2010/regular-season/",
delivering Swedish soccer results. When I study the source to the shown page I can extract the content,
but after pressing "Previous" I see the previous results on the screen but the source is not updated.
harsman
Posts: 27
Joined: Thu Nov 26, 2009 12:21 pm

Re: Receiving HTML from AJAX appl

Post by harsman »

That the page is using AJAX means it is retrieving data from the server in a more data oriented format than HTML, usually JSON or XML. This might actually make it easier to extract data compared to scraping it from HTML.

If you look at the Javascript source or watch network traffic (either via an external tool like Fiddler that Morten suggested, or with a browser integrated tool like Firebug for Firefox), you should be able to reverse engineer what HTTP-requests to make to get the data.
alexbalako
Posts: 16
Joined: Mon Nov 30, 2009 8:58 pm

Re: Receiving HTML from AJAX appl

Post by alexbalako »

Niels,

You may try to use Internet explorer ActiveX control which will execute java script for you on a page.
Than pool HTML from it.
User avatar
Dick Bowman
Posts: 235
Joined: Thu Jun 18, 2009 4:55 pm
Contact:

Re: Receiving HTML from AJAX appl

Post by Dick Bowman »

Have there been any further developments on this topic in the past year?

I find myself in a similar situation - a little application that page-scraped HTML now broken because the site author (British Met Office) now generates the pages seen in the browser with JavaScript. Obviously (?) the data I want to bring into APL is reaching my computer, but the browser seems to hide it from me.

Any specific suggestions about tools to look at? I'm not sure whether the last post is talking about general principles or something specific.
Visit http://apl.dickbowman.com to read more from Dick Bowman
User avatar
Morten|Dyalog
Posts: 460
Joined: Tue Sep 09, 2008 3:52 pm

Re: Receiving HTML from AJAX appl

Post by Morten|Dyalog »

Dick Bowman wrote:Have there been any further developments on this topic in the past year?

Not directly, but the MiServer team has a prototype of a tool to encode and decode JSON, that will be used for AJAX-style interaction with MiServer applications.

However, unless the data supplier documents the format of the required HTTP transactions, the only "solution" for the problem extracting data from web applications which use AJAX is to snoop on the communication between the Javascript application running in the browser and the server, and use Conga to send a similar request to the server, and either ⎕XML or the JSON-decoding tools (or something else, depending on the format) to take the result apart.
User avatar
Dick Bowman
Posts: 235
Joined: Thu Jun 18, 2009 4:55 pm
Contact:

Re: Receiving HTML from AJAX appl

Post by Dick Bowman »

Quick update to confirm that this thread has shown me what I needed to do...

0⊃ Firebug revealed that the Javascript was pulling files with the .json extension from the distant server
1⊃ Put together a quick/dirty decoder for the .json files

Which has put the broken part of the application back into action.

Thanks to all.
Visit http://apl.dickbowman.com to read more from Dick Bowman
Post Reply