Benedict's Soapbox About Me Selected Work

24 August 2010HTML Parsing/Screen Scraping with UIWebView

I asked a question on Stack Overflow about how to do screen scarping in iOS. The outcome was that using running some javascript with UIWebView stringByEvaluatingJavaScriptFromString: to extract and serialise the data is the best approach. However, there are a few gotchas.

Firstly, it’s worth noting what stringByEvaluatingJavaScriptFromString actually ‘returns’. It’s a little strange but some examples make it clear. The comment at the end of the lines is the output of NSLog:

NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"'hello';"]);   // hello
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"'hello';'goodbye';"]);   // goodbye
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"return 'hello';"]);   //
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"var greeting = function(){return 'hello';}; greeting();"]);   //hello

However, we can use stringByEvaluatingJavaScriptFromString: to inject javascript into the DOM and make additional stringByEvaluatingJavaScriptFromString calls to fetch the result:

NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"document.wordOfTheDay = 'discorporate';"]);   //
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"document.wordOfTheDay;"]);   //discorporate

It’s worth noting that there are other ways to communicate with the javascript in a UIWebView.

OK, on to the scraping.

The biggest problem is being certain that the DOM has loaded before the script is run. The UIWebViewDelegate protocol includes webViewDidFinishLoad: which at first glance seems perfect. If only life was that simple. I’ve encountered quite a few pages that trigger webViewDidFinishLoad: multiple times before the DOM is actually ready (presumable this is due to iframes or javascript).

The solution is to combine webViewDidFinishLoad: with the standard javascript approach of detecting when the DOM is ready. On the first invocation of webViewDidFinishLoad: we inject code to check the DOM for readiness (injecting this in webViewDidStartLoad: has unpredictable results):

if (/loaded|complete/.test(document.readyState))
{
    document.UIWebViewDocumentIsReady = true;
}
    else
{
    document.addEventListener('DOMContentLoaded', function(){document.UIWebViewDocumentIsReady = true;}, false);
}

We then poll the UIWebView to determine when the DOM is ready:

-(void)pollDocumentReadyState
{
    if ([@"true" caseInsensitiveCompare:[webview stringByEvaluatingJavaScriptFromString:@"document.UIWebViewDocumentIsReady;"]] == NSOrderedSame)
    {
        NSString *json = [webView stringByEvaluatingJavaScriptFromString:myFancyParsingAndSerializationScript];
        //Do something with json
    }
    else
    {
        [self performSelector:@selector(pollDocumentReadyState) withObject:nil afterDelay:1];
    }
}

That’s it!

I’ve created a class, EMKJavascriptEvaluation (zip archive), to handle all of this. Here’s a usage example:

-(void)beginScrape
{
    [[NSNotificationCenter defaultCenter] addObserver:self selector:@selector(jsEvaluationCompleted:) name:EMKJavascriptEvaluationComplete object:nil];

    NSString *scriptPath = [[NSBundle mainBundle] pathForResource:@"myFancyParsingAndSerializationScript" ofType:@"js"];
    NSString *script = [NSString stringWithContentsOfFile:scriptPath encoding:NSUTF8StringEncoding error:NULL];
    NSURL* url = [NSURL URLWithString: @"http://example.com"];

    EMKJavascriptEvaluation *evaluation = [EMKJavascriptEvaluation evaluateScript:script withHtmlAtURL:url];

    [evaluation injectLibraryAtPath:[[NSBundle mainBundle] pathForResource:@"jquery" ofType:@"js"]];
    [evaluation injectLibraryAtPath:[[NSBundle mainBundle] pathForResource:@"json2" ofType:@"js"]];

    [evaluation evaluate];
}

-(void)jsEvaluationCompleted:(NSNotification*)notification
{
    NSLog(@"result: %@", [[notification object] result]);
}

Take a look at the .h for details.

The code is completely free and comes with no warranty what so ever. I haven’t used this code in a finished app yet so there’s probably a bug or two.

Update: The easiest way to parse HTML is to treat it as XML and use the NSXMLParser. iOS comes with LibTidy which is capable of fixing a multitude of markup sins. Use LibTidy to create clean XML and pass this XML to NSXMLParser. Only use the approach outlined above if it’s not possible to use NSXMLParser.


Make a Comment

:

:



Timeline