Benedict's Soapbox

HTML Parsing/Screen Scraping with UIWebView

I asked a question on Stack Overflow about how to do screen scarping in iOS. The outcome was that using running some javascript with UIWebView stringByEvaluatingJavaScriptFromString: to extract and serialise the data is the best approach. However, there are a few gotchas.

Firstly, it’s worth noting what stringByEvaluatingJavaScriptFromString actually ‘returns’. It’s a little strange but some examples make it clear. The comment at the end of the lines is the output of NSLog:

NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"'hello';"]); // hello  
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"'hello';'goodbye';"]); // goodbye  
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"return 'hello';"]); //  
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"var greeting = function(){return 'hello';}; greeting();"]); //hello  

However, we can use stringByEvaluatingJavaScriptFromString: to inject javascript into the DOM and make additional stringByEvaluatingJavaScriptFromString calls to fetch the result:

NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"document.wordOfTheDay = 'discorporate';"]); //  
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"document.wordOfTheDay;"]); //discorporate  

It’s worth noting that there are other ways to communicate with the javascript in a UIWebView.

OK, on to the scraping.

The biggest problem is being certain that the DOM has loaded before the script is run. The UIWebViewDelegate protocol includes webViewDidFinishLoad: which at first glance seems perfect. If only life was that simple. I’ve encountered quite a few pages that trigger webViewDidFinishLoad: multiple times before the DOM is actually ready (presumable this is due to iframes or javascript).

The solution is to combine webViewDidFinishLoad: with the standard javascript approach of detecting when the DOM is ready. On the first invocation of webViewDidFinishLoad: we inject code to check the DOM for readiness (injecting this in webViewDidStartLoad: has unpredictable results):

if (/loaded|complete/.test(document.readyState))  
{  
document.UIWebViewDocumentIsReady = true;  
}  
else  
{  
document.addEventListener('DOMContentLoaded', function(){document.UIWebViewDocumentIsReady = true;}, false);  
}  

We then poll the UIWebView to determine when the DOM is ready:

-(void)pollDocumentReadyState  
{  
if ([@"true" caseInsensitiveCompare:[webview stringByEvaluatingJavaScriptFromString:@"document.UIWebViewDocumentIsReady;"]] == NSOrderedSame)  
{  
NSString *json = [webView stringByEvaluatingJavaScriptFromString:myFancyParsingAndSerializationScript];  
//Do something with json  
}  
else  
{  
[self performSelector:@selector(pollDocumentReadyState) withObject:nil afterDelay:1];  
}  
}  

That’s it!

I’ve created a class, EMKJavascriptEvaluation (zip archive), to handle all of this. Here’s a usage example:

-(void)beginScrape  
{  
[[NSNotificationCenter defaultCenter] addObserver:self selector:@selector(jsEvaluationCompleted:) name:EMKJavascriptEvaluationComplete object:nil];

NSString *scriptPath = [[NSBundle mainBundle] pathForResource:@"myFancyParsingAndSerializationScript" ofType:@"js"];  
NSString *script = [NSString stringWithContentsOfFile:scriptPath encoding:NSUTF8StringEncoding error:NULL];  
NSURL* url = [NSURL URLWithString: @"http://example.com"];

EMKJavascriptEvaluation *evaluation = [EMKJavascriptEvaluation evaluateScript:script withHtmlAtURL:url];

[evaluation injectLibraryAtPath:[[NSBundle mainBundle] pathForResource:@"jquery" ofType:@"js"]];  
[evaluation injectLibraryAtPath:[[NSBundle mainBundle] pathForResource:@"json2" ofType:@"js"]];

[evaluation evaluate];  
}

-(void)jsEvaluationCompleted:(NSNotification*)notification  
{  
NSLog(@"result: %@", [[notification object] result]);  
}  

Take a look at the .h for details.

The code is completely free and comes with no warranty what so ever. I haven’t used this code in a finished app yet so there’s probably a bug or two.

Update: The easiest way to parse HTML is to treat it as XML and use the NSXMLParser. iOS comes with LibTidy which is capable of fixing a multitude of markup sins. Use LibTidy to create clean XML and pass this XML to NSXMLParser. Only use the approach outlined above if it’s not possible to use NSXMLParser.