I asked a question on Stack Overflow about how to do screen scarping in iOS. The outcome was that using running some javascript with UIWebView stringByEvaluatingJavaScriptFromString: to extract and serialise the data is the best approach. However, there are a few gotchas.
Firstly, it’s worth noting what stringByEvaluatingJavaScriptFromString actually ‘returns’. It’s a little strange but some examples make it clear. The comment at the end of the lines is the output of NSLog:
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"'hello';"]); // hello
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"'hello';'goodbye';"]); // goodbye
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"return 'hello';"]); //
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"var greeting = function(){return 'hello';}; greeting();"]); //hello
However, we can use stringByEvaluatingJavaScriptFromString: to inject javascript into the DOM and make additional stringByEvaluatingJavaScriptFromString calls to fetch the result:
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"document.wordOfTheDay = 'discorporate';"]); //
NSLog(@"%@", [webView stringByEvaluatingJavaScriptFromString:@"document.wordOfTheDay;"]); //discorporate
It’s worth noting that there are other ways to communicate with the javascript in a UIWebView.
OK, on to the scraping.
The biggest problem is being certain that the DOM has loaded before the script is run. The UIWebViewDelegate protocol includes webViewDidFinishLoad: which at first glance seems perfect. If only life was that simple. I’ve encountered quite a few pages that trigger webViewDidFinishLoad: multiple times before the DOM is actually ready (presumable this is due to iframes or javascript).
The solution is to combine webViewDidFinishLoad: with the standard javascript approach of detecting when the DOM is ready. On the first invocation of webViewDidFinishLoad: we inject code to check the DOM for readiness (injecting this in webViewDidStartLoad: has unpredictable results):
if (/loaded|complete/.test(document.readyState))
{
document.UIWebViewDocumentIsReady = true;
}
else
{
document.addEventListener('DOMContentLoaded', function(){document.UIWebViewDocumentIsReady = true;}, false);
}
We then poll the UIWebView to determine when the DOM is ready:
-(void)pollDocumentReadyState
{
if ([@"true" caseInsensitiveCompare:[webview stringByEvaluatingJavaScriptFromString:@"document.UIWebViewDocumentIsReady;"]] == NSOrderedSame)
{
NSString *json = [webView stringByEvaluatingJavaScriptFromString:myFancyParsingAndSerializationScript];
//Do something with json
}
else
{
[self performSelector:@selector(pollDocumentReadyState) withObject:nil afterDelay:1];
}
}
That’s it!
I’ve created a class, EMKJavascriptEvaluation (zip archive), to handle all of this. Here’s a usage example:
-(void)beginScrape
{
[[NSNotificationCenter defaultCenter] addObserver:self selector:@selector(jsEvaluationCompleted:) name:EMKJavascriptEvaluationComplete object:nil];
NSString *scriptPath = [[NSBundle mainBundle] pathForResource:@"myFancyParsingAndSerializationScript" ofType:@"js"];
NSString *script = [NSString stringWithContentsOfFile:scriptPath encoding:NSUTF8StringEncoding error:NULL];
NSURL* url = [NSURL URLWithString: @"http://example.com"];
EMKJavascriptEvaluation *evaluation = [EMKJavascriptEvaluation evaluateScript:script withHtmlAtURL:url];
[evaluation injectLibraryAtPath:[[NSBundle mainBundle] pathForResource:@"jquery" ofType:@"js"]];
[evaluation injectLibraryAtPath:[[NSBundle mainBundle] pathForResource:@"json2" ofType:@"js"]];
[evaluation evaluate];
}
-(void)jsEvaluationCompleted:(NSNotification*)notification
{
NSLog(@"result: %@", [[notification object] result]);
}
Take a look at the .h for details.
The code is completely free and comes with no warranty what so ever. I haven’t used this code in a finished app yet so there’s probably a bug or two.
Update: The easiest way to parse HTML is to treat it as XML and use the NSXMLParser. iOS comes with LibTidy which is capable of fixing a multitude of markup sins. Use LibTidy to create clean XML and pass this XML to NSXMLParser. Only use the approach outlined above if it’s not possible to use NSXMLParser.