Recently I spent a day being Mark. By that I mean I spent a day setting up and working through the examples from Mark Watson‘s latest book, Scripting Intelligence, Web 3.0 Information Gathering and Processing from APress. Mark has been a close friend for a very long time and has written or co-authored more than 15 books in artificial intelligence, user agents, and Linux minutia and uses Java, Common Lisp, Scheme and C/C++ to illustrate his material. In his latest book, he uses Ruby to illustrate how one can get on board with Web 3.0 web applications. Where Web 2.0 was about personalization and social media, Web 3.0 is about distributing application services into the cloud and navigating the semantic web to gather and mash-up information and knowledge.
This would normally be a daunting task given the scope of the material that would have to be covered starting with natural language processing through semantic web standards and concept through information gathering and storage to information mashups and publishing. However Mark has an approach and quality that I have found unique and superlative in my professional career. His forte is that he can put together examples that illustrate the essence of what needs to be understood and allows one latter to pickup all that theory stuff when he or she has developed some hands-on street smarts about the technology. Whenever engineers are confronted with a new language, framework, or system, they typically want to run “Hello, World”, which is the simplest implementation that can be put together for that new technology. In that same vain, Mark has the knack to put together the simplest illustration of a complex concept – the “Hello World” of artificial intelligence – if you will.
For example, before you can do the really big stuff, one needs tools and agility to process text. Remember at its core – Google is essentially a very large text processing company – so this is not trivial. To process text one not only has to deal with various text formats and feeds, but all the blemishes and idiosyncrasies of natural language processing. This would be a book in itself! Mark’s approach is first get the reader set up with a Ruby development environment and then bring in Ruby Gems (which are modules that can be plugged-in) that do the heavy technical work including summarizing and classifying text and determining the sentiment of the article being parsed. All this in 65 pages and less than a dozen example scripts. In the end you may still not understand the details or differences between Bayesian Classifiers and Latent Semantic Analysis but you have a toolbox of routines that apply these to any text you want to parse and a framework for getting into the details when you are ready to go deeper. Though the code examples are not enterprise ready, the pieces have been brought together simply and nicely so that if that is where you want to go, you have at least made the first step. This is a place you would not be if you went through a complete text book on natural language processing and had in depth understanding of all the technology. You would still have to have to sit down put what you know into code.
At this point, I am thrilled – thinking about all the extensions and applications where I could apply my newly found capability. But then maybe I should hold those plans and move on through the book. Mark’s style is simple and to the point with very little selling or cheer-leading. After all you already bought the book so why continue selling the book. At times there may be sage advice that says “Grasshopper, once you have got this technique down, here is where you can go next” or “This is what I found works well.”. The excitement and enthusiasm of the reader comes from their own realization that they now have a capability they did not have before. It is this self realization that propels the reader through the book.
The reader will not be disappointed. He will continue to build his capability through out the book. In Part 2 he develops the apparatus for working with the Semantic Web. If you think of the semantic web as a universe of knowledge, the galaxies in that universe are different ontology where facts and concepts cluster around a particular knowledge domain. The ontology defines the relationships among the concepts within the domain. As one is moving through the world wide web one encounters clusters of these galaxies at each point. (What can I say – I am an astrophysicist). How does one represent and share knowledge in this universe?
In less than 80 pages, one believes they know the answer. Not only are Resource Description Format (RDF) and RDF Schema, which is one way knowledge domains can be described, standardized and accessed, are introduced and manipulated but one has also downloaded and setup Protégé from Stanford University that provides a free, open source “ontology editor and knowledge representation framework” for working with RDF and OWL. From there you can start doing SPARQL (pronounced – sparkle) queries of World Fact Book SPARQL Endpoints on the web. Enough to make your head spin but uber cool. This is then all brought back to Ruby and JRuby so that one has further expanded their toolbox and development environment to proceed on.
Part 3 covers Information Gathering and Storage to perform free-text searches using relation databases and how to move data in and out of these databases including scraping web sites for knowledge. Part 4 wraps this up with Information Publishing that deals with information publishing and large-scale distributed data processing to support information publishing (i.e. Cloud Computing). At the end one will set up on Amazon EC2 using Amazon’s Elastic MapReduce to their hearts content. Boom!
I am looking forward getting through the remainder of the book. Unfortunately, I had only one day being Mark. That was a day that re-introduced me the thrill of learning and doing something new and different, and reacquainted me with the a way of looking at problems and challenges in their simplest form. As an architect, I often get wrapped up in dealing with all the issues and details that must be related and worked out to form a “complete enterprise” solution that I can overlook at times that a simple prototype or experiment can perform wonders in seeing the overall path.
Of course, I HIGHLY recommend the book – you might think I’m prejudice but think of this as more like insider information. Though in the APress roadmap, this book is advanced in their Ruby on Rails series – I would say that anyone that is interested in the topic can pickup Ruby as they go along. Mark does a great job helping one get setup with the various tools from open sources on the web. If you are already proficient in Ruby and Rails, then this book is a must read and do.