In the late 90’s early 00’s I was spending a lot of my professional time writing Java code and attempting to process large XML and SGML files. At some point within that work, I was introduced to a fairly obscure language called OminMark.
At the time, OmniMark was in it’s early days and was a bit more freely available than it is today (it’s proprietary, and expensive, now)…but it was still pretty obscure. Remember these were the days before StackExchange and Google was still just a baby…so getting help and finding information about obscure languages was still mostly the domain of ‘physical books’. There were only two books I ever found related to OminMark (I bought both, and read each multiple times):
Those books - along with countless hours of trail and error - were my introduction to large (and mostly unstructured) data processing via streaming technology.
I’ll be honest, it took me awhile to wrap my head around the concepts (and the syntax), but once I got it, it was pretty cool…and very powerful (for our needs).
Skip to now - and big data, and processing big data in real time, is now a mainstream topic/challenge.
Now there’s lots of different approaches and moving parts around the topic…things from Hadoop/MapReduce, to ZeroMQ, to one of the newest approaches/interesting approaches in Apache Spark.
Honestly I’ve only started to scratch the surface with Spark so far (been reading through the Learning Spark book from O'Reilly)…but there’s already a lot to like and be impressed with. For me personally, here are some of the really key things I’m liking so far:
1. You can play with it directly in Python (it also has full support for Scala and Java if that’s more your speed).
2. The streaming concepts map pretty closely to my old OmniMark knowledge/thoughts (which I thought I had blocked out of my head long ago)
3. The actual syntax and specifics are fairly small and simple to learn (ie. there’s not a ton to have to know to get started.)
4. Setting it up is fairly painless (compared to what you’ve had to do historically to start playing with 'big data’).
So…if you’re fighting with large log or data files, especially if you’ve got a desire to do something with them in as near real-time as possible, I would highly recommend you take a look into Apache Spark…
This post has received 25 loves.
Kevin also talks in more depth about many of the these things around twice a month via his drip campaign and has a day job as CTO of Veritonic. You can also check out some of his open source code on GitHub or connect with him on Twitter @falicon or via email at kevin at falicon.com.
If you have comments, thoughts, or want to respond to something you see here I would encourage you to respond via a post on your own blog (and then let me know about the link via one of the routes mentioned above).