We’ve been on this big data adventure for a while. Not everything is still shiny and new anymore. In fact, some technologies may be holding you back. Remember, this is the fastest-moving area of enterprise tech — so much so that some software acts as a placeholder until better bits arrive.
Those upgrades — or replacements — can make the difference between a successful big data initiative and one you’ll be living down for the next few years. Here’s are some elements of the stack you should start to think about replacing:
1. MapReduce. MapReduce is slow. It’s rarely the best way to go about a problem. There are other algorithms to choose from — the most common is DAG, of which MapReduce can be considered a subset. If you’ve done a bunch of custom MapReduce jobs, the performance difference compared to Spark is worth the cost and trouble of switching.
2. Storm. I’m not saying Spark will eat the streaming world, although it might, but with technologies like Apex and Flink there are better, lower-latency alternatives to Spark than Storm. Besides, you should probably evaluate your latency tolerance and whether the bugs you have in your lower-level, more complicated code are worth a few extra milliseconds. Storm doesn’t have the support that it could, with Hortonworks as the only real backer — and with Hortonworks facing increasing market pressure, Storm is unlikely to get more attention.
3. Pig. Pig kind of blows. You can do anything it does with Spark or other technologies. At first Pig seems like a nice “PL/SQL for big data,” but you quickly find out it’s a little bizarre.
4. Java. No, not the JVM, but the language. The syntax is clunky for big data jobs. Plus, newer constructs like Lambda have been bolted onto the side in a somewhat awkward manner. The big data world has largely moved to Scala and Python (the latter when you can afford the performance hit and need Python libraries or are infested with Python developers). Of course, you can use R for stats, until you rewrite it in Python because R doesn’t have all the fun scale features.
5. Tez. This is another Hortonworks pet project. It’s a DAG implementation, but unlike Spark, Tez is described by one of its developers as like writing in “assembly language.” At the moment, with a Hortonworks distribution, you’ll end up using Tez behind Hive and other tools — but you can already use Spark as the engine in other distributions. Tez has always been kind of buggy anyhow. Again, this is one vendor’s project and doesn’t have the industry or community support of other technologies. It doesn’t have any runaway advantages over other solutions. This is an engine I’d look to consolidate out.
6. Oozie. I’ve long hated on Oozie. It isn’t much of a workflow engine or much of a scheduler — yet it’s both and neither at the same time! It is, however, a collection of bugs for a piece of software that shouldn’t be that hard to write. Between StreamSets, DAG implementations, and all, you should have ways to do most of what Oozie does.
7. Flume. Between StreamSets and Kafka and other solutions, you probably have an alternative to Flume. That May 20, 2015, release is looking a bit rusty. You can track the year-on-year activity level. Hearts and minds have left. It’s probably time to move on.