Datameer kicked off their first Big Data & Brews on the East Coast at Strata + Hadoop World New York. Datameer sat down with Tony Baer, principle analyst at Ovum Research, to discuss where Spark fits into the Hadoop ecosystem.
Watch Part 1 of Big Data & Brews with Tony Baer here.
TRANSCRIPT:
Andrew: All right. Here we are at another Strata Plus Hadoop World. I’m Andrew Brust, the senior director of technical product marketing and evangelism at Datameer. With me is Tony Baer, and he is the senior mucky-muck. No. principal analyst for Information Management at Ovum.
Tony: I think so. That’s what they tell me.
Andrew: A good pal as well. We are here to talk about what we’re seeing in the industry, what you’re seeing in the industry, and what new stuff this show has brought up in the world of Hadoop and big data. We also need to drink to that.
Tony: It wouldn’t be possible without this.
Andrew: That’s sort of a prerequisite for Big Data and Brews.
Tony: It’s about the free beer.
Andrew: First of all, let me ask you a question.
Tony: Ask.
Andrew: This show does have the word Hadoop in the title, right?
Tony: Yeah.
Andrew: Is it just about Hadoop anymore?
Tony: It’s really about a whole ecosystem. The fact is, is that it started getting into this contemplation of what is Hadoop? The fact is that what you’re really looking is a big data platform and ecosystem of technologies, and hopefully you’re working with technology providers that hopefully will simplify all this because the result is that you want to take advantage of innovations in scale-out clusters, commodity hardware, commodity software, so you can get results that are not commodity. How’s that?
Andrew: I like the irony there.
Tony: Yeah.
Andrew: Cheers. At least on the storage side of Hadoop, HDFS, the Hadoop Distributed File System, is that a unifying aspect or is even that …
Tony: Oh, man.
Andrew: Melting a little bit.
Tony: The thing is I’ve been having this discussion. I probably should be wearing all three lanyards. We started off with a MapR lanyard. I got the Cloudera. I got the Hortonworks, so I probably should’ve worn all three. The thing is what makes Hadoop Hadoop? Once upon a time you could say it was HDFS and MapReduce. You really can’t say that anymore. If you’re going to get down to the technical side of it, it’s basically very compatible API, so that what should theoretically work on one Hadoop platform should work on another. Obviously, reality is never quite so black or white or perfect.
Andrew: You know why I’m asking all this because of the S word. Spark.
Tony: You gave me a great platform for this a few months ago during Spark Summit. There’s been I think a lot of confusion, a lot of FUD out there, that it’s basically that, oh, Spark is going to replace Hadoop. I had a wonderful dinner with the Databricks folks last night. They’re very good friends of mine.
Andrew: Databricks, of course, being the principals of that company are the creators of Spark.
Tony: Right. Right. Exactly.
Andrew: Their cloud platform is all about Spark.
Tony: The idea of simplifying Spark and making Spark accessible, so you can use it with tools, such as Datameer for instance. Their survey basically said that roughly about half, 49 percent I think was the exact, were using Spark basically stand alone and that I think maybe…
Andrew: Stand alone meaning not running on a Hadoop cluster.
Tony: Exactly, is on a cluster that’s equipped with JBM and Linux. Technically not bare metal, but as far as data platform, it’s bare metal.
Andrew: Okay. What did you think of that statistic?
Tony: What I thought of that statistic is that if you’re running it in a cloud and you’re running it in an environment like that or Spark-as-a-service, of course, it’s going to be run on its own because you’re not using it for anything else. My contention is that that model makes sense if you’re working on a specialized cloud service, Spark is a service, your solving a specific problem, the data is coming from external sources, such as let’s say Internet of Things and/or you’re doing this basically as a proven concept.
Andrew: The idea though is the premise if you’re bringing data to Spark, so the idea of running it on an existing cluster is almost beside the point in these specific use cases.
Tony: In this case, yeah, because the thing is do you need all the other services that Hadoop would provide if that’s the only thing you were running on.
Andrew: Sure.
Tony: We’ve seen a lot of that. You can see repetitions or I should say similarities in the data warehousing world, let’s say, like Teradata, Oracle. They have basically workload optimized platforms. Okay. It’s not a new idea to put something on a platform that’s basically tailored for the workload. On the other hand, I’m thinking from a standpoint that ultimately if you’re going to bring this in house into production, will you have the IT staff resources to maintain a whole separate compute silo and a whole separate data governance silo and data management silo?
Andrew: Sure.
Tony: Yes, there are going to be some organizations that have very deep resources that will be able to do that. I don’t think that’s going to make sense for the enterprise mainstream.
Andrew: Those are the same companies that used Hadoop in the very early days.
Tony: Exactly. You’re talking about the classic early adopters, and those that are going to push Spark to its limits, so in that instance, yes. I could very much see a future for stand alone, but I think that’s going to be for basically the bleeding edge. I don’t see it, as I said, for when Hadoop and Spark get into the mainstream. The other thing also is that a lot of organizations are going to be using Spark and not realizing that they are because let’s say tools like Datameer, let’s say, package Spark under the hood.
Andrew: Right.
Andrew: That’s actually our whole value proposition-,
Tony: Yeah. We’re getting real-time operational results. Real-time operational analytics.
Andrew: The engine that brings it to you should not have to be your concern.
Tony: Exactly. You just care about the results. That even buttresses even more that ultimately most Spark workloads are going to run within a data platform and not stand alone. It’s not to say there won’t be Spark stand alone, but it won’t be the majority.
Find more information on Datameer here.
Related articles
- Hadoop spinner Cloudera lights Spark on MapReduce retirement (go.theregister.com)
- Big Data, Governance, and Hadoop Adoption Rates (dataversity.net)
- Empowering Analytics: Detecting Fraud In Hybrid Datasets (ctovision.com)