Stories by Christopher Nguyen on Medium

Did you enjoy “Algorithms of the Mind”?

Christopher Nguyen — Sun, 13 Sep 2015 18:54:03 GMT

If you’re reading this on a leisurely Sunday, perfect!

This is Christopher Nguyen, CEO of Adatao and one of the editors/contributors to Deep Learning 101.

It occurs to me that most everyone arrives here via Algorithms of the Mind, and may not have seen What You Must Know About Big Data and Machine Learning, the podcast I did with Sonal Chokshi of Andreessen Horowitz back in June.

Yes? Take a listen to Big Data/Machine Learning. It’s a good bridge to connect the dots from where we are today to where Algorithms of the Mind will lead us 50 years from now.

Enjoy & I look forward to hearing from you.

Did you enjoy “Algorithms of the Mind”? was originally published in Deep Learning 101 on Medium, where people are continuing the conversation by highlighting and responding to this story.

Inceptionism: Google Brain Imagination

Christopher Nguyen — Sun, 21 Jun 2015 17:35:26 GMT

Image Credit: Googler Mike Tyka’s Inceptionism Library

If you liked Algorithms of the Mind, you’ll love this: The “dreams” of Google’s AI are equal parts amazing and disturbing. Also read the original Google Research blog post.

Related, here’s a compelling example of “seeing with our brains and not with our eyes”, Now Blind Americans Can See with Device Atop Their Tongues.

And yes, Google is involved.

Christopher Nguyen — https://medium.com/@ctn

Inceptionism: Google Brain Imagination was originally published in Deep Learning 101 on Medium, where people are continuing the conversation by highlighting and responding to this story.

What You Must Know About Big Data and Machine Learning

Christopher Nguyen — Wed, 10 Jun 2015 08:45:18 GMT

Why “Volume, Velocity, and Variety” are wrong

Originally published at blog.adatao.com

A few weeks ago, Sonal Chokshi of Andreessen Horowitz and I chatted on an a16z podcast. Here’s her summary of that conversation:

On this episode of the a16z Podcast, Nguyen puts on his former computer science professor hat to describe “Big Data” in relation to “Machine Learning”— as well as what comes next with “Deep Learning”. Finally, the former Google exec shares how Hadoop and Spark evolved from the efforts of companies dealing with massive amounts of real-time information; what we need to make machine learning a property of every application (Why would we even want to?); and how we can make all this intelligence accessible to everyone.

“Machine Learning is to Big Data as
Human Learning is to Life Experiences”

We’ve heard from many people that this made so much more sense of Big Data and Machine Learning for them. So, I hope you’ll enjoy listening to this conversation about Big Data, Machine Learning, and the future of Deep Learning.

https://medium.com/media/237621aa956930f6670d3a756c74c1f3/href

Follow me on Twitter to keep informed of interesting developments on these topics.

Transcript

Sonal Chokshi: Hi everyone, welcome to the A16Z Podcast. This is Sonal and I’m here today with Christopher Nguyen from Adatao, which is a big data company and its mission is to democratize data intelligence and help people collaborate across the enterprise.

The best way to describe him is he’s an entrepreneurial scientist. He got his PhD from Stanford in Device Physics. He’s a former Google executive and as a professor, he started a computer engineering program at the Hong Kong University of Science and Technology. He’s basically an entrepreneurial scientist who’s merged the worlds of academia and doing a lot of startups.

Welcome, Christopher.

Christopher N.: Thank you, Sonal.

Sonal Chokshi: Actually, Christopher, maybe we want to just kick this off is..I actually just want to talk to you starting with big data. That’s a term that people throw around all the time, it’s completely overloaded, it’s buzz word, it means so many things to so many different people. Could you start by just telling me what your definition and take on big data is?

Christopher N.: There are two ways you can think of the term big data. There is what I think most of the world thinks about when people talk about big data, they think of the V’s, starting out as three V’s, volume, variety, velocity and so on. Then I think it’s now up to seven or eight different V’s, veracity, variance, and so on.

I actually don’t like that definition. I think that definition is functionally correct but it focuses on the problems of big data. These are the challenges that you have to deal with when you deal with big data but the definition skips or misses the part where it says you ask the question, “Why do you want to deal with these problems?” It turns out the reason for big data is machine learning.

“The reason for Big Data is Machine Learning”

Sonal Chokshi: The reason for big data is machine learning. That’s actually kind of counter-intuitive because I’ve actually heard it the other way around, that big data exists because of machine learning.

Christopher N.: I like something that Peter Norvig, the Director of Research at Google said when he referred to big data is, “Big data is not just quantitatively different but it’s qualitatively different.” In other words, there’s something that happens when you have enough data, it crosses a certain threshold.

For example, if you want to learn whether it hurts to hit your head against a brick wall, about five samples is probably very big data. If you want to learn how to classify images on the Internet, maybe two million samples are not big enough. It’s not so much a matter of how much data you have but how much is enough is to learn from.

When companies like Google, I would say it’s one of the original big data companies, when they started their life the very first batch of data they dealt with was big data. The term big data does not exists in these companies. They’ve always learned to take advantage of this data to make a lot of decisions.

Sonal Chokshi: The way I’ve heard it is that big data … that machine learning is one of many uses for big data.

Christopher N.: Right.

Sonal Chokshi: But you’re basically arguing for something different, can you describe what that is and why?

Christopher N.: Sure. If you think about the V’s definition of big data they are all problematic. We tend not to want to have problems unless there’s a reason for it, there’s a greater benefit to pay that cost. The benefit of big data is really, because we can unleash algorithms at them and these algorithms can automatically detect patterns and see these patterns.

I want to jump into that right away, because a lot of us in machine learning say this all the time, “What does it mean to detect patterns and so on?” The people take that for granted but then it’s a little fuzzy. The way I think about big data is, when machines learn from big data is very much like human beings learn from life experiences.

“Machines learn from Big Data like
Humans learn from Life Experiences

Sonal Chokshi: That’s actually interesting. I just want to hear more about why you make that analogy. You’re basically saying that machine learning is the way humans learn from life experiences, do you mean the way like a kid learns to navigate the world for the first time?

Christopher N.: Absolutely, that’s exactly right. For example, let’s turn, let’s flip that around and imagine would you like to have a child develop without any experiences and after 20 years what would that child, that person be like? Then why is it that we ascribe wisdom generally to older people than younger people?

Our brain capacity essentially remains about constant after a certain age, 16, 18, 20, whatever research you read, and yet wisdom continues to grow and accumulate and that’s … as the brain incorporates life experiences it is taking in a lot of big data, just like what machine learning algorithms do with data.

The opposite of that is rule-based computing or rule-based expert systems. You can come up with 10, 20, 30 rules and so on, but you can never come up with enough rules to handle the exceptions.

Sonal Chokshi: Exactly. Is machine learning for the exception handling then, or for everything? How does that work when you’re talking about computing?

Christopher N.: In a very real sense it is. You can think of it as for exception handling, but I like to think of it in terms of analogy as wisdom. You do have the rules but then you know when the rules don’t apply. The reason you know when the rules don’t apply is because you’ve seen three or four or five corner cases before. Somehow “intuitively” you find that in this situation that rule doesn’t apply, but what we think of as intuition are actually, you can think of as parameters inside a machine-learning model.

Sonal Chokshi: That’s interesting but how does … Just to be more concrete about that, that makes a lot of sense logically but concretely for businesses, like when you think about the business intelligence space and where we’ve been and where we are now, what’s different here? What’s happening? What do we get out of it? Basically, I guess I’m asking.

Christopher N.: Right, that’s a great question. Even the term business intelligence, sometimes we’re captured by what we meant in the past, and so what we said in the past was BI.

Sonal Chokshi: BI, being business intelligence.

Christopher N.: Exactly, business intelligence, can be self-limiting. In other words what business intelligence was, was limited by what was available. What was available was the ability to essentially look backward. You can ask a lot of questions using what we call aggregations.

Sonal Chokshi: Aggregations.

Christopher N.: You have a whole bunch of transactions that come in from all over the world and you can say, “Well, how much revenue did we make yesterday from that particular region of the world?” These are backward looking information, because that’s all we were capable of doing and because there was a particular lack of something and that something was big data. With enough data from all of that experiences what we can do is we’ll build a model out of that and project into the future.

You can think of business intelligence going forward as the ability to apply machine learning algorithms to big data and not just look at past questions but also future questions. We’re asking to predict the unknowns from the knowns.

“Business Intelligence will become predicting the Unknowns from the Knowns

Sonal Chokshi: What’s changed to make that possible? Because in the days of business intelligence, I think of stuff, the products that SAP and similar companies put out. What’s changed to make big data possible? I know the big obvious things are just more computing power, but more concretely like what’s physically making this possible to be able to parse and get all these, get these insights out of this data?

Christopher N.: Right. If you think about, it a lot of people have pointed out that big data has always existed. It’s always been there, we just didn’t collect it. Then the second insight that I think about is that we don’t necessarily get smarter over time. It’s just that certain technologies get cheaper, they become more available.

Machine learning algorithms have always been around. The data that exists that you could collect has always been around but it wasn’t until the advent of things like the Hadoop Project, and the launch of companies like Cloudera and MapR back in 2009. It made it affordable for many, many more companies to begin acquiring and storing a lot of this data.

Sonal Chokshi: I’m actually glad you brought up Hadoop, Christopher, because one of the things that I see a lot in reading about the big data space is a lot of myths and misconceptions around what Hadoop is, what Spark is. Because now we talk a lot about Apache Spark and we have a lot of, at A16Z full disclosure, we have investments in every level of the BDAS, the big Berkeley Data Analytics Stack coming out of the AMPLab.

Can you talk to us a little bit more about what exactly Hadoop does, and what Spark does, and how they all live together, and then, how that actually fits in to big data? For people who don’t actually crunch those numbers behind the scenes?

Christopher N.: Sure. I think we can look at it from two perspectives. I think that there is a top-down view and there is the bottom-up view. Let me start with the bottom up view because that’s how technology is always developed. We always build things from the bottom up and then we realize there’s a pattern here, and then we look top-down again.

From the bottom-up view, Hadoop is primarily a storage layer. There is the HDFS, the Hadoop file system. The distinction between that particular file system and other file systems in the past, I think the essential difference is that it is highly parallel.

Sonal Chokshi: Parallel, in terms of parallel processing?

Christopher N.: Parallel with storage, replication, and so on, so that you can have a lot of resiliency, and then it also is capable of running on commodity hardware. For the first time people can afford to buy many terabytes of storage and store it reliably, and still pay only a little amount for that.

Sonal Chokshi: Sorry, just to take a step back for a moment, the reason Hadoop and its ELK were able to run on commodity hardware is because the hardware has gotten cheap enough, or because the way that it processes and the way it’s architected it’s optimized for that? They could be the same, in fact in the end, but I do think it’s important to understand what the driver of that is.

Christopher N.: I think it’s both. It’s a supply and demand thing where sometimes the demand creates supply or sometimes the supply creates the demand. I think you can trace back, again, to companies like Google that started in the late 90's and early 2000's, and that started to use a lot of this commodity hardware.

Then also with Moore’s Law, making everything cheaper, essentially doubling the capacity that you can afford every 18 months. With that and then with companies that have taken down this path proving that there is something valuable about accumulating all these data and making decisions from it.

It’s all that intuition, as well as the actual economics of hardware prices going down, and the availability of open source projects. I think all of these things, the elements come together to essentially create the big data movement.

Sonal Chokshi: Where do Spark fit in to that?

Christopher N.: Spark, if you go, continuing with this bottom-up view, if you start from the storage level, and you know that storage is not enough. You can’t just store things …

Sonal Chokshi: Right, you’re not going to get insights out of just collecting them.

Christopher N.: Exactly, interestingly lots of database implementations in companies, people actually do put data in and never get anything out. In any computing stack you need more than just storage. You need a compute layer.

Sonal Chokshi: The first layer that you’re describing is the big data layer. That’s how you’re describing big data, it’s like storage.

Christopher N.: That’s exactly right.

Sonal Chokshi: I think right now people think about big data as actually getting the insights and analytics out of it, but you’re actually saying big data is just getting the big data, those that many signals and saving them in a certain place.

Christopher N.: For the purpose of being precise, I’m going to slice this up into levels so that we can refer to them more accurately. At the bottom layer we’ve got this big data and then above that we need big compute, in order to process all of this big data.

Sonal Chokshi: The storage layer, the processing layer, and what is big compute?

Christopher N.: Big compute, the first example of big compute you can think of is MapReduce. MapReduce, I don’t mean in terms of the algorithm but I mean the actual implementation with the Hadoop Project, the Hadoop MapReduce. That’s a parallelized computing system that can take all this data, do some computation with it, and then put it back. Then maybe an aggregation, for example asking the same question, the example that I gave earlier, “How much money do we make off of this widget out of Europe yesterday?” is an aggregation question.

If you have a thousand transactions you can do it with one machine but if you have, somehow stored a 100 billion of these rows and you want to ask the same question, maybe you have to parallelize it. That’s what MapReduce allows you to do. Unfortunately, MapReduce is actually not designed originally to handle queries.

Sonal Chokshi: First they only have two functions, map and reduce. Is that the reason, or is it because …

Christopher N.: Actually the reason is a little deeper and more pragmatic than that. Interestingly a lot of people may not realize that MapReduce was designed to be slow.

Sonal Chokshi: That is interesting. I didn’t know that.

Christopher N.: Let me unpack that a little bit. MapReduce as implemented at Google by Jeff Dean and Sanjay Ghemawat back in the early 2000's, and then they published their work in 2004. That MapReduce engine at Google was intended to do one particular job, and that job was to crawl and index the web. When that happens, Google’s approach was to parallelized it over thousands of machines.

When you have thousands of commodity machines doing a task that may last half a day, the probability of one of those machines going down is approaching 1. In fact it is about 1, any single machine could go down. When a machine goes down a question comes up, “Do we start the job over?” Certainly, you don’t want to have to do that because then it will never finish. It’s designed in such a way that if any single machine goes down another machine can be brought up and pick up where it started.

Sonal Chokshi: Hence it’s slow enough to be able to do that.

Christopher N.: Right, and the way you ensure that reliability is to write down everything every step of the way. If you do job A, and then you write out the results of job A, and then you do job B, write out the results of job B.

Sonal Chokshi: What does Spark do differently?

Christopher N.: Spark takes a different approach and as I said earlier, it’s not that we get smarter, it’s just that the constraints have changed. Spark’s goal is to be able to do a lot of these queries very, very fast. We’ve always known, independent of the economics of hardware and software, we know that the speed of access to RAM is a lot faster than accessing disk. In fact from CPU to RAM, you’re talking about 40 nanoseconds.

Sonal Chokshi: Sorry, just to be clear, when you say the speed of accessing RAM is a lot faster than accessing to disk, you’re just talking about how to get to the memory functions.

Christopher N.: That’s right, but generally the machinery knows … will feel the speed. Getting to information stored in RAM is about six orders of magnitude faster than getting to information stored on disk. Spark’s approach is to use memory. Now Spark, if it was created five, six years before its time would have completely failed because memory was so much more expensive.

Sonal Chokshi: Right, the hardware constraints were lifted there, that’s right.

Christopher N.: Exactly, and it was a few years after that of course, something else would have come in before Spark. The timing of Spark has a lot to do with its success. What Spark does for you is give you very fast query processing that the implementation, the MapReduce implementation of Hadoop doesn’t give you.

Sonal Chokshi: That helps us understand a little bit more of the difference between Hadoop and Spark. You’re basically talking about the in-memory aspect of it, being able to do things a lot faster.

Christopher N.: That’s right.

Sonal Chokshi: What does that give us concretely for big data and machine learning?

Christopher N.: That takes me to the top-down view, which is from the top down, e know that we always want things fast but then we also want them cheap.

Sonal Chokshi: Sorry, just to be … cantankerous for a second, why do we want things to be fast? Actually, like why do we always want them to be fast? What do we actually get out of that?

Christopher N.: Fast is competitiveness. If you can get your answer five minutes before I can, you’ll make decisions and then you’ll make that purchase, you make that buy, to supply whatever it is, that will happen before I get there, and you win.

Sometimes it’s implicitly obvious that we want everything faster because fast is competitive, but it turns out the difference between fast and slow is very, very critical. When you can get something in real time or you can get something in five seconds as opposed to five minutes, you will actually change your workflow.

You will actually do something, that’s what I learned from the consumer perspective with things like Gmail and so on. We had a phrase we call the “five-second barrier.” If the user can’t get something done within five seconds they won’t ever do it. It’s not like they’ll do it at twice the latency. Fast enables new, different use cases that may otherwise not happen.

“We had something called ‘The Five-Second Barrier’

Sonal Chokshi: That’s actually helpful because I think we tend to take it for granted that fast is better. I know obviously we want that information faster, but you’re basically talking about enabling entirely new workflows and use cases. Going back to what you were saying about the top down approach and where this fits …

Christopher N.: From the top-down perspective now we have the capability. We have this big compute layer and then we have this big data layer. So we have the capabilities of actually doing things very fast on massive amounts of data. We can apply algorithms. We can bring all these algorithms to bear on big data and get a lot of insight, but that’s still not enough because we haven’t put the human in this place yet. It’s still machines and that’s the problem with the bottom-up approach.

When you look from the top down there’s humans sitting at the command and control, at the inside layer, and they’ve got to make decisions. So far, our industry has not built the bridge from all of these machinery to that human user. There’s a layer missing.

Sonal Chokshi: The learning basically.

Christopher N.: The learning, as well as the application layer. The interfaces, the user experience, all of that put together can be thought of as the Big Apps layer. I’d like to think of things in terms of big apps, on top of big compute, on top of big data. When you have these three working effectively, harmoniously together then you have a very, very good big data stack.

“Big Apps, on top of Big Compute, on top of Big Data

Sonal Chokshi: Taking a step back for a moment, it seems obvious why the natural interfaces are pretty important for people to be able to interact with it. Let’s face it, the reason we are able to do any kind of computing is because of the GUI, like having a graphical user interface that allows us to not have to see the plumbing behind the scene. That seems pretty obvious that we need that.

What did that actually get you in the big data world? Sure, you can more easily read your data and get some insights from it, but I just feel like we throw that term around too much that we need a better interface to our data. What does it really get us?

Christopher N.: I think one way to understand it is to look back into the past. We went from the typewriter to the computer keyboard, and to the mouse, and now to touch screen and so on. You could ask the same question, what does touch screens get us, and why didn’t we do it before?

The reason touch screens and finger gestures and so on are valuable is because they’re much more natural than using a keyboard, but the reason we didn’t have that before is because the hardware and the software to make that happen were not available, or too expensive to do so.

The same analogy applies with big data machine learning. We could imagine all those capabilities before but they were too expensive. We didn’t have the storage capabilities for all of the data and we didn’t have the big compute capacity to do all of this. But now that we do and they’re affordable what you will see is that all of this machine learning will be a property of every application.

“Machine Learning will be a property of every application

Sonal Chokshi: What does that mean? I’ve heard Peter [Levine] say that as well. He makes the argument as well, that machine learning will be a property of every application as opposed to a stand-alone, isolated function, what does that actually mean?

Christopher N.: Imagine a world where, let’s say … We work with a lot of people and we expect our colleagues to remember what we say and learn from the interactions and so on and so forth. Can you imagine a world where your colleagues are just simple automatons and they don’t understand what you’re saying, and you told them something and they don’t remember it the next day, and their actions don’t change the result of that?

I claim that there will be a day very soon when then you will feel that about the machines you work with. In other words you would expect that to be a property of all of these machines.

“We would expect Learning to be a property of all our machines

Sonal Chokshi: You’re right. I think we already do expect it because we carry mobile phones with us all around, and it’s frustrating when you have a certain experience there that you can’t have with an application you’re using at work, on your desktop or anything else.

Christopher N.: Exactly.

Sonal Chokshi: I definitely think you’re right, that we might already even be there in some way, or that we need to be expecting that, but what does that really give us? Because when I think about big data, I think about it in the abstract. It’s still not clear to me what machine learning being a property of every application, what does that do for us?

Christopher N.: I’m going to give you an example by a story from one of the … There’s something called TGIF at Google, which actually we do at our company, Adatao, today as well, which is every Friday the execs basically come out and talk about almost every company secret possible to the whole company. People can ask any kind of questions that they want.

I remember there was one time when, at Google we were dealing with the problem of latency. Google cares a lot about speed. Larry was pushing everyone to make their services a lot faster. There was question people ask and say, “Hey Larry, we went from one second search delay to 500 millisecond, and 300 millisecond, a 100 millisecond. What do you want? I mean, what happens when we get to zero?” Then what Larry said was, “Why stop at zero? Why can’t it be negative latency?”

“Why can’t it be negative latency?

Sonal Chokshi: Right.

Christopher N.: Essentially what he meant was, “Why can’t our machines anticipate what we need? What we want to do.” That’s actually not as ridiculous as it may sound. Certainly, as human beings we do anticipate each other. Maybe if you see that I’m coughing or something, that you just go help me with a cup of water. Right now, I still have to tell the machine, even if we have a robot today to do that, I have to still ask that robot to do that.

Sonal Chokshi: You have to specify …

Christopher N.: What we get with predictive algorithms, what we get with machine learning, what we get with big data, remember big data is just life experiences, what we get with that is that our machines we’ll be able to learn. They will be able to anticipate. They will be able to predict. They will have behaviors that we normally expect of humans, of intelligent beings.

Sonal Chokshi: When we take that to every applications though, because why is it not okay to have it be an isolated, stand-alone thing? What do we get out of it when it becomes a part of every application?

Christopher N.: I think when it becomes part of every application then every component of the application will be receiving data all the time, maybe the screen is receiving my gestures, maybe my calendar is receiving appointments that I’m making, maybe even the location that I am at. Then they will be able to learn from all of this and make intelligent decisions about what calendar events to insert, what gestures to accept, and maybe I don’t even have to say that, it will just do that ahead of time for me.

In my view, in that world things will happen a lot better for me. It will become a lot easier for me to move around. It will become a lot easier for me to make decisions, and maybe a lot of decisions will also be suggested to me before I even have to think about it too much.

Sonal Chokshi: A lot of what you’re talking about is machines inferring and really aiding, learning like humans and helping augment human intelligence. What happens next?

Christopher N.: I think that’s a great question. I think if you back up and think about human evolution, there’s one variable that’s inexorably increasing. We may get taller, shorter, we may go from one continent to the next and so on, but one thing that’s been a single variable that’s constant or changing in one direction, that’s human intelligence. In fact our species intelligence. There’s absolutely no reason to think that we’re at the end of that. I think we’re just at the very beginning of that increasing intelligence.

“Intelligence is the inexorably increasing property in Evolution

A lot of the thing that we’re learning about machine learning itself, I’m really excited about that. If you look at the research in deep learning, what’s happening there, really in just the last 12 months, 24 months, to me, the exciting thing is that we’re learning so much about how our brains might work. It’s not just what the machines can do, but what they teach us about ourselves.

If you think about it from that perspective and think about how these algorithms are evolving, you actually see this very near future where human intelligence is going to be boosted by all this machine intelligence. That will actually change how we think about Evolution.

Sonal Chokshi: It’s interesting because people treat deep learning sometimes as just great for machine learning but you’re basically putting out the same continuum, and saying it’s just more machine learning?

Christopher N.: I think deep learning just happen to be one moniker of today, but it is a very important one because it’s showing some of us glimpses of the future, more so than at any time in the past. I think that’s the exciting thing.

What we’re doing, coming back, is the software that we’re building is essentially machine intelligence aiding human intelligence. Today, I would say in very primitive ways, I think it’s very helpful to enterprise but we’re just at the beginning of it. The next set of products we’re going to be building in deep-learning capabilities. We already have machines inside the company that can talk to each other.

It’s happening a lot faster than people are realizing, and I see it as our job to make sure that we, as the human species continue to leverage that power, as opposed to maybe one day be subjugated by it.

“We want our species to leverage that power, not subjugated by it

Sonal Chokshi: No, totally. Just one last question then, concretely what do we get out of that? It’s interesting academically and clearly it’s interesting beyond academically, because companies are investing in it left and right. In fact, more so in the corporate sphere than even in the university sphere, but what do we get out of that deep learning? What concretely comes out of that?

Christopher N.: I like to think of it in two ways, and I think they’re both concrete but perhaps one is more concrete than the other to some people’s views. Certainly, companies are helped when they have more intelligence about their data. People talk about, in the past you didn’t even know what was going on at the company level, let alone make decisions based out of it.

We’re coming to an age where you know what’s going on and the machines are also helping you make decisions. What you get out of it is competitiveness. Companies that invest in this and are good at this, that are data-intelligent, that are data-driven will win. That’s a competitive edge. That’s inevitable, but I think the larger picture also, is that as a species we’re explorers, it’s built in to our genes, and you can count on that as being inevitable.

“Like space exploration, this is exploration of the mind

Left alone, we’ll figure out that these are exciting frontiers that we will explore. We will always want to build intelligence. We will always want to build images of ourselves, if you will, maybe that intelligence that emerges would not be the same as human intelligence but we will attempt all of this. It’s just like space exploration, this is exploration of the mind.

Sonal Chokshi: Using the computer. That’s great. Thank you, Christopher from Adatao, and that’s another episode of A16Z Podcast. Thanks everyone.

What You Must Know About Big Data and Machine Learning was originally published in Deep Learning 101 on Medium, where people are continuing the conversation by highlighting and responding to this story.

Algorithms of the Mind

Christopher Nguyen — Fri, 22 May 2015 09:27:31 GMT

What Machine Learning Teaches Us About Ourselves

Originally published at blog.arimo.com.
Follow me on Twitter to keep informed of interesting developments on these topics.

“Science often follows technology, because inventions give us new ways to think about the world and new phenomena in need of explanation.”

Or so Aram Harrow, an MIT physics professor, counter-intuitively argues in “Why now is the right time to study quantum computing”.

He suggests that the scientific idea of entropy could not really be conceived until steam engine technology necessitated understanding of thermodynamics. Quantum computing similarly arose from attempts to simulate quantum mechanics on ordinary computers.

So what does all this have to do with machine learning?

Much like steam engines, machine learning is a technology intended to solve specific classes of problems. Yet results from the field are indicating intriguing—possibly profound—scientific clues about how our own brains might operate, perceive, and learn. The technology of machine learning is giving us new ways to think about the science of human thought … and imagination.

Not Computer Vision, But Computer Imagination

Five years ago, deep learning pioneer Geoff Hinton (who currently splits his time between the University of Toronto and Google) published the following demo.

https://medium.com/media/682bf3bb68b933897ceeded34e9713ee/href

Hinton had trained a five-layer neural network to recognize handwritten digits when given their bitmapped images. It was a form of computer vision, one that made handwriting machine-readable.

But unlike previous works on the same topic, where the main objective is simply to recognize digits, Hinton’s network could also run in reverse. That is, given the concept of a digit, it can regenerate images corresponding to that very concept.

https://medium.com/media/8dee0a9682fbb15b4d7e11f460b049ee/href

We are seeing, quite literally, a machine imagining an image of the concept of “8”.

The magic is encoded in the layers between inputs and outputs. These layers act as a kind of associative memory, mapping back-and-forth from image and concept, from concept to image, all in one neural network.

“Is this how human imagination might work?

But beyond the simplistic, brain-inspired machine vision technology here, the broader scientific question is whether this is how human imagination — visualization — works. If so, there’s a huge a-ha moment here.

After all, isn’t this something our brains do quite naturally? When we see the digit 4, we think of the concept “4”. Conversely, when someone says “8”, we can conjure up in our minds’ eye an image of the digit 8.

Is it all a kind of “running backwards” by the brain from concept to images (or sound, smell, feel, etc.) through the information encoded in the layers? Aren’t we watching this network create new pictures — and perhaps in a more advanced version, even new internal connections — as it does so?

On Concepts and Intuitions

If visual recognition and imagination are indeed just back-and-forth mapping between images and concepts, what’s happening between those layers? Do deep neural networks have some insight or analogies to offer us here?

Let’s first go back 234 years, to Immanuel Kant’s Critique of Pure Reason, in which he argues that “Intuition is nothing but the representation of phenomena”.

Kant railed against the idea that human knowledge could be explained purely as empirical and rational thought. It is necessary, he argued, to consider intuitions. In his definitions, “intuitions” are representations left in a person’s mind by sensory perceptions, where as “concepts” are descriptions of empirical objects or sensory data. Together, these make up human knowledge.

Fast forwarding two centuries later, Berkeley CS professor Alyosha Efros, who specializes in Visual Understanding, pointed out that “there are many more things in our visual world than we have words to describe them with”. Using word labels to train models, Efros argues, exposes our techniques to a language bottleneck. There are many more un-namable intuitions than we have words for.

There is an intriguing mapping between ML Labels and human Concepts, and between ML Encodings and human Intuitions.

In training deep networks, such as the seminal “cat-recognition” work led by Quoc Le at Google/Stanford, we’re discovering that the activations in successive layers appear to go from lower to higher conceptual levels. An image recognition network encodes bitmaps at the lowest layer, then apparent corners and edges at the next layer, common shapes at the next, and so on. These intermediate layers don’t necessarily have any activations corresponding to explicit high-level concepts, like “cat” or “dog”, yet they do encode a distributed representation of the sensory inputs. Only the final, output layer has such a mapping to human-defined labels, because they are constrained to match those labels.

“Is this Intuition staring at us in the face?

Therefore, the above encodings and labels seem to correspond to exactly what Kant referred to as “intuitions” and “concepts”.

In yet another example of machine learning technology revealing insights about human thought, the network diagram above makes you wonder whether this is how the architecture of Intuition — albeit vastly simplified — is being expressed.

The Sapir-Whorf Controversy

If — as Efros has pointed out — there are a lot more conceptual patterns than words can describe, then do words constrain our thoughts? This question is at the heart of the Sapir-Whorf or Linguistic Relativity Hypothesis, and the debate about whether language completely determines the boundaries of our cognition, or whether we are unconstrained to conceptualize anything — regardless of the languages we speak.

In its strongest form, the hypothesis posits that the structure and lexicon of languages constrain how one perceives and conceptualizes the world.

Can you pick the odd one out? The Himba — who have distinct words for the two shades of green — can pick it out instantly. Credit: Mark Frauenfelder, How Language Affects Color Perception, and Randy MacDonald for verifying the RGB’s.

One of the most striking effects of this is demonstrated in the color test shown here. When asked to pick out the one square with a shade of green that’s distinct from all the others, the Himba people of northern Namibia — who have distinct words for the two shades of green — can find it almost instantly.

The rest of us, however, have a much harder time doing so.

The theory is that — once we have words to distinguish one shade from another, our brains will train itself to discriminate between the shades, so the difference would become more and more “obvious” over time. In seeing with our brain, not with our eyes, language drives perception.

“We see with our brains, not with our eyes.

With machine learning, we also observe something similar. In supervised learning, we train our models to best match images (or text, audio, etc.) against provided labels or categories. By definition, these models are trained to discriminate much more effectively between categories that have provided labels, than between other possible categories for which we have not provided labels. When viewed from the perspective of supervised machine learning, this outcome is not at all surprising. So perhaps we shouldn’t be too surprised by the results of the color experiment above, either. Language does indeed influence our perception of the world, in the same way that labels in supervised machine learning influence the model’s ability to discriminate among categories.

And yet, we also know that labels are not strictly required to discriminate between cues. In Google’s “cat-recognizing brain”, the network eventually discovers the concept of “cat”, “dog”, etc. all by itself — even without training the algorithm against explicit labels. After this unsupervised training, whenever the network is fed an image belonging to a certain category like “Cats”, the same corresponding set of “Cat” neurons always gets fired up. Simply by looking at the vast set of training images, this network has discovered the essential patterns of each category, as well as the differences of one category vs. another.

In the same way, an infant who is repeatedly shown a paper cup would soon recognize the visual pattern of such a thing, even before it ever learns the words “paper cup” to attach that pattern to a name. In this sense, the strong form of the Sapir-Whorf hypothesis cannot be entirely correct — we can, and do, discover concepts even without the words to describe them.

Supervised and unsupervised machine learning turn out to represent the two sides of the controversy’s coin. And if we recognized them as such, perhaps Sapir-Whorf would not be such a controversy, and more of a reflection of supervised and unsupervised human learning.

I find these correspondences deeply fascinating — and we’ve only scratched the surface. Philosophers, psychologists, linguists, and neuroscientists have studied these topics for a long time. The connection to machine learning and computer science is more recent, especially with the advances in big data and deep learning. When fed with huge amounts of text, images, or audio data, the latest deep learning architectures are demonstrating near or even better-than-human performance in language translation, image classification, and speech recognition.

Every new discovery in machine learning demystifies a bit more of what may be going on in our brains. We’re increasingly able to borrow from the vocabulary of machine learning to talk about our minds.

Democratizing Data Intelligence for All

Christopher Nguyen — Sat, 02 May 2015 03:04:57 GMT

A Major Progress Report

April 30th, 2015 (Originally published at http://arimo.com/blog/)

Just as accessibility of the automobile transformed our mobility in the the 20th century, accessibility of data intelligence will transform our minds in the 21st. Over the next 10 years, machine learning is going to be pervasively present in our everyday lives.

For the past decade, the Big Data conversation has been about HDFS, MapReduce, monads and monoids. Today, we are changing the conversation, from bottom-up to top-down.

I’m excited to announce the general availability of the Arimo Data Intelligence Platform. The platform comprises of a set of web applications that combine beautiful, consumer-grade user experience with enterprise-grade power and sophistication, for both business users and data scientists. Just as the iPhone democratized the mobile Internet, Arimo Apps are here to democratize data intelligence.

An Evolutionary Perspective

Twelve years ago, Sanjay Ghemawat et al. published the seminal paper on Google’s distributed filesystem, GFS, that enabled the company to store multiple copies of the entire Internet on inexpensive commodity hardware. A year later, Jeff Dean et al. revealed details of Google’s implementation of MapReduce, the Big-Compute engine that scaled to process all this data. Soon, Hadoop was launched by Doug Cutting, and in 2009, MapR and Cloudera were born to provide commercial support for it. About the same time, Matei Zaharia began work at the Berkeley AMPLab on what became Apache Spark, a fast in-memory Big-Compute engine on top of Big-Data storage.

When we started Arimo with a focus on Big Data and Machine Learning, we knew that Big Compute on Big Data were not enough. We wanted to create the missing Big Apps layer to have all the power and sophistication of big data and machine learning, but also connect all this to human users, with beautiful user experiences. While Steve Jobs’s iPhone was far from being the first mobile phone, it completely changed the conversation from technology to user experience, from bottom-up to top-down, from power to productivity.

The Arimo Data Intelligence Platform (ADIP)

The Arimo Data Intelligence Platform (ADIP) is an execution milestone. It comprises three key components: (1) Arimo Apps: user-facing applications with Google-Docs-like ease of use, yet with the predictive power and sophistication of a state-of-the-art Big Data systems (2) Arimo AppBuilder: a toolset that allows you to easily build your own custom applications using our patent-pending SmartQuery technology and application templates, and (3) Arimo PredictiveEngine: a powerful, distributed server with sophisticated algorithms that scale to petabytes on top of virtually any data source.

The familiarity of office productivity tools

If you are in Sales, Marketing or HR, you will experience the familiarity of office productivity tools and be able to ask questions of huge amounts of data without dependence on IT. Arimo Apps blend the simplicity and ease of a consumer application, with the power of Big Compute.

Use the same languages, directly on all of your data

If you are a data scientist familiar with R, Python or SQL, you continue to use the same languages, directly on all of your data and not just a sample, then easily share your analyses with a single URL. Arimo helps you eliminate the barriers between advanced analytics and large datasets, between building and deploying models, and facilitates collaboration with your business counterparts.

The overview of Adatao Data Intelligence Platform

Arimo’s Data Intelligence Platform can be deployed on-premise or in the AWS cloud. Arimo has partnered with Databricks, the team that created and continues to drive Apache Spark, to provide fully managed Spark clusters, enabling an end-to-end integrated stack for our end users. Arimo has also partnered with Altiscale, the leading Hadoop-as-a-Service provider, to offer an end-to-end Big Data stack to our customers.

Our Product Philosophy

Our product philosophy is based on three key pillars:

Natural Interfaces — Instead of having to deal with clumsy interfaces and stodgy database concepts, business users simply ask questions with Arimo SmartQuery, “How much would revenue change if marketing spend rises by 1%?”, without the need for programming skills.
Collaboration — Can you imagine a SWAT team planning a mission by email? Then why would you accept running your hyper-competitive, data-driven business that way? Business users and data scientists collaborate on the same data, at the same time, using tools and languages familiar to them: plain English, R, Python, SQL, Java, or Scala.
Machine Learning — Instead of putting Machine Learning (ML) on some intimidating pedestal far removed from business users, Arimo applications are built with inherent predictive capabilities to bring the real value of Big Data to all users. We are at an inflection point in the intelligence of our computing systems; now ML is to software what software was to machines.

So why Natural Interfaces, Collaboration, and Machine Learning?

Huge end-user value will be realized when the power of Big Data systems is intuitively accessible, through natural human interfaces. And as users of Google Apps testify, when people with diverse skills collaborate in the same space on the same data at the same time, time-to-decision is shortened and team productivity is boosted by 10x-100x.

The real reason for Big Data is Machine Learning. Big data has existed for a long time, but now data is retained so that our algorithms can automatically learn from them and do useful, intelligent things. ML will become an inherent property in all the tools and devices in our lives. We will demand all of our software and devices to be intelligent enablers for us, through learning from data.

Today, I welcome you to try the Arimo Data Intelligence Platform and experience the power and ease of use of Natural Interfaces, Collaboration, and Machine Learning in your everyday workflows with your colleagues and friends.

To learn more, please visit our product page.

Democratizing Data Intelligence for All was originally published in Arimo Narratives™ on Medium, where people are continuing the conversation by highlighting and responding to this story.

2015: The Year of Big Apps

Christopher Nguyen — Thu, 09 Apr 2015 21:23:33 GMT

Big Intelligence, from Big Apps on Big Compute on Big Data

March 17, 2015 (Originally published at blog.adatao.com)

Technology revolutions play out in familiar patterns. And almost always from the bottom up.

Remember Web 1.0? That was about browsers and Javascript and web server infrastructure. Then in Web 2.0, we shifted the focus to top-down user experience. How about when relational databases were introduced to the market 30 years ago? Much of the focus then was on bottom-up data engines and relational algebra. Then came SQL, 4GL, GUIs, and the cycle finally completed with user-facing business applications.

In Web 2.0, we shifted the focus to top-down user experience.

The Big Data story follows the same arc. Big Data 1.0 has been about storage (e.g., HDFS) and computation (e.g., Apache Spark). We are now at the threshold of Big Data 2.0. It’s time to change the conversation and focus on end-user applications. These are Big-Data-native applications, which business users and data scientists can use to interact directly with their Big Data.

We’ll call them “Big Apps”.

Big Apps on Big Compute on Big Data

Big-Data applications (Big Apps) can’t happen before the arrival of Big Compute, sitting on top of Big Data. In that sense, Apache Spark and Tachyon are key pieces of this larger puzzle. They play the role of the Big Compute engine that fills the gaps between Big Data and Big Apps. As I have written elsewhere, in Spark and Tachyon we have the perfect architectural timing. These engines correctly anticipate the cross-over between rising business value and dropping hardware (memory) costs.

But there’s another significant property of Spark that separates it from all other in-memory Big Compute engines. And this is something most of us do not fully appreciate.

In the Spark framework, data is stored in RDDs (Resilient Distributed Datasets), which are first-class citizens. RDDs have life cycles that transcend compute cycles.

This is very different from, say, Hadoop MapReduce, which holds data in memory only temporarily as an internal part of each Map-Reduce stage. When each stage completes, the data, transformed and summarized, is written out to disk. All persistence happens at the disk level.

What do RDDs buy us, besides memory-speed iterations between compute stages? They buy us something very significant: the ability to have a long-lived applications that can access high-level data structures without having to go back to disk, and without having to recompute them. If you look at other architectures that are Spark’s rivals, most will lack one property or another in this very important dimension.

Likewise, Tachyon provides us with the facility of persistent in-memory data structures. And once we have such structures, we can not only access them, we can also share them. Thus, Spark and Tachyon make it possible to create collaborative Big Apps.

Spark RDDs have life cycles that transcend compute cycles, making it possible to create collaborative Big Apps

Collaboration-Led Productivity: The Most Important Feature/Benefit for Users

When I was working on Google Apps, we would often hear people ask, “Why launch Google Spreadsheets? It’s 20 years behind Microsoft Excel and 200 features short!” They didn’t realize that a driving mantra for Google Apps was “It’s the collaboration, People!” I have seen metrics, and still experience daily, how Google Apps’ real-time collaboration features boost team task productivity by a factor of 10x or more. It is collaboration among team members with diverse skillsets and points of view that yields these large gains in organizational smarts.

It’s the Collaboration, People!

When something can increase productivity by such a huge amount, arguments about data engines that run 20% faster just pale in comparison.

Adatao Big Apps for Business Analysts and Data Scientists

Therefore, to us at Adatao, the fact that Spark and Tachyon enable deep collaboration over Big Data is very significant. We have built a full suite of user-facing applications that exploits these collaboration capabilities. For example, Adatao Narratives is an interactive document that allows business analysts and data scientists to collaborate on creating data narratives, complete with text, interactive charts & maps, in real time, on the same huge datasets. They can use different access languages, one using plain English, the other using R. They can even collaborate across different client applications, one using a web browser, the other using R-Studio.

The brain power, insights, and productivity supported by these capabilities are phenomenal.

What’s Coming Next?

What else can we do to help people and businesses become smarter and more productive? From the very first days where companies such as Google and Facebook started accumulating data at scale, it was about applying algorithms to learn from that data, to build better systems and to drive decisions. So if you think about it, the driving rationale for Big Data is really to marry it with Machine Learning to produce wisdom and insights. Thanks to Big Data and Big Compute, recent major advances in Deep Learning and related areas indicate that we are at the threshold of a significant acceleration in machine intelligence.

Big Data is really about Machine Learning … We are at the threshold of a significant acceleration in machine intelligence.

At Adatao, we are working to ensure that we can all enjoy the benefits of this acceleration. The power of predictive analytics can equip business systems to learn via examples. It will help us discover the unknown by interpolating from the knowns. It will help us forecast future outcomes by extrapolating from the past. Every one of Adatao’s Big Apps has such predictive capabilities built-in as native features, from natural human interfaces to advanced machine-learning algorithms in the engine. Machine Learning is strong in our team’s DNA, and we envision a future in which machine intelligence is increasingly leveraged to aid and boost our human intelligence.

I am optimistic about this future and excited about delivering Adatao Big Apps, towards a future with Data Intelligence for All.

2015: The Year of Big Apps was originally published in Arimo Narratives™ on Medium, where people are continuing the conversation by highlighting and responding to this story.

On Deep Learning — A Tweeted Bibliography

Christopher Nguyen — Sat, 04 Apr 2015 05:32:34 GMT

On Deep Learning

A Tweeted Bibliography

Here’s a collection of my tweets on interesting/exciting developments in Deep Learning or Machine Learning in general.

It’s in no grand order, but does serve as a convenient reference & provides some context.

Christopher Nguyen on Twitter

Ever wonder what Stochastic in SGD really means? From 1951, Robbins & Monro http://bit.ly/1DIM0hq #MachineLearning pic.twitter.com/sJqDKsvFUt

Adatao [ah-DAY-tao] on Twitter

Here: our favorite reference on Discriminative vs. Generative classifiers, by @AndrewYNg http://buff.ly/1wA50bj pic.twitter.com/BbJoCIytYo

Christopher Nguyen on Twitter

Why do neural networks with more layers perform better than a single layer MLP with same # of params? http://bit.ly/1sGzYOj

Christopher Nguyen on Twitter

Universal function approximators: widely known, often misunderstood in #DeepLearning. Review http://bit.ly/1GTcWxz . pic.twitter.com/d2tJtn0eRM

Christopher Nguyen on Twitter

Geoff Hinton's Very Cool #DeepLearning demo. Run in reverse to see how Imagination might work http://bit.ly/1IWlcgE pic.twitter.com/DQJS8EjJGV

Christopher Nguyen on Twitter

Good 12/14 arXiv review of object recog w #DeepLearning. Amazingly it's also already outdated! http://bit.ly/1GRue08 pic.twitter.com/DaES8JFoPT

Christopher Nguyen on Twitter

A classic. Profoundly interesting in suggesting how we might encode concepts in our brains. http://bit.ly/1H0a0Be pic.twitter.com/SgIbdv7DBv

Christopher Nguyen on Twitter

Distributed-representation interpretation will prove largely correct #NeuralScience cf YBengio http://bit.ly/19147EK pic.twitter.com/cmZoB6hKOj

Christopher Nguyen on Twitter

Must-read: classic RHW Nature Letter, casually introducing backprop, "that triggered a boom in neural net research". pic.twitter.com/hhjGEaOMhG

Christopher Nguyen on Twitter

MSFT working on custom FPGA sys for #DeepLearning http://bit.ly/1LWs7XP . Should also look into @ylecun's NeuFlow. pic.twitter.com/qJECGBaS5Z

Christopher Nguyen on Twitter

MSFT Asia #DeepLearning team just demonstrated better-than-human visual recog. +1 @harryshum! http://bit.ly/1A03iFa pic.twitter.com/Xc28rPW596

Christopher Nguyen on Twitter

Microsoft team demonstrates shallow nets can rival #DeepLearning nets, suggesting alternative... http://bit.ly/1yEqdIA pic.twitter.com/0IWkwIz4mN

Christopher Nguyen on Twitter

New @MSFTResearch algorithm helps scale ad predictions & #DeepLearning to billions of vars http://bit.ly/1GGS1KF pic.twitter.com/VmiABaT7iB

Andrew Ng on Twitter

Deep Speech improves speech recognition; outperforms Bing/Google/Apple APIs in noisy environments! http://onforb.es/1x2BpOu

Christopher Nguyen on Twitter

DeepLearning UMich group has developed RL technique to best Google's $400MM DeepMind at Real-Time Atari Games http://bit.ly/1AgRjmx

Christopher Nguyen on Twitter

GOOG +1 over MSFT! Input norm >Param init, 1/10x training steps #DeepLearning HT @annodomini80 http://bit.ly/1E6dZXq pic.twitter.com/2QARCdZdoQ

Christopher Nguyen on Twitter

How do you find the best way to visualize a 784-dimensional dataset? http://bit.ly/1DOnE7Y pic.twitter.com/qVBUrUnv9V

Christopher Nguyen on Twitter

This is profound. Why wait for ASICs given our already very powerful #DeepLearning machine? #OneAlgorithm #Evolution http://bit.ly/1NFqEUK

Christopher Nguyen on Twitter

Google team's #DeepLearning sentence translator outperforms statistical machine translation http://bit.ly/15qdvjX pic.twitter.com/e6ZjdodL5s

Adatao [ah-DAY-tao] on Twitter

DeepLearning is figuring out how to tell stories from pictures #DataNarratives #DataViz http://buff.ly/1Dv9UxA pic.twitter.com/ayzu4yWl8k

Christopher Nguyen on Twitter

Neural Turing Machine creates its own algorithms (#NotAI, because I know exactly how it works) http://arxiv.org/pdf/1410.5401v2.pdf ... pic.twitter.com/nragAASyvg

Christopher Nguyen on Twitter

From Phil Colella's 7 dwarfs to Dave Patterson's 13: do get that MapReduce is just 1 of many http://bit.ly/1DgZ77Z pic.twitter.com/FTq3lbrh86

Christopher Nguyen on Twitter

This is the remarkable #DeepLearning Q&A work at @facebook demo'ed by @schrep at recent F8 http://buff.ly/1IQykEP pic.twitter.com/XkPEdMZPzn

Christopher Nguyen on Twitter

Introducing iRNNs, in a co-pub of Quoc Le and Geoff Hinton (and Navdeep Jaitly) http://bit.ly/1FediQX #DeepLearning pic.twitter.com/0KXgPz3xEK

On Deep Learning — A Tweeted Bibliography was originally published in Deep Learning 101 on Medium, where people are continuing the conversation by highlighting and responding to this story.

Big Data 2.0 Has Arrived

Christopher Nguyen — Tue, 17 Feb 2015 19:36:51 GMT

When We Can Process Data Flexibly At-Scale, What Will We Want To Do? — Provost & Fawcett

Aug 7, 2014 (Originally published at blog.adatao.com)

Two years ago, we set out on a journey. The hype around Big Data was fast outpacing its tangible business value. Yet, we were at the threshold of a fundamental shift in our industry — just as Web 1.0 was about technologies and functionalities of the web, while Web 2.0 was about the focus on the user; we’re seeing the same fundamental shift from Big Data 1.0 to Big Data 2.0.

We’re seeing a fundamental shift from Big Data 1.0 to Big Data 2.0

Adatao is the missing puzzle piece that bridges the gap between Big Data 1.0 of the past 5 years, and Big Data 2.0 going forward, with the arrival of “Big Compute” capabilities. We had seen, first-hand, Big Compute in action on Wall Street in the early 2000’s, and then at places like Google in in mid 2000′s. We’d learned that fast, powerful Big Compute engines would enable businesses to use Big Data fluidly and interactively, at the speed of thought. But accessibility was limited due to cost. The Adatao founding team knew that without Big Compute, Big Data could never deliver on its hype.

Without Big Compute, Big Data could never deliver on its hype

Two years ago, we confidently bet on Apache Spark because we knew where the trends were heading. We set out to build Big Data applications that assumed the mass availability of Big Compute. Today, we are delivering on the promise of Big Data 2.0 with our current products:

pInsights — the “beauty layer” that enables business analysts and data scientists to easily and fluidly interact with Big Data in an easy to consume, interactive format. Similar to a Facebook or Google Search engine, predictive SmartQuery was built into a Google Doc type document that allows users to instantly and collaboratively produce embedded analytics within seconds to assist with decision making.
pAnalytics — the “power layer” that enables data scientists and data engineers to analyze massive amounts of data in seconds. pAnalytics sifts through the data by representing it as one large, simple table, hiding all the data complexities, enabling data scientists and engineers to work with Big Data analytics in a very simple, powerful way. Data can be pulled in from Cassandra, analyzed in Spark, and the results saved back to S3 — all using one familiar API. This allows data scientists and engineers to focus on data analysis, and multiply their productivity by 10 times.

Ease-of-use + Collaboration + Predictive Capabilities = Data Intelligence

Since we debuted our products this past December, we’ve seen incredible demand from companies across every sector eager to finally unlock the potential of their investments in Hadoop, and reap the long promised benefits of Big Data.

As we look to build our team to scale and meet this demand, I’m excited to announce our Series A financing with our new partners at Andreessen Horowitz, who are leading the investment with participation from Lightspeed Venture Partners and Bloomberg Beta.

Peter Levine, from Andreessen Horowitz will join our board. Peter has extensive ground-up operating leadership experience and teaches a class on building a professional enterprise sales infrastructure at Stanford GSB and MIT Sloan School. Marc Andreessen will also support us as a Board Observer — his excitement about the potential of Adatao is summed up in the following:

As a successful two-time company builder and former Engineering Director of Google Apps, Christopher has the technical company builder background that we love to bet on. Plus, he’s assembled a strong team of engineers and PhDs in parallel systems and machine learning that collectively have a unique and powerful vision.

But we were really sold when we saw what Adatao has built — we were blown away. Christopher and team see a future convergence of human and machine intelligence that we believe in, and they have the technology roadmap and engineering experience to get there.

We can’t wait to help as Adatao designs the future of Big Data.

- Marc Andreessen

We have found each firm to be a true partner who not only shares our vision, but is also incredibly supportive and committed to being there with us every step of the way on our challenging but exciting journey. From Lyon Wong and John Vrionis of Lightspeed, who recognized our potential in the Big Data ecosystem quite early on, to James Cham at Bloomberg Beta who understood our value and strength as soon as he experienced the product, we are incredibly excited about what the future will bring.

In conversation after conversation with enterprises, we’re seeing that business needs are driving the convergence of business intelligence and data science/machine learning, directly on top of big data. This convergence is creating an entirely new set of business value, at a scale that has never been seen before.

Big Data + Big Compute = Big Data Intelligence

We can’t wait for the journey we’re embarking on to make Data Intelligence truly available to all.

Originally published at blog.adatao.com.

Big Data 2.0 Has Arrived was originally published in Arimo Narratives™ on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why and How We Bet on Apache Spark

Christopher Nguyen — Tue, 17 Feb 2015 12:28:06 GMT

The Story of Apache Spark, From Our Perspective

June 18, 2014 (Originally published at blog.adatao.com)

In early 2012, a group of engineers with background in distributed systems and machine learning came together to form Adatao (ah-DAY-tao). We saw a major unsolved problem in the nascent Hadoop ecosystem: it was largely a storage play. Data was sitting passively on HDFS, with very little value being extracted. To be sure, there was MapReduce, Hive, Pig, etc., but value is a strong function of (a) speed of computation, (b) sophistication of logic, and (c) ease of use. While Hadoop ecosystem was being developed well at the substrate, there were enormous opportunities above it left uncaptured.

Data was sitting passively on HDFS, with little value being extracted.

In-memory computing was key to the solution.

On speed: we had seen data move at-scale and at enormously faster rates in systems like Dremel and PowerDrill at Google. It enabled interactive behavior simply not available to Hadoop users. Without doubt, we knew that interactive speed was necessary, and that in-memory computing was key to the solution. As Cloudera’s Mike Olson has quipped, “We’re lucky to live in an age where there’s a Google. They live about 5 years in the future, and occasionally send messages back to the rest of us.” Google does indeed “live in the future”, in terms of the demands of scale and the value it is extracting from data.

On sophistication: for Adatao, the essential difference between “small” and “big” data is whether data is big enough to learn from. For some questions, such as “Does it hurt to hit my head against a brick wall?”, 5 samples suffice. To classify large images, a million samples aren’t enough. We knew this was the second missing key in Big Data: aggregates and descriptives were necessary but insufficient. The Big-Data world needed the sophistication of machine learning. Big Data needed Big Compute. “Predictive” isn’t just another adjective in a long string of X-analytics; it is the quantum change, separating the value of big from small.

The difference between “small” and “big” data is whether data is big enough to learn from

Thus Adatao was born as a “Big Data/Machine Learning” company. Our exact product features would be driven by customer conversations, but the core thesis was clear. We wanted to bring “Data Intelligence for All”, specifically with the speed and sophistication discussed above.

If in-memory compute and machine-learning logic were the key to unlocking the value of Big Data, why hadn’t this been solved already in 2012? Because cost/benefit trade-offs matter, in any technology transition. In the chart below, the crossover points happened at different times for different endeavors; it hit critical mass on Wall Street about 2000–2005, at Google c. 2006–2010, and we project for the enterprise world at-large: about now (2013–2015).

Cross-over points for transitions to in-memory computing

The future increasingly favors RAM

If this isn’t clearly happening for your organization or industry yet, relax. It will, soon. Because as the latency and bandwidth trend charts below show, the future increasingly favors RAM.

The future increasingly favors a shift to RAM

As the Adatao team set out to build this big-compute analytic stack on Hadoop, we wanted our solution to reach all the way to the business users, while also exposing convenient APIs for data engineers and scientists. This required a combination of a great collaborative user interface, solid data-mining and machine-learning logic, backed by a powerful big-compute engine. We did a survey of the in-memory landscape, and found a small number of teams also working in the same direction. But virtually all were either too incremental or too aggressive. Some were developing work-arounds such as caching data between MR iterations, or maintaining a low-level memory cache with no persistent, high-level data structures. Others promoted yet-slow & expensive “virtualized memory” architectures, still too early for prime time.

The AMPLab team made the right architectural decisions for the times

Then we came across Spark and the Berkeley AMPLab team. Immediately, we knew they had identified the right problem statements, and made the right architectural decisions for the times. Here are some key design choices correctly made for widespread adoption c. 2012:

Data model: Spark was the only architecture that supported the concept of a high-level, persistent distributed in-memory dataset. All “in-memory” systems are not equivalent. Spark’s RDDs exist independently of any given compute step, allowing for not only speedy iterative algorithms, with high-level data sets readily available to each iteration without delay. Equally importantly, they made long-running interactive memory-speed applications possible.
Resiliency by recomputation: with replication being the other option, Spark made the timely choice to prefer recomputation. Memory had gotten cheaper, but not yet cheap enough for replication to be the order of the day, as it is with HDFS disks.
General DAG support: while it was possible to build dedicated SQL query engines to overcome Hadoop MapReduce’s limitations (and others did choose this path), Spark’s general DAG model meant we could build arbitrary algorithms and applications on it.

We were seriously betting the company on Spark, promoting its goodness in every relevant conversation.

We were ecstatic. Spark represented years of R&D we didn’t have to spend building an engine before building sophisticated, user-facing applications. When we made the decision to support the AMPLab Spark effort, there were only 1 or 2 others that had made similar commitments. We were seriously betting the company on Spark.

But thanks to Spark, we were able to move ahead quietly and quickly on Adatao pInsights and pAnalytics, iterating on customer feedback while passing our inputs and market data along to the Spark team. We promoted Spark’s goodness in every relevant conversation. By late summer 2013, Databricks was about to be born, further increasing our confidence on the Spark-on-Hadoop ecosystem. There was now going to be an official, commercial entity with an existence predicated on developing the growth of the ecosystem and maintaining its health. And the team at Databricks is doing an excellent job at that stewardship.

Apache Spark will have a bright future

Today, Adatao is one of the first applications to be Certified on Spark. We’re seeing remarkable enterprise adoption speeds for Adatao-on-Spark. The most sophisticated customers tend to be companies that have already deployed Hadoop, who are all too familiar with the failed promises of Big Data. We see immediate excitement in customers the moment they see the Adatao solution: a user-facing analytics application that is interactive, easy-to-use, supports both basic analytics and machine learning, and is actually running in seconds of real time over large Hadoop datasets. Finally, users are truly able to extract data intelligence from data storage. Value creation is no longer just about Big Data. It’s about Big Compute, and Spark has delivered that capability for us.

Spark has made it as a top-level Apache project, going from incubation to graduation in record time. It is also one of Apache’s most active projects with hundreds of contributors. This is because of its superior architecture and timeliness of engineering choices, as discussed above. With that plus appropriate care and feeding, Apache Spark will have a bright future even as it evolves and adapts to changing technology and business drivers.

Originally published at blog.adatao.com.

Why and How We Bet on Apache Spark was originally published in Arimo Narratives™ on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Christopher Nguyen on Medium

Did you enjoy “Algorithms of the Mind”?

Inceptionism: Google Brain Imagination

What You Must Know About Big Data and Machine Learning

Transcript

Algorithms of the Mind

Not Computer Vision, But Computer Imagination

On Concepts and Intuitions

The Sapir-Whorf Controversy

Further Reading

Democratizing Data Intelligence for All

A Major Progress Report

An Evolutionary Perspective

The Arimo Data Intelligence Platform (ADIP)

Our Product Philosophy

So why Natural Interfaces, Collaboration, and Machine Learning?

2015: The Year of Big Apps

Big Apps on Big Compute on Big Data

Collaboration-Led Productivity: The Most Important Feature/Benefit for Users

What’s Coming Next?

On Deep Learning — A Tweeted Bibliography

On Deep Learning

Christopher Nguyen on Twitter

Adatao [ah-DAY-tao] on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Andrew Ng on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Adatao [ah-DAY-tao] on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Christopher Nguyen on Twitter

Big Data 2.0 Has Arrived

Why and How We Bet on Apache Spark