Abstract: "In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year-old benchmarks, but gradually making it harder to explore innovative machine learning research ideas. We explain how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, show how this reliance on high-performance but inflexible kernels reinforces the dominant style of programming model, and argue these programming abstractions lack expressiveness, maintainability, and modularity; all of which hinders research progress. We conclude by noting promising directions in the field, and advocate steps to advance progress towards high-performance general purpose numerical computing systems on modern accelerators."
The main example the authors use to illustrate these issues is capsule networks, first proposed two years ago.[a]
To date, no one has been able to develop a high-performance implementation of capsule networks. At present, the best-performing implementations in Tensorflow and PyTorch must copy, rearrange, and materialize to memory two orders of magnitude more data than necessary, due to the issues raised by the authors. See sections 1 and 2 of the paper for the gory details.
This is something we run into even when pushing transformer-style models (which can be viewed as a partial step towards some of these capsule ideas) into new areas like computer vision. There are places where you can beat convolutions in terms of parameters+FLOPs but wall-clock time is poor due to the same issues presented in this paper - there aren't good optimized self-attention kernels so you're forced to implement things in terms of existing ops - depthwise convs, etc. It's definitely a real pain when trying to bring in new primitives to the deep learning world, only compounded by the increasing diversity of accelerators in use.
Note: I am not an expert on ML, ANNs, etc - most of what I have done has been mainly "hobby level" and MOOC learning.
From what I understand, though, what GH is trying to accomplish with capsule networks (which I have tried to understand, and have yet to succeed) is optimizing (or possible removal of) backpropagation.
He has noted in the past how - so far (from what I know) - there isn't a biological equivalent to backpropagation for learning. Backprop is a completely artificial mathematical construct that doesn't happen in natural systems. It is also extremely energy intensive - at least in the manner that it is currently done.
So the question is - that he's hoping to answer I think - what is a proper working framework for implementing an artificial neural network that can learn, without using backpropagation (or using it differently)?
I think whoever can solve this will fundamentally remake the field of ML/AI - much in the same way that backprop (and later "deep learning") did.
It's not the backprop that is holding back ML - it takes about as much time as the forward pass and requires 3x the memory. Backprop is necessary for almost all the deep neural nets that have state of the art results, attempting to replace it would push us back a decade or more.
The main problem with ML is the separation of data and compute. It takes a lot of time and energy to move data around. We need 'in memory compute'.
I've heard the argument made about "in memory compute" - and I do know that various companies have made hardware toward that direction (and some of it has very low power requirements).
I just find it compelling that Hinton - at least as of 2 years ago - has this viewpoint that we should be rethinking the concept of backprop...
I'd say this is a difference of scale. Hinton's comments are meant to inspire and encourage _The Next Great Thing_. We could read the paper that broadly, but they seem to be addressing a more immediate local minima.
Isn’t backpropagation just an efficient method for getting a gradient? It sounds like what he’s looking for an alternative to is the idea of adjusting parameters to match labeled training data.
I thought this was going to be about this "stuck in a rut":
GH: One big challenge the community faces is that if you want to get a paper published in machine learning now it's got to have a table in it, with all these different data sets across the top, and all these different methods along the side, and your method has to look like the best one. If it doesn’t look like that, it’s hard to get published. I don't think that's encouraging people to think about radically new ideas.
Now if you send in a paper that has a radically new idea, there's no chance in hell it will get accepted, because it's going to get some junior reviewer who doesn't understand it. Or it’s going to get a senior reviewer who's trying to review too many papers and doesn't understand it first time round and assumes it must be nonsense. Anything that makes the brain hurt is not going to get accepted. And I think that's really bad.
What we should be going for, particularly in the basic science conferences, is radically new ideas. Because we know a radically new idea in the long run is going to be much more influential than a tiny improvement. That's I think the main downside of the fact that we've got this inversion now, where you've got a few senior guys and a gazillion young guys.
Which is the granddaddy of the other one. You don't really need better frameworks that make it easier to explore innovative ideas if you can't ever hope to publish those innovative ideas, even if you manage to make them work.
Both GH and TFA are raising a pretty classic "structure of scientific progress" trade-off: communities that are well organized around specific {theories, datasets, metrics, hardware, software} greatly accelerate progress within the box, and exclude progress outside it.
The way around has always been the hard work of creating your own box - ideally starting with questions, then theories, then methods, and so on.
What's not available is instant attention/accolades/funding for work that presents beautiful methods without first building a community that cares about the problems those methods solve and agrees on a definition of success. So, folks with radically new ideas should focus on rallying others to agree on that definition. I see a lot of good work these days introducing new problem definitions and datasets for that purpose.
This is applicable to almost every research area, where the publication of peer-reviewed (PR) papers is de-facto a standard. I have a background in physics and a few papers published in Q1 journals, so I heard these complaints about "peer-review pressure" many times. But the truth is somewhere in between: 1. yes, sometimes you get a stubborn reviewer and it is extremely difficult to get through, I've experienced this several times; 2. but PR still deliver a required filter against a flood of low-quality papers.
In essence, I found the note above as too emotional, since currently we have nothing better than PR of papers, review of code, etc., in order to guarantee the minimal level of quality.
You get that "flood of low-quality papers" on Arxiv but most researchers filter through papers daily and are generally capable of isolating the relevant and interesting from the dull and unimportant. Probably a lot of citations to papers are somewhat dishonest with the Arxiv version being the only version that was read. From another researcher's standpoint, the peer review is not very important: you already know if you got scooped, you already know if you find something interesting enough to study as well.
Peer review seems to have a purpose but it does not seem to be a quality gate keeping purpose. Perhaps enforcing orthodoxy is the actual purpose.
The majority doesn't have to but journals should. The majority will pick it up if and when someone has proven its utility above and beyond current techniques.
However, it should be noted that most "radically new" ideas are neither radical nor new. And it would be more or less accurate to label a vast swathe of them as "nonsense" although I'd prefer to use a term with a less pejorative connotation.
The difficulty comes in sorting out the "nonsense" from the genuinely new ideas. This takes expertise and time. It's easier to use a quick heuristic to weed out those most likely to be "nonsense" so you can spend the time on better prospects. Of course that heuristic is unlikely to be perfect.
Admittedly I can't speak to ML specifically but I assume the same applies to it as it does to many other areas.
Ideas are cheap. I have ideas, you have ideas, everyone has ideas. What counts is beating the state of the art, that turns heads and raises eyebrows (such as ResNet, AlphaZero, WaveNet, BigGAN and Transformer). Usually those results are based on lots of compute and training data, though.
Ideas are cheap, but the skills to engineer ML pipes and the knowledge requirements to grok models are not. And the cost of testing some ideas not is not cheap either.
Lots of people have legit skills, high level math, pro software engineering, actual ML chops (I got mine), and probably significant learning in other sciences. But there is no door to walk through on that alone, only strange VC corridors and FAANGY career mazes.
There should be a wide door to support people with real skills taking chances. If 100K for so many silly startups is worth the gamble, so is the gamble on people who have gone the extra distance with their learning and experience.
you can train some weak small baseline Transformer on your workstation GPU, and then check if your creative idea outperforms it.
If it does, then it would be worth trying to pursue larger calculations.
Don't people just read the arXiv? I think Hinton is being a bit old fashioned here. A paper doesn't need to be published to a journal at all to be influential.
How do you filter arXiv? There's way too many papers on arXiv, it's impossible to keep track of everything, especially since the quality on arXiv is ... heterogenous.
That's interesting, this method has Google take the place of peer review. Are there concerns with the Google algorithm being unsuited to this purpose, or SEO?
Personally, I only trust arxiv preprints when I know something about the authors.
Google searches all the journals I need. I try searching journal databases (like web of science) and they always miss papers I know exist. I type the title into Google and I usually find it, and if I don't I ALWAYS find it on Google scholar. They even have nifty little links to download the citations for endnote etc.
t's quite easy to produce a catchy looking title and abstract. Unlike mere Googling, top conferences and journals add a layer of peer review that reduces the probability of low-quality content. I've been burned too often by random papers on the internet.
If you have an idea about machine learning that doesn't fit the mold, you're pretty much forced to write your own mini-framework before trying it. And I'm not even talking about CUDA compilation.
Mainstream languages don't have syntax for trivial math like graphs and matrices. Even 2d array support is horrible everywhere.
Try to write the following algorithm:
1. Given an image in greyscale, chose two n*n regions at random.
2. Compute the median difference between pixels in those regions.
3. Add both regions to a graph (unless they are already there) and set the edge value to that difference.
Trivial to visualize in your head, a nightmare to implement in most languages.
And what if you want to parallelize this to work on large amounts of data? I have hopes that AMD's monster processors will make experimentation a little bit more practical, since you wouldn't be in such a desperate need to push everything onto GPU.
Isn't this the kind of application where Julia would shine?
If you're trying to develop something new or cutting edge it makes sense that frameworks don't necessarily suit you. Frameworks try to make common actions easy (provide a 'happy path'), very much the opposite of developing something new.
Something like Julia which offers useful primitives, but can also let you compile general purpose code for CUDA seems like a better fit, rather than expecting frameworks to cater to you.
The authors of the paper discuss how Julia fares with capsule networks in section 4.1:[a]
> There are frameworks, such as Julia, which nominally use the same language to represent both the graph of operators and their implementations, but back-end designs can diminish the effectiveness of such a front end. In Julia, while 2D convolution is provided as a native Julia library, there is an overloaded conv2d function for GPU inputs which calls NVidia’s cuDNN kernel. Bypassing this custom implementation in favor of the generic code essentially hits a “not implemented” case and falls back to a path that is many orders of magnitude slower.
Julia makes this substantially easier, but yes in many cases you might need to come up with a cache-optimized tensor operation kernel. The Julia Lab is building some tooling for the automatic construction of this stuff which is compatible with the AD frameworks, kind of like Halide, but in the Julia language and compatible with all of its generics. The real key though is that packages play nicely in the Julia sphere, so if someone writes a good Julia code for the operation and puts a package up, then you can take that package and utilize it in Flux even if the person did not write for ML purposes. There's a lot of quantum particle folks writing esoteric tensor operations in Julia that I personally have been pulling stuff from.
> ...in many cases you might need to come up with a cache-optimized tensor operation kernel...
Yes, exactly. That's one of the issues raised by the authors of the paper.
They note that in practice, most AI researchers will not do that. They iterate and test code very quickly and cannot invest the time/effort necessary (a) to figure out how to write a cache-optimized kernel every time they might need one, or even (b) wait for an automated kernel-writing tool to finish searching for and compiling an optimized kernel (which in practice often ends up being slower than manipulating the data in inefficient ways in order to use existing, inflexible, prebuilt kernels).
>They note that in practice, most AI researchers will not do that.
I think the main issue there is that these kernel compilers are not well-integrated into the ML libraries. If it's a standard part of it, well-documented, easy to build (usually the difficult part is compiling someone's custom compiler...), and if the compiled results can be "stored" for future ML models to just use, I think it would see more adoption. As of now, tensor compilers for ML frameworks are more of a (good but shaky) research tool. I think a Julia approach of "do it on Julia code" (even the compiler, so it's easy to just ]add the package) could definitely break down some of these barriers to getting it done in a way that garners widespread adoption, but a lot of work will need to be done in order to get a good enough tensor compiler for people to care. If anything, this is a fun space with many opportunities.
Clearly, that would help. And clearly, there are opportunities for improvement -- as much as two orders of magnitude (!) in the case of capsule networks.
That said, I think the authors are right about the challenge here: "...current frameworks excel at workloads where it makes sense to manually tune the small set of computations used by a particular model or family of models. Unfortunately, frameworks become poorly suited to research, because there is a performance cliff when experimenting with computations that haven’t previously been identified as important. While a few hours of search may be acceptable before production deployment, it is unrealistic to expect researchers to put up with such compilation times (recall this is just one kernel in what may be a large overall computation); and even if optimized kernels were routinely cached locally, it would be a major barrier to disseminating research if anyone who downloaded a model’s source code had to spend hours or days compiling it for their hardware before being able to experiment with it."
I guess "easier" is a relative term; welding something together in your garage is never going to be as easy & efficient as buying the Toyota... but what the binding constraints on the field's progress are I don't claim to know.
But since you seem to understand the paper a bit, may I ask: I thought convolution was done by FFT, shaving N^2 to NlogN and all that. Are these automated compilers they discuss smart enough to find that, or are they literally re-arranging (1)? Or is it all so memory-bound that this question makes no sense?
Maybe I'm not understanding the problem they hit, but couldn't they just modify conv2d that calls cuDNN to do what they want it to do? In Julia even CUDA kernals could be written in Julia. I realize that CUDA & CuDNN programming is not easy, but perhaps given Julia's powerful macro system and compiler tools this could be automated to some extent?
Part of the problem is that CuDNN is closed source. If CuDNN almost does what you want, you have to reimplement it, and your implementation will not be as fast as NVIDIAs because they have put a lot of work into it and also understand the subtleties of their GPU's architecture better. A lot of performance details of NVIDIAs GPUs are not documented.
And the other problem is that tensor kernels are just a bear to optimize. Even on the CPU, if you write a 3 loop implementation of matrix multiplication you're leaving behind a ton of performance. You want to use a blocked algorithm which specifically knows your cache sizes, and it comes out to like a 12 nested loop implementation which can be architecture-dependent and maybe have a tiny bit of assembly magic in the middle. That's just a CPU, and that's just a dense matmul. This is why things like Halide exist for trying to generate optimized kernels to things like stencil operations, since "just write the loop" really isn't a good strategy. Couple that with the fact that you need to specialize it to the architecture, and indeed a lot of it isn't documented, and you have a hard time getting something as fast as CuDNN.
If you haven't seen this stuff, take a look at: https://www.cs.cmu.edu/afs/cs/project/pscico-guyb/realworld/... . That's cache-oblivious, so it still leaves some performance on the table (which is why BLAS doesn't use a cache-oblivious form), but gets the point across that you can do a ton of algorithmic work here.
What Julia gives you is something where you can directly write the loops and compile to CUDA kernels, and it will run fast. The issue there isn't making the CUDA kernels, but coming up with the algorithm. CuDNN is proprietary so you can't just copy their tensor kernels into Julia. The Julia Lab has some individuals interested in building tensor compilers for this case though, where you can abstractly define such an operation and it will try to come up with a fast version. And then hopefully build binaries from it so it's "compile once"? But this stuff is just hard and there's no easy answer, though Julia is probably the best approach for building an answer IMO :).
I definitely agree that if you stray outside the "traditional" CNN, LSTM, etc. it is very hard to implement custom layers in most frameworks. Especially if they are recurrent. I tried to implement a custom RNN in Tensorflow and gave up. All of the documentation is "just call tf.Lstm()" or whatever.
The exception is CNTK. That makes it really really easy to describe any network including recurrent ones. Sadly it doesn't seem to have caught on at all.
I agree that it is super painful and confusing to implement RNNs in TensorFlow and had a similar experience starting out. You should take a look at https://www.tensorflow.org/api_docs/python/tf/scan, that is one of the easier ways of implementing one I think. Other than that Pytorch is also great for experimentation and even has a nice example https://pytorch.org/tutorials/advanced/cpp_extension.html of implementing a variant of an LSTM.
Contributing to this problem is that there is a ton of low quality blog posts on all of these topics.
Tensorflow has pretty poor documentation. Looks like you've been reading the higher level stuff. It's pretty easy to make custom layers and such with tf if you read some of the source. Though, honestly, numpy might be the simplest method to create a custom layers inside an architecture. If you use TF 2.0 this is simple because it doesn't even have to be part of the graph, per se.
Is the fix here really just, lets build more infrastructure? If so, who is going to fund it? AI/ML is already mostly a marketing term at this stage, it's all just people who want results now now now. Maybe this is the reason that AI winters have occured: Companies expect all this stuff to be plug and play, but it really isn't. You've got to put in the hard yards, and that means the dollars.
Considering the difficulty outlined in the article is in optimising the more 'cutting-edge' approaches (i.e. those without manually optimised kernels), there are numerous AI research centres (in industry and academia) that would be quite keen to reduce the compute cost of exploring new approaches / architectures.
The vast majority of companies applying ML are unlikely to step outside of the "mainstream" in terms of model architecture (at least currently), and thus the conclusions aren't particularly relevant to them.
If you are lucky you can still see them for yourself by searching Google for mat-table. (It is a component of Angular Material, which the AI/ML thing kind of realizes but not without fuzzing it to include an actual table with a mat on as well :-)
Edit: meanwhile, for the first time I can remember for years Google isn't pushing some dumb dating site front and center but rather text ads for some things that could be relevant for a software engineer with a lovely wife and small kids: mule integration, robotics stuff and holiday suggestions. Congrats to whoever managed to convince their boss (or the AI) to try some other options. I’d also be delighted if next year I’d get a well placed ad for a couple of the conferences I missed out on this year.
It is not just academia that is stuck with systems that are fast as some things and slow at other things.
Branching, pointer indirection, and variable length data structures all kill performance. That is where the serialization tax comes from. Arrays of numbers can be loaded into RAM, even memory mapped, and be used right away.
Commercial products are very much limited by what we know how to make fast.
I think languages like http://unisonweb.org/posts/ are a bit early, but in a couple decades there could be a strong institutional push to languages with features like it.
This does seem to be a big problem. More than just frameworks not being amenable to alternative approaches, people just don't know than non-mainstream approaches exist. But if they find out about something different and try to use it, good luck.
It's not impossible though. For example, I have been thinking about learning OpenCL. And there are other GPGPU approaches like CUDA and a lot of less popular stuff.
https://dl.acm.org/citation.cfm?id=3321441 (click on "PDF" link to read)
Abstract: "In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year-old benchmarks, but gradually making it harder to explore innovative machine learning research ideas. We explain how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, show how this reliance on high-performance but inflexible kernels reinforces the dominant style of programming model, and argue these programming abstractions lack expressiveness, maintainability, and modularity; all of which hinders research progress. We conclude by noting promising directions in the field, and advocate steps to advance progress towards high-performance general purpose numerical computing systems on modern accelerators."
The main example the authors use to illustrate these issues is capsule networks, first proposed two years ago.[a]
To date, no one has been able to develop a high-performance implementation of capsule networks. At present, the best-performing implementations in Tensorflow and PyTorch must copy, rearrange, and materialize to memory two orders of magnitude more data than necessary, due to the issues raised by the authors. See sections 1 and 2 of the paper for the gory details.
Two orders of magnitude. That is pathetic.
[a] https://arxiv.org/abs/1710.09829