Dear Google, here's my rant about your compute libraries
5 min read
I stopped ranting about software libraries and programming languages some time ago. But given the current context of things in everything AI, be it research or the end product, I felt like ranting. It's all from my personal experience of using Google's libraries.
Disclaimer: Again, it's all my personal opinion and you can have yours which doesn't have to align with mine. It's totally fine. And no, you suddenly don't have to change the library in the middle of your project because I'm ranting about it.
So back in 2018, Google released Jax because both Pytorch and Tensorflow were pretty cumbersome to work with TPUs. Cloud is one of the key money makers for Google and it doesn't give them a good image when their in-house Tensorflow becomes difficult to use on cloud TPUs. Jax solved that problem. But it also had features that neither of the other two libs had at that time. Transforms(jit, vmap, ....) and a functional interface. Under the hood Jax is NumPy, but you can run it on accelerators, such as TPUs or GPUs.
I started using Jax in the latter part of 2021. It's fun to use, and it's really fast when you can "jit" your functions right. But it's all fun and play until you actually have to do something and there exists no documentation or example for it. Jax and the related ecosystem (Flax, dm-haiku, Optax, Trax etc.) from Google probably have the most obnoxious and cumbersome documentation among all the compute libraries out there. It's sad. Especially when you consider that it comes from a company like Google and that most of their recent research is implemented using Jax. There are hardly any examples you can look at to get a hint about something. Often I had to dig into the source code to get a rough understanding of things. They still don't have an official forum like Pytorch has, other than a discussion section on their Github repo. It's been five years since 2018. Holy!
Tensorflow isn't better off either
Tensorflow is really old. I remember doing my bachelor thesis entirely on Tensorflow. In fact, I'd say that Tensorflow 1.x was really good. It was easy to use, and the documentation was usable enough during 2017-2018. Things went haywire after Tensorflow 2.x and when they frantically started replicating features from Pytorch. Pytorch uses a dynamic computation graph! On which Earth did it seem reasonable to you that your static graph compiler can emulate a dynamic graph without a huge performance penalty? Sure Keras still has a very no-nonsense code interface to define and train models and may get tangled if you want to tweak too much but that's an issue you can work around (if you're patient enough to dig through the docs or source code). The reason I stopped using Tensorflow altogether and moved to Pytorch and Jax was one single thing crucial to deep learning. Accelerator setup. So what happened actually?
TF 2.3 released with a CUDA version compatibility which was older than the minimum supported version for my GPU. Sure it was a new GPU and required the latest CUDA version but you shouldn't be telling people that you can't do anything about it. Look at Pytorch! They bundle their own CUDA and cuDNN binary so that people don't have to worry about it. What's stopping you from that? Licensing? Ask people to use Anaconda and install the cuda-toolkit package from there. (Well as of last week, the conda instructions on their official site don't work either. TF installs fine but doesn't find the CUDA files even if an explicit path flag has been provided).
So yeah, there was I, with a project deadline, a new GPU and no way in TF to use that GPU. I somehow managed to finish the project on Colab. This isn't ideal. Not everything can run inside a Jupyter Notebook environment (e.g. multiprocessing). So I moved to Pytorch. Now Pytorch isn't without its quirks but with the documentation and the number of examples they have, along with a very helpful and active user forum, you can expect to get your problem solved or at the bare minimum get some direction on getting things done. And please don't get me started with all the optimization and memory-saving libraries Pytorch has.
So Jax again and where does it lead to
Despite all its shortcomings, I very much like using Jax for certain tasks. You can't beat the speed-up Jax can provide when you use jit (all hail XLA). But I wish it had documentation like Pytorch. Or a discussion forum (whatever they have right now) which didn't feel like a desert.
ChatGPT is all the rage these days, right? With media outlets gossiping about a dialogue system like ChatGPT to dethrone an information retrieval system like Google search and how Google is under pressure from it. OpenAI built ChatGPT. Do you know what they use? Pytorch. Who backs them now? Microsoft. What does Microsoft use? Pytorch. Not just them, Nvidia, Apple, Intel, and Tesla use it as well. With all these companies actively using a library, the documentation and related tools are bound to get better. Google is very capable of doing the same. What holds them back, I don't know. So anyways, where does this lead to?
People who work on building/researching such AI-centric systems (I personally don't like the term but so be it) would like their work process to be as frictionless as possible so that they can focus on the main problem instead of chasing mice under the table. Google's tools, which they market for such purposes, only bring friction to the table. They're clunky, slow, the documentation is unhelpful and setting up a minimal work environment is a chore. So if I as someone on the research side, get fed up over and over again over some company's tools, I'll begin losing my trust in them. It's a small domino effect that scales to something humongous in the long run.
People at both Deepmind and Google Brain are doing wonderful work. The source code is there, the papers are in arxiv but when they claim that library X was built by the researchers to make research easier, all their recent work is done with X and then the public documentation looks like that of Jax or Tensorflow, it doesn't enforce much trust. Rather raises questions which I don't want to ask here.
(And if you're a web developer, I'm pretty sure you remember what Google in their infinite wisdom did with Angular.)