Machine Learning Models
Kernl is an open source project
that optimizes and accelerates your PyTorch model
from transformers import AutoModel from kernl.model_optimization import optimize_model model = AutoModel.from_pretrained(model_name).eval().cuda() // model optimization in one line of code 🙂 optimize_model(model)
What is kernl?
Kernl's goal is to optimize the most common models in one line and simplify the way you work.
Its philosophy is to remain simple and accessible.
No need to rewrite your PyTorch model, you stay in the comfort of Python to train and infer.
For advanced cases, Kernl provides resources such as a debugger, tutorials, etc., to allow everyone to tweak and optimize their own models with OpenAI Triton.
More accessible than CUDA, there is no need to relearn everything, we remain in the world of PyTorch.
Performant & efficient solution
Kernl is based on kernel fusion and relies on open source technologies, such as CUDA Graphs, OpenAI Triton, TorchDynamo.
This combination drastically reduces memory accesses, eliminates CPU overhead and ultimately makes the models significantly much faster.
At Lefebvre Sarrut we dedicate our innovation and R&D initiatives to empower
enabling knowledge in law, tax and regulation.
We already run several large language models to make law more accessible.
Our need is to explore and iterate quickly, at low cost, to train and infer our own models without being dependent on other solutions we had used like TensorRT or ONNX.
Our goal with Kernl is to be able to optimize any model, simply and efficiently, while remaining autonomous and independent of complex CUDA code.
Open source and ethics
Providing educational materials to help you is one of our goals because sharing is part of our DNA.
Kernl is part of an Open Source approach because we firmly believe in its virtues of sharing and exchange.
We are working to make the project as accessible as possible and we encourage everyone to contribute in their own way.
Please feel free to consult the contribution guide.
The purpose of kernl is to make latest models more accessible to a wider audience of developers, with time-cost efficiency at heart.
By doing so, not only we democratize large language models but we also contribute to a more resilient AI ecosystem.
How is it efficient?
Kernel fusion is based on a simple recipe:
- Make a graph of the model with PyTorch FX and TorchDynamo
- Identify only costly operations (e.g. Attention, Linear Layer, etc.)
- Dynamically replace these by an OpenAI Triton operation that fuses them
- And keep the pre-existing optimizations
This simple recipe drastically reduces GPU memory bandwidth bottleneck and accelerates inference and training.
Pretty crazy performance gains
Kernels fusion is a part of our optimizations. By fusing them, GPU memory accesses are significantly reduced, CPU overhead is eliminated, which reduces inference latency and increases training speed.
For example, Bert is up to 12 times faster than the Hugging Face baseline.
T5 is also 6 times faster (and we are still halfway through the optimizations!).