Why Mojo🔥

A backstory and rationale for why we created the Mojo language.

When we started Modular, we had no intentions of building a new programming language. But as we were building our platform with the intent to unify the world’s ML/AI infrastructure, we realized that programming across the entire stack was too complicated. Plus, we were writing a lot of MLIR by hand and not having a good time.

What we wanted was an innovative and scalable programming model that could target accelerators and other heterogeneous systems that are pervasive in machine learning. This meant a programming language with powerful compile-time metaprogramming, integration of adaptive compilation techniques, caching throughout the compilation flow, and other things that are not supported by existing languages.

And although accelerators are important, one of the most prevalent and sometimes overlooked “accelerators” is the host CPU. Today, CPUs have lots of tensor-core-like accelerator blocks and other AI acceleration units, but they also serve as the “fall back” for operations that specialized accelerators don’t handle, such as data loading, pre- and post-processing, and integrations with foreign systems. So it was clear that we couldn’t lift AI with an “accelerator language” that worked with only specific processors.

Applied AI systems need to address all these issues and we decided there was no reason it couldn’t be done with just one language. So Mojo was born.

We decided that our mission for Mojo would include innovations in compiler internals and support for current and emerging accelerators, but we didn’t see any need to innovate in language syntax or community. So we chose to embrace the Python ecosystem because it is so widely used, it is loved by the AI ecosystem, and because it is really nice!

Mojo as a member of the Python family

The Mojo language has lofty goals - we want full compatibility with the Python ecosystem, we would like predictable low-level performance and low-level control, and we need the ability to deploy subsets of code to accelerators. We also don’t want ecosystem fragmentation - we hope that people find our work to be useful over time, and don’t want something like the Python 2 => Python 3 migration to happen again. These are no small goals!

Fortunately, while Mojo is a brand new code base, we aren’t really starting from scratch conceptually. Embracing Python massively simplifies our design efforts, because most of the syntax is already specified. We can instead focus our efforts on building the compilation model and designing specific systems programming features. We also benefit from tremendous work on other languages (e.g. Clang, Rust, Swift, Julia, Zig, Nim, etc), and leverage the MLIR compiler ecosystem. We also benefit from experience with the Swift programming language, which migrated most of a massive Objective-C community over to a new language.

Further, we decided that the right long-term goal for Mojo is to provide a superset of Python (i.e. be compatible with existing programs) and to embrace the CPython immediately for long-tail ecosystem enablement. To a Python programmer, we expect and hope that Mojo will be immediately familiar, while also providing new tools for developing systems-level code that enable you to do things that Python falls back to C and C++ for. We aren’t trying to convince the world that “static is good” or “dynamic is good” - our belief is that both are good when used for the right applications, and that the language should enable the programmer to make the call.

How compatible is Mojo with Python really?

Mojo already supports many core features of Python including async/await, error handling, variadics, etc, but… it is still very early and missing many features - so today it isn’t very compatible. Mojo doesn’t even support classes yet!

That said, we have experience with two major but different compatibility journeys: the “Clang” compiler is a C, C++ and Objective-C (and CUDA, OpenCL, …) that is part of LLVM. A major goal of Clang was to be a “compatible replacement” for GCC, MSVC and other existing compilers. It is hard to make a direct comparison, but the complexity of the Clang problem appears to be an order of magnitude bigger than implementing a compatible replacement for Python. The journey there gives good confidence we can do this right for the Python community.

Another example is the Swift programming language, which embraced the Objective-C runtime and language ecosystem and progressively shifted millions of programmers (and huge amounts of code) incrementally over to a completely different programming language. With Swift, we learned lessons about how to be “run-time compatible” and cooperate with a legacy runtime. In the case of Python and Mojo, we expect Mojo to cooperate directly with the CPython runtime and have similar support for integrating with CPython classes and objects without having to compile the code itself. This will allow us to talk to a massive ecosystem of existing code, but provide a progressive migration approach where incremental work put in for migration will yield incremental benefit.

Overall, we believe that the north star of compatibility, continued vigilance on design, and incremental progress towards full compatibility will get us to where we need to be in time.

Intentional differences from Python

While compatibility and migratability are key to success, we also want Mojo to be a first class language on its own, and cannot be hobbled by not being able to introduce new keywords or add a few grammar productions. As such, our approach to compatibility is two fold:

We utilize CPython to run all existing Python3 code “out of the box” without modification and use its runtime, unmodified, for full compatibility with the entire ecosystem. Running code this way will get no benefit from Mojo, but the sheer existence and availability of this ecosystem will rapidly accelerate the bring-up of Mojo, and leverage the fact that Python is really great for high level programming already.
We will provide a mechanical migrator that provides very good compatibility for people who want to move Python code to Mojo. For example, Mojo provides a backtick feature that allows use of any keyword as an identifier, providing a trivial mechanical migration path for code that uses those keywords as identifiers or keyword arguments. Code that migrates to Mojo can then utilize the advanced systems programming features.

Together, this allows Mojo to integrate well in a mostly-CPython world, but allows Mojo programmers to be able to progressively move code (a module or file at a time) to Mojo. This approach was used and proved by the Objective-C to Swift migration that Apple performed. Swift code is able to subclass and utilize Objective-C classes, and programmers were able to adopt Swift incrementally in their applications. Swift also supports building APIs that are useful for Objective-C programmers, and we expect Mojo to be a great way to implement APIs for CPython as well.

It will take some time to build Mojo and the migration support, but we feel confident that this will allow us to focus our energies and avoid distractions. We also think the relationship with CPython can build from both directions - wouldn’t it be cool if the CPython team eventually reimplemented the interpreter in Mojo instead of C? 🔥

Detailed motivation

Mojo started with the goal of bringing an innovative programming model to accelerators and other heterogeneous systems that are pervasive in machine learning. That said, one of the most important and prevalent “accelerators” is actually the host CPU. These CPUs are getting lots of tensor-core-like accelerator blocks and other dedicated AI acceleration units, but they also importantly serve as the “fall back” to support operations the accelerators don’t. This includes tasks like data loading, pre- and post-processing, and integrations with foreign systems written (e.g.) in C++.

As such, it became clear that we couldn’t build a limited accelerator language that targets a narrow subset of the problem (e.g. just work for tensors). We needed to support the full gamut of general purpose programming. At the same time, we didn’t see a need to innovate in syntax or community, and so we decided to embrace and complete the Python ecosystem.

Why Python?

Python is the dominant force in both the field ML and also countless other fields. It is easy to learn, known by important cohorts of programmers (e.g. data scientists), has an amazing community, has tons of valuable packages, and has a wide variety of good tooling. Python supports development of beautiful and expressive APIs through its dynamic programming features, which led machine learning frameworks like TensorFlow and PyTorch embraced Python as a frontend to their high-performance runtimes implemented in C++.

For Modular today, Python is a non-negotiable part of our API surface stack - this is dictated by our customers. Given that everything else in our stack is negotiable, it stands to reason that we should start from a “Python First” approach.

More subjectively, we feel that Python is a beautiful language - designed with simple and composable abstractions, eschews needless punctuation that is redundant-in-practice with indentation, and built with powerful (dynamic) metaprogramming features that are a runway to extend to what we need for Modular. We hope that those in the Python ecosystem see our new direction as taking Python ahead to the next level - completing it - instead of trying to compete with it.

What’s wrong with Python?

Python has well known problems - most obviously, poor low-level performance and CPython implementation decisions like the GIL. While there are many active projects underway to improve these challenges, the issues brought by Python go deeper and particularly impact the AI field. Instead of talking about those technical limitations, we’ll talk about the implications of these limitations here in 2023.

Note that everywhere we refer to Python in this section is referring to the CPython implementation. We’ll talk about other implementations in a bit.

The two-world problem

For a variety of reasons, Python isn’t suitable for systems programming. Fortunately, Python has amazing strengths as a glue layer, and low-level bindings to C and C++ allow building libraries in C, C++ and many other languages with better performance characteristics. This is what has enabled things like numpy, TensorFlow and PyTorch and a vast number of other libraries in the ecosystem.

Unfortunately, while this approach is an effective way to building high performance Python libraries, its approach comes with a cost: building these hybrid libraries is very complicated, requiring low-level understanding of the internals of cpython, requires knowledge of C/C++/… programming (undermining one of the original goals of using Python in the first place), makes it difficult to evolve large frameworks, and (in the case of ML) pushes the world towards “graph based” programming models which have worse fundamental usability than “eager mode” systems. TensorFlow was an exemplar of this, but much of the effort in PyTorch 2 is focused around discovering graphs to enable more aggressive compilation methods.

Beyond the fundamental nature of the two-world problem in terms of system complexity, it makes everything else in the ecosystem more complicated. Debuggers generally can’t step across Python and C code, and those that can aren’t widely accepted. It is a pain for the package ecosystems to deal C/C++ code instead of a single world. Projects like PyTorch with significant C++ investments are intentionally trying to move more of their codebase to Python because they know it gains usability.

The three-world and N-world problem

The two-world problem is commonly felt across the Python ecosystem, but things are even worse for developers of machine learning frameworks. AI is pervasively accelerated, and those accelerators use bespoke programming languages like CUDA. While CUDA is a relative of C++, it has its own special problems and limitations, and does not have consistent tools like debuggers or profilers. It is also effectively locked to a single hardware maker!

The AI world has an incredible amount of innovation on the hardware front, and as a consequence, complexity is spiraling out of control. There are now many attempts to build limited programming systems for accelerators (OpenCL, Sycl, OneAPI, …). This complexity explosion is continuing to increase and none of these systems solve the fundamental fragmentation in tools and ecosystem that is hurting the industry so badly.

Mobile and server deployment

Another challenge for the Python ecosystem is one of deployment. There are many facets to this, including folks who want to carefully control dependencies, some folks prefer to be able to deploy hermetically compiled “a.out” files, and multithreading and performance are also very important. These are areas where we would like to see the Python ecosystem take steps forward.

Mojo as a member of the Python family

How compatible is Mojo with Python really?

Intentional differences from Python

Detailed motivation

Why Python?

What’s wrong with Python?

The two-world problem

The three-world and N-world problem

Mobile and server deployment

Related work: other approaches to improve Python

Improving CPython and JIT compiling Python

Python subsets and other Python-like languages

Embedded DSLs in Python