So, you register a company. In order to do that, you need to deposit some money. Then, you come up with a name for your company, which is rejected. Ultimately, you either give up or realize that someone else built a shopping mall in the meantime. But we are not genuine architects.
We are software engineers , and we have the privilege of bringing our ideas to life with no permits or bureaucracy. The only thing we need is spare time and the will to spend it on programming instead of sudoku puzzles. If you still insist that the motivation for programming cannot be pure curiosity, let me just mention that the first programming language that I designed was developed as a result of necessity, not mere curiosity.
I am going to develop the programming language as we go through the series, so there is a possibility that I will develop some bugs as well. They should be ironed out as we approach the end of the series. The complete source code for everything described here is freely available at GitHub. I need a key-value container that should retrieve values fast, in logarithmic time. I am completely against unnecessary optimization, but in this case, I feel like a lot of memory is wasted on nothing.
Not only that, but later, we will need to implement a maximal munch algorithm on top of such a container, and map is not the best container for that. The only problem with that approach is lesser code readability as we need to keep in mind that the vector is sorted, so I developed a small class that assures that constraint. All functions, classes, etc. I will omit that namespace for readability. As you can see, the implementation of this class is quite simple. The tokenizer will read characters from the input stream.
At this stage of the project, it is hard for me to decide what the input stream will be, so I will use std::function instead. I will use well-known conventions from C-style stream functions, such as getc , which returns an int instead of char as well as a negative number to signal the end of a file. However, it is really convenient to read a couple of characters in advance, before an assumption of a token type in a tokenizer.
To that end, I implemented a stream that will allow us to unread some characters. To save space, I will omit the implementation details, which you can find on my GitHub page. The tokenizer is the lowest-level compiler component. There are four types of tokens that are of interest to us:. To ensure appeal and planetary popularity of this language, we will use well-known C-like syntax. The ones without prefix are operators.
Almost all of them follow the C convention. I am a firm believer that single-statement blocks without curly braces are not worth it. Therefore, elif can be used to eliminate braceless statements completely. Whether or not we allow it is a decision that can wait for now. Otherwise, the tokenizer will assume that it is an identifier. If it reads more, it will unread all extra characters it has read after the longest recognized operator.
Basically, we treat all operators as candidates at the beginning. Whenever the range is non-empty, we check if the first element in the range has no more characters if such an element exists, it is always at the beginning of the range as the lookup is sorted. In that case, we have a legal operator matched. The longest such operator is one that we return.
We will unread all the characters that we eventually read after that. Since tokens are heterogeneous, a token is a convenience class that wraps std::variant different token types, namely:. The identifier is just a class with a single member of the std::string type. It is there for convenience as, in my opinion, std::variant is cleaner if all of its alternatives are different types.
So you may trade off the ease of parsing with execution complexity. Before you write either an interpreter or a compiler, you need to design a language. You can spend time studying category theory and type systems and programming paradigms, but the best thing you can do for yourself at this stage is to study ideas from a wide breadth of programming languages.
Semantics are also more difficult to change after the fact than syntax. Some questions to ask yourself, to start thinking about language semantics and ergonomics:. In this phase, I usually keep a text document where I experiment with syntax by writing small programs in the new as-yet-nonexistent language.
I write notes and questions for myself in comments in the code, and try to implement small algorithms and data structures like recursive mathematical functions, sorting algorithms, and small web servers. I usually hold myself to this stage until most of my questions about syntax and semantics are answered. I also use this phase to test my hypotheses about the core ideas behind the language design.
At their core, interpreters and compilers are layers and layers of small transformations to your source code a string of text into some form executable by your computer. The art of designing one boils down to picking good intermediate formats for these transformations to work with, and efficient runtime data structures to encode values and functions of your language within a running program.
Both series of articles takes you from a complete beginner in programming languages, to designing and building a simple interpreter step-by-step. While many people have followed through these guides and built versions of their interpreter line-by-line from the book, I skimmed them and used the chapters in the books as references throughout my own journey rather than follow the instructions step-by-step. But nonetheless, these books provide very solid groundwork on top of which you can study additional topics in how interpreters and compilers work.
Lastly, I learned a lot during my work on interpreters by reading the implementations of well-designed and well-documented interpreters and compilers. Of those, I enjoyed:. The list here is just the tip of the iceberg of the blog articles, case studies, and GitHub Issues threads I pored through to add meat to the skeleton I had in my mind of how interpreters and compilers worked.
Programming languages is a deep and broad topic! As I mentioned above, even after an initial implementation, a toy programming language gives you all the flexibility and creative headroom in the world to implement more advanced features, concepts, and optimizations in your project for you to tinker away. More sophisticated type systems.
Studying type systems piqued my interest in two related topics, category theory and implementations of elegant data structures with advanced type systems in languages like Haskell and Elm. How can you make your interpreter or compiler faster? And what does it even mean for it to be faster?