Creating a programming language from scratch part 1: Understanding syntax trees

Posted by Thomas Neumüller on 2022-04-26

In this new series I want to give a step by step instruction on how to build your own programming language. It has taken me multiple time-consuming attempts to acquire the experience necessary to build something like this and I have now successfully implemented my own programming language with features like namespaces, classes, constructors, operator-overloading and more.

I will explain in detail how to go about something like this. You can use this series as a tutorial and follow along or just give it a read for fun. If you have any interesting thoughts or requests, please leave them in the comments below!

Preamble

This is probably not how you would go about creating a programming language if you were actually aming to create one for productive use. Actually, I have no idea. I haven’t done any research whatsoever about the topic. For me, the fun was about finding out how to do it completely on my own. I have figured it out and thought some of you might be interested in how I did it. So take this for what it’s worth.

The general approach

How does one create a programming language? How is that even possible?

Source code is just text, inherently it has no meaning. What gives it meaning and makes it do something is another program written by someone else that reads that text and pulls it apart to make sense of it. There needs to be a specification and rules on how the code has to be structured so that the program reading the text and the developer can agree on what the code should look like and what can or cannot be done and how.

There are generally two classes of programs like that: Compilers and interpreters.

Both start by reading the source code and pulling it apart, but what happens afterwards is where they differ: The compiler then takes the destructured source code and converts it into an executable, essentially a sequence of instructions that are executable by the hardware without any additional steps. An interpreter reads the instructions and executes them directly. This may be a little bit harder to understand, so consider the following example:

Compilers vs interpreters - a real world example

There are two people, Clara and Ian. Both are given a sheet of paper with source code. Both start reading and understanding it.

After understanding what the code does, Clara takes another piece of paper and writes down very simple instructions, one after the other, that are so easy to understand and execute that anyone without any knowledge about programming can follow them. Clara did the job of a compiler.

Ian, after understanding the code, begins immediately doing what the code tells him to and loudly announces the result of the code on the piece of paper. He does not write down any additional code. Ian has interpreted the code.

Compilers vs interpreters - a technical example

More technically, imagine the following code:

1
print 1 + 2 + 3

A compiler reads the code and outputs another file that contains the following instructions:

1
2
3
add 1 and 2
take the result and add 3
print the result

An interpreter looks at the instructions and outputs the number 6.

Tokenization

The first important step towards making your program understand the text that represents source code is to destructure it into tokens. A token is the smallest meaningful unit in your programming language. Every token has a meaning, e.g. an operator like + or -, a keyword like return, a string literal like "Hello World!" or the name of a variable, like height. Tokenization is where you need to make the first important decisions about your programming language. You can of course make changes later, but making fundamental changes may break large parts of the rest of your code, so be careful. Don’t overthink it tho, this is just a fun project.

The way I went about it was the following: I defined which characters I wanted to define the end of a token and the beginning of a new one. One good example would be the semicolon; a semicolon is it’s own token and ends whatever comes before it, so “abcde;fg” will never be a single token, but be split into three tokens: “abcde”, “;” and “fg”. I want consecutive letters to belong to the same token, as well as consecutive numbers. Characters like ., ,, *, (, {, etc. should be where tokens are split. Also, spaces (including tabs, new lines, etc.) should split a token in two, but this time without becoming their own token. In my language, a space has no meaning. Languages like python use whitespace for deciding what line belongs to what block of code, if you want to replicate something like that, preserving whitespace during tokenization is important.

What I came up with was the following regular expression:

1
string rxSeparators = $"(\\s*([\(\)\[\].,:;{}<>""])\\s*|\\s+([\(\)\[\].,:;{}<>""])?\\s*)";

It specifies which characters are separators, specifically (, ), [, ], ., ,, :, ;, {, }, <, > and ". It then makes sure that leading spaces or spaces themselves are also being captured as a separator. By later trimming the captured strings and checking for their length, discarding zero-length strings, we can eliminate spaces from the tokens altogether.

You can start by implementing an algorithm that reads a text file and splits it into tokens and check if the result looks the way you want it to.

Strings

There is a special case that needs to be taken into account: In case you want your language to support them, string literals need to be treated differently. A string literal is a word in quotes, something like "Hello World!". This literal will currently be split into the tokens " Hello World ! ", which is not what we want. We want it to remain one unit.

What you need to during tokenization is to constantly check for quotes ". Once you find one, you walk all the way to the next one and consider everything in between a single token, including the quotation marks. Those are needed to later be able to identify the token as a string literal. If you don’t include the quotation marks and end up having a token int, you can’t tell whether it means the type int or whether it was a string containing the word "int". Also, usually there is a way to escape quotes within strings, otherwise there would be no way to include the quote character in a string. Normally that can be achieved by adding a back slash \ in front of the quote " to mark it as “not the end of the string”, but you can implement escaping in any way you want to, or leave it out for now.

Error messages

Disclaimer: Normally, syntax error messages would be generated by a tokenizer that uses a grammar to check the code before semantic parsing starts. I am not using a grammar in this series.

In order to output useful error messages to the user, like “Error in path/to/file.txt(27, 15): Variable “x” is not defined”, we need to remember the file name, line and column for each token. A token can therefore only be represented by an appropriate data structure, such as the following:

1
2
3
4
class Token {
public string Value, Filename;
public int Line, Column;
}

Building a syntax tree

After tokenization, the next step is to look at each token and decide in which way to build a syntax tree.

What is a syntax tree?

Computers work by executing one operation after the other from a big list of instructions. A list of instructions must therefore be the result of our parser, but how do we get there? A syntax tree is a neat way to put code into a meaningful structure that can then simply be executed from the top down. A syntax tree consists of nodes, which may or may not have children.

Look at this simple calculation: 4 + 3 * 2. In order to calculate the correct result, we need to add 4 to the result of the multiplication of 3 and 2. In other words, we need to first multiply 3 and 2, remember the result, and add it to 4.

In a syntax tree, the first instruction sits at the bottom. This is, because the last operation to be executed, in this case the + operation, asks it’s children, in this case 4 and 3 * 2 for their respective results. Because each syntax tree node may return a result, instead of execution we will use the term evaluation. Each node can be evaluated, producing a result. The inner workings of an evaluation are transparent to the parent node, all it knows is the result and how to use it for it’s own respective evaluation.

The above calculation, following the “multiplication before addition” rule, will result in the following syntax tree:

The root node is therefore the + operator. If we ask the root node to evaluate itself, it will ask it’s child nodes for the results of their respective evaluations and add both to produce a result. The evaluation of 4 will simply yield the value 4, the evaluation of * will trigger both the evaluation of 3 and 2, which will subsequently be multiplied and returned. + then has the values 4 and 6 as the results of the evaluation of it’s child nodes and returns 4 + 6 = 10 as it’s respective result. If + had a parent node, that node would now use this result to work out it’s own evaluation result.

In modern programming languages with a call stack, this logic is easy to implement. If you want to build an interpreter with a language that doesn’t feature modern call-and-return, you’re up for a task. A call stack is of vital importance for the rest of this series. Unless you’re some kind of masochist, I would strongly recommend to use a language that has the following features:

  • subroutines
  • classes
  • inheritance
  • generics / templates

I will use C#, because it’s the language I’m most familiar with, but many other languages like C++ or Java are perfectly adequate as well. Doing this in C will be a pain in the ass, unless you have lots of experience, in which case you probably don’t need to read this tutorial.

Expressions

Our first task will be to take a list of tokens and build a syntax tree from it. The easiest and most comfortable way to achive this is to make use of recursion. We will first take a look at what we call an expression: A single statement like a variable declaration, assignment or function call. The declaration of a datatype, control flow structure (while, if, for, …) or other, complex concepts are not an expression. Take a look at this:

✔️ Is an expression:

  • int x = 3 + 14;
  • parent.Execute();
  • 4 + 3 * 2;
  • DataType d;

Is not an expression:

  • class DataType {
  • if (a == true)
  • while (a == true)

Whether or not you allow expressions like 4 + 3 * 2; is up to you; in itself it doesn’t make much sense, because while it is syntactically correct and produces a result, the result is not used and goes to waste. Many languages forbid expressions like these, because they are 1. useless and 2. indicative of a mistake: No one in their right mind would put such an expression in a reasonable application, because it has no use. The programmer likely forgot to assign the result of the expression to a variable.

Recursive parsing

Expressions can be parsed recursively, because every part of a valid expression is in itself again a valid expression. Take this for example: int x = 3 + 14;

int x is a valid expression declaring a variable of type integer with name “x”. 3 + 14 is a valid expression, returning the result of the addition of two integer numbers, 3 and 14. 3 and 14 are themselves valid expressions, returning their respective values. This allows us to look at an expression and ask the following questions to decide how to proceed with parsing:

  • Does this expression contain an operator?
  • What is the return type of this expression?
  • Is this expression minimal?

Operator precedence

As you know from school, mathematical operators have precedence. Precedence is the order in which operators are evaluated. 3 + 14 * 3 is equivalent to 3 + (14 * 3), because * has higher precedence than +, the multiplication is evaluated first. When parsing an expression with a single operator, precedence is irrelevant. The expression 14 * 3 is being split in the middle and results in an OperatorMultiplyNode with two children, an IntegerValueNode(14) and an IntegerValueNode(3).

When multiple operators are involved, precedence must be observed. Because * has higher precedence than +, the expression must first be split at the + operator. 3 + 14 * 3 is therefore split into 3 and 14 * 3 first, resulting in the following syntax tree:

Operators must always be parsed from low to high precedence. Quick question to check if you’ve understood precedence correctly: Does = have higher or lower precedence than -?

The statement int x = 3 + 14 would be split into int x and 3 + 14 first, because the result of 3 + 14 must be assigned to the variable x. Adding int x = 3 and 14 together wouldn’t make much sense, as it would result in the number 17 which would then not be used for anything and be discarded, as well as a new variable x with value 3, which is not what one would intuitively expect to be the result of this expression.

The correct syntax tree for int x = 3 + 14 would therefore be

Thus, to answer the question, = has lower precedence than any mathematical operator.

Parsing an expression is a matter of searching for the operator with the lowest precedence and splitting the expression at that point, creating a new operator node (e.g. OperatorPlusNode). The two children of the new operator node will be the results of parsing the left and right hand side of the split expression, e.g. int x and 3 + 14.

Each node class will inherit from an abstract base class Node, which will define an abstract method Evaluate(). Each derived node class will then have to implement it’s very own evaluation. The implementation for OperatorPlusNode could look like this (C#):

1
2
3
4
5
6
7
8
9
10
11
abstract class Node {
protected Node Left, Right;

public abstract Value Evaluate();
}

class OperatorPlusNode : Node {
public override Value Evaluate() {
return Left.Evaluate() + Right.Evaluate();
}
}

Terminator characters

While some languages like JavaScript don’t require semicolons to terminate a statement, writing a parser to deal with optional semicolons is much harder than strictly requiring them. The parser needs some way of determining the end of a statement and a semicolon (or any other character) is the easiest. We will ignore identifying distinct expressions for now and focus on that later, because it ties closely to parsing data types and functions, which is not the topic of this article.

Conclusion

With what I’ve covered in this article, parsing simple mathematical expressions is now super easy:

  1. tokenize
  2. if more than one token, split by operator
  3. if exactly one token, convert to value node

Of course this list is simplyfied because it doesn’t deal with syntax errors. It would totally fail with parsing something like 1 + 2 +, or even with a valid expression like -1 + 4. This simple parser cannot yet deal with negative numbers, because the minus sign in this case is actually an operator with a single argument.

In the next article I will discuss negative numbers, operators with a different number of arguments than two, and parentheses. Stay tuned!


Thoughts or questions? Leave me a comment!