1000 words

While working on plans for a Commodore 64 style sci-fi video game, I came up with the idea to have the ships computer act like a real 80s home computer.

Since I knew I wanted to add mini games to the computer, I thought it would be cool to have a real programming language that players could use to make their own games on it.

The Commodore 64 uses a programming language called BASIC. BASIC is, well, a basic language, so I thought making my own dialect of it would be a fun challenge.

So how do you make your own programming language?

There are three steps required for plain text to be interpreted as functional code:

  • The first is a Lexer, this turns the provided text into 'tokens', which make the code easier to understand later in the process.
  • Next is a Parser, this takes the tokens and constructs what's called an 'abstract syntax tree', which is a data structure that describes how and in what order the instructions should be performed in.
  • Lastly, the Interpreter, which reads the instructions and performs the corresponding actions.

This post will go over the creation of the Lexer. I will be writing it in c#, with the goal of the code running in a virtual machine inside of Unity. You can see the final file here.

A line of BASIC code could look something like this:

10 A = 5*C : PRINT A

In a tokenised form, it might look something like this:

Token: "LineNumber", Value: "10"
Token: "Identifier", Value: "A"
Token: "Equals", Value: "="
Token: "Number", Value: "5"
Token: "Multiply", Value: "*"
Token: "Identifier", Value: "C"
Token: "Colon", Value: ":"
Token: "Print", Value: "PRINT"
Token: "Identifier", Value: "A"

The first step in creating this lexer is defining what a token is. Currently, my tokens store two values. Their type, and their value.

public struct Token(TokenType Type, string Value)
{
	public TokenType tokenType = Type;
	public string value = Value ?? "";
}

A tokens 'type' is stored in an enum, which is easier to parse in code than the string value.

public enum TokenType
{
	Identitifer, Number, Text,
	Csr, Fre, Asc, Chr, Slc, Abs, Rnd, Cos, Tan, Sgn, Sqr, Flr, Cel, Exp,
	Print, Get, Key, Parse, Reset, Goto, Gosub, Return, End, Write, On, Time, Read, Com,
	If, Then, For, To, Step, Next, Or, Not, And,
	Save, Load, List, New, Dir, Run, Delete, AutoRun,
	EqualTo, NotEqualTo, Less, Greater, LessEqual, GreaterEqual,
	OpenBracket, CloseBracket, Comma, Colon, SemiColon, Percent, Dollar, At, Plus, Minus, Multiply, Divide, Equals,
	LineNumber,
	ILLEGAL
}

These token types include all commands, methods, operators, and punctuation in the SIMPLE langauge. It also includes data types such as text, numbers, and variables, and an illegal token for anything not recognised.

The goal of the lexer is to create a list of these tokens, based on the code inputted by the user. Note that the lexer does not care if the code is grammaticaly correct, or even if it contains illegal tokens, as these errors will be handled by the parser in the next step.


To get started we'll need some variables. The Lexer will read throught the text one character at a time, so we'll need an array of all the characters in the inputted code. We'll also need to remember the position in the array of which character we're looking at. And lastly, we'll need a list of tokens that will represent a single line of source code. (A list of these lists will represent the entire program)

List<Token> tokenLine = new(); // tokens for a single line of code

string lineText = ""; // literal source code
char[] lineCharacters = Array.Empty(); // array of characters in lineText

int lineposition = 0; //position of current character
char C; // character being read

The easiest tokens to parse are the ones that are a single character, i.e. +, =, or (. To tokenise these, we can use a switch statement to check if the character we are currently reading is one of our tokens. Since SIMPLE ingores white-space, reading a space will move on to the next character.

switch(C)
{
	case '+':
		CreateToken(TokenType.Plus, "+");
		break;
	case '-':
		CreateToken(TokenType.Minus, "-");
		break;
	case '*':
		CreateToken(TokenType.Multiply, "*");
		break;
	case '/':
		CreateToken(TokenType.Divide, "/");
		break;
	case ' ':
		break;
}

These statments will be filled out for every single character token. The CreateToken() method used simply makes a token, fills out it's values, then adds it to the list of tokens for the line.

void CreateToken(TokenType type, string value)
{
	Token newToken = new(type, value);
	tokenLine.Add(newToken);
}

The next tokens to parse are numbers, and text.

Text is anything encompassed in a ", therefore we can check for it in the switch statement by adding:

switch(C)
{
	case '\"':
		ParseText();
		break;
	...

	...
}

The ParseText() method adds new characters to a string until it finds the closing ", then creates a token of type text with the value of the string.

void ParseText()
{
	string newToken = "";

	lineposition++; // move past the "

	for (; lineposition < lineCharacters.Length; lineposition++)
	{
		if (!lineCharacters[lineposition].Equals('\"')) newToken += lineCharacters[lineposition];
		else break;
	}

	CreateToken(TokenType.Text, newToken);
}

For numbers, we can check if our character is a number with char.IsNumber(C), and add new characters to a string until it becomes false. This logic is a bit more complicated as it allows for floating point numbers, and as such handles edge cases where the number inputted might be .12, 23., or just ..

void ParseNumber()
{
	string newToken = "";

	if (lineCharacters[lineposition].Equals('.')) // if first character is '.', then assume 0.XYZ
	{
		newToken = "0"; // left as 0 because the loop will still evaluate the full stop

		if (!char.IsNumber(lineCharacters[lineposition + 1])) // only '.' was inputted, therefore assumed number is 0.0
		{
			newToken = "0.0";
			CreateToken(TokenType.Number, newToken);
			return;
		}
	}

	bool stopLast = false;

	for (; lineposition < lineCharacters.Length; lineposition++)
	{
		char C = lineCharacters[lineposition];

		if (char.IsNumber(C) || C.Equals('.'))
		{
			if (stopLast) stopLast = false;
			if (C.Equals('.')) stopLast = true;
			newToken += C;
		}
		else
		{
			if (stopLast) newToken += "0";
			lineposition--;
			break;
		}
	}

	CreateToken(TokenType.Number, newToken);
}

Since SIMPLE uses line numbers at the beginning of a line of code, a modified version of this method is called before the string is parsed, which doesnt allow for floating point numbers.


Lastly, there are identifiers. These are words not encased in quotes, which are either a language command, such as PRINT, or they are variable definitions.

To parse these, we loop through the characters and check if they are letters. We also notes if we found any capital letters, as these are not allowed in SIMPLE.

void ParseIdentifier()
{
	string newToken = "";

	bool hasCaps = false;

	for (; lineposition < lineCharacters.Length; lineposition++)
	{
		char C = lineCharacters[lineposition];

		if (char.IsUpper(C)) hasCaps = true;
		if (char.IsLetter(C)) newToken += C;
		else
		{
			lineposition--;
			break;
		}
	}
	if (hasCaps)
	{
		CreateToken(TokenType.ILLEGAL, newToken); // commands and variables cannot have capital letters
		return;
	}

	if (Commands.TryGetValue(newToken.ToLower(), out TokenType value)) CreateToken(value, newToken); // identifier is a command, set the tokentype accordingly
	else CreateToken(TokenType.Identitifer, newToken); // identifier is not recognised, must be a variable
}

Once the word has been parsed we have to check if its a command or a variable. To do this, we use the word as a key in a dictionary. If it matches an item, it means that it is one of the listed commands, and it takes the corresponding tokentype and creates the command token. If it isnt a match, then the word is assumed to be a variable name, and is created as an 'identifier' token. If the word contains capital letters, the word is an illegal token.

Our keyword-tokentype dictionary looks something like this, it is needed so that the lexer knows what text corresponds to what tokentype enum value.

Dictionary<string, TokenType> Commands = new() // command strings and their corresponding token type
{
	// Keywords
	{"csr", TokenType.Csr}, {"fre", TokenType.Fre}, {"asc", TokenType.Asc}, {"chr", TokenType.Chr},
	{"slc", TokenType.Slc}, {"abs", TokenType.Abs}, {"rnd", TokenType.Rnd}, {"cos", TokenType.Cos},
	{"tan", TokenType.Tan}, {"sgn", TokenType.Sgn}, {"sqr", TokenType.Sqr}, {"flr", TokenType.Flr},
	{"cel", TokenType.Cel}, {"exp", TokenType.Exp}, ... , ... , ... 
}

So overall, our switch statement checks for all the single character tokens (and the "character) and if it isnt any of those, it must be a number, or an identifier. If it is none of these, an illegal token is created.

switch(C)
{
	...

	...

	default: // not a symbol, must be a number or letter, else illegal
		if (char.IsNumber(C) || C.Equals('.')) ParseNumber();
		else if (char.IsLetter(C)) ParseIdentifier();
		else CreateToken(TokenType.ILLEGAL, C.ToString());
		break;
}

To conclude, with the outlined code, an example input of:

10 A = A*(2+5) : PRINT "RESULT" ; A

Would output as:

Token: "LineNumber", Value: "10"
Token: "Identitifer", Value: "A"
Token: "Equals", Value: "="
Token: "Identitifer", Value: "A"
Token: "Multiply", Value: "*"
Token: "OpenBracket", Value: "("
Token: "Number", Value: "2"
Token: "Plus", Value: "+"
Token: "Number", Value: "5"
Token: "CloseBracket", Value: ")"
Token: "Colon", Value: ":"
Token: "Print", Value: "PRINT"
Token: "Text", Value: "RESULT"
Token: "SemiColon", Value: ";"
Token: "Identitifer", Value: "A"

Specifically, the output would be a series of Token structs stored inside of the tokenLine list. The above 'result' is the output of this code:

foreach (Token token in tokenLine)
{
	Console.WriteLine($"Token: \"{token.tokenType}\", Value: \"{token.value}\"");
}

So that was the outline of my Lexer! If you want to see the whole program file, click here.

I couldn't find a lot of information about creating a lexer in c#, so that's why I decided to share my program. Logan Patterson's video series about creating a lexer was incredibly helpful in making mine; it was written in go.

Next up is the parser, which is much more complicated to write. I'll write a post about it once it's done.
Thanks for reading!