Writing a simple Compiler on my own - Abstract Syntax Tree Management [C][Flex][Bison]

[Custom Thumbnail]

All the Code of the series can be found at the Github repository:


    Hello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we continue the AST implementation, by writing the needed functions for creating nodes and traversing the tree (at first for testing purposes only).

The topics that we will cover today are:

  1. Tweaking the Nodes
  2. Writing the remaining node creation functions
  3. Implementing a AST traversal function
  4. Testing out the AST tree "manually"


    Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
  • What Compiler Design is (mainly the steps)
  • For which exact Language the Compiler is build for (Tokens and Grammar)
  • How to use Flex and Bison
  • How to implement a lexer and parser for the language using those tools
  • What the Symbol Table is and how we implement it
  • How we combine Flex and Bison together
  • How we can pass information from the Lexer to the Parser
  • How we define operator priorities, precedencies and associativity
  • What Semantic Analysis is (Attributes, SDT etc.)
  • How we do the so called "Scope Resolution"
  • How we declare types and check the type inter-compatibility for different cases, including function parameters
  • How we check function calls later on using a "Revisit queue", as function declarations mostly happen after functions get used (in our Language)
  • Intermediate Code Representations, including AST's
  • How we implement an AST


Talking about the series in general this series can be rated:
  • Intermediate to Advanced
Today's topic(s) can be rated:
  • Medium
So, without further ado, let's now finally start with the actual Tutorial...

Actual Tutorial Content

Tweaking the Nodes

    The first thing that we will do in today's follow-up article is fixing the nodes. It's not much that changes. Some of the nodes just have dynamic arrays for which we don't declared an item count variable. So, now:
  • The Declaration node will contain a "names_count" variable, so that we know the number of entries in the "names" array.
  • The If Node will contain a "elseif_count" variable, so that we know the number of else if branches stored in the "elsif_branches" array.
  • The Function call node will contain a "num_of_pars" variables, so that we know the number of parameters in the "params" array. It's the same name that we use in the symbol table entry!
    I don't know if you noticed, but I don't write else if, but elsif, which is something that is used in the hardware programming language VHDL. This is something that I also did last time. I guess it's quite fun! :P

So, in Code these structures now are:
typedef struct AST_Node_Decl{ enum Node_Type type; // node type
// data type int data_type;
// symbol table entries of the variables list_t **names; int names_count; }AST_Node_Decl;

typedef struct AST_Node_If{ enum Node_Type type; // node type
// condition struct AST_Node *condition;
// if branch struct AST_Node *if_branch;
// else if branches struct AST_Node **elsif_branches; int elseif_count;
// else branch struct AST_Node *else_branch; }AST_Node_If;

typedef struct AST_Node_Func_Call{ enum Node_Type type; // node type
// function identifier list_t *entry;
// call parameters AST_Node **params; int num_of_pars; }AST_Node_Func_Call;

You will understand why we did these changes later on!

The remaining Node Creation Functions

    By making a change in the Declaration node definition we of course also have to change the creation function for this node. So, let's do this first! It's pretty simple, we only have to add a new parameter and set the new entry to the value of this parameter. The new function is:
AST_Node *new_ast_decl_node(int data_type, list_t **names, int names_count){
    // allocate memory
    AST_Node_Decl *v = malloc (sizeof (AST_Node_Decl));
// set entries v->type = DECL_NODE; v->data_type = data_type; v->names = names; v->names_count = names_count;
// return type-casted result return (struct AST_Node *) v; }

    This is something that we actually do for every single node case! We just have to add the entries of each node as function parameters and then set all the corresponding entries to the right parameter "value".

So, let's do this:
/* Statements */
AST_Node *new_ast_if_node(AST_Node *condition, AST_Node *if_branch, AST_Node **elsif_branches, int elseif_count, AST_Node *else_branch){ // allocate memory AST_Node_If *v = malloc (sizeof (AST_Node_If));
// set entries v->type = IF_NODE; v->condition = condition; v->if_branch = if_branch; v->elsif_branches = elsif_branches; v->elseif_count = elseif_count; v->else_branch = else_branch;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_elsif_node(AST_Node *condition, AST_Node *elsif_branch){ // allocate memory AST_Node_Elsif *v = malloc (sizeof (AST_Node_Elsif));
// set entries v->type = ELSIF_NODE; v->condition = condition; v->elsif_branch = elsif_branch;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_for_node(AST_Node *initialize, AST_Node *condition, AST_Node *increment, AST_Node *for_branch){ // allocate memory AST_Node_For *v = malloc (sizeof (AST_Node_For));
// set entries v->type = FOR_NODE; v->initialize = initialize; v->condition = condition; v->increment = increment; v->for_branch = for_branch;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_while_node(AST_Node *condition, AST_Node *while_branch){ // allocate memory AST_Node_While *v = malloc (sizeof (AST_Node_While));
// set entries v->type = WHILE_NODE; v->condition = condition; v->while_branch = while_branch;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_assign_node(list_t *entry, AST_Node *assign_val){ // allocate memory AST_Node_Assign *v = malloc (sizeof (AST_Node_Assign));
// set entries v->type = ASSIGN_NODE; v->entry = entry; v->assign_val = assign_val;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_simple_node(int statement_type){ // allocate memory AST_Node_Simple *v = malloc (sizeof (AST_Node_Simple));
// set entries v->type = SIMPLE_NODE; v->statement_type = statement_type;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_incr_node(list_t *entry, int incr_type, int pf_type){ // allocate memory AST_Node_Incr *v = malloc (sizeof (AST_Node_Incr));
// set entries v->type = INCR_NODE; v->entry = entry; v->incr_type = incr_type; v->pf_type = pf_type;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_func_call_node(list_t *entry, AST_Node **params, int num_of_pars){ // allocate memory AST_Node_Func_Call *v = malloc (sizeof (AST_Node_Func_Call));
// set entries v->type = FUNC_CALL; v->entry = entry; v->params = params; v->num_of_pars = num_of_pars;
// return type-casted result return (struct AST_Node *) v; }
/* Expressions */
AST_Node *new_ast_arithm_node(enum Arithm_op op, AST_Node *left, AST_Node *right){ // allocate memory AST_Node_Arithm *v = malloc (sizeof (AST_Node_Arithm));
// set entries v->type = ARITHM_NODE; v->op = op; v->left = left; v->right = right;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_bool_node(enum Bool_op op, AST_Node *left, AST_Node *right){ // allocate memory AST_Node_Bool *v = malloc (sizeof (AST_Node_Bool));
// set entries v->type = BOOL_NODE; v->op = op; v->left = left; v->right = right;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_rel_node(enum Rel_op op, AST_Node *left, AST_Node *right){ // allocate memory AST_Node_Rel *v = malloc (sizeof (AST_Node_Rel));
// set entries v->type = REL_NODE; v->op = op; v->left = left; v->right = right;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_equ_node(enum Equ_op op, AST_Node *left, AST_Node *right){ // allocate memory AST_Node_Equ *v = malloc (sizeof (AST_Node_Equ));
// set entries v->type = EQU_NODE; v->op = op; v->left = left; v->right = right;
// return type-casted result return (struct AST_Node *) v; }
/* Functions */
AST_Node *new_ast_func_decl_node(int ret_type, list_t *entry){ // allocate memory AST_Node_Func_Decl *v = malloc (sizeof (AST_Node_Func_Decl));
// set entries v->type = FUNC_DECL; v->ret_type = ret_type; v->entry = entry;
// return type-casted result return (struct AST_Node *) v; }
AST_Node *new_ast_return_node(int ret_type, AST_Node *ret_val){ // allocate memory AST_Node_Return *v = malloc (sizeof (AST_Node_Return));
// set entries v->type = FUNC_DECL; v->ret_type = ret_type; v->ret_val = ret_val;
// return type-casted result return (struct AST_Node *) v; }

AST Traversal Function

    The most important topic is not how we create these nodes, but how we use them! We type-cast all the nodes to the type "AST_Node", but can go back to the previous type, by just type-casting back to the original type. This original type is being stored in the enum variable "type". By doing a simple switch case statement with the "type" as parameter, we can easily separate the types.

Print Function

    So, let's create a print function that gives out the type of node and also some information about the none-Node entries. This function will then be used in the AST traversal function, so that we can easily print out the information of a node It will take an AST_Node as parameter and then type-cast it to the correct type depending on the node type. So, the first thing that we have to declare is a temporary node pointer for each of the ast node structures that store none-Node information. In the end we can then easily access the entries of this type.
The code looks as following:
void ast_print_node(AST_Node *node){
    /* temp nodes */
    AST_Node_Decl *temp_decl;
    AST_Node_Const *temp_const;
    AST_Node_If *temp_if;
    AST_Node_Assign *temp_assign;
    AST_Node_Simple *temp_simple;
    AST_Node_Incr *temp_incr;
    AST_Node_Func_Call *temp_func_call;
    AST_Node_Arithm *temp_arithm;
    AST_Node_Bool *temp_bool;
    AST_Node_Rel *temp_rel;
    AST_Node_Equ *temp_equ;
    AST_Node_Func_Decl *temp_func_decl;
    AST_Node_Return *temp_return;
switch(node->type){ case BASIC_NODE: printf("Basic Node\n"); break; case DECL_NODE: temp_decl = (struct AST_Node_Decl *) node; printf("Declaration Node of data-type %d for %d names\n", temp_decl->data_type, temp_decl->names_count); break; case CONST_NODE: temp_const = (struct AST_Node_Const *) node; printf("Constant Node of const-type %d\n", temp_const->const_type); break; case IF_NODE: temp_if = (struct AST_Node_If *) node; printf("If Node with %d elseifs\n", temp_if->elseif_count); break; case ELSIF_NODE: printf("Elsif Node\n"); break; case FOR_NODE: printf("For Node\n"); break; case WHILE_NODE: printf("While Node\n"); break; case ASSIGN_NODE: temp_assign = (struct AST_Node_Assign *) node; printf("Assign Node of entry %s\n", temp_assign->entry->st_name); break; case SIMPLE_NODE: temp_simple = (struct AST_Node_Simple *) node; printf("Simple Node of statement %d\n", temp_simple->statement_type); break; case INCR_NODE: temp_incr = (struct AST_Node_Incr *) node; printf("Increment Node of entry %s being %d %d\n", temp_incr->entry->st_name, temp_incr->incr_type, temp_incr->pf_type); break; case FUNC_CALL: temp_func_call = (struct AST_Node_Func_Call *) node; printf("Function Call Node with %d parameters\n", temp_func_call->num_of_pars); break; case ARITHM_NODE: temp_arithm = (struct AST_Node_Arithm *) node; printf("Arithmetic Node of operator %d\n", temp_arithm->op); break; case BOOL_NODE: temp_bool = (struct AST_Node_Bool *) node; printf("Boolean Node of operator %d\n", temp_bool->op); break; case REL_NODE: temp_rel = (struct AST_Node_Rel *) node; printf("Relational Node of operator %d\n", temp_rel->op); break; case EQU_NODE: temp_equ = (struct AST_Node_Equ *) node; printf("Equality Node of operator %d\n", temp_equ->op); break; case FUNC_DECL: temp_func_decl = (struct AST_Node_Func_Decl *) node; printf("Function Declaration Node of %s with ret_type %d\n", temp_func_decl->entry->st_name, temp_func_decl->ret_type); break; case RETURN_NODE: temp_return = (struct AST_Node_Return *) node; printf("Return Node of ret_type %d\n", temp_return->ret_type); break; default: /* wrong choice case */ fprintf(stderr, "Error in node selection!\n"); exit(1); } }

    Now we need to write a function that traverses through the tree. There are of course lots and lots of ways to do this! As we already mentioned in the "AST Principle" article, one of the intermediate code forms is the so called Post-fix code, for which we also did some examples. It so happens that the most common way of traversing through trees is using pre-order, in-order or post-order traversal. Thinking about the Post-fix code, which also makes sense with the Bottom-Up parser that we are writing, we will of course try to use a type of post-order traversal. But, not all nodes have left and right children! Some have no children at all, other nodes might even have more than 2 children.

    Of course the problem is quite recursive! What we have to do first is check if the node is empty, which is the exit condition of the recursion. After that we will call the same function for all the children nodes, depending on the type of node. In the end, we will just print out the node (post fix) using the previously declared "ast_print_node" function.
The code is:
void ast_traversal(AST_Node *node){
    int i;
/* check if empty */ if(node == NULL){ return; }
/* left and right child cases */ if(node->type == BASIC_NODE || node->type == ARITHM_NODE || node->type == BOOL_NODE || node->type == REL_NODE || node->type == EQU_NODE){ ast_traversal(node->left); ast_traversal(node->right); ast_print_node(node); // postfix } /* the if case */ else if(node->type == IF_NODE){ AST_Node_If *temp_if = (struct AST_Node_If *) node; ast_traversal(temp_if->condition); ast_traversal(temp_if->if_branch); for(i = 0; i < temp_if->elseif_count; i++){ ast_traversal(temp_if->elsif_branches[i]); } ast_traversal(temp_if->else_branch); ast_print_node(node); } /* the else if case */ else if(node->type == ELSIF_NODE){ AST_Node_Elsif *temp_elsif = (struct AST_Node_Elsif *) node; ast_traversal(temp_elsif->condition); ast_traversal(temp_elsif->elsif_branch); ast_print_node(node); } /* for case */ else if(node->type == FOR_NODE){ AST_Node_For *temp_for = (struct AST_Node_For *) node; ast_traversal(temp_for->condition); ast_traversal(temp_for->for_branch); ast_print_node(node); } /* while case */ else if(node->type == WHILE_NODE){ AST_Node_While *temp_while = (struct AST_Node_While *) node; ast_traversal(temp_while->condition); ast_traversal(temp_while->while_branch); ast_print_node(node); } /* assign case */ else if(node->type == ASSIGN_NODE){ AST_Node_Assign *temp_assign = (struct AST_Node_Assign *) node; ast_traversal(temp_assign->assign_val); ast_print_node(node); } /* function call case */ else if(node->type == FUNC_CALL){ AST_Node_Func_Call *temp_func_call = (struct AST_Node_Func_Call *) node; for(i = 0; i < temp_func_call->num_of_pars; i++){ ast_traversal(temp_func_call->params[i]); } ast_print_node(node); } /* return case */ else if(node->type == RETURN_NODE){ AST_Node_Return *temp_return = (struct AST_Node_Return *) node; ast_traversal(temp_return->ret_val); ast_print_node(node); } /* others */ else{ ast_print_node(node); } }

    And this was actually it! Not that complicated right? If you know trees and how to implement a traversal for them, it's just an extension of that to work for different types of tree nodes! We don't simply have left and right children nodes, but each node is quite "unique".

Testing out the AST structure "manually"

    Before we actually use this structure and it's functions we should first test it out, to see what it prints out! Let's try to create the following if expression tree:

Starting from the children, the Code looks as following:
Value val1, val2;
val1.ival = 0;
val2.ival = 1;
AST_Node *const_node1 = new_ast_const_node(INT_TYPE, val1);
AST_Node *const_node2 = new_ast_const_node(INT_TYPE, val2);
AST_Node *bool_node = new_ast_bool_node(OR, const_node1, const_node2);
AST_Node *simple_node = new_ast_simple_node(0);
AST_Node *if_node = new_ast_if_node(bool_node, simple_node, NULL, 0, NULL);

Running the Code we get:

I guess it's self-explanatory that we are truly traversing with Post-order!

Explanation of some values:
  • The integer type has the value 1 and so the constants are of const-type 1
  • The Boolean operator OR has the value 0
  • The Simple Node of continue has a statement-value of 0
    So, what remains now? Well, all these "new_ast_node" functions have to be added to the action rules of the Bison parser, so that we create a syntax tree for the program. So, the parse tree will now be "replaced" somewhat by our AST!



No references, just continuing on with the implementation on my own!


All of the images are custom-made!

Previous parts of the series

Final words | Next up on the project

     And this is actually it for today's post! I hope that I explained everything as much as I needed to, meaning that you learned something out of it.
Next up on this series are:
  • Semantic analysis (using semantic, and so action rules in Bison)
  • Intermediate Code generation (actually using the AST)
  • Machine Code generation (MIPS Assembly)
     Which are all topics that will need more than one article to complete. Also, note that we might also get into Optimizations later on, or could even extend the Language by adding complex datatypes (structs and unions), more rules etc.
So, see ya next time!

GitHub Account:


Keep on drifting! ;)

3 columns
2 columns
1 column