etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0:   IEC 61131-3 IL and ST compiler
etisserant@0: 
etisserant@0: 
etisserant@0:   The following compiler has been based on the
etisserant@0:   FINAL DRAFT - IEC 61131-3, 2nd Ed. (2001-12-10)
etisserant@0: 
etisserant@0: 
etisserant@0:   (c) 2003 Mario de Sousa
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: *********                                              *********
etisserant@0: *********                                              *********
etisserant@0: *********   O V E R A L L    A R C H I T E C T U R E   *********
etisserant@0: *********                                              *********
etisserant@0: *********                                              *********
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: 
etisserant@0:  The compiler works in 4(+1) stages:
etisserant@0:  Stage 1   - Lexical analyser      - implemented with flex (iec.flex)
etisserant@0:  Stage 2   - Syntax parser         - implemented with bison (iec.y)
etisserant@0:  Stage 3   - Semantics analyser    - not yet implemented
etisserant@0:  Stage 4   - Code generator        - implemented in C++
etisserant@0:  Stage 4+1 - Binary code generator - gcc, javac, etc...
etisserant@0: 
etisserant@0: 
etisserant@0:  Data structures passed between stages, in global variables:
etisserant@0:  1->2   : tokens (int), and token values (char *)
etisserant@0:  2->1   : symbol tables (defined in symtable.hh)
etisserant@0:  2->3   : abstract syntax tree (tree of C++ classes, in absyntax.hh file)
etisserant@0:  3->4   : Same as 2->3
etisserant@0:  4->4+1 : file with program in c, java, etc...
etisserant@0: 
etisserant@0: 
etisserant@0:  The compiler works in several passes:
etisserant@0:  Pass 1: executes stages 1 and 2 simultaneously
etisserant@0:  Pass 2: executes stage 3
etisserant@0:  Pass 3: executes stage 4
etisserant@0:  Pass 4: executes stage 4+1
etisserant@0: 
etisserant@0: 
etisserant@0:  NOTE 1
etisserant@0:  ======
etisserant@0:  Note that stage 2 passes data back to stage 1. This is only
etisserant@0: possible because both stages are executed in the same pass.
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0:  NOTE 2
etisserant@0:  ======
etisserant@0:   I (Mario) have a feeling that the abstract syntax may be
etisserant@0: considerably simplified without any drawbacks to semantic checking
etisserant@0: and code generation. I have nevertheless opted to keep as much 
etisserant@0: info as possible in the abstract syntax tree, in case it may become
etisserant@0: necessary further on.
etisserant@0:  Once we start coding the next stages (semantic checking and code
etisserant@0: generation) I will get a better understanding of what is required
etisserant@0: of the abstract syntax tree. At that stage I will be better
etisserant@0: positioned to make a more informed decision on how best to structure
etisserant@0: the abstract syntax tree.
etisserant@0:  For now, we play conservative and keep as much info as possible.
etisserant@0: 
etisserant@0:  
etisserant@0: 
etisserant@0:  NOTE 3
etisserant@0:  ======
etisserant@0:  It would be nice to get this parser integrated into the gcc
etisserant@0: group of compilers. We would then be able to compile our st/il
etisserant@0: programs directly into executable binaries, for all the processor
etisserant@0: architectures gcc currently supports.
etisserant@0:  The gcc compilers are divided into a frontend and backend. The
etisserant@0: data structure between these two stages is called the syntax
etisserant@0: tree. In essence, we would need to create a new frontend that
etisserant@0: would parse the st/il program and build the syntax tree.
etisserant@0: Unfortunately the gcc syntax tree is not very well documented,
etisserant@0: and doing semantic checking on this tree would probably be a
etisserant@0: nightmare.
etisserant@0:  We therefore chose to follow the same route as the gnat (ada 95)
etisserant@0: and cobol compilers, i.e. generate our own abstract syntax tree,
etisserant@0: do semantic checking on our tree, do whatever optimisation
etisserant@0: we can at this level on our own tree, and only then build
etisserant@0: the gcc syntax tree from our abstract syntax tree.
etisserant@0:  All this may still be integrated with the gcc backend to generate
etisserant@0: a new gnu compiler for the st and il programming languages.
etisserant@0: Since generating the gcc syntax tree will probably envolve some
etisserant@0: trial and error effort due to the sparseness of documentation,
etisserant@0: we chose to start off by coding a C++ code generator for
etisserant@0: our stage 4. We may later implement a gcc syntax tree generator
etisserant@0: as an alternative stage 4 process, and then integrate it with
etisserant@0: the gcc toplevel.c file (command line parsing, etc...).
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: *********                                              *********
etisserant@0: *********                                              *********
etisserant@0: *********               S T A G E      1               *********
etisserant@0: *********                                              *********
etisserant@0: *********                                              *********
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: Issue 1
etisserant@0: =======
etisserant@0: 
etisserant@0:  The syntax defines the common_character_representation as:
etisserant@0: <any printable character except '$', '"' or "'"> | <escape sequences>
etisserant@0: 
etisserant@0:  Flex includes the function print_char() that defines
etisserant@0: all printable characters portably (i.e. whatever character
etisserant@0: encoding is currently being used , ASCII, EBCDIC, etc...)
etisserant@0: Unfortunately, we cannot generate the definition of
etisserant@0: common_character_representation portably, since flex
etisserant@0: does not allow definition of sets by subtracting
etisserant@0: elements in one set from another set (Note how
etisserant@0: common_character_representation could be defined by
etisserant@0: subtracting '$' '"' and "'" from print_char() ).
etisserant@0: This means we must build up the defintion of
etisserant@0: common_character_representation using only set addition,
etisserant@0: which leaves us with the only choice of defining the
etisserant@0: characters non-portably...
etisserant@0: 
etisserant@0:  In short, the definition we use for common_character_representation
etisserant@0: only works for ASCII character encoding!
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: Issue 2
etisserant@0: =======
etisserant@0: 
etisserant@0: We extend the IEC 61131-3 standard syntax to allow inclusion of 
etisserant@0: other files. The accepted syntax is:
etisserant@0: 
etisserant@0:    (*#include "<filename>" *)
etisserant@0: 
etisserant@0: Note how this would be ignored by other standard complient compilers 
etisserant@0: as a simple comment!
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: *********                                              *********
etisserant@0: *********                                              *********
etisserant@0: *********               S T A G E      2               *********
etisserant@0: *********                                              *********
etisserant@0: *********                                              *********
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: ****************************************************************
etisserant@0: 
etisserant@0:  Overall Comments
etisserant@0:  ================
etisserant@0: 
etisserant@0:  
etisserant@0:  Comment 1
etisserant@0:  ---------
etisserant@0:  We have augmented the syntax the specification defines to include
etisserant@0: restrictions defined in the semantics of the languages.
etisserant@0: 
etisserant@0:  This is required because the syntax cannot be parsed by a LALR(1)
etisserant@0: parser as it is presented in the specification. Many reduce/reduce
etisserant@0: and shift/reduce conflicts arise. This is mainly because the parser
etisserant@0: cannot discern how to reduce an identifier. Identifiers show up in
etisserant@0: many places in the syntax, and it is not entirely possible to
etisserant@0: figure out if the identifier is a variable_name, enumeration
etisserant@0: value, function block name, etc... only from the context in
etisserant@0: which it appears.
etisserant@0: 
etisserant@0:  A more detailed example of why we need symbol tables are
etisserant@0: the type definitions...  In definition of new types
etisserant@0: (section B 1.3.3) the parser needs to figure out the class of
etisserant@0: the new type being defined (enumerated, structure, array, etc...).
etisserant@0: This works well when the base classes are elementary types
etisserant@0: (or structures, enumeration, arrays, etc. thereof). It becomes
etisserant@0: confusing to the parser when the new_type is based on a previously
etisserant@0: user defined type.
etisserant@0: 
etisserant@0: TYPE
etisserant@0:   new_type_1 : INT := 99;
etisserant@0:   new_type_2 : new_type_1 := 100;
etisserant@0: END_TYPE
etisserant@0: 
etisserant@0:  When parsing new_type_1, the parser can figure out that the
etisserant@0: identifier new_type_1 is a simple_type_name, because it is
etisserant@0: based on a elementary type without structure, arrays, etc...
etisserant@0:  While parsing new_type_2, it becomes confused how to reduce
etisserant@0: the new_type_2 identifier, as it is based on the identifier
etisserant@0: new_type_1, of which it does not know the class (remember, at this
etisserant@0: stage new_type_1 is a simple identifier!).
etisserant@0:  We therefore need to keep track of the class of the user
etisserant@0: defined types as they are declared, so that the lexical analyser
etisserant@0: can tell the syntax parser what class the type belongs to. We
etisserant@0: cannot use the abstract syntax tree itself to search for the
etisserant@0: declaration of new_type_1 as we only get a handle to the root
etisserant@0: of the tree at the end of the parsing.
etisserant@0: 
etisserant@0:  We therefore maintain an independent and parallel table of symbols,
etisserant@0: that is filled as we come across the type delcarations in the code.
etisserant@0: Actually, we ended up also needing to store variable names in
etisserant@0: the symbol table. Since variable names come and go out of scope
etisserant@0: depending on what portion of code we are parsing, we sometimes
etisserant@0: need to remove the variable names from the symbol table.
etisserant@0: Since the ST and IL languages only have a single level of scope,
etisserant@0: I (Mario) found it easier to simply use a second symbol table for
etisserant@0: the variable names that is completely cleared when the parser
etisserant@0: reaches the end of a function (function block or program).
etisserant@0: 
etisserant@0: What I mean when I say that these languages have a single level
etisserant@0: of scope is that all variables used in a function (function block
etisserant@0: or program) must be declared inside that function (function block
etisserant@0: or program). Even global variables must be re-declared as EXTERN
etisserant@0: before a function may access them! This means that it is easy
etisserant@0: to simply load up the variable name symbol table when we start
etisserant@0: parsing a function (function block or program), and to clear it
etisserant@0: when we reach the end. Checking whether variables declared
etisserant@0: as EXTERN really exist inside a RESOURCE or a CONFIGURATION
etisserant@0: is left to stage 3 (semantic checking) where we can use the
etisserant@0: abstract tree itself to search for the variables (NOTE: semantic
etisserant@0: cheching at stage 3 has not yet been implemented, so we may yet
etisserant@0: end up using a symbol table too at that stage!).
etisserant@0: 
etisserant@0:  Due to the use of the symbol tables, and special identifier
etisserant@0: tokens depending on the type of identifier it had previously
etisserant@0: been declared in the code being parsed, the syntax was slightly
etisserant@0: changed regarding the definition of variable names, derived
etisserant@0: function names, etc... FROM for e.g.:
etisserant@0: variable_name: identifier;
etisserant@0: TO
etisserant@0: variable_name: variable_name_token;
etisserant@0: 
etisserant@0:  Flex first looks at the symbol tables when it finds an identifier,
etisserant@0: and returns the correct token corresponding to the identifier
etisserant@0: type in question. Only if the identifier is not currently stored
etisserant@0: in any symbol table, does flex return a simple identifier_token.
etisserant@0: 
etisserant@0:  This means that the declarations of variables, functions etc...
etisserant@0: were changed FROM:
etisserant@0: function_declaration: FUNCTION derived_function_name ...
etisserant@0: TO
etisserant@0: function_declaration: FUNCTION identifier ...
etisserant@0: since the initial definition of derived_function_name had been
etisserant@0: changed FROM
etisserant@0: derived_function_name: identifier;
etisserant@0: TO
etisserant@0: derived_function_name: derived_function_name_token;
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0:  Comment 2
etisserant@0:  ---------
etisserant@0:  Since the ST and IL languages share a lot of common syntax,
etisserant@0: I have decided to write a single parser to handle both languages
etisserant@0: simultaneously. This approach has the advantage that the user
etisserant@0: may mix the language used in the same file, as long as each function
etisserant@0: is written in a single lanuage.
etisserant@0: 
etisserant@0:  This approach also assumes that all the IL language operators are
etisserant@0: keywords, which means that it is not possible to define variables
etisserant@0: using names such as "LD", "ST", etc...
etisserant@0: Note that the spec does not consider these operators to be keywords,
etisserant@0: so it means that they should be available for variable names! On the
etisserant@0: other hand, all implementations of the ST and IL languages seems to
etisserant@0: treat them as keywords, so there is not much harm in doing the same.
etisserant@0: 
etisserant@0:  If it ever becomes necessary to allow variables with names of IL 
etisserant@0: operators, either the syntax will have to be augmented, or we can 
etisserant@0: brake up the parser in two: one for ST and another for IL.
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: /********************************/
etisserant@0: /* B 1.3.3 - Derived data types */
etisserant@0: /********************************/
etisserant@0: 
etisserant@0: Issue 1
etisserant@0: =======
etisserant@0: 
etisserant@0: According to the spec, the valid construct
etisserant@0: TYPE new_str_type : STRING := "hello!"; END_TYPE
etisserant@0: has two possible routes to type_declaration...
etisserant@0: 
etisserant@0: Route 1:
etisserant@0: type_declaration: single_element_type_declaration
etisserant@0: single_element_type_declaration: simple_type_declaration
etisserant@0: simple_type_declaration: identifier ':' simple_spec_init
etisserant@0: simple_spec_init: simple_specification ASSIGN constant
etisserant@0: (shift:  identifier <- 'new_str_type')
etisserant@0: simple_specification: elementary_type_name
etisserant@0: elementary_type_name: STRING
etisserant@0: (shift: elementary_type_name <- STRING)
etisserant@0: (reduce: simple_specification <- elementary_type_name)
etisserant@0: (shift: constant <- "hello!")
etisserant@0: (reduce: simple_spec_init: simple_specification ASSIGN constant)
etisserant@0: (reduce: ...)
etisserant@0: 
etisserant@0: 
etisserant@0: Route 2:
etisserant@0: type_declaration: string_type_declaration
etisserant@0: string_type_declaration: identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init
etisserant@0: (shift:  identifier <- 'new_str_type')
etisserant@0: elementary_string_type_name: STRING
etisserant@0: (shift: elementary_string_type_name <- STRING)
etisserant@0: (shift: string_type_declaration_size <- /* empty */)
etisserant@0: string_type_declaration_init: ASSIGN character_string
etisserant@0: (shift: character_string <- "hello!")
etisserant@0: (reduce: string_type_declaration_init <- ASSIGN character_string)
etisserant@0: (reduce: string_type_declaration <- identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init )
etisserant@0: (reduce: type_declaration <- string_type_declaration)
etisserant@0: 
etisserant@0: 
etisserant@0:  At first glance it seems that removing route 1 would make
etisserant@0: the most sense. Unfortunately the construct 'simple_spec_init'
etisserant@0: shows up multiple times in other rules, so changing this construct
etisserant@0: would mean changing all the rules in which it appears.
etisserant@0: I (Mario) therefore chose to remove route 2 instead. This means
etisserant@0: that the above declaration gets stored in a
etisserant@0: simple_type_declaration_c, and not in a string_type_declaration_c
etisserant@0: as would be expected!
etisserant@0: 
etisserant@0: 
etisserant@0: /***********************/
etisserant@0: /* B 1.5.1 - Functions */
etisserant@0: /***********************/
etisserant@0: 
etisserant@0: 
etisserant@0: Issue 1
etisserant@0: =======
etisserant@0: 
etisserant@0:  Due to reduce/reduce conflicts between identifiers
etisserant@0: being reduced to either a variable or an enumerator value,
etisserant@0: we were forced to keep a symbol table of the names
etisserant@0: of all declared variables. Variables are no longer
etisserant@0: created from simple identifier_token, but from
etisserant@0: variable_name_token.
etisserant@0: 
etisserant@0:  BUT, in functions the function name may be used as
etisserant@0: a variable! In order to be able to parse this correctly,
etisserant@0: the token parser (flex) must return a variable_name_token
etisserant@0: when it comes across the function name, while parsing
etisserant@0: the function itself.
etisserant@0: We do this by inserting the function name into the variable
etisserant@0: symbol table, and having flex return a variable_name_token
etisserant@0: whenever it comes across it.
etisserant@0: When we finish parsing the function the variable name
etisserant@0: symbol table is cleared of all entries, and the function
etisserant@0: name is inserted into the library element symbol table. This
etisserant@0: means that from then onwards flex will return a
etisserant@0: derived_function_name_token whenever it comes across the
etisserant@0: function name.
etisserant@0: 
etisserant@0: In order to insert the function name into the variable_name
etisserant@0: symbol table BEFORE the function body gets parsed, we
etisserant@0: need the parser to reduce a construct that contains the
etisserant@0: the function name. That is why we created the extra
etisserant@0: construct 'function_name_declaration', i.e. to force
etisserant@0: the parser to reduce it, before parsing the function body,
etisserant@0: and therefore get an oportunity to insert the function name
etisserant@0: into the variable name symbol table!
etisserant@0: 
etisserant@0: 
etisserant@0: /********************************/
etisserant@0: /* B 3.2.4 Iteration Statements */
etisserant@0: /********************************/
etisserant@0: 
etisserant@0: Issue 1
etisserant@0: =======
etisserant@0: 
etisserant@0: For the 'FOR' iteration loop
etisserant@0: 
etisserant@0:   FOR control_variable ASSIGN expression TO expression BY expression DO statement_list END_FOR
etisserant@0: 
etisserant@0: The spec declares the control variable in the syntax as
etisserant@0: 
etisserant@0:   control_variable: identifier;
etisserant@0: 
etisserant@0: but then defines the semantics of control_variable (Section 3.3.2.4)
etisserant@0: as being of an integer type (e.g., SINT, INT, or DINT).
etisserant@0: Obviously this presuposes that the control_variable must have been
etisserant@0: declared in some 'VAR .. VAR_END' construct, so I (Mario) changed
etisserant@0: the syntax to read
etisserant@0: 
etisserant@0:   control_variable: variable_name;
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: 
etisserant@0: **************************************************************************
etisserant@0: 
etisserant@0:   (c) 2003 Mario de Sousa