etisserant@0: etisserant@0: etisserant@0: msousa@265: MATIEC - IEC 61131-3 compiler etisserant@0: etisserant@0: etisserant@0: The following compiler has been based on the etisserant@0: FINAL DRAFT - IEC 61131-3, 2nd Ed. (2001-12-10) etisserant@0: etisserant@0: msousa@265: Copyright (C) 2003-2011 Mario de Sousa (msousa@fe.up.pt) etisserant@0: etisserant@0: etisserant@0: etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: ********* O V E R A L L A R C H I T E C T U R E ********* etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: etisserant@0: The compiler works in 4(+1) stages: etisserant@0: Stage 1 - Lexical analyser - implemented with flex (iec.flex) etisserant@0: Stage 2 - Syntax parser - implemented with bison (iec.y) msousa@265: Stage 3 - Semantics analyser - currently in its early stages etisserant@0: Stage 4 - Code generator - implemented in C++ etisserant@0: Stage 4+1 - Binary code generator - gcc, javac, etc... etisserant@0: etisserant@0: etisserant@0: Data structures passed between stages, in global variables: etisserant@0: 1->2 : tokens (int), and token values (char *) etisserant@0: 2->1 : symbol tables (defined in symtable.hh) etisserant@0: 2->3 : abstract syntax tree (tree of C++ classes, in absyntax.hh file) etisserant@0: 3->4 : Same as 2->3 etisserant@0: 4->4+1 : file with program in c, java, etc... etisserant@0: etisserant@0: etisserant@0: The compiler works in several passes: etisserant@0: Pass 1: executes stages 1 and 2 simultaneously etisserant@0: Pass 2: executes stage 3 etisserant@0: Pass 3: executes stage 4 etisserant@0: Pass 4: executes stage 4+1 etisserant@0: etisserant@0: etisserant@0: NOTE 1 etisserant@0: ====== etisserant@0: Note that stage 2 passes data back to stage 1. This is only etisserant@0: possible because both stages are executed in the same pass. etisserant@0: etisserant@0: etisserant@0: etisserant@0: NOTE 2 etisserant@0: ====== etisserant@0: It would be nice to get this parser integrated into the gcc etisserant@0: group of compilers. We would then be able to compile our st/il etisserant@0: programs directly into executable binaries, for all the processor etisserant@0: architectures gcc currently supports. etisserant@0: The gcc compilers are divided into a frontend and backend. The etisserant@0: data structure between these two stages is called the syntax etisserant@0: tree. In essence, we would need to create a new frontend that etisserant@0: would parse the st/il program and build the syntax tree. etisserant@0: Unfortunately the gcc syntax tree is not very well documented, etisserant@0: and doing semantic checking on this tree would probably be a etisserant@0: nightmare. etisserant@0: We therefore chose to follow the same route as the gnat (ada 95) etisserant@0: and cobol compilers, i.e. generate our own abstract syntax tree, etisserant@0: do semantic checking on our tree, do whatever optimisation etisserant@0: we can at this level on our own tree, and only then build etisserant@0: the gcc syntax tree from our abstract syntax tree. etisserant@0: All this may still be integrated with the gcc backend to generate etisserant@0: a new gnu compiler for the st and il programming languages. etisserant@0: Since generating the gcc syntax tree will probably envolve some etisserant@0: trial and error effort due to the sparseness of documentation, etisserant@0: we chose to start off by coding a C++ code generator for etisserant@0: our stage 4. We may later implement a gcc syntax tree generator etisserant@0: as an alternative stage 4 process, and then integrate it with etisserant@0: the gcc toplevel.c file (command line parsing, etc...). etisserant@0: etisserant@0: etisserant@0: etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: ********* S T A G E 1 ********* etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: etisserant@0: etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: The syntax defines the common_character_representation as: etisserant@0: <any printable character except '$', '"' or "'"> | <escape sequences> etisserant@0: etisserant@0: Flex includes the function print_char() that defines etisserant@0: all printable characters portably (i.e. whatever character etisserant@0: encoding is currently being used , ASCII, EBCDIC, etc...) etisserant@0: Unfortunately, we cannot generate the definition of etisserant@0: common_character_representation portably, since flex etisserant@0: does not allow definition of sets by subtracting etisserant@0: elements in one set from another set (Note how etisserant@0: common_character_representation could be defined by etisserant@0: subtracting '$' '"' and "'" from print_char() ). etisserant@0: This means we must build up the defintion of etisserant@0: common_character_representation using only set addition, etisserant@0: which leaves us with the only choice of defining the etisserant@0: characters non-portably... etisserant@0: etisserant@0: In short, the definition we use for common_character_representation etisserant@0: only works for ASCII character encoding! etisserant@0: etisserant@0: etisserant@0: etisserant@0: etisserant@0: Issue 2 etisserant@0: ======= etisserant@0: etisserant@0: We extend the IEC 61131-3 standard syntax to allow inclusion of etisserant@0: other files. The accepted syntax is: etisserant@0: msousa@265: {#include "<filename>" } msousa@265: msousa@265: We use a pragma directive for this (allowed by the standard itself), msousa@265: since it is an extension of the standard. In principle, this would msousa@265: be ignored by other standard complient compilers! etisserant@0: etisserant@0: etisserant@0: etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: ********* S T A G E 2 ********* etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: etisserant@0: Overall Comments etisserant@0: ================ etisserant@0: etisserant@0: etisserant@0: Comment 1 etisserant@0: --------- etisserant@0: We have augmented the syntax the specification defines to include etisserant@0: restrictions defined in the semantics of the languages. etisserant@0: etisserant@0: This is required because the syntax cannot be parsed by a LALR(1) etisserant@0: parser as it is presented in the specification. Many reduce/reduce etisserant@0: and shift/reduce conflicts arise. This is mainly because the parser etisserant@0: cannot discern how to reduce an identifier. Identifiers show up in etisserant@0: many places in the syntax, and it is not entirely possible to etisserant@0: figure out if the identifier is a variable_name, enumeration etisserant@0: value, function block name, etc... only from the context in etisserant@0: which it appears. etisserant@0: etisserant@0: A more detailed example of why we need symbol tables are etisserant@0: the type definitions... In definition of new types etisserant@0: (section B 1.3.3) the parser needs to figure out the class of etisserant@0: the new type being defined (enumerated, structure, array, etc...). etisserant@0: This works well when the base classes are elementary types etisserant@0: (or structures, enumeration, arrays, etc. thereof). It becomes etisserant@0: confusing to the parser when the new_type is based on a previously etisserant@0: user defined type. etisserant@0: etisserant@0: TYPE etisserant@0: new_type_1 : INT := 99; etisserant@0: new_type_2 : new_type_1 := 100; etisserant@0: END_TYPE etisserant@0: etisserant@0: When parsing new_type_1, the parser can figure out that the etisserant@0: identifier new_type_1 is a simple_type_name, because it is etisserant@0: based on a elementary type without structure, arrays, etc... etisserant@0: While parsing new_type_2, it becomes confused how to reduce etisserant@0: the new_type_2 identifier, as it is based on the identifier etisserant@0: new_type_1, of which it does not know the class (remember, at this etisserant@0: stage new_type_1 is a simple identifier!). etisserant@0: We therefore need to keep track of the class of the user etisserant@0: defined types as they are declared, so that the lexical analyser etisserant@0: can tell the syntax parser what class the type belongs to. We etisserant@0: cannot use the abstract syntax tree itself to search for the etisserant@0: declaration of new_type_1 as we only get a handle to the root etisserant@0: of the tree at the end of the parsing. etisserant@0: etisserant@0: We therefore maintain an independent and parallel table of symbols, etisserant@0: that is filled as we come across the type delcarations in the code. etisserant@0: Actually, we ended up also needing to store variable names in etisserant@0: the symbol table. Since variable names come and go out of scope etisserant@0: depending on what portion of code we are parsing, we sometimes etisserant@0: need to remove the variable names from the symbol table. etisserant@0: Since the ST and IL languages only have a single level of scope, etisserant@0: I (Mario) found it easier to simply use a second symbol table for etisserant@0: the variable names that is completely cleared when the parser etisserant@0: reaches the end of a function (function block or program). etisserant@0: etisserant@0: What I mean when I say that these languages have a single level etisserant@0: of scope is that all variables used in a function (function block etisserant@0: or program) must be declared inside that function (function block etisserant@0: or program). Even global variables must be re-declared as EXTERN etisserant@0: before a function may access them! This means that it is easy etisserant@0: to simply load up the variable name symbol table when we start etisserant@0: parsing a function (function block or program), and to clear it etisserant@0: when we reach the end. Checking whether variables declared etisserant@0: as EXTERN really exist inside a RESOURCE or a CONFIGURATION etisserant@0: is left to stage 3 (semantic checking) where we can use the etisserant@0: abstract tree itself to search for the variables (NOTE: semantic etisserant@0: cheching at stage 3 has not yet been implemented, so we may yet etisserant@0: end up using a symbol table too at that stage!). etisserant@0: etisserant@0: Due to the use of the symbol tables, and special identifier etisserant@0: tokens depending on the type of identifier it had previously etisserant@0: been declared in the code being parsed, the syntax was slightly etisserant@0: changed regarding the definition of variable names, derived etisserant@0: function names, etc... FROM for e.g.: etisserant@0: variable_name: identifier; etisserant@0: TO etisserant@0: variable_name: variable_name_token; etisserant@0: etisserant@0: Flex first looks at the symbol tables when it finds an identifier, etisserant@0: and returns the correct token corresponding to the identifier etisserant@0: type in question. Only if the identifier is not currently stored etisserant@0: in any symbol table, does flex return a simple identifier_token. etisserant@0: etisserant@0: This means that the declarations of variables, functions etc... etisserant@0: were changed FROM: etisserant@0: function_declaration: FUNCTION derived_function_name ... etisserant@0: TO etisserant@0: function_declaration: FUNCTION identifier ... etisserant@0: since the initial definition of derived_function_name had been etisserant@0: changed FROM etisserant@0: derived_function_name: identifier; etisserant@0: TO etisserant@0: derived_function_name: derived_function_name_token; etisserant@0: etisserant@0: etisserant@0: etisserant@0: etisserant@0: Comment 2 etisserant@0: --------- etisserant@0: Since the ST and IL languages share a lot of common syntax, etisserant@0: I have decided to write a single parser to handle both languages etisserant@0: simultaneously. This approach has the advantage that the user etisserant@0: may mix the language used in the same file, as long as each function etisserant@0: is written in a single lanuage. etisserant@0: etisserant@0: This approach also assumes that all the IL language operators are etisserant@0: keywords, which means that it is not possible to define variables etisserant@0: using names such as "LD", "ST", etc... etisserant@0: Note that the spec does not consider these operators to be keywords, etisserant@0: so it means that they should be available for variable names! On the etisserant@0: other hand, all implementations of the ST and IL languages seems to etisserant@0: treat them as keywords, so there is not much harm in doing the same. etisserant@0: etisserant@0: If it ever becomes necessary to allow variables with names of IL etisserant@0: operators, either the syntax will have to be augmented, or we can etisserant@0: brake up the parser in two: one for ST and another for IL. etisserant@0: etisserant@0: etisserant@0: etisserant@0: /********************************/ etisserant@0: /* B 1.3.3 - Derived data types */ etisserant@0: /********************************/ etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: According to the spec, the valid construct etisserant@0: TYPE new_str_type : STRING := "hello!"; END_TYPE etisserant@0: has two possible routes to type_declaration... etisserant@0: etisserant@0: Route 1: etisserant@0: type_declaration: single_element_type_declaration etisserant@0: single_element_type_declaration: simple_type_declaration etisserant@0: simple_type_declaration: identifier ':' simple_spec_init etisserant@0: simple_spec_init: simple_specification ASSIGN constant etisserant@0: (shift: identifier <- 'new_str_type') etisserant@0: simple_specification: elementary_type_name etisserant@0: elementary_type_name: STRING etisserant@0: (shift: elementary_type_name <- STRING) etisserant@0: (reduce: simple_specification <- elementary_type_name) etisserant@0: (shift: constant <- "hello!") etisserant@0: (reduce: simple_spec_init: simple_specification ASSIGN constant) etisserant@0: (reduce: ...) etisserant@0: etisserant@0: etisserant@0: Route 2: etisserant@0: type_declaration: string_type_declaration etisserant@0: string_type_declaration: identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init etisserant@0: (shift: identifier <- 'new_str_type') etisserant@0: elementary_string_type_name: STRING etisserant@0: (shift: elementary_string_type_name <- STRING) etisserant@0: (shift: string_type_declaration_size <- /* empty */) etisserant@0: string_type_declaration_init: ASSIGN character_string etisserant@0: (shift: character_string <- "hello!") etisserant@0: (reduce: string_type_declaration_init <- ASSIGN character_string) etisserant@0: (reduce: string_type_declaration <- identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init ) etisserant@0: (reduce: type_declaration <- string_type_declaration) etisserant@0: etisserant@0: etisserant@0: At first glance it seems that removing route 1 would make etisserant@0: the most sense. Unfortunately the construct 'simple_spec_init' etisserant@0: shows up multiple times in other rules, so changing this construct etisserant@0: would mean changing all the rules in which it appears. etisserant@0: I (Mario) therefore chose to remove route 2 instead. This means etisserant@0: that the above declaration gets stored in a etisserant@0: simple_type_declaration_c, and not in a string_type_declaration_c etisserant@0: as would be expected! etisserant@0: etisserant@0: etisserant@0: /***********************/ etisserant@0: /* B 1.5.1 - Functions */ etisserant@0: /***********************/ etisserant@0: etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: Due to reduce/reduce conflicts between identifiers etisserant@0: being reduced to either a variable or an enumerator value, etisserant@0: we were forced to keep a symbol table of the names etisserant@0: of all declared variables. Variables are no longer etisserant@0: created from simple identifier_token, but from etisserant@0: variable_name_token. etisserant@0: etisserant@0: BUT, in functions the function name may be used as etisserant@0: a variable! In order to be able to parse this correctly, etisserant@0: the token parser (flex) must return a variable_name_token etisserant@0: when it comes across the function name, while parsing etisserant@0: the function itself. etisserant@0: We do this by inserting the function name into the variable etisserant@0: symbol table, and having flex return a variable_name_token etisserant@0: whenever it comes across it. etisserant@0: When we finish parsing the function the variable name etisserant@0: symbol table is cleared of all entries, and the function etisserant@0: name is inserted into the library element symbol table. This etisserant@0: means that from then onwards flex will return a etisserant@0: derived_function_name_token whenever it comes across the etisserant@0: function name. etisserant@0: etisserant@0: In order to insert the function name into the variable_name etisserant@0: symbol table BEFORE the function body gets parsed, we etisserant@0: need the parser to reduce a construct that contains the etisserant@0: the function name. That is why we created the extra etisserant@0: construct 'function_name_declaration', i.e. to force etisserant@0: the parser to reduce it, before parsing the function body, etisserant@0: and therefore get an oportunity to insert the function name etisserant@0: into the variable name symbol table! etisserant@0: etisserant@0: etisserant@0: /********************************/ etisserant@0: /* B 3.2.4 Iteration Statements */ etisserant@0: /********************************/ etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: For the 'FOR' iteration loop etisserant@0: etisserant@0: FOR control_variable ASSIGN expression TO expression BY expression DO statement_list END_FOR etisserant@0: etisserant@0: The spec declares the control variable in the syntax as etisserant@0: etisserant@0: control_variable: identifier; etisserant@0: etisserant@0: but then defines the semantics of control_variable (Section 3.3.2.4) etisserant@0: as being of an integer type (e.g., SINT, INT, or DINT). etisserant@0: Obviously this presuposes that the control_variable must have been etisserant@0: declared in some 'VAR .. VAR_END' construct, so I (Mario) changed etisserant@0: the syntax to read etisserant@0: etisserant@0: control_variable: variable_name; etisserant@0: etisserant@0: etisserant@0: etisserant@0: etisserant@0: ************************************************************************** etisserant@0: msousa@265: Copyright (C) 2003-2011 Mario de Sousa (msousa@fe.up.pt)