etisserant@0: etisserant@0: etisserant@0: msousa@265: MATIEC - IEC 61131-3 compiler etisserant@0: etisserant@0: etisserant@0: The following compiler has been based on the etisserant@0: FINAL DRAFT - IEC 61131-3, 2nd Ed. (2001-12-10) etisserant@0: etisserant@0: msousa@444: Copyright (C) 2003-2012 Mario de Sousa (msousa@fe.up.pt) msousa@444: msousa@444: msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: ********* ********* msousa@444: ********* ********* msousa@444: ********* O V E R A L L G O A L S ********* msousa@444: ********* ********* msousa@444: ********* ********* msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: msousa@444: msousa@444: msousa@444: This project has the goal of producing an open source compiler for the programming languages defined msousa@444: in the IEC 61131-3 standard. These programming languages are mostly used in the industrial automation msousa@444: domain, to program PLCs (Programmable Logic Controllers). msousa@444: msousa@444: This standard defines 5 programming languages: msousa@444: - IL : Instructtion List msousa@444: A textual programming language, somewhat similar to assembly. msousa@444: - ST : Structured Text msousa@444: A textual programming language, somewhat similar to Pascal. msousa@444: - FBD: Function Block Diagram msousa@444: A graphical programming language, somewhat similar to an electrical circuit diagram based on small msousa@444: scale integration ICs (Integrated Circuits) (counters, AND/OR/XOR/... logic gates, timers, ...). msousa@444: - LD : Ladder Diagram msousa@444: A graphical programming language, somewhat similar to an electrical circuit diagram based on msousa@444: relays (used for basic cabled logic controllers). msousa@444: - SFC: Sequential Function Chart msousa@444: A graphical programming language, that defines a state machine, based largely on Grafcet. msousa@444: (may also be expressed in textual format). msousa@444: msousa@444: Of the above 5 languages, the standard defines textual representations for IL, ST and SFC. msousa@444: It is these 3 languages that we target, and we currently support all three, as long as they are msousa@444: expressed in the textual format as defined in the standard. msousa@444: msousa@444: Currently the matiec project generates two compilers (more correctly, code translaters, but we like msousa@444: to call them compilers :-O ): iec2c, and iec2iec msousa@444: msousa@444: Both compilers accept the same input: a text file with ST, IL and/or SFC code. msousa@444: msousa@444: The iec2c compiler generates ANSI C code which is equivalent to the IEC 61131-3 code expressed in the input file. msousa@444: msousa@444: The iec2iec compiler generates IEC61131-3 code which is equivalent to the IEC 61131-3 code expressed in the input file. msousa@444: This last compiler should generate and output file which should be almost identical to the input file (some formating msousa@444: may change, as well as the case of letters, etc.). This 'compiler' is mostly used by the matiec project contributors msousa@444: to help debug the lexical and syntax portions of the compilers. msousa@444: msousa@444: msousa@444: msousa@444: To compile/build these compilers, just msousa@444: $./configure; make msousa@444: msousa@444: etisserant@0: etisserant@0: etisserant@0: etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: ********* O V E R A L L A R C H I T E C T U R E ********* etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: etisserant@0: The compiler works in 4(+1) stages: msousa@444: ================================== msousa@444: Stage 1 - Lexical analyser - implemented with flex (stage1_2/iec_flex.ll) msousa@444: Stage 2 - Syntax parser - implemented with bison (stage1_2/iec_bison.yy) msousa@444: Stage pre3 - Populate symbol tables - Symbol tables that will ease searching for symbols in the abstract symbol tree. msousa@444: Stage 3 - Semantics analyser - currently does type checking only msousa@444: Stage 4 - Code generator - generates ANSI C code msousa@444: msousa@444: Stage 5 - Binary code generator - gcc, javac, etc... (Not integrated into matiec compiler. Must be called explicitly by the user.) msousa@444: etisserant@0: etisserant@0: etisserant@0: Data structures passed between stages, in global variables: msousa@444: ========================================================== msousa@444: 1->2 : tokens (int), and token values (char *) (defined in stage1_2/stage1_2_priv.hh) msousa@444: 2->1 : symbol tables (implemented in util/symtable.[hh|cc], and defined in stage1_2/stage1_2_priv.hh) msousa@444: 2->3 : abstract syntax tree (tree of C++ objects, whose classes are defined in absyntax/absyntax.hh) msousa@444: pre3->3,4 : global symbol tables (defined in util/[d]symtable.[hh|cc] and declared in absyntax_utils/absyntax_utils.hh) msousa@444: 3->4 : abstract syntax tree (same as 2->3), but now annotated (i.e. some extra data inserted into the absyntax tree) msousa@444: msousa@444: 4->5 : file with program in c, java, etc... msousa@444: msousa@444: etisserant@0: etisserant@0: etisserant@0: The compiler works in several passes: msousa@444: ==================================== msousa@444: msousa@444: Stage 1 and Stage 2 msousa@444: ------------------- msousa@444: Executed in one single pass. This pass will: msousa@444: - Do lexical analysis msousa@444: - Do syntax analysis msousa@444: - Execute the absyntax_utils/add_en_eno_param_decl_c visitor class msousa@444: This class will add the EN and ENO parameter declarations to all msousa@444: functions that do not have them already explicitly declared by the user. msousa@444: This will let us handle these parameters in the remaining compiler just as if msousa@444: they were standard input/output parameters. msousa@444: msousa@444: msousa@444: Stage Pre3 msousa@444: ---------- msousa@444: Executed in one single pass. This pass will populate the following symbol tables: msousa@444: - function_symtable; /* A symbol table with all globally declared functions POUs. */ msousa@444: - function_block_type_symtable; /* A symbol table with all globally declared functions block POUs. */ msousa@444: - program_type_symtable; /* A symbol table with all globally declared program POUs. */ msousa@444: - type_symtable; /* A symbol table with all user declared (non elementary) datat type definitions. */ msousa@444: - enumerated_value_symtable; /* A symbol table with all identifiers (values) declared for enumerated types. */ msousa@444: msousa@444: msousa@444: Stage 3 msousa@444: ------- msousa@444: Executes two algorithms (flow control analysis, and data type analysis) in several passes. msousa@444: msousa@444: Flow control: msousa@444: Pass 1: Does flow control analysis (for now only of IL code) msousa@444: Implemented in -> stage3/flow_control_analysis_c msousa@444: This will anotate the abstract syntax tree msousa@444: (Every object of the class il_instruction_c that is in the abstract syntax tree will have the variable 'prev_il_instruction' correctly filled in.) msousa@444: msousa@444: Data Type Analysis msousa@444: Pass 1: Analyses the possible data types each expression/literal/IL instruction/etc. may take msousa@444: Implemented in -> stage3/fill_candidate_datatypes_c msousa@444: This will anotate the abstract syntax tree msousa@457: (Every object of in the abstract syntax tree that may have a data type, will have the variable 'candidate_datatypes' correctly filled in. msousa@457: Additionally, objects in the abstract syntax tree that represen function invocations will have the variable msousa@457: 'candidate_functions' correctly filled in.) msousa@457: Pass 2: Narrows all the possible data types each expression/literal/IL instruction/etc. may take down to a single data type msousa@444: Implemented in -> stage3/narrow_candidate_datatypes_c msousa@444: This will anotate the abstract syntax tree msousa@444: (Every object of in the abstract syntax tree that may have a data type, will have the variable 'datatype' correctly filled in. msousa@444: Additionally, objects in the abstract syntax tree that represen function invocations will have the variables msousa@457: 'called_function_declaration' and 'extensible_param_count' correctly filled in. msousa@444: Additionally, objects in the abstract syntax tree that represen function block (FB) invocations will have the variable msousa@444: 'called_fb_declaration' correctly filled in.) msousa@444: Pass 2: Prints error messages in the event of the IEC 61131-3 source code being analysed contains semantic data type incompatibility errors. msousa@444: Implemented in -> stage3/print_datatype_errors_c msousa@444: msousa@444: msousa@444: Stage 4 msousa@444: ------- msousa@444: Has 2 possible implementations. msousa@444: msousa@444: iec2c : Generates C source code in a single pass (stage4/generate_c). msousa@444: iec2iec: Generates IEC61131 source code in a single pass (stage4/generate_iec). msousa@444: msousa@444: msousa@444: msousa@444: msousa@444: msousa@444: msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: ********* ********* msousa@444: ********* ********* msousa@444: ********* N O T E S ********* msousa@444: ********* ********* msousa@444: ********* ********* msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: **************************************************************** msousa@444: msousa@444: msousa@444: msousa@444: etisserant@0: etisserant@0: etisserant@0: NOTE 1 etisserant@0: ====== etisserant@0: Note that stage 2 passes data back to stage 1. This is only etisserant@0: possible because both stages are executed in the same pass. etisserant@0: etisserant@0: etisserant@0: etisserant@0: NOTE 2 etisserant@0: ====== etisserant@0: It would be nice to get this parser integrated into the gcc etisserant@0: group of compilers. We would then be able to compile our st/il etisserant@0: programs directly into executable binaries, for all the processor etisserant@0: architectures gcc currently supports. etisserant@0: The gcc compilers are divided into a frontend and backend. The etisserant@0: data structure between these two stages is called the syntax etisserant@0: tree. In essence, we would need to create a new frontend that etisserant@0: would parse the st/il program and build the syntax tree. etisserant@0: Unfortunately the gcc syntax tree is not very well documented, etisserant@0: and doing semantic checking on this tree would probably be a etisserant@0: nightmare. etisserant@0: We therefore chose to follow the same route as the gnat (ada 95) etisserant@0: and cobol compilers, i.e. generate our own abstract syntax tree, etisserant@0: do semantic checking on our tree, do whatever optimisation etisserant@0: we can at this level on our own tree, and only then build etisserant@0: the gcc syntax tree from our abstract syntax tree. etisserant@0: All this may still be integrated with the gcc backend to generate etisserant@0: a new gnu compiler for the st and il programming languages. etisserant@0: Since generating the gcc syntax tree will probably envolve some etisserant@0: trial and error effort due to the sparseness of documentation, etisserant@0: we chose to start off by coding a C++ code generator for etisserant@0: our stage 4. We may later implement a gcc syntax tree generator etisserant@0: as an alternative stage 4 process, and then integrate it with etisserant@0: the gcc toplevel.c file (command line parsing, etc...). etisserant@0: etisserant@0: etisserant@0: etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: ********* S T A G E 1 ********* etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: etisserant@0: etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: The syntax defines the common_character_representation as: etisserant@0: <any printable character except '$', '"' or "'"> | <escape sequences> etisserant@0: etisserant@0: Flex includes the function print_char() that defines etisserant@0: all printable characters portably (i.e. whatever character etisserant@0: encoding is currently being used , ASCII, EBCDIC, etc...) etisserant@0: Unfortunately, we cannot generate the definition of etisserant@0: common_character_representation portably, since flex etisserant@0: does not allow definition of sets by subtracting etisserant@0: elements in one set from another set (Note how etisserant@0: common_character_representation could be defined by etisserant@0: subtracting '$' '"' and "'" from print_char() ). etisserant@0: This means we must build up the defintion of etisserant@0: common_character_representation using only set addition, etisserant@0: which leaves us with the only choice of defining the etisserant@0: characters non-portably... etisserant@0: etisserant@0: In short, the definition we use for common_character_representation etisserant@0: only works for ASCII character encoding! etisserant@0: etisserant@0: etisserant@0: etisserant@0: etisserant@0: Issue 2 etisserant@0: ======= etisserant@0: etisserant@0: We extend the IEC 61131-3 standard syntax to allow inclusion of etisserant@0: other files. The accepted syntax is: etisserant@0: msousa@265: {#include "<filename>" } msousa@265: msousa@265: We use a pragma directive for this (allowed by the standard itself), msousa@265: since it is an extension of the standard. In principle, this would msousa@265: be ignored by other standard complient compilers! etisserant@0: etisserant@0: etisserant@0: etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: ********* S T A G E 2 ********* etisserant@0: ********* ********* etisserant@0: ********* ********* etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: **************************************************************** etisserant@0: etisserant@0: Overall Comments etisserant@0: ================ etisserant@0: etisserant@0: etisserant@0: Comment 1 etisserant@0: --------- etisserant@0: We have augmented the syntax the specification defines to include etisserant@0: restrictions defined in the semantics of the languages. etisserant@0: etisserant@0: This is required because the syntax cannot be parsed by a LALR(1) etisserant@0: parser as it is presented in the specification. Many reduce/reduce etisserant@0: and shift/reduce conflicts arise. This is mainly because the parser etisserant@0: cannot discern how to reduce an identifier. Identifiers show up in etisserant@0: many places in the syntax, and it is not entirely possible to etisserant@0: figure out if the identifier is a variable_name, enumeration etisserant@0: value, function block name, etc... only from the context in etisserant@0: which it appears. etisserant@0: etisserant@0: A more detailed example of why we need symbol tables are etisserant@0: the type definitions... In definition of new types etisserant@0: (section B 1.3.3) the parser needs to figure out the class of etisserant@0: the new type being defined (enumerated, structure, array, etc...). etisserant@0: This works well when the base classes are elementary types etisserant@0: (or structures, enumeration, arrays, etc. thereof). It becomes etisserant@0: confusing to the parser when the new_type is based on a previously etisserant@0: user defined type. etisserant@0: etisserant@0: TYPE etisserant@0: new_type_1 : INT := 99; etisserant@0: new_type_2 : new_type_1 := 100; etisserant@0: END_TYPE etisserant@0: etisserant@0: When parsing new_type_1, the parser can figure out that the etisserant@0: identifier new_type_1 is a simple_type_name, because it is etisserant@0: based on a elementary type without structure, arrays, etc... etisserant@0: While parsing new_type_2, it becomes confused how to reduce etisserant@0: the new_type_2 identifier, as it is based on the identifier etisserant@0: new_type_1, of which it does not know the class (remember, at this etisserant@0: stage new_type_1 is a simple identifier!). etisserant@0: We therefore need to keep track of the class of the user etisserant@0: defined types as they are declared, so that the lexical analyser etisserant@0: can tell the syntax parser what class the type belongs to. We etisserant@0: cannot use the abstract syntax tree itself to search for the etisserant@0: declaration of new_type_1 as we only get a handle to the root etisserant@0: of the tree at the end of the parsing. etisserant@0: etisserant@0: We therefore maintain an independent and parallel table of symbols, etisserant@0: that is filled as we come across the type delcarations in the code. etisserant@0: Actually, we ended up also needing to store variable names in etisserant@0: the symbol table. Since variable names come and go out of scope etisserant@0: depending on what portion of code we are parsing, we sometimes etisserant@0: need to remove the variable names from the symbol table. etisserant@0: Since the ST and IL languages only have a single level of scope, etisserant@0: I (Mario) found it easier to simply use a second symbol table for etisserant@0: the variable names that is completely cleared when the parser etisserant@0: reaches the end of a function (function block or program). etisserant@0: etisserant@0: What I mean when I say that these languages have a single level etisserant@0: of scope is that all variables used in a function (function block etisserant@0: or program) must be declared inside that function (function block etisserant@0: or program). Even global variables must be re-declared as EXTERN etisserant@0: before a function may access them! This means that it is easy etisserant@0: to simply load up the variable name symbol table when we start etisserant@0: parsing a function (function block or program), and to clear it etisserant@0: when we reach the end. Checking whether variables declared etisserant@0: as EXTERN really exist inside a RESOURCE or a CONFIGURATION etisserant@0: is left to stage 3 (semantic checking) where we can use the etisserant@0: abstract tree itself to search for the variables (NOTE: semantic etisserant@0: cheching at stage 3 has not yet been implemented, so we may yet etisserant@0: end up using a symbol table too at that stage!). etisserant@0: etisserant@0: Due to the use of the symbol tables, and special identifier etisserant@0: tokens depending on the type of identifier it had previously etisserant@0: been declared in the code being parsed, the syntax was slightly etisserant@0: changed regarding the definition of variable names, derived etisserant@0: function names, etc... FROM for e.g.: etisserant@0: variable_name: identifier; etisserant@0: TO etisserant@0: variable_name: variable_name_token; etisserant@0: etisserant@0: Flex first looks at the symbol tables when it finds an identifier, etisserant@0: and returns the correct token corresponding to the identifier etisserant@0: type in question. Only if the identifier is not currently stored etisserant@0: in any symbol table, does flex return a simple identifier_token. etisserant@0: etisserant@0: This means that the declarations of variables, functions etc... etisserant@0: were changed FROM: etisserant@0: function_declaration: FUNCTION derived_function_name ... etisserant@0: TO etisserant@0: function_declaration: FUNCTION identifier ... etisserant@0: since the initial definition of derived_function_name had been etisserant@0: changed FROM etisserant@0: derived_function_name: identifier; etisserant@0: TO etisserant@0: derived_function_name: derived_function_name_token; etisserant@0: etisserant@0: etisserant@0: etisserant@0: etisserant@0: Comment 2 etisserant@0: --------- etisserant@0: Since the ST and IL languages share a lot of common syntax, etisserant@0: I have decided to write a single parser to handle both languages etisserant@0: simultaneously. This approach has the advantage that the user etisserant@0: may mix the language used in the same file, as long as each function etisserant@0: is written in a single lanuage. etisserant@0: etisserant@0: This approach also assumes that all the IL language operators are etisserant@0: keywords, which means that it is not possible to define variables etisserant@0: using names such as "LD", "ST", etc... etisserant@0: Note that the spec does not consider these operators to be keywords, etisserant@0: so it means that they should be available for variable names! On the etisserant@0: other hand, all implementations of the ST and IL languages seems to etisserant@0: treat them as keywords, so there is not much harm in doing the same. etisserant@0: etisserant@0: If it ever becomes necessary to allow variables with names of IL etisserant@0: operators, either the syntax will have to be augmented, or we can etisserant@0: brake up the parser in two: one for ST and another for IL. etisserant@0: etisserant@0: etisserant@0: etisserant@0: /********************************/ etisserant@0: /* B 1.3.3 - Derived data types */ etisserant@0: /********************************/ etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: According to the spec, the valid construct etisserant@0: TYPE new_str_type : STRING := "hello!"; END_TYPE etisserant@0: has two possible routes to type_declaration... etisserant@0: etisserant@0: Route 1: etisserant@0: type_declaration: single_element_type_declaration etisserant@0: single_element_type_declaration: simple_type_declaration etisserant@0: simple_type_declaration: identifier ':' simple_spec_init etisserant@0: simple_spec_init: simple_specification ASSIGN constant etisserant@0: (shift: identifier <- 'new_str_type') etisserant@0: simple_specification: elementary_type_name etisserant@0: elementary_type_name: STRING etisserant@0: (shift: elementary_type_name <- STRING) etisserant@0: (reduce: simple_specification <- elementary_type_name) etisserant@0: (shift: constant <- "hello!") etisserant@0: (reduce: simple_spec_init: simple_specification ASSIGN constant) etisserant@0: (reduce: ...) etisserant@0: etisserant@0: etisserant@0: Route 2: etisserant@0: type_declaration: string_type_declaration etisserant@0: string_type_declaration: identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init etisserant@0: (shift: identifier <- 'new_str_type') etisserant@0: elementary_string_type_name: STRING etisserant@0: (shift: elementary_string_type_name <- STRING) etisserant@0: (shift: string_type_declaration_size <- /* empty */) etisserant@0: string_type_declaration_init: ASSIGN character_string etisserant@0: (shift: character_string <- "hello!") etisserant@0: (reduce: string_type_declaration_init <- ASSIGN character_string) etisserant@0: (reduce: string_type_declaration <- identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init ) etisserant@0: (reduce: type_declaration <- string_type_declaration) etisserant@0: etisserant@0: etisserant@0: At first glance it seems that removing route 1 would make etisserant@0: the most sense. Unfortunately the construct 'simple_spec_init' etisserant@0: shows up multiple times in other rules, so changing this construct etisserant@0: would mean changing all the rules in which it appears. etisserant@0: I (Mario) therefore chose to remove route 2 instead. This means etisserant@0: that the above declaration gets stored in a etisserant@0: simple_type_declaration_c, and not in a string_type_declaration_c etisserant@0: as would be expected! etisserant@0: etisserant@0: etisserant@0: /***********************/ etisserant@0: /* B 1.5.1 - Functions */ etisserant@0: /***********************/ etisserant@0: etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: Due to reduce/reduce conflicts between identifiers etisserant@0: being reduced to either a variable or an enumerator value, etisserant@0: we were forced to keep a symbol table of the names etisserant@0: of all declared variables. Variables are no longer etisserant@0: created from simple identifier_token, but from etisserant@0: variable_name_token. etisserant@0: etisserant@0: BUT, in functions the function name may be used as etisserant@0: a variable! In order to be able to parse this correctly, etisserant@0: the token parser (flex) must return a variable_name_token etisserant@0: when it comes across the function name, while parsing etisserant@0: the function itself. etisserant@0: We do this by inserting the function name into the variable etisserant@0: symbol table, and having flex return a variable_name_token etisserant@0: whenever it comes across it. etisserant@0: When we finish parsing the function the variable name etisserant@0: symbol table is cleared of all entries, and the function etisserant@0: name is inserted into the library element symbol table. This etisserant@0: means that from then onwards flex will return a etisserant@0: derived_function_name_token whenever it comes across the etisserant@0: function name. etisserant@0: etisserant@0: In order to insert the function name into the variable_name etisserant@0: symbol table BEFORE the function body gets parsed, we etisserant@0: need the parser to reduce a construct that contains the etisserant@0: the function name. That is why we created the extra etisserant@0: construct 'function_name_declaration', i.e. to force etisserant@0: the parser to reduce it, before parsing the function body, etisserant@0: and therefore get an oportunity to insert the function name etisserant@0: into the variable name symbol table! etisserant@0: etisserant@0: etisserant@0: /********************************/ etisserant@0: /* B 3.2.4 Iteration Statements */ etisserant@0: /********************************/ etisserant@0: etisserant@0: Issue 1 etisserant@0: ======= etisserant@0: etisserant@0: For the 'FOR' iteration loop etisserant@0: etisserant@0: FOR control_variable ASSIGN expression TO expression BY expression DO statement_list END_FOR etisserant@0: etisserant@0: The spec declares the control variable in the syntax as etisserant@0: etisserant@0: control_variable: identifier; etisserant@0: etisserant@0: but then defines the semantics of control_variable (Section 3.3.2.4) etisserant@0: as being of an integer type (e.g., SINT, INT, or DINT). etisserant@0: Obviously this presuposes that the control_variable must have been etisserant@0: declared in some 'VAR .. VAR_END' construct, so I (Mario) changed etisserant@0: the syntax to read etisserant@0: etisserant@0: control_variable: variable_name; etisserant@0: etisserant@0: etisserant@0: etisserant@0: etisserant@0: ************************************************************************** etisserant@0: msousa@444: Copyright (C) 2003-2012 Mario de Sousa (msousa@fe.up.pt)