diff -r 000000000000 -r fb772792efd1 readme --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/readme Wed Jan 31 15:32:38 2007 +0100 @@ -0,0 +1,407 @@ + + + + IEC 61131-3 IL and ST compiler + + + The following compiler has been based on the + FINAL DRAFT - IEC 61131-3, 2nd Ed. (2001-12-10) + + + (c) 2003 Mario de Sousa + + + +**************************************************************** +**************************************************************** +**************************************************************** +********* ********* +********* ********* +********* O V E R A L L A R C H I T E C T U R E ********* +********* ********* +********* ********* +**************************************************************** +**************************************************************** +**************************************************************** + + The compiler works in 4(+1) stages: + Stage 1 - Lexical analyser - implemented with flex (iec.flex) + Stage 2 - Syntax parser - implemented with bison (iec.y) + Stage 3 - Semantics analyser - not yet implemented + Stage 4 - Code generator - implemented in C++ + Stage 4+1 - Binary code generator - gcc, javac, etc... + + + Data structures passed between stages, in global variables: + 1->2 : tokens (int), and token values (char *) + 2->1 : symbol tables (defined in symtable.hh) + 2->3 : abstract syntax tree (tree of C++ classes, in absyntax.hh file) + 3->4 : Same as 2->3 + 4->4+1 : file with program in c, java, etc... + + + The compiler works in several passes: + Pass 1: executes stages 1 and 2 simultaneously + Pass 2: executes stage 3 + Pass 3: executes stage 4 + Pass 4: executes stage 4+1 + + + NOTE 1 + ====== + Note that stage 2 passes data back to stage 1. This is only +possible because both stages are executed in the same pass. + + + + NOTE 2 + ====== + I (Mario) have a feeling that the abstract syntax may be +considerably simplified without any drawbacks to semantic checking +and code generation. I have nevertheless opted to keep as much +info as possible in the abstract syntax tree, in case it may become +necessary further on. + Once we start coding the next stages (semantic checking and code +generation) I will get a better understanding of what is required +of the abstract syntax tree. At that stage I will be better +positioned to make a more informed decision on how best to structure +the abstract syntax tree. + For now, we play conservative and keep as much info as possible. + + + + NOTE 3 + ====== + It would be nice to get this parser integrated into the gcc +group of compilers. We would then be able to compile our st/il +programs directly into executable binaries, for all the processor +architectures gcc currently supports. + The gcc compilers are divided into a frontend and backend. The +data structure between these two stages is called the syntax +tree. In essence, we would need to create a new frontend that +would parse the st/il program and build the syntax tree. +Unfortunately the gcc syntax tree is not very well documented, +and doing semantic checking on this tree would probably be a +nightmare. + We therefore chose to follow the same route as the gnat (ada 95) +and cobol compilers, i.e. generate our own abstract syntax tree, +do semantic checking on our tree, do whatever optimisation +we can at this level on our own tree, and only then build +the gcc syntax tree from our abstract syntax tree. + All this may still be integrated with the gcc backend to generate +a new gnu compiler for the st and il programming languages. +Since generating the gcc syntax tree will probably envolve some +trial and error effort due to the sparseness of documentation, +we chose to start off by coding a C++ code generator for +our stage 4. We may later implement a gcc syntax tree generator +as an alternative stage 4 process, and then integrate it with +the gcc toplevel.c file (command line parsing, etc...). + + + +**************************************************************** +**************************************************************** +**************************************************************** +********* ********* +********* ********* +********* S T A G E 1 ********* +********* ********* +********* ********* +**************************************************************** +**************************************************************** +**************************************************************** + + + +Issue 1 +======= + + The syntax defines the common_character_representation as: + | + + Flex includes the function print_char() that defines +all printable characters portably (i.e. whatever character +encoding is currently being used , ASCII, EBCDIC, etc...) +Unfortunately, we cannot generate the definition of +common_character_representation portably, since flex +does not allow definition of sets by subtracting +elements in one set from another set (Note how +common_character_representation could be defined by +subtracting '$' '"' and "'" from print_char() ). +This means we must build up the defintion of +common_character_representation using only set addition, +which leaves us with the only choice of defining the +characters non-portably... + + In short, the definition we use for common_character_representation +only works for ASCII character encoding! + + + + +Issue 2 +======= + +We extend the IEC 61131-3 standard syntax to allow inclusion of +other files. The accepted syntax is: + + (*#include "" *) + +Note how this would be ignored by other standard complient compilers +as a simple comment! + + + +**************************************************************** +**************************************************************** +**************************************************************** +********* ********* +********* ********* +********* S T A G E 2 ********* +********* ********* +********* ********* +**************************************************************** +**************************************************************** +**************************************************************** + + Overall Comments + ================ + + + Comment 1 + --------- + We have augmented the syntax the specification defines to include +restrictions defined in the semantics of the languages. + + This is required because the syntax cannot be parsed by a LALR(1) +parser as it is presented in the specification. Many reduce/reduce +and shift/reduce conflicts arise. This is mainly because the parser +cannot discern how to reduce an identifier. Identifiers show up in +many places in the syntax, and it is not entirely possible to +figure out if the identifier is a variable_name, enumeration +value, function block name, etc... only from the context in +which it appears. + + A more detailed example of why we need symbol tables are +the type definitions... In definition of new types +(section B 1.3.3) the parser needs to figure out the class of +the new type being defined (enumerated, structure, array, etc...). +This works well when the base classes are elementary types +(or structures, enumeration, arrays, etc. thereof). It becomes +confusing to the parser when the new_type is based on a previously +user defined type. + +TYPE + new_type_1 : INT := 99; + new_type_2 : new_type_1 := 100; +END_TYPE + + When parsing new_type_1, the parser can figure out that the +identifier new_type_1 is a simple_type_name, because it is +based on a elementary type without structure, arrays, etc... + While parsing new_type_2, it becomes confused how to reduce +the new_type_2 identifier, as it is based on the identifier +new_type_1, of which it does not know the class (remember, at this +stage new_type_1 is a simple identifier!). + We therefore need to keep track of the class of the user +defined types as they are declared, so that the lexical analyser +can tell the syntax parser what class the type belongs to. We +cannot use the abstract syntax tree itself to search for the +declaration of new_type_1 as we only get a handle to the root +of the tree at the end of the parsing. + + We therefore maintain an independent and parallel table of symbols, +that is filled as we come across the type delcarations in the code. +Actually, we ended up also needing to store variable names in +the symbol table. Since variable names come and go out of scope +depending on what portion of code we are parsing, we sometimes +need to remove the variable names from the symbol table. +Since the ST and IL languages only have a single level of scope, +I (Mario) found it easier to simply use a second symbol table for +the variable names that is completely cleared when the parser +reaches the end of a function (function block or program). + +What I mean when I say that these languages have a single level +of scope is that all variables used in a function (function block +or program) must be declared inside that function (function block +or program). Even global variables must be re-declared as EXTERN +before a function may access them! This means that it is easy +to simply load up the variable name symbol table when we start +parsing a function (function block or program), and to clear it +when we reach the end. Checking whether variables declared +as EXTERN really exist inside a RESOURCE or a CONFIGURATION +is left to stage 3 (semantic checking) where we can use the +abstract tree itself to search for the variables (NOTE: semantic +cheching at stage 3 has not yet been implemented, so we may yet +end up using a symbol table too at that stage!). + + Due to the use of the symbol tables, and special identifier +tokens depending on the type of identifier it had previously +been declared in the code being parsed, the syntax was slightly +changed regarding the definition of variable names, derived +function names, etc... FROM for e.g.: +variable_name: identifier; +TO +variable_name: variable_name_token; + + Flex first looks at the symbol tables when it finds an identifier, +and returns the correct token corresponding to the identifier +type in question. Only if the identifier is not currently stored +in any symbol table, does flex return a simple identifier_token. + + This means that the declarations of variables, functions etc... +were changed FROM: +function_declaration: FUNCTION derived_function_name ... +TO +function_declaration: FUNCTION identifier ... +since the initial definition of derived_function_name had been +changed FROM +derived_function_name: identifier; +TO +derived_function_name: derived_function_name_token; + + + + + Comment 2 + --------- + Since the ST and IL languages share a lot of common syntax, +I have decided to write a single parser to handle both languages +simultaneously. This approach has the advantage that the user +may mix the language used in the same file, as long as each function +is written in a single lanuage. + + This approach also assumes that all the IL language operators are +keywords, which means that it is not possible to define variables +using names such as "LD", "ST", etc... +Note that the spec does not consider these operators to be keywords, +so it means that they should be available for variable names! On the +other hand, all implementations of the ST and IL languages seems to +treat them as keywords, so there is not much harm in doing the same. + + If it ever becomes necessary to allow variables with names of IL +operators, either the syntax will have to be augmented, or we can +brake up the parser in two: one for ST and another for IL. + + + +/********************************/ +/* B 1.3.3 - Derived data types */ +/********************************/ + +Issue 1 +======= + +According to the spec, the valid construct +TYPE new_str_type : STRING := "hello!"; END_TYPE +has two possible routes to type_declaration... + +Route 1: +type_declaration: single_element_type_declaration +single_element_type_declaration: simple_type_declaration +simple_type_declaration: identifier ':' simple_spec_init +simple_spec_init: simple_specification ASSIGN constant +(shift: identifier <- 'new_str_type') +simple_specification: elementary_type_name +elementary_type_name: STRING +(shift: elementary_type_name <- STRING) +(reduce: simple_specification <- elementary_type_name) +(shift: constant <- "hello!") +(reduce: simple_spec_init: simple_specification ASSIGN constant) +(reduce: ...) + + +Route 2: +type_declaration: string_type_declaration +string_type_declaration: identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init +(shift: identifier <- 'new_str_type') +elementary_string_type_name: STRING +(shift: elementary_string_type_name <- STRING) +(shift: string_type_declaration_size <- /* empty */) +string_type_declaration_init: ASSIGN character_string +(shift: character_string <- "hello!") +(reduce: string_type_declaration_init <- ASSIGN character_string) +(reduce: string_type_declaration <- identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init ) +(reduce: type_declaration <- string_type_declaration) + + + At first glance it seems that removing route 1 would make +the most sense. Unfortunately the construct 'simple_spec_init' +shows up multiple times in other rules, so changing this construct +would mean changing all the rules in which it appears. +I (Mario) therefore chose to remove route 2 instead. This means +that the above declaration gets stored in a +simple_type_declaration_c, and not in a string_type_declaration_c +as would be expected! + + +/***********************/ +/* B 1.5.1 - Functions */ +/***********************/ + + +Issue 1 +======= + + Due to reduce/reduce conflicts between identifiers +being reduced to either a variable or an enumerator value, +we were forced to keep a symbol table of the names +of all declared variables. Variables are no longer +created from simple identifier_token, but from +variable_name_token. + + BUT, in functions the function name may be used as +a variable! In order to be able to parse this correctly, +the token parser (flex) must return a variable_name_token +when it comes across the function name, while parsing +the function itself. +We do this by inserting the function name into the variable +symbol table, and having flex return a variable_name_token +whenever it comes across it. +When we finish parsing the function the variable name +symbol table is cleared of all entries, and the function +name is inserted into the library element symbol table. This +means that from then onwards flex will return a +derived_function_name_token whenever it comes across the +function name. + +In order to insert the function name into the variable_name +symbol table BEFORE the function body gets parsed, we +need the parser to reduce a construct that contains the +the function name. That is why we created the extra +construct 'function_name_declaration', i.e. to force +the parser to reduce it, before parsing the function body, +and therefore get an oportunity to insert the function name +into the variable name symbol table! + + +/********************************/ +/* B 3.2.4 Iteration Statements */ +/********************************/ + +Issue 1 +======= + +For the 'FOR' iteration loop + + FOR control_variable ASSIGN expression TO expression BY expression DO statement_list END_FOR + +The spec declares the control variable in the syntax as + + control_variable: identifier; + +but then defines the semantics of control_variable (Section 3.3.2.4) +as being of an integer type (e.g., SINT, INT, or DINT). +Obviously this presuposes that the control_variable must have been +declared in some 'VAR .. VAR_END' construct, so I (Mario) changed +the syntax to read + + control_variable: variable_name; + + + + + +************************************************************************** + + (c) 2003 Mario de Sousa