matiec: comparison readme

equal deleted inserted replaced

--1:000000000000
+:fb772792efd1
+IEC 61131-3 IL and ST compiler
+The following compiler has been based on the
+FINAL DRAFT - IEC 61131-3, 2nd Ed. (2001-12-10)
+(c) 2003 Mario de Sousa
+****************************************************************
+****************************************************************
+****************************************************************
+*********                                              *********
+*********                                              *********
+*********   O V E R A L L    A R C H I T E C T U R E   *********
+*********                                              *********
+*********                                              *********
+****************************************************************
+****************************************************************
+****************************************************************
+The compiler works in 4(+1) stages:
+Stage 1   - Lexical analyser      - implemented with flex (iec.flex)
+Stage 2   - Syntax parser         - implemented with bison (iec.y)
+Stage 3   - Semantics analyser    - not yet implemented
+Stage 4   - Code generator        - implemented in C++
+Stage 4+1 - Binary code generator - gcc, javac, etc...
+Data structures passed between stages, in global variables:
+1->2   : tokens (int), and token values (char *)
+2->1   : symbol tables (defined in symtable.hh)
+2->3   : abstract syntax tree (tree of C++ classes, in absyntax.hh file)
+3->4   : Same as 2->3
+4->4+1 : file with program in c, java, etc...
+The compiler works in several passes:
+Pass 1: executes stages 1 and 2 simultaneously
+Pass 2: executes stage 3
+Pass 3: executes stage 4
+Pass 4: executes stage 4+1
+NOTE 1
+======
+Note that stage 2 passes data back to stage 1. This is only
+possible because both stages are executed in the same pass.
+NOTE 2
+======
+I (Mario) have a feeling that the abstract syntax may be
+considerably simplified without any drawbacks to semantic checking
+and code generation. I have nevertheless opted to keep as much
+info as possible in the abstract syntax tree, in case it may become
+necessary further on.
+Once we start coding the next stages (semantic checking and code
+generation) I will get a better understanding of what is required
+of the abstract syntax tree. At that stage I will be better
+positioned to make a more informed decision on how best to structure
+the abstract syntax tree.
+For now, we play conservative and keep as much info as possible.
+NOTE 3
+======
+It would be nice to get this parser integrated into the gcc
+group of compilers. We would then be able to compile our st/il
+programs directly into executable binaries, for all the processor
+architectures gcc currently supports.
+The gcc compilers are divided into a frontend and backend. The
+data structure between these two stages is called the syntax
+tree. In essence, we would need to create a new frontend that
+would parse the st/il program and build the syntax tree.
+Unfortunately the gcc syntax tree is not very well documented,
+and doing semantic checking on this tree would probably be a
+nightmare.
+We therefore chose to follow the same route as the gnat (ada 95)
+and cobol compilers, i.e. generate our own abstract syntax tree,
+do semantic checking on our tree, do whatever optimisation
+we can at this level on our own tree, and only then build
+the gcc syntax tree from our abstract syntax tree.
+All this may still be integrated with the gcc backend to generate
+a new gnu compiler for the st and il programming languages.
+Since generating the gcc syntax tree will probably envolve some
+trial and error effort due to the sparseness of documentation,
+we chose to start off by coding a C++ code generator for
+our stage 4. We may later implement a gcc syntax tree generator
+as an alternative stage 4 process, and then integrate it with
+the gcc toplevel.c file (command line parsing, etc...).
+****************************************************************
+****************************************************************
+****************************************************************
+*********                                              *********
+*********                                              *********
+*********               S T A G E      1               *********
+*********                                              *********
+*********                                              *********
+****************************************************************
+****************************************************************
+****************************************************************
+Issue 1
+=======
+The syntax defines the common_character_representation as:
+<any printable character except '$', '"' or "'"> | <escape sequences>
+Flex includes the function print_char() that defines
+all printable characters portably (i.e. whatever character
+encoding is currently being used , ASCII, EBCDIC, etc...)
+Unfortunately, we cannot generate the definition of
+common_character_representation portably, since flex
+does not allow definition of sets by subtracting
+elements in one set from another set (Note how
+common_character_representation could be defined by
+subtracting '$' '"' and "'" from print_char() ).
+This means we must build up the defintion of
+common_character_representation using only set addition,
+which leaves us with the only choice of defining the
+characters non-portably...
+In short, the definition we use for common_character_representation
+only works for ASCII character encoding!
+Issue 2
+=======
+We extend the IEC 61131-3 standard syntax to allow inclusion of
+other files. The accepted syntax is:
+(*#include "<filename>" *)
+Note how this would be ignored by other standard complient compilers
+as a simple comment!
+****************************************************************
+****************************************************************
+****************************************************************
+*********                                              *********
+*********                                              *********
+*********               S T A G E      2               *********
+*********                                              *********
+*********                                              *********
+****************************************************************
+****************************************************************
+****************************************************************
+Overall Comments
+================
+Comment 1
+---------
+We have augmented the syntax the specification defines to include
+restrictions defined in the semantics of the languages.
+This is required because the syntax cannot be parsed by a LALR(1)
+parser as it is presented in the specification. Many reduce/reduce
+and shift/reduce conflicts arise. This is mainly because the parser
+cannot discern how to reduce an identifier. Identifiers show up in
+many places in the syntax, and it is not entirely possible to
+figure out if the identifier is a variable_name, enumeration
+value, function block name, etc... only from the context in
+which it appears.
+A more detailed example of why we need symbol tables are
+the type definitions...  In definition of new types
+(section B 1.3.3) the parser needs to figure out the class of
+the new type being defined (enumerated, structure, array, etc...).
+This works well when the base classes are elementary types
+(or structures, enumeration, arrays, etc. thereof). It becomes
+confusing to the parser when the new_type is based on a previously
+user defined type.
+TYPE
+new_type_1 : INT := 99;
+new_type_2 : new_type_1 := 100;
+END_TYPE
+When parsing new_type_1, the parser can figure out that the
+identifier new_type_1 is a simple_type_name, because it is
+based on a elementary type without structure, arrays, etc...
+While parsing new_type_2, it becomes confused how to reduce
+the new_type_2 identifier, as it is based on the identifier
+new_type_1, of which it does not know the class (remember, at this
+stage new_type_1 is a simple identifier!).
+We therefore need to keep track of the class of the user
+defined types as they are declared, so that the lexical analyser
+can tell the syntax parser what class the type belongs to. We
+cannot use the abstract syntax tree itself to search for the
+declaration of new_type_1 as we only get a handle to the root
+of the tree at the end of the parsing.
+We therefore maintain an independent and parallel table of symbols,
+that is filled as we come across the type delcarations in the code.
+Actually, we ended up also needing to store variable names in
+the symbol table. Since variable names come and go out of scope
+depending on what portion of code we are parsing, we sometimes
+need to remove the variable names from the symbol table.
+Since the ST and IL languages only have a single level of scope,
+I (Mario) found it easier to simply use a second symbol table for
+the variable names that is completely cleared when the parser
+reaches the end of a function (function block or program).
+What I mean when I say that these languages have a single level
+of scope is that all variables used in a function (function block
+or program) must be declared inside that function (function block
+or program). Even global variables must be re-declared as EXTERN
+before a function may access them! This means that it is easy
+to simply load up the variable name symbol table when we start
+parsing a function (function block or program), and to clear it
+when we reach the end. Checking whether variables declared
+as EXTERN really exist inside a RESOURCE or a CONFIGURATION
+is left to stage 3 (semantic checking) where we can use the
+abstract tree itself to search for the variables (NOTE: semantic
+cheching at stage 3 has not yet been implemented, so we may yet
+end up using a symbol table too at that stage!).
+Due to the use of the symbol tables, and special identifier
+tokens depending on the type of identifier it had previously
+been declared in the code being parsed, the syntax was slightly
+changed regarding the definition of variable names, derived
+function names, etc... FROM for e.g.:
+variable_name: identifier;
+TO
+variable_name: variable_name_token;
+Flex first looks at the symbol tables when it finds an identifier,
+and returns the correct token corresponding to the identifier
+type in question. Only if the identifier is not currently stored
+in any symbol table, does flex return a simple identifier_token.
+This means that the declarations of variables, functions etc...
+were changed FROM:
+function_declaration: FUNCTION derived_function_name ...
+TO
+function_declaration: FUNCTION identifier ...
+since the initial definition of derived_function_name had been
+changed FROM
+derived_function_name: identifier;
+TO
+derived_function_name: derived_function_name_token;
+Comment 2
+---------
+Since the ST and IL languages share a lot of common syntax,
+I have decided to write a single parser to handle both languages
+simultaneously. This approach has the advantage that the user
+may mix the language used in the same file, as long as each function
+is written in a single lanuage.
+This approach also assumes that all the IL language operators are
+keywords, which means that it is not possible to define variables
+using names such as "LD", "ST", etc...
+Note that the spec does not consider these operators to be keywords,
+so it means that they should be available for variable names! On the
+other hand, all implementations of the ST and IL languages seems to
+treat them as keywords, so there is not much harm in doing the same.
+If it ever becomes necessary to allow variables with names of IL
+operators, either the syntax will have to be augmented, or we can
+brake up the parser in two: one for ST and another for IL.
+/********************************/
+/* B 1.3.3 - Derived data types */
+/********************************/
+Issue 1
+=======
+According to the spec, the valid construct
+TYPE new_str_type : STRING := "hello!"; END_TYPE
+has two possible routes to type_declaration...
+Route 1:
+type_declaration: single_element_type_declaration
+single_element_type_declaration: simple_type_declaration
+simple_type_declaration: identifier ':' simple_spec_init
+simple_spec_init: simple_specification ASSIGN constant
+(shift:  identifier <- 'new_str_type')
+simple_specification: elementary_type_name
+elementary_type_name: STRING
+(shift: elementary_type_name <- STRING)
+(reduce: simple_specification <- elementary_type_name)
+(shift: constant <- "hello!")
+(reduce: simple_spec_init: simple_specification ASSIGN constant)
+(reduce: ...)
+Route 2:
+type_declaration: string_type_declaration
+string_type_declaration: identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init
+(shift:  identifier <- 'new_str_type')
+elementary_string_type_name: STRING
+(shift: elementary_string_type_name <- STRING)
+(shift: string_type_declaration_size <- /* empty */)
+string_type_declaration_init: ASSIGN character_string
+(shift: character_string <- "hello!")
+(reduce: string_type_declaration_init <- ASSIGN character_string)
+(reduce: string_type_declaration <- identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init )
+(reduce: type_declaration <- string_type_declaration)
+At first glance it seems that removing route 1 would make
+the most sense. Unfortunately the construct 'simple_spec_init'
+shows up multiple times in other rules, so changing this construct
+would mean changing all the rules in which it appears.
+I (Mario) therefore chose to remove route 2 instead. This means
+that the above declaration gets stored in a
+simple_type_declaration_c, and not in a string_type_declaration_c
+as would be expected!
+/***********************/
+/* B 1.5.1 - Functions */
+/***********************/
+Issue 1
+=======
+Due to reduce/reduce conflicts between identifiers
+being reduced to either a variable or an enumerator value,
+we were forced to keep a symbol table of the names
+of all declared variables. Variables are no longer
+created from simple identifier_token, but from
+variable_name_token.
+BUT, in functions the function name may be used as
+a variable! In order to be able to parse this correctly,
+the token parser (flex) must return a variable_name_token
+when it comes across the function name, while parsing
+the function itself.
+We do this by inserting the function name into the variable
+symbol table, and having flex return a variable_name_token
+whenever it comes across it.
+When we finish parsing the function the variable name
+symbol table is cleared of all entries, and the function
+name is inserted into the library element symbol table. This
+means that from then onwards flex will return a
+derived_function_name_token whenever it comes across the
+function name.
+In order to insert the function name into the variable_name
+symbol table BEFORE the function body gets parsed, we
+need the parser to reduce a construct that contains the
+the function name. That is why we created the extra
+construct 'function_name_declaration', i.e. to force
+the parser to reduce it, before parsing the function body,
+and therefore get an oportunity to insert the function name
+into the variable name symbol table!
+/********************************/
+/* B 3.2.4 Iteration Statements */
+/********************************/
+Issue 1
+=======
+For the 'FOR' iteration loop
+FOR control_variable ASSIGN expression TO expression BY expression DO statement_list END_FOR
+The spec declares the control variable in the syntax as
+control_variable: identifier;
+but then defines the semantics of control_variable (Section 3.3.2.4)
+as being of an integer type (e.g., SINT, INT, or DINT).
+Obviously this presuposes that the control_variable must have been
+declared in some 'VAR .. VAR_END' construct, so I (Mario) changed
+the syntax to read
+control_variable: variable_name;
+**************************************************************************
+(c) 2003 Mario de Sousa

changeset 0	fb772792efd1
child 67	08097122a922