|
1 |
|
2 |
|
3 |
|
4 IEC 61131-3 IL and ST compiler |
|
5 |
|
6 |
|
7 The following compiler has been based on the |
|
8 FINAL DRAFT - IEC 61131-3, 2nd Ed. (2001-12-10) |
|
9 |
|
10 |
|
11 (c) 2003 Mario de Sousa |
|
12 |
|
13 |
|
14 |
|
15 **************************************************************** |
|
16 **************************************************************** |
|
17 **************************************************************** |
|
18 ********* ********* |
|
19 ********* ********* |
|
20 ********* O V E R A L L A R C H I T E C T U R E ********* |
|
21 ********* ********* |
|
22 ********* ********* |
|
23 **************************************************************** |
|
24 **************************************************************** |
|
25 **************************************************************** |
|
26 |
|
27 The compiler works in 4(+1) stages: |
|
28 Stage 1 - Lexical analyser - implemented with flex (iec.flex) |
|
29 Stage 2 - Syntax parser - implemented with bison (iec.y) |
|
30 Stage 3 - Semantics analyser - not yet implemented |
|
31 Stage 4 - Code generator - implemented in C++ |
|
32 Stage 4+1 - Binary code generator - gcc, javac, etc... |
|
33 |
|
34 |
|
35 Data structures passed between stages, in global variables: |
|
36 1->2 : tokens (int), and token values (char *) |
|
37 2->1 : symbol tables (defined in symtable.hh) |
|
38 2->3 : abstract syntax tree (tree of C++ classes, in absyntax.hh file) |
|
39 3->4 : Same as 2->3 |
|
40 4->4+1 : file with program in c, java, etc... |
|
41 |
|
42 |
|
43 The compiler works in several passes: |
|
44 Pass 1: executes stages 1 and 2 simultaneously |
|
45 Pass 2: executes stage 3 |
|
46 Pass 3: executes stage 4 |
|
47 Pass 4: executes stage 4+1 |
|
48 |
|
49 |
|
50 NOTE 1 |
|
51 ====== |
|
52 Note that stage 2 passes data back to stage 1. This is only |
|
53 possible because both stages are executed in the same pass. |
|
54 |
|
55 |
|
56 |
|
57 NOTE 2 |
|
58 ====== |
|
59 I (Mario) have a feeling that the abstract syntax may be |
|
60 considerably simplified without any drawbacks to semantic checking |
|
61 and code generation. I have nevertheless opted to keep as much |
|
62 info as possible in the abstract syntax tree, in case it may become |
|
63 necessary further on. |
|
64 Once we start coding the next stages (semantic checking and code |
|
65 generation) I will get a better understanding of what is required |
|
66 of the abstract syntax tree. At that stage I will be better |
|
67 positioned to make a more informed decision on how best to structure |
|
68 the abstract syntax tree. |
|
69 For now, we play conservative and keep as much info as possible. |
|
70 |
|
71 |
|
72 |
|
73 NOTE 3 |
|
74 ====== |
|
75 It would be nice to get this parser integrated into the gcc |
|
76 group of compilers. We would then be able to compile our st/il |
|
77 programs directly into executable binaries, for all the processor |
|
78 architectures gcc currently supports. |
|
79 The gcc compilers are divided into a frontend and backend. The |
|
80 data structure between these two stages is called the syntax |
|
81 tree. In essence, we would need to create a new frontend that |
|
82 would parse the st/il program and build the syntax tree. |
|
83 Unfortunately the gcc syntax tree is not very well documented, |
|
84 and doing semantic checking on this tree would probably be a |
|
85 nightmare. |
|
86 We therefore chose to follow the same route as the gnat (ada 95) |
|
87 and cobol compilers, i.e. generate our own abstract syntax tree, |
|
88 do semantic checking on our tree, do whatever optimisation |
|
89 we can at this level on our own tree, and only then build |
|
90 the gcc syntax tree from our abstract syntax tree. |
|
91 All this may still be integrated with the gcc backend to generate |
|
92 a new gnu compiler for the st and il programming languages. |
|
93 Since generating the gcc syntax tree will probably envolve some |
|
94 trial and error effort due to the sparseness of documentation, |
|
95 we chose to start off by coding a C++ code generator for |
|
96 our stage 4. We may later implement a gcc syntax tree generator |
|
97 as an alternative stage 4 process, and then integrate it with |
|
98 the gcc toplevel.c file (command line parsing, etc...). |
|
99 |
|
100 |
|
101 |
|
102 **************************************************************** |
|
103 **************************************************************** |
|
104 **************************************************************** |
|
105 ********* ********* |
|
106 ********* ********* |
|
107 ********* S T A G E 1 ********* |
|
108 ********* ********* |
|
109 ********* ********* |
|
110 **************************************************************** |
|
111 **************************************************************** |
|
112 **************************************************************** |
|
113 |
|
114 |
|
115 |
|
116 Issue 1 |
|
117 ======= |
|
118 |
|
119 The syntax defines the common_character_representation as: |
|
120 <any printable character except '$', '"' or "'"> | <escape sequences> |
|
121 |
|
122 Flex includes the function print_char() that defines |
|
123 all printable characters portably (i.e. whatever character |
|
124 encoding is currently being used , ASCII, EBCDIC, etc...) |
|
125 Unfortunately, we cannot generate the definition of |
|
126 common_character_representation portably, since flex |
|
127 does not allow definition of sets by subtracting |
|
128 elements in one set from another set (Note how |
|
129 common_character_representation could be defined by |
|
130 subtracting '$' '"' and "'" from print_char() ). |
|
131 This means we must build up the defintion of |
|
132 common_character_representation using only set addition, |
|
133 which leaves us with the only choice of defining the |
|
134 characters non-portably... |
|
135 |
|
136 In short, the definition we use for common_character_representation |
|
137 only works for ASCII character encoding! |
|
138 |
|
139 |
|
140 |
|
141 |
|
142 Issue 2 |
|
143 ======= |
|
144 |
|
145 We extend the IEC 61131-3 standard syntax to allow inclusion of |
|
146 other files. The accepted syntax is: |
|
147 |
|
148 (*#include "<filename>" *) |
|
149 |
|
150 Note how this would be ignored by other standard complient compilers |
|
151 as a simple comment! |
|
152 |
|
153 |
|
154 |
|
155 **************************************************************** |
|
156 **************************************************************** |
|
157 **************************************************************** |
|
158 ********* ********* |
|
159 ********* ********* |
|
160 ********* S T A G E 2 ********* |
|
161 ********* ********* |
|
162 ********* ********* |
|
163 **************************************************************** |
|
164 **************************************************************** |
|
165 **************************************************************** |
|
166 |
|
167 Overall Comments |
|
168 ================ |
|
169 |
|
170 |
|
171 Comment 1 |
|
172 --------- |
|
173 We have augmented the syntax the specification defines to include |
|
174 restrictions defined in the semantics of the languages. |
|
175 |
|
176 This is required because the syntax cannot be parsed by a LALR(1) |
|
177 parser as it is presented in the specification. Many reduce/reduce |
|
178 and shift/reduce conflicts arise. This is mainly because the parser |
|
179 cannot discern how to reduce an identifier. Identifiers show up in |
|
180 many places in the syntax, and it is not entirely possible to |
|
181 figure out if the identifier is a variable_name, enumeration |
|
182 value, function block name, etc... only from the context in |
|
183 which it appears. |
|
184 |
|
185 A more detailed example of why we need symbol tables are |
|
186 the type definitions... In definition of new types |
|
187 (section B 1.3.3) the parser needs to figure out the class of |
|
188 the new type being defined (enumerated, structure, array, etc...). |
|
189 This works well when the base classes are elementary types |
|
190 (or structures, enumeration, arrays, etc. thereof). It becomes |
|
191 confusing to the parser when the new_type is based on a previously |
|
192 user defined type. |
|
193 |
|
194 TYPE |
|
195 new_type_1 : INT := 99; |
|
196 new_type_2 : new_type_1 := 100; |
|
197 END_TYPE |
|
198 |
|
199 When parsing new_type_1, the parser can figure out that the |
|
200 identifier new_type_1 is a simple_type_name, because it is |
|
201 based on a elementary type without structure, arrays, etc... |
|
202 While parsing new_type_2, it becomes confused how to reduce |
|
203 the new_type_2 identifier, as it is based on the identifier |
|
204 new_type_1, of which it does not know the class (remember, at this |
|
205 stage new_type_1 is a simple identifier!). |
|
206 We therefore need to keep track of the class of the user |
|
207 defined types as they are declared, so that the lexical analyser |
|
208 can tell the syntax parser what class the type belongs to. We |
|
209 cannot use the abstract syntax tree itself to search for the |
|
210 declaration of new_type_1 as we only get a handle to the root |
|
211 of the tree at the end of the parsing. |
|
212 |
|
213 We therefore maintain an independent and parallel table of symbols, |
|
214 that is filled as we come across the type delcarations in the code. |
|
215 Actually, we ended up also needing to store variable names in |
|
216 the symbol table. Since variable names come and go out of scope |
|
217 depending on what portion of code we are parsing, we sometimes |
|
218 need to remove the variable names from the symbol table. |
|
219 Since the ST and IL languages only have a single level of scope, |
|
220 I (Mario) found it easier to simply use a second symbol table for |
|
221 the variable names that is completely cleared when the parser |
|
222 reaches the end of a function (function block or program). |
|
223 |
|
224 What I mean when I say that these languages have a single level |
|
225 of scope is that all variables used in a function (function block |
|
226 or program) must be declared inside that function (function block |
|
227 or program). Even global variables must be re-declared as EXTERN |
|
228 before a function may access them! This means that it is easy |
|
229 to simply load up the variable name symbol table when we start |
|
230 parsing a function (function block or program), and to clear it |
|
231 when we reach the end. Checking whether variables declared |
|
232 as EXTERN really exist inside a RESOURCE or a CONFIGURATION |
|
233 is left to stage 3 (semantic checking) where we can use the |
|
234 abstract tree itself to search for the variables (NOTE: semantic |
|
235 cheching at stage 3 has not yet been implemented, so we may yet |
|
236 end up using a symbol table too at that stage!). |
|
237 |
|
238 Due to the use of the symbol tables, and special identifier |
|
239 tokens depending on the type of identifier it had previously |
|
240 been declared in the code being parsed, the syntax was slightly |
|
241 changed regarding the definition of variable names, derived |
|
242 function names, etc... FROM for e.g.: |
|
243 variable_name: identifier; |
|
244 TO |
|
245 variable_name: variable_name_token; |
|
246 |
|
247 Flex first looks at the symbol tables when it finds an identifier, |
|
248 and returns the correct token corresponding to the identifier |
|
249 type in question. Only if the identifier is not currently stored |
|
250 in any symbol table, does flex return a simple identifier_token. |
|
251 |
|
252 This means that the declarations of variables, functions etc... |
|
253 were changed FROM: |
|
254 function_declaration: FUNCTION derived_function_name ... |
|
255 TO |
|
256 function_declaration: FUNCTION identifier ... |
|
257 since the initial definition of derived_function_name had been |
|
258 changed FROM |
|
259 derived_function_name: identifier; |
|
260 TO |
|
261 derived_function_name: derived_function_name_token; |
|
262 |
|
263 |
|
264 |
|
265 |
|
266 Comment 2 |
|
267 --------- |
|
268 Since the ST and IL languages share a lot of common syntax, |
|
269 I have decided to write a single parser to handle both languages |
|
270 simultaneously. This approach has the advantage that the user |
|
271 may mix the language used in the same file, as long as each function |
|
272 is written in a single lanuage. |
|
273 |
|
274 This approach also assumes that all the IL language operators are |
|
275 keywords, which means that it is not possible to define variables |
|
276 using names such as "LD", "ST", etc... |
|
277 Note that the spec does not consider these operators to be keywords, |
|
278 so it means that they should be available for variable names! On the |
|
279 other hand, all implementations of the ST and IL languages seems to |
|
280 treat them as keywords, so there is not much harm in doing the same. |
|
281 |
|
282 If it ever becomes necessary to allow variables with names of IL |
|
283 operators, either the syntax will have to be augmented, or we can |
|
284 brake up the parser in two: one for ST and another for IL. |
|
285 |
|
286 |
|
287 |
|
288 /********************************/ |
|
289 /* B 1.3.3 - Derived data types */ |
|
290 /********************************/ |
|
291 |
|
292 Issue 1 |
|
293 ======= |
|
294 |
|
295 According to the spec, the valid construct |
|
296 TYPE new_str_type : STRING := "hello!"; END_TYPE |
|
297 has two possible routes to type_declaration... |
|
298 |
|
299 Route 1: |
|
300 type_declaration: single_element_type_declaration |
|
301 single_element_type_declaration: simple_type_declaration |
|
302 simple_type_declaration: identifier ':' simple_spec_init |
|
303 simple_spec_init: simple_specification ASSIGN constant |
|
304 (shift: identifier <- 'new_str_type') |
|
305 simple_specification: elementary_type_name |
|
306 elementary_type_name: STRING |
|
307 (shift: elementary_type_name <- STRING) |
|
308 (reduce: simple_specification <- elementary_type_name) |
|
309 (shift: constant <- "hello!") |
|
310 (reduce: simple_spec_init: simple_specification ASSIGN constant) |
|
311 (reduce: ...) |
|
312 |
|
313 |
|
314 Route 2: |
|
315 type_declaration: string_type_declaration |
|
316 string_type_declaration: identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init |
|
317 (shift: identifier <- 'new_str_type') |
|
318 elementary_string_type_name: STRING |
|
319 (shift: elementary_string_type_name <- STRING) |
|
320 (shift: string_type_declaration_size <- /* empty */) |
|
321 string_type_declaration_init: ASSIGN character_string |
|
322 (shift: character_string <- "hello!") |
|
323 (reduce: string_type_declaration_init <- ASSIGN character_string) |
|
324 (reduce: string_type_declaration <- identifier ':' elementary_string_type_name string_type_declaration_size string_type_declaration_init ) |
|
325 (reduce: type_declaration <- string_type_declaration) |
|
326 |
|
327 |
|
328 At first glance it seems that removing route 1 would make |
|
329 the most sense. Unfortunately the construct 'simple_spec_init' |
|
330 shows up multiple times in other rules, so changing this construct |
|
331 would mean changing all the rules in which it appears. |
|
332 I (Mario) therefore chose to remove route 2 instead. This means |
|
333 that the above declaration gets stored in a |
|
334 simple_type_declaration_c, and not in a string_type_declaration_c |
|
335 as would be expected! |
|
336 |
|
337 |
|
338 /***********************/ |
|
339 /* B 1.5.1 - Functions */ |
|
340 /***********************/ |
|
341 |
|
342 |
|
343 Issue 1 |
|
344 ======= |
|
345 |
|
346 Due to reduce/reduce conflicts between identifiers |
|
347 being reduced to either a variable or an enumerator value, |
|
348 we were forced to keep a symbol table of the names |
|
349 of all declared variables. Variables are no longer |
|
350 created from simple identifier_token, but from |
|
351 variable_name_token. |
|
352 |
|
353 BUT, in functions the function name may be used as |
|
354 a variable! In order to be able to parse this correctly, |
|
355 the token parser (flex) must return a variable_name_token |
|
356 when it comes across the function name, while parsing |
|
357 the function itself. |
|
358 We do this by inserting the function name into the variable |
|
359 symbol table, and having flex return a variable_name_token |
|
360 whenever it comes across it. |
|
361 When we finish parsing the function the variable name |
|
362 symbol table is cleared of all entries, and the function |
|
363 name is inserted into the library element symbol table. This |
|
364 means that from then onwards flex will return a |
|
365 derived_function_name_token whenever it comes across the |
|
366 function name. |
|
367 |
|
368 In order to insert the function name into the variable_name |
|
369 symbol table BEFORE the function body gets parsed, we |
|
370 need the parser to reduce a construct that contains the |
|
371 the function name. That is why we created the extra |
|
372 construct 'function_name_declaration', i.e. to force |
|
373 the parser to reduce it, before parsing the function body, |
|
374 and therefore get an oportunity to insert the function name |
|
375 into the variable name symbol table! |
|
376 |
|
377 |
|
378 /********************************/ |
|
379 /* B 3.2.4 Iteration Statements */ |
|
380 /********************************/ |
|
381 |
|
382 Issue 1 |
|
383 ======= |
|
384 |
|
385 For the 'FOR' iteration loop |
|
386 |
|
387 FOR control_variable ASSIGN expression TO expression BY expression DO statement_list END_FOR |
|
388 |
|
389 The spec declares the control variable in the syntax as |
|
390 |
|
391 control_variable: identifier; |
|
392 |
|
393 but then defines the semantics of control_variable (Section 3.3.2.4) |
|
394 as being of an integer type (e.g., SINT, INT, or DINT). |
|
395 Obviously this presuposes that the control_variable must have been |
|
396 declared in some 'VAR .. VAR_END' construct, so I (Mario) changed |
|
397 the syntax to read |
|
398 |
|
399 control_variable: variable_name; |
|
400 |
|
401 |
|
402 |
|
403 |
|
404 |
|
405 ************************************************************************** |
|
406 |
|
407 (c) 2003 Mario de Sousa |