======================================================================= List of Implemented Fixes and Changes for Maintenance Releases of PCCTS For a summary of the most significant changes see CHANGES_SUMMARY.TXT ======================================================================= DISCLAIMER The software and these notes are provided "as is". They may include typographical or technical errors and their authors disclaims all liability of any kind or nature for damages due to error, fault, defect, or deficiency regardless of cause. All warranties of any kind, either express or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. ------------------------------------------------------- Note: Items #153 to #1 are now in a separate file named CHANGES_FROM_133_BEFORE_MR13.txt ------------------------------------------------------- #261. (Changed in MR19) Defer token fetch for C++ mode Item #216 has been revised to indicate that use of the defer fetch option (ZZDEFER_FETCH) requires dlg option -i. #260. (MR22) Raise default lex buffer size from 8,000 to 32,000 bytes. ZZLEXBUFSIZE is the size (in bytes) of the buffer used by dlg generated lexers. The default value has been raised to 32,000 and the value used by antlr, dlg, and sorcerer has also been raised to 32,000. #259. (MR22) Default function arguments in C++ mode. If a rule is declared: rr [int i = 0] : .... then the declaration generated by pccts resembles: void rr(int i = 0); however, the definition must omit the default argument: void rr(int i) {...} In the past the default value was not omitted. In MR22 the generated code resembles: void rr(int i /* = 0 */ ) {...} Implemented by Volker H. Simonis (simonis@informatik.uni-tuebingen.de) #258. (MR22) Using a base class for your parser In item #102 (MR10) the class statement was extended to allow one to specify a base class other than ANTLRParser for the generated parser. It turned out that this was less than useful because the constructor still specified ANTLRParser as the base class. The class statement now uses the first identifier appearing after the ":" as the name of the base class. For example: class MyParser : public FooParser { Generates in MyParser.h: class MyParser : public FooParser { Generates in MyParser.cpp something that resembles: MyParser::MyParser(ANTLRTokenBuffer *input) : FooParser(input,1,0,0,4) { token_tbl = _token_tbl; traceOptionValueDefault=1; // MR10 turn trace ON } The base class must constructor must have a signature similar to that of ANTLRParser. #257. (MR21a) Removed dlg statement that -i has no effect in C++ mode. This was incorrect. #256. (MR21a) Malformed syntax graph causes crash after error message. In the past, certain kinds of errors in the very first grammar element could cause the construction of a malformed graph representing the grammar. This would eventually result in a fatal internal error. The code has been changed to be more resistant to this particular error. #255. (MR21a) ParserBlackBox(FILE* f) This constructor set openByBlackBox to the wrong value. Reported by Kees Bakker (kees_bakker@tasking.nl). #254. (MR21a) Reporting syntax error at end-of-file When there was a syntax error at the end-of-file the syntax error routine would substitute "" for the programmer's end-of-file symbol. This substitution is now done only when the programmer does not define his own end-of-file symbol or the symbol begins with the character "@". Reported by Kees Bakker (kees_bakker@tasking.nl). #253. (MR21) Generation of block preamble (-preamble and -preamble_first) The antlr option -preamble causes antlr to insert the code BLOCK_PREAMBLE at the start of each rule and block. It does not insert code before rules references, token references, or actions. By properly defining the macro BLOCK_PREAMBLE the user can generate code which is specific to the start of blocks. The antlr option -preamble_first is similar, but inserts the code BLOCK_PREAMBLE_FIRST(PreambleFirst_123) where the symbol PreambleFirst_123 is equivalent to the first set defined by the #FirstSetSymbol described in Item #248. I have not investigated how these options interact with guess mode (syntactic predicates). #252. (MR21) Check for null pointer in trace routine When some trace options are used when the parser is generated without the trace enabled, the current rule name may be a NULL pointer. A guard was added to check for this in restoreState. Reported by Douglas E. Forester (dougf@projtech.com). #251. (MR21) Changes to #define zzTRACE_RULES The macro zzTRACE_RULES was being use to pass information to AParser.h. If this preprocessor symbol was not properly set the first time AParser.h was #included, the declaration of zzTRACEdata would be omitted (it is used by the -gd option). Subsequent #includes of AParser.h would be skipped because of the #ifdef guard, so the declaration of zzTracePrevRuleName would never be made. The result was that proper compilation was very order dependent. The declaration of zzTRACEdata was made unconditional and the problem of removing unused declarations will be left to optimizers. Diagnosed by Douglas E. Forester (dougf@projtech.com). #250. (MR21) Option for EXPERIMENTAL change to error sets for blocks The antlr option -mrblkerr turns on an experimental feature which is supposed to provide more accurate syntax error messages for k=1, ck=1 grammars. When used with k>1 or ck>1 grammars the behavior should be no worse than the current behavior. There is no problem with the matching of elements or the computation of prediction expressions in pccts. The task is only one of listing the most appropriate tokens in the error message. The error sets used in pccts error messages are approximations of the exact error set when optional elements in (...)* or (...)+ are involved. While entirely correct, the error messages are sometimes not 100% accurate. There is also a minor philosophical issue. For example, suppose the grammar expects the token to be an optional A followed by Z, and it is X. X, of course, is neither A nor Z, so an error message is appropriate. Is it appropriate to say "Expected Z" ? It is correct, it is accurate, but it is not complete. When k>1 or ck>1 the problem of providing the exactly correct list of tokens for the syntax error messages ends up becoming equivalent to evaluating the prediction expression for the alternatives twice. However, for k=1 ck=1 grammars the prediction expression can be computed easily and evaluated cheaply, so I decided to try implementing it to satisfy a particular application. This application uses the error set in an interactive command language to provide prompts which list the alternatives available at that point in the parser. The user can then enter additional tokens to complete the command line. To do this required more accurate error sets then previously provided by pccts. In some cases the default pccts behavior may lead to more robust error recovery or clearer error messages then having the exact set of tokens. This is because (a) features like -ge allow the use of symbolic names for certain sets of tokens, so having extra tokens may simply obscure things and (b) the error set is use to resynchronize the parser, so a good choice is sometimes more important than having the exact set. Consider the following example: Note: All examples code has been abbreviated to the absolute minimum in order to make the examples concise. star1 : (A)* Z; The generated code resembles: old new (with -mrblkerr) ------------- -------------------- for (;;) { for (;;) { match(A); match(A); } } match(Z); if (! A and ! Z) then FAIL(...{A,Z}...); } match(Z); With input X old message: Found X, expected Z new message: Found X, expected A, Z For the example: star2 : (A|B)* Z; old new (with -mrblkerr) ------------- -------------------- for (;;) { for (;;) { if (!A and !B) break; if (!A and !B) break; if (...) { if (...) { } } else { else { FAIL(...{A,B,Z}...) FAIL(...{A,B}...); } } } } match(B); if (! A and ! B and !Z) then FAIL(...{A,B,Z}...); } match(B); With input X old message: Found X, expected Z new message: Found X, expected A, B, Z With input A X old message: Found X, expected Z new message: Found X, expected A, B, Z This includes the choice of looping back to the star block. The code for plus blocks: plus1 : (A)+ Z; The generated code resembles: old new (with -mrblkerr) ------------- -------------------- do { do { match(A); match(A); } while (A) } while (A) match(Z); if (! A and ! Z) then FAIL(...{A,Z}...); } match(Z); With input A X old message: Found X, expected Z new message: Found X, expected A, Z This includes the choice of looping back to the plus block. For the example: plus2 : (A|B)+ Z; old new (with -mrblkerr) ------------- -------------------- do { do { if (A) { match(A); } else if (B) { match(B); } else { if (cnt > 1) break; FAIL(...{A,B,Z}...) FAIL(...{A,B}...); } } cnt++; } } match(Z); if (! A and ! B and !Z) then FAIL(...{A,B,Z}...); } match(B); With input X old message: Found X, expected A, B, Z new message: Found X, expected A, B With input A X old message: Found X, expected Z new message: Found X, expected A, B, Z This includes the choice of looping back to the star block. #249. (MR21) Changes for DEC/VMS systems Jean-François Piéronne (jfp@altavista.net) has updated some VMS related command files and fixed some minor problems related to building pccts under the DEC/VMS operating system. For DEC/VMS users the most important differences are: a. Revised makefile.vms b. Revised genMMS for genrating VMS style makefiles. #248. (MR21) Generate symbol for first set of an alternative pccts can generate a symbol which represents the tokens which may appear at the start of a block: rr : #FirstSetSymbol(rr_FirstSet) ( Foo | Bar ) ; This will generate the symbol rr_FirstSet of type SetWordType with elements Foo and Bar set. The bits can be tested using code similar to the following: if (set_el(Foo, &rr_FirstSet)) { ... This can be combined with the C array zztokens[] or the C++ routine tokenName() to get the print name of the token in the first set. The size of the set is given by the newly added enum SET_SIZE, a protected member of the generated parser's class. The number of elements in the generated set will not be exactly equal to the value of SET_SIZE because of synthetic tokens created by #tokclass, #errclass, the -ge option, and meta-tokens such as epsilon, and end-of-file. The #FirstSetSymbol must appear immediately before a block such as (...)+, (...)*, and {...}, and (...). It may not appear immediately before a token, a rule reference, or action. However a token or rule reference can be enclosed in a (...) in order to make the use of #pragma FirstSetSymbol legal. rr_bad : #FirstSetSymbol(rr_bad_FirstSet) Foo; // Illegal rr_ok : #FirstSetSymbol(rr_ok_FirstSet) (Foo); // Legal Do not confuse FirstSetSymbol sets with the sets used for testing lookahead. The sets used for FirstSetSymbol have one element per bit, so the number of bytes is approximately the largest token number divided by 8. The sets used for testing lookahead store 8 lookahead sets per byte, so the length of the array is approximately the largest token number. If there is demand, a similar routine for follow sets can be added. #247. (MR21) Misleading error message on syntax error for optional elements. Prior to MR21, tokens which were optional did not appear in syntax error messages if the block which immediately followed detected a syntax error. Consider the following grammar which accepts Number, Word, and Other: rr : {Number} Word; For this rule the code resembles: if (LA(1) == Number) { match(Number); consume(); } match(Word); Prior to MR21, the error message for input "$ a" would be: line 1: syntax error at "$" missing Word With MR21 the message will be: line 1: syntax error at "$" expecting Word, Number. The generate code resembles: if ( (LA(1)==Number) ) { zzmatch(Number); consume(); } else { if ( (LA(1)==Word) ) { /* nothing */ } else { FAIL(... message for both Number and Word ...); } } match(Word); The code generated for optional blocks in MR21 is slightly longer than the previous versions, but it should give better error messages. The code generated for: { a | b | c } should now be *identical* to: ( a | b | c | ) which was not the case prior to MR21. Reported by Sue Marvin (sue@siara.com). #246. (Changed in MR21) Use of $(MAKE) for calls to make Calls to make from the makefiles were replaced with $(MAKE) because of problems when using gmake. Reported with fix by Sunil K.Vallamkonda (sunil@siara.com). #245. (Changed in MR21) Changes to genmk The following command line options have been added to genmk: -cfiles ... To add a user's C or C++ files into makefile automatically. The list of files must be enclosed in apostrophes. This option may be specified multiple times. -compiler ... The name of the compiler to use for $(CCC) or $(CC). The default in C++ mode is "CC". The default in C mode is "cc". -pccts_path ... The value for $(PCCTS), the pccts directory. The default is /usr/local/pccts. Contributed by Tomasz Babczynski (t.babczynski@ict.pwr.wroc.pl). #244. (Changed in MR21) Rename variable "not" in antlr.g When antlr.g is compiled with a C++ compiler, a variable named "not" causes problems. Reported by Sinan Karasu (sinan.karasu@boeing.com). #243 (Changed in MR21) Replace recursion with iteration in zzfree_ast Another refinement to zzfree_ast in ast.c to limit recursion. NAKAJIMA Mutsuki (muc@isr.co.jp). #242. (Changed in MR21) LineInfoFormatStr Added an #ifndef/#endif around LineInfoFormatStr in pcctscfg.h. #241. (Changed in MR21) Changed macro PURIFY to a no-op *********************** *** NOT IMPLEMENTED *** *********************** The PURIFY macro was changed to a no-op because it was causing problems when passing C++ objects. The old definition: #define PURIFY(r,s) memset((char *) &(r),'\\0',(s)); The new definition: #define PURIFY(r,s) /* nothing */ #endif #240. (Changed in MR21) sorcerer/h/sorcerer.h _MATCH and _MATCHRANGE Added test for NULL token pointer. Suggested by Peter Keller (keller@ebi.ac.uk) #239. (Changed in MR21) C++ mode AParser::traceGuessFail If tracing is turned on when the code has been generated without trace code, a failed guess generates a trace report even though there are no other trace reports. This make the behavior consistent with other parts of the trace system. Reported by David Wigg (wiggjd@sbu.ac.uk). #238. (Changed in MR21) Namespace version #include files Changed reference from CStdio to cstdio (and other #include file names) in the namespace version of pccts. Should have known better. #237. (Changed in MR21) ParserBlackBox(FILE*) In the past, ParserBlackBox would close the FILE in the dtor even though it was not opened by ParserBlackBox. The problem is that there were two constructors, one which accepted a file name and did an fopen, the other which accepted a FILE and did not do an fopen. There is now an extra member variable which remembers whether ParserBlackBox did the open or not. Suggested by Mike Percy (mpercy@scires.com). #236. (Changed in MR21) tmake now reports down pointer problem When ASTBase::tmake attempts to update the down pointer of an AST it checks to see if the down pointer is NULL. If it is not NULL it does not do the update and returns NULL. An attempt to update the down pointer is almost always a result of a user error. This can lead to difficult to find problems during tree construction. With this change, the routine calls a virtual function reportOverwriteOfDownPointer() which calls panic to report the problem. Users who want the old behavior can redefined the virtual function in their AST class. Suggested by Sinan Karasu (sinan.karasu@boeing.com) #235. (Changed in MR21) Made ANTLRParser::resynch() virtual Suggested by Jerry Evans (jerry@swsl.co.uk). #234. (Changed in MR21) Implicit int for function return value ATokenBuffer:bufferSize() did not specify a type for the return value. Reported by Hai Vo-Ba (hai@fc.hp.com). #233. (Changed in MR20) Converted to MSVC 6.0 Due to external circumstances I have had to convert to MSVC 6.0 The MSVC 5.0 project files (.dsw and .dsp) have been retained as xxx50.dsp and xxx50.dsw. The MSVC 6.0 files are named xxx60.dsp and xxx60.dsw (where xxx is the related to the directory/project). #232. (Changed in MR20) Make setwd bit vectors protected in parser.h The access for the setwd array in the parser header was not specified. As a result, it would depend on the code which preceded it. In MR20 it will always have access "protected". Reported by Piotr Eljasiak (eljasiak@zt.gdansk.tpsa.pl). #231. (Changed in MR20) Error in token buffer debug code. When token buffer debugging is selected via the pre-processor symbol DEBUG_TOKENBUFFER there is an erroneous check in AParser.cpp: #ifdef DEBUG_TOKENBUFFER if (i >= inputTokens->bufferSize() || inputTokens->minTokens() < LLk ) /* MR20 Was "<=" */ ... #endif Reported by David Wigg (wiggjd@sbu.ac.uk). #230. (Changed in MR20) Fixed problem with #define for -gd option There was an error in setting zzTRACE_RULES for the -gd (trace) option. Reported by Gary Funck (gary@intrepid.com). #229. (Changed in MR20) Additional "const" for literals "const" was added to the token name literal table. "const" was added to some panic() and similar routine #228. (Changed in MR20) dlg crashes on "()" The following token defintion will cause DLG to crash. #token "()" When there is a syntax error in a regular expression many of the dlg routines return a structure which has null pointers. When this is accessed by callers it generates the crash. I have attempted to fix the more common cases. Reported by Mengue Olivier (dolmen@bigfoot.com). #227. (Changed in MR20) Array overwrite Steveh Hand (sassth@unx.sas.com) reported a problem which was traced to a temporary array which was not properly resized for deeply nested blocks. This has been fixed. #226. (Changed in MR20) -pedantic conformance G. Hobbelt (i_a@mbh.org) and THM made many, many minor changes to create prototypes for all the functions and bring antlr, dlg, and sorcerer into conformance with the gcc -pedantic option. This may require uses to add pccts/h/pcctscfg.h to some files or makefiles in order to have __USE_PROTOS defined. #225 (Changed in MR20) AST stack adjustment in C mode The fix in #214 for AST stack adjustment in C mode missed some cases. Reported with fix by Ger Hobbelt (i_a@mbh.org). #224 (Changed in MR20) LL(1) and LL(2) with #pragma approx This may take a record for the oldest, most trival, lexical error in pccts. The regular expressions for LL(1) and LL(2) lacked an escape for the left and right parenthesis. Reported by Ger Hobbelt (i_a@mbh.org). #223 (Changed in MR20) Addition of IBM_VISUAL_AGE directory Build files for antlr, dlg, and sorcerer under IBM Visual Age have been contributed by Anton Sergeev (ags@mlc.ru). They have been placed in the pccts/IBM_VISUAL_AGE directory. #222 (Changed in MR20) Replace __STDC__ with __USE_PROTOS Most occurrences of __STDC__ replaced with __USE_PROTOS due to complaints from several users. #221 (Changed in MR20) Added #include for DLexerBase.h to PBlackBox. Added #include for DLexerBase.h to PBlackBox. #220 (Changed in MR19) strcat arguments reversed in #pred parse The arguments to strcat are reversed when creating a print name for a hash table entry for use with #pred feature. Problem diagnosed and fix reported by Scott Harrington (seh4@ix.netcom.com). #219. (Changed in MR19) C Mode routine zzfree_ast Changes to reduce use of recursion for AST trees with only right links or only left links in the C mode routine zzfree_ast. Implemented by SAKAI Kiyotaka (ksakai@isr.co.jp). #218. (Changed in MR19) Changes to support unsigned char in C mode Changes to antlr.h and err.h to fix omissions in use of zzchar_t Implemented by SAKAI Kiyotaka (ksakai@isr.co.jp). #217. (Changed in MR19) Error message when dlg -i and -CC options selected *** This change was rescinded by item #257 *** The parsers generated by pccts in C++ mode are not able to support the interactive lexer option (except, perhaps, when using the deferred fetch parser option.(Item #216). DLG now warns when both -i and -CC are selected. This warning was suggested by David Venditti (07751870267-0001@t-online.de). #216. (Changed in MR19) Defer token fetch for C++ mode Implemented by Volker H. Simonis (simonis@informatik.uni-tuebingen.de) Normally, pccts keeps the lookahead token buffer completely filled. This requires max(k,ck) tokens of lookahead. For some applications this can cause deadlock problems. For example, there may be cases when the parser can't tell when the input has been completely consumed until the parse is complete, but the parse can't be completed because the input routines are waiting for additional tokens to fill the lookahead buffer. When the ANTLRParser class is built with the pre-processor option ZZDEFER_FETCH defined, the fetch of new tokens by consume() is deferred until LA(i) or LT(i) is called. To test whether this option has been built into the ANTLRParser class use "isDeferFetchEnabled()". Using the -gd trace option with the default tracein() and traceout() routines will defeat the effort to defer the fetch because the trace routines print out information about the lookahead token at the start of the rule. Because the tracein and traceout routines are virtual it is easy to redefine them in your parser: class MyParser { << virtual void tracein(ANTLRChar * ruleName) { fprintf(stderr,"Entering: %s\n", ruleName); } virtual void traceout(ANTLRChar * ruleName) { fprintf(stderr,"Leaving: %s\n", ruleName); } >> The originals for those routines are pccts/h/AParser.cpp This requires use of the dlg option -i (interactive lexer). This is experimental. The interaction with guess mode (syntactic predicates)is not known. #215. (Changed in MR19) Addition of reset() to DLGLexerBase There was no obvious way to reset the lexer for reuse. The reset() method now does this. Suggested by David Venditti (07751870267-0001@t-online.de). #214. (Changed in MR19) C mode: Adjust AST stack pointer at exit In C mode the AST stack pointer needs to be reset if there will be multiple calls to the ANTLRx macros. Reported with fix by Paul D. Smith (psmith@baynetworks.com). #213. (Changed in MR18) Fatal error with -mrhoistk (k>1 hoisting) When rearranging code I forgot to un-comment a critical line of code that handles hoisting of predicates with k>1 lookahead. This is now fixed. Reported by Reinier van den Born (reinier@vnet.ibm.com). #212. (Changed in MR17) Mac related changes by Kenji Tanaka Kenji Tanaka (kentar@osa.att.ne.jp) has made a number of changes for Macintosh users. a. The following Macintosh MPW files aid in installing pccts on Mac: pccts/MPW_Read_Me pccts/install68K.mpw pccts/installPPC.mpw pccts/antlr/antlr.r pccts/antlr/antlr68K.make pccts/antlr/antlrPPC.make pccts/dlg/dlg.r pccts/dlg/dlg68K.make pccts/dlg/dlgPPC.make pccts/sorcerer/sor.r pccts/sorcerer/sor68K.make pccts/sorcerer/sorPPC.make They completely replace the previous Mac installation files. b. The most significant is a change in the MAC_FILE_CREATOR symbol in pcctscfg.h: old: #define MAC_FILE_CREATOR 'MMCC' /* Metrowerks C/C++ Text files */ new: #define MAC_FILE_CREATOR 'CWIE' /* Metrowerks C/C++ Text files */ c. Added calls to special_fopen_actions() where necessary. #211. (Changed in MR16a) C++ style comment in dlg This has been fixed. #210. (Changed in MR16a) Sor accepts \r\n, \r, or \n for end-of-line A user requested that Sorcerer be changed to accept other forms of end-of-line. #209. (Changed in MR16) Name of files changed. Old: CHANGES_FROM_1.33 New: CHANGES_FROM_133.txt Old: KNOWN_PROBLEMS New: KNOWN_PROBLEMS.txt #208. (Changed in MR16) Change in use of pccts #include files There were problems with MS DevStudio when mixing Sorcerer and PCCTS in the same source file. The problem is caused by the redefinition of setjmp in the MS header file setjmp.h. In setjmp.h the pre-processor symbol setjmp was redefined to be _setjmp. A later effort to execute #include resulted in an effort to #include <_setjmp.h>. I'm not sure whether this is a bug or a feature. In any case, I decided to fix it by avoiding the use of pre-processor symbols in #include statements altogether. This has the added benefit of making pre-compiled headers work again. I've replaced statements: old: #include PCCTS_SETJMP_H new: #include "pccts_setjmp.h" Where pccts_setjmp.h contains: #ifndef __PCCTS_SETJMP_H__ #define __PCCTS_SETJMP_H__ #ifdef PCCTS_USE_NAMESPACE_STD #include #else #include #endif #endif A similar change has been made for other standard header files required by pccts and sorcerer: stdlib.h, stdarg.h, stdio.h, etc. Reported by Jeff Vincent (JVincent@novell.com) and Dale Davis (DalDavis@spectrace.com). #207. (Changed in MR16) dlg reports an invalid range for: [\0x00-\0xff] dlg will report that this is an invalid range. Diagnosed by Piotr Eljasiak (eljasiak@no-spam.zt.gdansk.tpsa.pl): I think this problem is not specific to unsigned chars because dlg reports no error for the range [\0x00-\0xfe]. I've found that information on range is kept in field letter (unsigned char) of Attrib struct. Unfortunately the letter value internally is for some reasons increased by 1, so \0xff is represented here as 0. That's why dlg complains about the range [\0x00-\0xff] in dlg_p.g: if ($$.letter > $2.letter) { error("invalid range ", zzline); } The fix is: if ($$.letter > $2.letter && 255 != $$2.letter) { error("invalid range ", zzline); } #206. (Changed in MR16) Free zzFAILtext in ANTLRParser destructor The ANTLRParser destructor now frees zzFAILtext. Problem and fix reported by Manfred Kogler (km@cast.uni-linz.ac.at). #205. (Changed in MR16) DLGStringReset argument now const Changed: void DLGStringReset(DLGChar *s) {...} To: void DLGStringReset(const DLGChar *s) {...} Suggested by Dale Davis (daldavis@spectrace.com) #204. (Changed in MR15a) Change __WATCOM__ to __WATCOMC__ in pcctscfg.h Reported by Oleg Dashevskii (olegdash@my-dejanews.com). #203. (Changed in MR15) Addition of sorcerer to distribution kit I have finally caved in to popular demand. The pccts 1.33mr15 kit will include sorcerer. The separate sorcerer kit will be discontinued. #202. (Changed) in MR15) Organization of MS Dev Studio Projects in Kit Previously there was one workspace that contained projects for all three parts of pccts: antlr, dlg, and sorcerer. Now each part (and directory) has its own workspace/project and there is an additional workspace/project to build a library from the .cpp files in the pccts/h directory. The library build will create pccts_debug.lib or pccts_release.lib according to the configuration selected. If you don't want to build pccts 1.33MR15 you can download a ready-to-run kit for win32 from http://www.polhode.com/win32.zip. The ready-to-run for win32 includes executables, a pre-built static library for the .cpp files in the pccts/h directory, and a sample application You will need to define the environment variable PCCTS to point to the root of the pccts directory hierarchy. #201. (Changed in MR15) Several fixes by K.J. Cummings (cummings@peritus.com) Generation of SETJMP rather than SETJMP_H in gen.c. (Sor B19) Declaration of ref_vars_inits for ref_var_inits in pccts/sorcerer/sorcerer.h. #200. (Changed in MR15) Remove operator=() in AToken.h User reported that WatCom couldn't handle use of explicit operator =(). Replace with equivalent using cast operator. #199. (Changed in MR15) Don't allow use of empty #tokclass Change antlr.g to disallow empty #tokclass sets. Reported by Manfred Kogler (km@cast.uni-linz.ac.at). #198. Revised ANSI C grammar due to efforts by Manuel Kessler Manuel Kessler (mlkessler@cip.physik.uni-wuerzburg.de) Allow trailing ... in function parameter lists. Add bit fields. Allow old-style function declarations. Support cv-qualified pointers. Better checking of combinations of type specifiers. Release of memory for local symbols on scope exit. Allow input file name on command line as well as by redirection. and other miscellaneous tweaks. This is not part of the pccts distribution kit. It must be downloaded separately from: http://www.polhode.com/ansi_mr15.zip #197. (Changed in MR14) Resetting the lookahead buffer of the parser Explanation and fix by Sinan Karasu (sinan.karasu@boeing.com) Consider the code used to prime the lookahead buffer LA(i) of the parser when init() is called: void ANTLRParser:: prime_lookahead() { int i; for(i=1;i<=LLk; i++) consume(); dirty=0; //lap = 0; // MR14 - Sinan Karasu (sinan.karusu@boeing.com) //labase = 0; // MR14 labase=lap; // MR14 } When the parser is instantiated, lap=0,labase=0 is set. The "for" loop runs LLk times. In consume(), lap = lap +1 (mod LLk) is computed. Therefore, lap(before the loop) == lap (after the loop). Now the only problem comes in when one does an init() of the parser after an Eof has been seen. At that time, lap could be non zero. Assume it was lap==1. Now we do a prime_lookahead(). If LLk is 2, then consume() { NLA = inputTokens->getToken()->getType(); dirty--; lap = (lap+1)&(LLk-1); } or expanding NLA, token_type[lap&(LLk-1)]) = inputTokens->getToken()->getType(); dirty--; lap = (lap+1)&(LLk-1); so now we prime locations 1 and 2. In prime_lookahead it used to set lap=0 and labase=0. Now, the next token will be read from location 0, NOT 1 as it should have been. This was never caught before, because if a parser is just instantiated, then lap and labase are 0, the offending assignment lines are basically no-ops, since the for loop wraps around back to 0. #196. (Changed in MR14) Problems with "(alpha)? beta" guess Consider the following syntactic predicate in a grammar with 2 tokens of lookahead (k=2 or ck=2): rule : ( alpha )? beta ; alpha : S t ; t : T U | T ; beta : S t Z ; When antlr computes the prediction expression with one token of lookahead for alts 1 and 2 of rule t it finds an ambiguity. Because the grammar has a lookahead of 2 it tries to compute two tokens of lookahead for alts 1 and 2 of t. Alt 1 clearly has a lookahead of (T U). Alt 2 is one token long so antlr tries to compute the follow set of alt 2, which means finding the things which can follow rule t in the context of (alpha)?. This cannot be computed, because alpha is only part of a rule, and antlr can't tell what part of beta is matched by alpha and what part remains to be matched. Thus it impossible for antlr to properly determine the follow set of rule t. Prior to 1.33MR14, the follow of (alpha)? was computed as FIRST(beta) as a result of the internal representation of guess blocks. With MR14 the follow set will be the empty set for that context. Normally, one expects a rule appearing in a guess block to also appear elsewhere. When the follow context for this other use is "ored" with the empty set, the context from the other use results, and a reasonable follow context results. However if there is *no* other use of the rule, or it is used in a different manner then the follow context will be inaccurate - it was inaccurate even before MR14, but it will be inaccurate in a different way. For the example given earlier, a reasonable way to rewrite the grammar: rule : ( alpha )? beta alpha : S t ; t : T U | T ; beta : alpha Z ; If there are no other uses of the rule appearing in the guess block it will generate a test for EOF - a workaround for representing a null set in the lookahead tests. If you encounter such a problem you can use the -alpha option to get additional information: line 2: error: not possible to compute follow set for alpha in an "(alpha)? beta" block. With the antlr -alpha command line option the following information is inserted into the generated file: #if 0 Trace of references leading to attempt to compute the follow set of alpha in an "(alpha)? beta" block. It is not possible for antlr to compute this follow set because it is not known what part of beta has already been matched by alpha and what part remains to be matched. Rules which make use of the incorrect follow set will also be incorrect 1 #token T alpha/2 line 7 brief.g 2 end alpha alpha/3 line 8 brief.g 2 end (...)? block at start/1 line 2 brief.g #endif At the moment, with the -alpha option selected the program marks any rules which appear in the trace back chain (above) as rules with possible problems computing follow set. Reported by Greg Knapen (gregory.knapen@bell.ca). #195. (Changed in MR14) #line directive not at column 1 Under certain circunstances a predicate test could generate a #line directive which was not at column 1. Reported with fix by David Kågedal (davidk@lysator.liu.se) (http://www.lysator.liu.se/~davidk/). #194. (Changed in MR14) (C Mode only) Demand lookahead with #tokclass In C mode with the demand lookahead option there is a bug in the code which handles matches for #tokclass (zzsetmatch and zzsetmatch_wsig). The bug causes the lookahead pointer to get out of synchronization with the current token pointer. The problem was reported with a fix by Ger Hobbelt (hobbelt@axa.nl). #193. (Changed in MR14) Use of PCCTS_USE_NAMESPACE_STD The pcctscfg.h now contains the following definitions: #ifdef PCCTS_USE_NAMESPACE_STD #define PCCTS_STDIO_H #define PCCTS_STDLIB_H #define PCCTS_STDARG_H #define PCCTS_SETJMP_H #define PCCTS_STRING_H #define PCCTS_ASSERT_H #define PCCTS_ISTREAM_H #define PCCTS_IOSTREAM_H #define PCCTS_NAMESPACE_STD namespace std {}; using namespace std; #else #define PCCTS_STDIO_H #define PCCTS_STDLIB_H #define PCCTS_STDARG_H #define PCCTS_SETJMP_H #define PCCTS_STRING_H #define PCCTS_ASSERT_H #define PCCTS_ISTREAM_H #define PCCTS_IOSTREAM_H #define PCCTS_NAMESPACE_STD #endif The runtime support in pccts/h uses these pre-processor symbols consistently. Also, antlr and dlg have been changed to generate code which uses these pre-processor symbols rather than having the names of the #include files hard-coded in the generated code. This required the addition of "#include pcctscfg.h" to a number of files in pccts/h. It appears that this sometimes causes problems for MSVC 5 in combination with the "automatic" option for pre-compiled headers. In such cases disable the "automatic" pre-compiled headers option. Suggested by Hubert Holin (Hubert.Holin@Bigfoot.com). #192. (Changed in MR14) Change setText() to accept "const ANTLRChar *" Changed ANTLRToken::setText(ANTLRChar *) to setText(const ANTLRChar *). This allows literal strings to be used to initialize tokens. Since the usual token implementation (ANTLRCommonToken) makes a copy of the input string, this was an unnecessary limitation. Suggested by Bob McWhirter (bob@netwrench.com). #191. (Changed in MR14) HP/UX aCC compiler compatibility problem Needed to explicitly declare zzINF_DEF_TOKEN_BUFFER_SIZE and zzINF_BUFFER_TOKEN_CHUNK_SIZE as ints in pccts/h/AParser.cpp. Reported by David Cook (dcook@bmc.com). #190. (Changed in MR14) IBM OS/2 CSet compiler compatibility problem Name conflict with "_cs" in pccts/h/ATokenBuffer.cpp Reported by David Cook (dcook@bmc.com). #189. (Changed in MR14) -gxt switch in C mode The -gxt switch in C mode didn't work because of incorrect initialization. Reported by Sinan Karasu (sinan@boeing.com). #188. (Changed in MR14) Added pccts/h/DLG_stream_input.h This is a DLG stream class based on C++ istreams. Contributed by Hubert Holin (Hubert.Holin@Bigfoot.com). #187. (Changed in MR14) Rename config.h to pcctscfg.h The PCCTS configuration file has been renamed from config.h to pcctscfg.h. The problem with the original name is that it led to name collisions when pccts parsers were combined with other software. All of the runtime support routines in pccts/h/* have been changed to use the new name. Existing software can continue to use pccts/h/config.h. The contents of pccts/h/config.h is now just "#include "pcctscfg.h". I don't have a record of the user who suggested this. #186. (Changed in MR14) Pre-processor symbol DllExportPCCTS class modifier Classes in the C++ runtime support routines are now declared: class DllExportPCCTS className .... By default, the pre-processor symbol is defined as the empty string. This if for use by MSVC++ users to create DLL classes. Suggested by Manfred Kogler (km@cast.uni-linz.ac.at). #185. (Changed in MR14) Option to not use PCCTS_AST base class for ASTBase Normally, the ASTBase class is derived from PCCTS_AST which contains functions useful to Sorcerer. If these are not necessary then the user can define the pre-processor symbol "PCCTS_NOT_USING_SOR" which will cause the ASTBase class to replace references to PCCTS_AST with references to ASTBase where necessary. The class ASTDoublyLinkedBase will contain a pure virtual function shallowCopy() that was formerly defined in class PCCTS_AST. Suggested by Bob McWhirter (bob@netwrench.com). #184. (Changed in MR14) Grammars with no tokens generate invalid tokens.h Reported by Hubert Holin (Hubert.Holin@bigfoot.com). #183. (Changed in MR14) -f to specify file with names of grammar files In DEC/VMS it is difficult to specify very long command lines. The -f option allows one to place the names of the grammar files in a data file in order to bypass limitations of the DEC/VMS command language interpreter. Addition supplied by Bernard Giroud (b_giroud@decus.ch). #182. (Changed in MR14) Output directory option for DEC/VMS Fix some problems with the -o option under DEC/VMS. Fix supplied by Bernard Giroud (b_giroud@decus.ch). #181. (Changed in MR14) Allow chars > 127 in DLGStringInput::nextChar() Changed DLGStringInput to cast the character using (unsigned char) so that languages with character codes greater than 127 work without changes. Suggested by Manfred Kogler (km@cast.uni-linz.ac.at). #180. (Added in MR14) ANTLRParser::getEofToken() Added "ANTLRToken ANTLRParser::getEofToken() const" to match the setEofToken routine. Requested by Manfred Kogler (km@cast.uni-linz.ac.at). #179. (Fixed in MR14) Memory leak for BufFileInput subclass of DLGInputStream The BufFileInput class described in Item #142 neglected to release the allocated buffer when an instance was destroyed. Reported by Manfred Kogler (km@cast.uni-linz.ac.at). #178. (Fixed in MR14) Bug in "(alpha)? beta" guess blocks first sets In 1.33 vanilla, and all maintenance releases prior to MR14 there is a bug in the handling of guess blocks which use the "long" form: (alpha)? beta inside a (...)*, (...)+, or {...} block. This problem does *not* apply to the case where beta is omitted or when the syntactic predicate is on the leading edge of an alternative. The problem is that both alpha and beta are stored in the syntax diagram, and that some analysis routines would fail to skip the alpha portion when it was not on the leading edge. Consider the following grammar with -ck 2: r : ( (A)? B )* C D | A B /* forces -ck 2 computation for old antlr */ /* reports ambig for alts 1 & 2 */ | B C /* forces -ck 2 computation for new antlr */ /* reports ambig for alts 1 & 3 */ ; The prediction expression for the first alternative should be LA(1)={B C} LA(2)={B C D}, but previous versions of antlr would compute the prediction expression as LA(1)={A C} LA(2)={B D} Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu) who provided a very clear example of the problem and identified the probable cause. #177. (Changed in MR14) #tokdefs and #token with regular expression In MR13 the change described by Item #162 caused an existing feature of antlr to fail. Prior to the change it was possible to give regular expression definitions and actions to tokens which were defined via the #tokdefs directive. This now works again. Reported by Manfred Kogler (km@cast.uni-linz.ac.at). #176. (Changed in MR14) Support for #line in antlr source code Note: this was implemented by Arpad Beszedes (beszedes@inf.u-szeged.hu). In 1.33MR14 it is possible for a pre-processor to generate #line directives in the antlr source and have those line numbers and file names used in antlr error messages and in the #line directives generated by antlr. The #line directive may appear in the following forms: #line ll "sss" xx xx ... where ll represents a line number, "sss" represents the name of a file enclosed in quotation marks, and xxx are arbitrary integers. The following form (without "line") is not supported at the moment: # ll "sss" xx xx ... The result: zzline is replaced with ll from the # or #line directive FileStr[CurFile] is updated with the contents of the string (if any) following the line number Note ---- The file-name string following the line number can be a complete name with a directory-path. Antlr generates the output files from the input file name (by replacing the extension from the file-name with .c or .cpp). If the input file (or the file-name from the line-info) contains a path: "../grammar.g" the generated source code will be placed in "../grammar.cpp" (i.e. in the parent directory). This is inconvenient in some cases (even the -o switch can not be used) so the path information is removed from the #line directive. Thus, if the line-info was #line 2 "../grammar.g" then the current file-name will become "grammar.g" In this way, the generated source code according to the grammar file will always be in the current directory, except when the -o switch is used. #175. (Changed in MR14) Bug when guess block appears at start of (...)* In 1.33 vanilla and all maintenance releases prior to 1.33MR14 there is a bug when a guess block appears at the start of a (...)+. Consider the following k=1 (ck=1) grammar: rule : ( (STAR)? ZIP )* ID ; Prior to 1.33MR14, the generated code resembled: ... zzGUESS_BLOCK while ( 1 ) { if ( ! LA(1)==STAR) break; zzGUESS if ( !zzrv ) { zzmatch(STAR); zzCONSUME; zzGUESS_DONE zzmatch(ZIP); zzCONSUME; ... Note that the routine uses STAR for the prediction expression rather than ZIP. With 1.33MR14 the generated code resembles: ... while ( 1 ) { if ( ! LA(1)==ZIP) break; ... This problem existed only with (...)* blocks and was caused by the slightly more complicate graph which represents (...)* blocks. This caused the analysis routine to compute the first set for the alpha part of the "(alpha)? beta" rather than the beta part. Both (...)+ and {...} blocks handled the guess block correctly. Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu) who provided a very clear example of the problem and identified the probable cause. #174. (Changed in MR14) Bug when action precedes syntactic predicate In 1.33 vanilla, and all maintenance releases prior to MR14, there was a bug when a syntactic predicate was immediately preceded by an action. Consider the following -ck 2 grammar: rule : <> (alpha)? beta C | A B ; alpha : A ; beta : A B; Prior to MR14, the code generated for the first alternative resembled: ... zzGUESS if ( !zzrv && LA(1)==A && LA(2)==A) { alpha(); zzGUESS_DONE beta(); zzmatch(C); zzCONSUME; } else { ... The prediction expression (i.e. LA(1)==A && LA(2)==A) is clearly wrong because LA(2) should be matched to B (first[2] of beta is {B}). With 1.33MR14 the prediction expression is: ... if ( !zzrv && LA(1)==A && LA(2)==B) { alpha(); zzGUESS_DONE beta(); zzmatch(C); zzCONSUME; } else { ... This will only affect users in which alpha is shorter than than max(k,ck) and there is an action immediately preceding the syntactic predicate. This problem was reported by reported by Arpad Beszedes (beszedes@inf.u-szeged.hu) who provided a very clear example of the problem and identified the presence of the init-action as the likely culprit. #173. (Changed in MR13a) -glms for Microsoft style filenames with -gl With the -gl option antlr generates #line directives using the exact name of the input files specified on the command line. An oddity of the Microsoft C and C++ compilers is that they don't accept file names in #line directives containing "\" even though these are names from the native file system. With -glms option, the "\" in file names appearing in #line directives is replaced with a "/" in order to conform to Microsoft compiler requirements. Reported by Erwin Achermann (erwin.achermann@switzerland.org). #172. (Changed in MR13) \r\n in antlr source counted as one line Some MS software uses \r\n to indicate a new line. Antlr now recognizes this in counting lines. Reported by Edward L. Hepler (elh@ece.vill.edu). #171. (Changed in MR13) #tokclass L..U now allowed The following is now allowed: #tokclass ABC { A..B C } Reported by Dave Watola (dwatola@amtsun.jpl.nasa.gov) #170. (Changed in MR13) Suppression for predicates with lookahead depth >1 In MR12 the capability for suppression of predicates with lookahead depth=1 was introduced. With MR13 this had been extended to predicates with lookahead depth > 1 and released for use by users on an experimental basis. Consider the following grammar with -ck 2 and the predicate in rule "a" with depth 2: r1 : (ab)* "@" ; ab : a | b ; a : (A B)? => <>? A B C ; b : A B C ; Normally, the predicate would be hoisted into rule r1 in order to determine whether to call rule "ab". However it should *not* be hoisted because, even if p is false, there is a valid alternative in rule b. With "-mrhoistk on" the predicate will be suppressed. If "-info p" command line option is present the following information will appear in the generated code: while ( (LA(1)==A) #if 0 Part (or all) of predicate with depth > 1 suppressed by alternative without predicate pred << p(LATEXT(2))>>? depth=k=2 ("=>" guard) rule a line 8 t1.g tree context: (root = A B ) The token sequence which is suppressed: ( A B ) The sequence of references which generate that sequence of tokens: 1 to ab r1/1 line 1 t1.g 2 ab ab/1 line 4 t1.g 3 to b ab/2 line 5 t1.g 4 b b/1 line 11 t1.g 5 #token A b/1 line 11 t1.g 6 #token B b/1 line 11 t1.g #endif A slightly more complicated example: r1 : (ab)* "@" ; ab : a | b ; a : (A B)? => <>? (A B | D E) ; b : <>? D E ; In this case, the sequence (D E) in rule "a" which lies behind the guard is used to suppress the predicate with context (D E) in rule b. while ( (LA(1)==A || LA(1)==D) #if 0 Part (or all) of predicate with depth > 1 suppressed by alternative without predicate pred << q(LATEXT(2))>>? depth=k=2 rule b line 11 t2.g tree context: (root = D E ) The token sequence which is suppressed: ( D E ) The sequence of references which generate that sequence of tokens: 1 to ab r1/1 line 1 t2.g 2 ab ab/1 line 4 t2.g 3 to a ab/1 line 4 t2.g 4 a a/1 line 8 t2.g 5 #token D a/1 line 8 t2.g 6 #token E a/1 line 8 t2.g #endif && #if 0 pred << p(LATEXT(2))>>? depth=k=2 ("=>" guard) rule a line 8 t2.g tree context: (root = A B ) #endif (! ( LA(1)==A && LA(2)==B ) || p(LATEXT(2)) ) { ab(); ... #169. (Changed in MR13) Predicate test optimization for depth=1 predicates When the MR12 generated a test of a predicate which had depth 1 it would use the depth >1 routines, resulting in correct but inefficient behavior. In MR13, a bit test is used. #168. (Changed in MR13) Token expressions in context guards The token expressions appearing in context guards such as: (A B)? => <>? someRule are computed during an early phase of antlr processing. As a result, prior to MR13, complex expressions such as: ~B L..U ~L..U TokClassName ~TokClassName were not computed properly. This resulted in incorrect context being computed for such expressions. In MR13 these context guards are verified for proper semantics in the initial phase and then re-evaluated after complex token expressions have been computed in order to produce the correct behavior. Reported by Arpad Beszedes (beszedes@inf.u-szeged.hu). #167. (Changed in MR13) ~L..U Prior to MR13, the complement of a token range was not properly computed. #166. (Changed in MR13) token expression L..U The token U was represented as an unsigned char, restricting the use of L..U to cases where U was assigned a token number less than 256. This is corrected in MR13. #165. (Changed in MR13) option -newAST To create ASTs from an ANTLRTokenPtr antlr usually calls "new AST(ANTLRTokenPtr)". This option generates a call to "newAST(ANTLRTokenPtr)" instead. This allows a user to define a parser member function to create an AST object. Similar changes for ASTBase::tmake and ASTBase::link were not thought necessary since they do not create AST objects, only use existing ones. #164. (Changed in MR13) Unused variable _astp For many compilations, we have lived with warnings about the unused variable _astp. It turns out that this varible can *never* be used because the code which references it was commented out. This investigation was sparked by a note from Erwin Achermann (erwin.achermann@switzerland.org). #163. (Changed in MR13) Incorrect makefiles for testcpp examples All the examples in pccts/testcpp/* had incorrect definitions in the makefiles for the symbol "CCC". Instead of CCC=CC they had CC=$(CCC). There was an additional problem in testcpp/1/test.g due to the change in ANTLRToken::getText() to a const member function (Item #137). Reported by Maurice Mass (maas@cuci.nl). #162. (Changed in MR13) Combining #token with #tokdefs When it became possible to change the print-name of a #token (Item #148) it became useful to give a #token statement whose only purpose was to giving a print name to the #token. Prior to this change this could not be combined with the #tokdefs feature. #161. (Changed in MR13) Switch -gxt inhibits generation of tokens.h #160. (Changed in MR13) Omissions in list of names for remap.h When a user selects the -gp option antlr creates a list of macros in remap.h to rename some of the standard antlr routines from zzXXX to userprefixXXX. There were number of omissions from the remap.h name list related to the new trace facility. This was reported, along with a fix, by Bernie Solomon (bernard@ug.eds.com). #159. (Changed in MR13) Violations of classic C rules There were a number of violations of classic C style in the distribution kit. This was reported, along with fixes, by Bernie Solomon (bernard@ug.eds.com). #158. (Changed in MR13) #header causes problem for pre-processors A user who runs the C pre-processor on antlr source suggested that another syntax be allowed. With MR13 such directives such as #header, #pragma, etc. may be written as "\#header", "\#pragma", etc. For escaping pre-processor directives inside a #header use something like the following: \#header << \#include >> #157. (Fixed in MR13) empty error sets for rules with infinite recursion When the first set for a rule cannot be computed due to infinite left recursion and it is the only alternative for a block then the error set for the block would be empty. This would result in a fatal error. Reported by Darin Creason (creason@genedax.com) #156. (Changed in MR13) DLGLexerBase::getToken() now public #155. (Changed in MR13) Context behind predicates can suppress With -mrhoist enabled the context behind a guarded predicate can be used to suppress other predicates. Consider the following grammar: r0 : (r1)+; r1 : rp | rq ; rp : <

>? B ; rq : (A)? => <>? (A|B); In earlier versions both predicates "p" and "q" would be hoisted into rule r0. With MR12c predicate p is suppressed because the context which follows predicate q includes "B" which can "cover" predicate "p". In other words, in trying to decide in r0 whether to call r1, it doesn't really matter whether p is false or true because, either way, there is a valid choice within r1. #154. (Changed in MR13) Making hoist suppression explicit using <> A common error, even among experienced pccts users, is to code an init-action to inhibit hoisting rather than a leading action. An init-action does not inhibit hoisting. This was coded: rule1 : <<;>> rule2 This is what was meant: rule1 : <<;>> <<;>> rule2 With MR13, the user can code: rule1 : <<;>> <> rule2 The following will give an error message: rule1 : <> rule2 If the <> appears as an init-action rather than a leading action an error message is issued. The meaning of an init-action containing "nohoist" is unclear: does it apply to just one alternative or to all alternatives ? ------------------------------------------------------- Note: Items #153 to #1 are now in a separate file named CHANGES_FROM_133_BEFORE_MR13.txt -------------------------------------------------------