Nested multiline comments are an extension of multiline comment, which is handled during the lexical analysis stage. It is interesting that most programming languages do not support the nested multiline comment. I guess they likely consider that it is unnecessary and makes more complexity in lexical analysis, moreover, they might inherit this style from C or just because of some historical issues. However, a nestable mechanism of multiline comments could be very useful in some cases, e.g., comment out a huge selection of the code which includes some multiline comments already.
Anyway, the implementations of nested multiline comments usually (or, always) use an accumulator or stack to record the depth of the nested multiline comment and checks their well-terminatedness of them and reports the error messages to the programmers.
This write-up describes the basic principles and implementation of nested multiline comment, and discusses how to design a user-friendly and intuitive handling of the exceptions of nested multiline comment. And, it is also used to practice my writing skill.
A minimal implementation of the nested multiline comment could be very simple and clear. Look at the following C code:
typedef struct {
FILE *in;
char ch;
char ne;
size_t ln;
size_t col;
} lex_t;
void adv(lex_t *lex) { ... }
void skip_multiline_comment(lex_t *lex) {
adv(lex); // skip /
adv(lex); // skip *
size_t depth = 1;
while (depth > 0) {
if (lex->ch == EOF)
// TODO: error handling here
if (lex->ch == '/' && lex->ne == '*') {
adv(lex); // skip /
adv(lex); // skip *
depth++;
} else if (lex->ch == '*' && lex->ne == '/') {
adv(lex); // skip *
adv(lex); // skip /
depth--;
} else {
adv(lex);
}
}
}
Because this function assumes that we have matched the beginning of the
multiline comment already, we need to advance the lexical analyzer twice
to skip the first two characters '/' and '*' firstly.
After advancing, we declare a variable depth to record the depth of
the nested multiline comment we scan, which is simply our “accumulator”. The
following control flow is a while loop, which will operate the
accumulator by these rules: if the analyzer encounters a beginning marker (it
is /* in this example) of a multiline comment, then the accumulator
will be increased by 1; if the analyzer encounters an ending marker (*/)
, then the accumulator will be decreased by 1; otherwise, we just ignore the
content of the comment and keep advancing.
Depending on the terminating condition of the loop, the accumulator will be
increased or decreased until it reaches to zero — which means that the comment is
terminated normally. If the comment is not terminated normally, then the
accumulator will not be decreased to zero, and the loop will keep running until
the analyzer encounters an “EOF”, meaning “the end of file”.
You might note that: in this example, I do not show you the error handling of this scanner — because it will be discussed in the next section: the ways to handle the exceptions.
A main exception is the unterminated multiline comment. But because of the nestedness, the error handling of nested multiline comment requires more thought. Here, the core question is: how do we best locate and report these errors?
Well, we could record the latest unterminated comment encountered by the analyzer. For implementation, we need to record the line and column of every comment we occur. Let's incorporate this into the code:
size_t ln = lex->ln, col = lex->col;
// ...
if (lex->ch == EOF) {
LOG(ERRO, "unterminated multiline comment occurred at %zu:%zu", ln, col);
return;
}
Now we have two variables ln and col for recording
the location of the comment, Accordingly, we need to update them in the loop:
if (lex->ch == '/' && lex->ne == '*') {
ln = lex->ln;
col = lex->col;
// ...
}
Assume there is the source file a.rem, then this code will work like this:
$ cat rem.txt
/*
/*
*/
$ ./lexer rem.txt
[ERRO] unterminated multiline comment occurred at 2:1
There is a problem — in the rem.txt, the unterminated comment
is in the first line in fact, but the logger reported that the unterminated comment
is in the second line! Actually, we only need to record the outermost multiline
comment of the entire nested multiline comment. Because if the multiline comment
is unterminated, then the outermost comment is always unenclosed. So, we do
not need to record every beginning marker, we can just record the outermost beginning
marker:
if (lex->ch == '/' && lex->ne == '*') {
// delete the updating here.
// ...
}
Now, the error message is much more clear:
$ cat rem.txt
/*
/*
*/
$ ./lexer rem.txt
[ERRO] unterminated multiline comment occurred at 1:1
In fact, this implementation still has a defect, which is that the reporter does not handle the multiple unterminated comments well. Think about this example:
/*
/*
Given our current implementation, it will only report the beginning of this structure:
[ERRO] unterminated multiline comment occurred at 1:1
It seems useless, whereas the ideal output is: the reporter should iterate through the comments and find all unterminated comments, then report them one by one. To solve this problem, we need a stack to record the entire structure of the comment. You need to weigh — is it the current solution enough? Will the introduction of the stack make more unnecessary complexity? I think the former solution is enough in most cases, but in this write-up, I will still show you my implementation of the “ideal solution”.
Mentioned before, we need a stack to implement the solution. The implementation is shown in the following C code:
#define COMMENT_MAX_DEPTH 8
typedef struct {
size_t ln;
size_t col;
} loc_t;
void skip_multiline_comment(lex_t *lex) {
loc_t stack[COMMENT_MAX_DEPTH];
adv(lex); // skip /
adv(lex); // skip *
size_t depth = 0;
stack[depth].ln = lex->ln;
stack[depth].col = lex->col;
depth++;
while (depth > 0) {
if (lex->ch == EOF) {
for (size_t i = 0; i < depth; i++)
LOG(ERRO, "unterminated multiline comment occurred at %zu:%zu",
stack[i].ln, stack[i].col);
return;
}
if (depth >= COMMENT_MAX_DEPTH) {
LOG(ERRO, "comment nesting too deep, maximum depth is %d",
COMMENT_MAX_DEPTH);
return;
}
if (lex->ch == '/' && lex->ne == '*') {
stack[depth].ln = lex->ln;
stack[depth].col = lex->col;
depth++;
adv(lex); // skip /
adv(lex); // skip *
} else if (lex->ch == '*' && lex->ne == '/') {
adv(lex); // skip *
adv(lex); // skip /
depth--;
} else {
adv(lex);
}
}
}
In this implementation, we declare a structure loc_t for recording
the location information of the comments and a variable stack for
recording the locations of comments we encountered. We use the variable depth
to access the elements of the stack in this example.
We do not need any further work to maintain the stack, but only need to record
the locations of beginning markers. Because the stack is indexed by the variable
depth. When the analyzer encounters the beginning marker, then the stack
will be pushed the location indexed by the current depth. When the analyzer encounters
the ending marker, then depth will be decreased by 1 — and since that,
when the analyzer encounters a new beginning marker, and the new one will replace the
old one, which is verified that it is well-terminated.
After this process, the remaining elements in the stack is always unterminated. So, we could simply iterate through the stack and report the location information for the error messages.
Notice that: because we use a stack-allocated stack to record the location of
comments, so it will introduce a depth limitation, which is defined by the macro
COMMENT_MAX_DEPTH, which is 8 in this implementation. In most cases,
the depth of the comments will not be too big, so this limitation is enough for
most usages. But whereas the previous implementation allows almost infinite
depth, it is kind of annoying.
In this write-up, we have successfully explained the principles of handling the nested multiline comment, and built a nice handling of the exceptions.
We firstly used an accumulator to record the depth of the comments to check
the termination of the comments. Then, we introduced two ways to report the exception
of the unterminated comments, whose main point is that the recording of the locations
of unterminated comments. The first one is that recording the outermost unterminated
comments straightforward, because all unterminated nested multiline comments’
outermost comments are unterminated. But we found its limitation soon. So we
introduced the second one: using a stack to record all unterminated comments’
locations with a subtle mechanism of the accumulator depth.