grammar = require './grammar'
KEYWORDS = grammar.KEYWORDS
RVALUE_OK = grammar.RVALUE_OK
This module applies a couple of “syntactic sugar” pre-processing steps to Kal code before it goes to the compiler. These steps would be onerous to do during the parsing stage, but are generally easier to do on a token stream. Each function in this module takes an input token stream and returns a new, possibly modified one.
Some sugar functions use the keyword list from the grammar, most notable the implicit parentheses for function calls.
grammar = require './grammar'
KEYWORDS = grammar.KEYWORDS
RVALUE_OK = grammar.RVALUE_OK
The entry point for this module is the translate_sugar
function, which takes an input token stream and returns a modified token stream for use with the parser. It also takes an optional options
parameter which may contain the following properties:
The function also takes a tokenizer
argument which is a function that given a code string, returns an array with the first element being a token array and the second a comment token array. The Kal compiler uses the tokenize
funtion in the lexer
module for this argument. tokenizer
, if present, is used to tokenize code embedded in double-quoted strings. If this argument is missing, double-quoted strings with embedded code blocks will be left as strings.
function translate_sugar (tokens, options, tokenizer)
The current sugar stages are:
"1 + 1 = #{1 + 1}"
), this function tokenizes the code blocks and converts the string to the equivalent of "1 + 1 = " + (1 + 1)
.my_function 1, 2
.print
to console.log
(a,b) -> return a + b
) to standard Kal function syntax.The output is a new token stream (array).
out_tokens = coffee_style_functions print_statement noparen_function_calls multiline_statements multiline_lists clean code_in_strings tokens, tokenizer
Debug printing of the token stream is enabled with the show_tokens
option.
if options?.show_tokens
debug = []
for t in out_tokens
if t.type is 'NEWLINE'
debug.push '\n'
else
debug.push t.value or t.type
console.log debug.join ' '
return out_tokens
exports.translate_sugar = translate_sugar
This function allows support for double-quoted strings with embedded code, like: “x is #{x}”. It uses the tokenizer
argument (a function that converts a code string into a token array, like lexer.tokenize
) to run the code blocks in the string through the lexer. The return value is the merged stream of tokens.
function code_in_strings (tokens, tokenizer)
We abort if there is no tokenizer
provided and just don't translate the strings.
return tokens when tokenizer doesnt exist
The output is a new token array (we don't modify the original).
out_tokens = []
for token in tokens
For double-quoted strings, we search for code blocks like "#{code}"
. The regex uses the non-greedy operator to avoid parsing "#{block1} #{block2}"
as a single block.
if token.type is 'STRING' and token.value[0] is '"'
rv = token.value
r = /#{.*?}/g
m = r.exec rv
We generally must add parentheses around any string that gets broken up for code blocks (and it is always safe to do so). soft
indicates that this was added by the sugar
module, not the user. It's passed forward to no-paren function calls.
add_parens = yes if m otherwise no
out_tokens.push({text:'(', line:token.line, value:'(', type:'LITERAL', soft:yes}) when add_parens
For each code block match, we first add a string token to the stream for all the constant text before the block start, then a +
.
while m
new_token_text = rv.slice(0,m.index) + '"'
out_tokens.push {text:new_token_text, line:token.line, value:new_token_text, type:'STRING'}
out_tokens.push {text:'+', line:token.line, value:'+', type:'LITERAL'}
Next we add the parsed version of the code block (a token array) generated by running the code through the lexer. If there is more than one token, this also needs to be in parentheses.
new_tokens = tokenizer(rv.slice(m.index+2,m.index+m[0].length-1))[0]
out_tokens.push({text:'(', line:token.line, value:'(', type:'LITERAL'}) when new_tokens.length isnt 1
out_tokens = out_tokens.concat new_tokens
out_tokens.push({text:')', line:token.line, value:')', type:'LITERAL'}) when new_tokens.length isnt 1
Next we make a string out of any remaining text after the block in case this is the last match. If the loop exits here, it gets added to the token stream, otherwise we ignore it since the next iteration will take care of it. If the string is the empty string, we set it to blank since we don't want things like "a is #{a}"
turning into ("a is " + a + "")
for asthetic reasons.
rv = '"' + rv.slice(m.index+m[0].length)
if rv is '""'
rv = ''
else
out_tokens.push {text:'+', line:token.line, value:'+', type:'LITERAL'}
Find the next code block if there is one.
r = /#{.*?}/g
m = r.exec rv
If there wasn't a next code block, add the remaining string (if any) and close paren.
out_tokens.push({text:rv, line:token.line, value:rv, type:'STRING'}) when rv isnt ''
out_tokens.push({text:')', line:token.line, value:')', type:'LITERAL', soft:yes}) when add_parens
else
For anything other than a double-quoted string, just pass it through.
out_tokens.push token
return out_tokens
Removes whitespace. It marks tokens that were followed by whitespace so that the later stages can detect the difference between things like my_function(a) ->
and my_function (a) ->
.
function clean (tokens)
out_tokens = []
for token in tokens
if token.type isnt 'WHITESPACE'
out_tokens.push token
else if out_tokens.length > 0
out_tokens[out_tokens.length - 1].trailed_by_white = yes
return out_tokens
This function removes newlines and indentation after commas, allowing long lines of code to be broken up into multiple lines. Token line numbers are preserved for error reporting.
function multiline_statements (tokens)
out_tokens = []
last_token = null
We keep track of whether or not we are on a continued line and how many indents we ignored.
continue_line = no
reduce_dedent = 0
for token in tokens
skip_token = no
If we see a newline after a comma, remove it from the stream and mark that we are in line continuation mode.
if last_token?.value in [','] and token.type is 'NEWLINE'
continue_line = yes
skip_token = yes
In line continuation mode, ignore indents and dedents, but keep track of them. We exit line continuation mode when we see a DEDENT
that brings back to even with the original line.
else if continue_line
if token.type is 'INDENT'
skip_token = yes
reduce_dedent += 1
else if token.type is 'NEWLINE'
skip_token = yes
else if token.type is 'DEDENT'
if reduce_dedent > 0
reduce_dedent -= 1
skip_token = yes
if reduce_dedent is 0
out_tokens.push {text:'\n', line:token.line, value:'',type:'NEWLINE'}
else
When exiting line continuation mode, we have to add back in the last NEWLINE
.
out_tokens.push last_token
Add the token to the new stream unless we decided to skip it.
out_tokens.push(token) unless skip_token
last_token = token
return out_tokens
This stage converts implicit function calls (my_function a, b
) to explicit ones (my_function(a,b)
). NOPAREN_WORDS
specify keywords that should not be considered as a first argument to a function call. For example, we don't want x is a
to turn into x(is(a))
, but we do want x y z
to become x(y(z))
.
NOPAREN_WORDS = ['is','otherwise','except','else','doesnt','exist','exists','isnt','inherits',
'from','and','or','xor','in','when','instanceof','of','nor','if','unless',
'except','for','with','wait','task','fail','parallel','series','safe','but',
'bitwise','mod','second','seconds','while','until']
This function is admittedly messy and in need of a rewrite. But it's not broken, so…
function noparen_function_calls (tokens)
out_tokens = []
close_paren_count = 0
last_token = null
triggers = []
closures = []
ignore_next_indent = no
We need a token counter because sometimes we look back two or three tokens.
i = 0
while i < tokens.length
token = tokens[i]
Check that the previous token is not a reserved word. This can happen if the last token is not a keyword, two tokens ago was a .
(like x.for a
), or the last token is a keyword but a valid r-value (me x
).
last_token_isnt_reserved = not (last_token?.value in KEYWORDS) or tokens[i-2]?.value is '.' or (last_token?.value in RVALUE_OK)
Check if the previous token was callable. This is only true if it is an IDENTIFIER
(not reserved) or a ]
like x[1] a
.
last_token_callable = (last_token?.type is 'IDENTIFIER' and last_token_isnt_reserved) or last_token?.value is ']'
Check that the current token isn't a no-paren word (not looking at something like x for
).
token_isnt_reserved = not (token.value in NOPAREN_WORDS)
Check that the current token is not a literal (don't want my_function * 2
to become my_function(* 2)
).
non_literal = (token.type in ['IDENTIFIER','NUMBER','STRING','REGEX'])
There are some exceptions for callable literals, for things like f {x:1}
, f [1]
, and ->
.
callable_literal = (token.value is '{' or (token.value is '[' and last_token?.trailed_by_white) or (token.value is '-' and tokens[i+1]?.value is '>'))
Combining previous checks, we check that this token is not an operator.
this_token_not_operator = ((non_literal or callable_literal) and token_isnt_reserved)
Check if this is a function declaration.
declaring_a_function = tokens[i-2]?.value in ['function','task','method','class'] and last_token?.type is 'IDENTIFIER'
Check if a parenthesis is soft
, meaning added by the sugar and not the user.
soft_paren = (token.value is '(' and token.soft but not declaring_a_function)
Don't want to add parentheses around bitwise left
or bitwise right
, but we also really don't want left
and right
to be no-paren words, otherwise x left
would not translate to x(left)
. These are really useful words, so we handle them in this special case to avoid this issue.
bitwise_shift = (last_token?.value in ['left','right']) and tokens[i-2]?.value is 'bitwise'
If the previous token is callable and the current token is not an operator (or it‘s a parenthesis that the user didn’t add) and we're not in the special bitwise
case, then we add an open paren. We add a trigger to close the parentheses on the next NEWLINE
.
if last_token_callable and (this_token_not_operator or soft_paren) but not bitwise_shift
triggers.push 'NEWLINE'
out_tokens.push {text:'(', line:token.line, value:'(', type:'LITERAL'}
closures.push ')'
If we're passing a function as an argument, we want to change the close trigger to a DEDENT
and ignore the next INDENT
.
else if (token.value is 'function' or (token.value is '>' and last_token?.value is '-')) and triggers[triggers.length-1] is 'NEWLINE'
triggers[triggers.length-1] = 'DEDENT'
ignore_next_indent = yes
Keep track of indents so that streams like: x = myfunct function () NEWLINE INDENT ... DEDENT
will not close out parentheses early.
else if token.type is 'INDENT'
if ignore_next_indent
ignore_next_indent = no
else
triggers.push 'DEDENT'
closures.push ''
Reset the ignore_next_indent
flag if necessary.
else if token.type is 'NEWLINE' and tokens[i+1]?.type isnt 'INDENT'
ignore_next_indent = no
Check if we hit a “closure” (end of implied parentheses) when we are looking for a NEWLINE
. This can happen on an actual NEWLINE
or when we hit a tail conditional.
if (token.type is 'NEWLINE' or token.value in ['if','unless','when','except']) and closures.length > 0 and triggers[triggers.length - 1] is 'NEWLINE'
If so, pop all NEWLINE
closures and add in the implied tokens. NEWLINE
s can close out multiple parentheses (x = a b c
).
while closures.length > 0 and triggers[triggers.length - 1] is 'NEWLINE'
triggers.pop()
closure = closures.pop()
out_tokens.push({text:closure, line:token.line, value:closure, type:'LITERAL'}) if closure isnt ''
out_tokens.push token
If our closure had a DEDENT
trigger, pop it and add the token.
else if token.type is 'DEDENT' and closures.length > 0 and triggers[triggers.length - 1] is 'DEDENT'
out_tokens.push token
triggers.pop()
closure = closures.pop()
out_tokens.push({text:closure, line:token.line, value:closure, type:'LITERAL'}) if closure isnt ''
If no trigger was matched, just pass the token through.
else if closures.length is 0 or token.type isnt triggers[triggers.length - 1]
out_tokens.push token
last_token = token
i += 1
If we hit EOF, pop out all the remaning closures.
while closures.length > 0
closure = closures.pop()
out_tokens.push({text:closure, line:token.line, value:closure, type:'LITERAL'}) if closure isnt ''
return out_tokens
This function converts CoffeeScript-style functions (() ->
) to Kal syntax.
function coffee_style_functions (tokens)
out_tokens = []
last_token = null
We need to track the token index since we look back several tokens in this stage.
i = 0
while i < tokens.length
token = tokens[i]
Look for a ->
.
if last_token?.value is '-' and token?.value is '>'
If we see the ->
, that means the current token is >
and we already added the -
to the new stream. We have to pop the -
off the stream.
out_tokens.pop()
We create a new token stream fragment for this function header.
new_tokens = []
Next we examine the last token in the stream. Since we just popped the -
, this will either be a )
if the definition is in the form (args) ->
or something else if it doesn't specify arguments.
t = out_tokens.pop()
if t?.value is ')'
If there are arguments here, keep popping until we hit the (
, adding the argument tokens to the new_tokens
stream. At the end of this loop, new_tokens
will be the arguments passed (if any) without enclosing parens.
while t?.value isnt '('
new_tokens.unshift t
t = out_tokens.pop()
Pass the closing paren.
new_tokens.unshift t
else
If no arguments were specified, let new_tokens be ()
out_tokens.push t
new_tokens.push {text:'(', line:token.line, value:'(', type:'LITERAL'}
new_tokens.push {text:')', line:token.line, value:')', type:'LITERAL'}
Prepend the function
token to new_tokens
, which currently has the arguments (if any) in parentheses. Then add it to the out_tokens
stream.
f_token = {text:'function', line:token.line, value:'function', type:'IDENTIFIER'}
new_tokens.unshift f_token
out_tokens = out_tokens.concat new_tokens
else
If we're not handling a Coffee-Style function, just pass tokens through.
out_tokens.push token
last_token = token
i += 1
return out_tokens
This function converts list definitions that span multiple lines into a single line. Tokens retain their original line numbers. This supports lists and explicit map definitions ({}
).
This function is admittedly awful and needs rework.
function multiline_lists (tokens)
out_tokens = []
We need to track nested lists.
list_depth = 0
last_token_was_separator = no
indent_depths = []
indent_depth = 0
leftover_indent = 0
for token in tokens
skip_this_token = no
We need to keep track of whether or not this token is eligible as a list item separator.
token_is_separator = (token.type in ['NEWLINE','INDENT', 'DEDENT'] or token.value is ',')
When we see a list start, we push to the list stack.
if token.value is '[' or token.value is '{'
list_depth += 1
indent_depths.push indent_depth
indent_depth = 0
Likewise for a list end, we pop the stack.
else if token.value is ']' or token.value is '}'
list_depth -= 1
leftover_indent = indent_depth
indent_depth = indent_depths.pop()
Keep track of the indentation level, looking for a token that returns us to the original indent. We continue to skip indents/dedents until this happens. Basically, we want to ignore indentation inside these multi-line definitions. Once back to original the indent level, we push in a NEWLINE
.
Note that none of this happens unless we are inside a list definition (all these flags are ignored).
else if token.type is 'INDENT'
indent_depth += 1
if leftover_indent isnt 0
leftover_indent += 1
skip_this_token = yes
out_tokens.push({text:'', line:token.line, value:'\n', type:'NEWLINE'}) if leftover_indent is 0
else if token.type is 'DEDENT'
indent_depth -= 1
if leftover_indent isnt 0
leftover_indent -= 1
out_tokens.push({text:'', line:token.line, value:'\n', type:'NEWLINE'}) if leftover_indent is 0
skip_this_token = yes
Skip newlines inside of list definitions.
else if token.type is 'NEWLINE'
if leftover_indent isnt 0
skip_this_token = yes
else
leftover_indent = 0
if list_depth > 0
The first token in a newline stretch gets turned into a comma
if token_is_separator and not last_token_was_separator
out_tokens.push {text:',', line:token.line, value:',', type:'LITERAL'}
else
out_tokens.push token unless token_is_separator or skip_this_token
else
out_tokens.push token unless skip_this_token
last_token_was_separator = token_is_separator and (list_depth > 0)
return out_tokens
Convert print
tokens to console
.
log
tokens.
function print_statement (tokens)
new_tokens = []
for token in tokens
if token.value is 'print' and token.type is 'IDENTIFIER'
new_tokens.push {text:'print', line:token.line, value:'console', type:'IDENTIFIER'}
new_tokens.push {text:'print', line:token.line, value:'.', type:'LITERAL'}
new_tokens.push {text:'print', line:token.line, value:'log', type:'IDENTIFIER'}
else
new_tokens.push token
return new_tokens