• Jump To … +
    ast.litkal command.litkal generator.litkal grammar.litkal interactive.litkal kal.litkal lexer.litkal literate.litkal parser.litkal sugar.litkal
  • ¶

    Kal Sugar

  • ¶

    This module applies a couple of “syntactic sugar” pre-processing steps to Kal code before it goes to the compiler. These steps would be onerous to do during the parsing stage, but are generally easier to do on a token stream. Each function in this module takes an input token stream and returns a new, possibly modified one.

    Some sugar functions use the keyword list from the grammar, most notable the implicit parentheses for function calls.

    grammar  = require './grammar'
    KEYWORDS  = grammar.KEYWORDS
    RVALUE_OK = grammar.RVALUE_OK
  • ¶

    The entry point for this module is the translate_sugar function, which takes an input token stream and returns a modified token stream for use with the parser. It also takes an optional options parameter which may contain the following properties:

    • show_tokens - if true, this module will print the input token stream to the console. This is useful for debugging the compiler

    The function also takes a tokenizer argument which is a function that given a code string, returns an array with the first element being a token array and the second a comment token array. The Kal compiler uses the tokenize funtion in the lexer module for this argument. tokenizer, if present, is used to tokenize code embedded in double-quoted strings. If this argument is missing, double-quoted strings with embedded code blocks will be left as strings.

    function translate_sugar (tokens, options, tokenizer)
  • ¶

    The current sugar stages are:

    1. code_in_strings - for double-quoted strings with embedded code blocks ("1 + 1 = #{1 + 1}"), this function tokenizes the code blocks and converts the string to the equivalent of "1 + 1 = " + (1 + 1).
    2. clean - removes whitespace
    3. multiline_statements - removes line breaks after commas on long statements
    4. multiline_lists - this function collapses list definitions that span muliple lines into a single line, though the tokens do still retain their original line numbers.
    5. no_paren_function_calls - adds parentheses around implicit function calls like my_function 1, 2.
    6. print_statement - converts calls to print to console.log
    7. coffee_style_functions - converts functions with CoffeeScript syntax ((a,b) -> return a + b) to standard Kal function syntax.

    The output is a new token stream (array).

      out_tokens = coffee_style_functions print_statement noparen_function_calls multiline_statements multiline_lists clean code_in_strings tokens, tokenizer
  • ¶

    Debug printing of the token stream is enabled with the show_tokens option.

      if options?.show_tokens
        debug = []
        for t in out_tokens
          if t.type is 'NEWLINE'
            debug.push '\n'
          else
            debug.push t.value or t.type
        console.log debug.join ' '
      return out_tokens
    exports.translate_sugar = translate_sugar
  • ¶

    Code In Strings

  • ¶

    This function allows support for double-quoted strings with embedded code, like: “x is #{x}”. It uses the tokenizer argument (a function that converts a code string into a token array, like lexer.tokenize) to run the code blocks in the string through the lexer. The return value is the merged stream of tokens.

    function code_in_strings (tokens, tokenizer)
  • ¶

    We abort if there is no tokenizer provided and just don't translate the strings.

      return tokens when tokenizer doesnt exist
  • ¶

    The output is a new token array (we don't modify the original).

      out_tokens = []
      for token in tokens
  • ¶

    For double-quoted strings, we search for code blocks like "#{code}". The regex uses the non-greedy operator to avoid parsing "#{block1} #{block2}" as a single block.

        if token.type is 'STRING' and token.value[0] is '"'
          rv = token.value
          r = /#{.*?}/g
          m = r.exec rv
  • ¶

    We generally must add parentheses around any string that gets broken up for code blocks (and it is always safe to do so). soft indicates that this was added by the sugar module, not the user. It's passed forward to no-paren function calls.

          add_parens = yes if m otherwise no
          out_tokens.push({text:'(', line:token.line, value:'(', type:'LITERAL', soft:yes}) when add_parens
  • ¶

    For each code block match, we first add a string token to the stream for all the constant text before the block start, then a +.

          while m
            new_token_text = rv.slice(0,m.index) + '"'
            out_tokens.push {text:new_token_text, line:token.line, value:new_token_text, type:'STRING'}
            out_tokens.push {text:'+', line:token.line, value:'+', type:'LITERAL'}
  • ¶

    Next we add the parsed version of the code block (a token array) generated by running the code through the lexer. If there is more than one token, this also needs to be in parentheses.

            new_tokens = tokenizer(rv.slice(m.index+2,m.index+m[0].length-1))[0]
            out_tokens.push({text:'(', line:token.line, value:'(', type:'LITERAL'}) when new_tokens.length isnt 1
            out_tokens = out_tokens.concat new_tokens
            out_tokens.push({text:')', line:token.line, value:')', type:'LITERAL'}) when new_tokens.length isnt 1
  • ¶

    Next we make a string out of any remaining text after the block in case this is the last match. If the loop exits here, it gets added to the token stream, otherwise we ignore it since the next iteration will take care of it. If the string is the empty string, we set it to blank since we don't want things like "a is #{a}" turning into ("a is " + a + "") for asthetic reasons.

            rv = '"' + rv.slice(m.index+m[0].length)
            if rv is '""'
              rv = ''
            else
              out_tokens.push {text:'+', line:token.line, value:'+', type:'LITERAL'}
  • ¶

    Find the next code block if there is one.

            r = /#{.*?}/g
            m = r.exec rv
  • ¶

    If there wasn't a next code block, add the remaining string (if any) and close paren.

          out_tokens.push({text:rv, line:token.line, value:rv, type:'STRING'}) when rv isnt ''
          out_tokens.push({text:')', line:token.line, value:')', type:'LITERAL', soft:yes}) when add_parens
        else
  • ¶

    For anything other than a double-quoted string, just pass it through.

          out_tokens.push token
      return out_tokens
  • ¶

    Clean

  • ¶

    Removes whitespace. It marks tokens that were followed by whitespace so that the later stages can detect the difference between things like my_function(a) -> and my_function (a) ->.

    function clean (tokens)
      out_tokens = []
      for token in tokens
        if token.type isnt 'WHITESPACE'
          out_tokens.push token
        else if out_tokens.length > 0
          out_tokens[out_tokens.length - 1].trailed_by_white = yes
      return out_tokens
  • ¶

    Multiline Statements

  • ¶

    This function removes newlines and indentation after commas, allowing long lines of code to be broken up into multiple lines. Token line numbers are preserved for error reporting.

    function multiline_statements (tokens)
      out_tokens = []
      last_token = null
  • ¶

    We keep track of whether or not we are on a continued line and how many indents we ignored.

      continue_line = no
      reduce_dedent = 0
    
      for token in tokens
        skip_token = no
  • ¶

    If we see a newline after a comma, remove it from the stream and mark that we are in line continuation mode.

        if last_token?.value in [','] and token.type is 'NEWLINE'
          continue_line = yes
          skip_token = yes
  • ¶

    In line continuation mode, ignore indents and dedents, but keep track of them. We exit line continuation mode when we see a DEDENT that brings back to even with the original line.

        else if continue_line
          if token.type is 'INDENT'
            skip_token = yes
            reduce_dedent += 1
          else if token.type is 'NEWLINE'
            skip_token = yes
          else if token.type is 'DEDENT'
            if reduce_dedent > 0
              reduce_dedent -= 1
              skip_token = yes
              if reduce_dedent is 0
                out_tokens.push {text:'\n', line:token.line, value:'',type:'NEWLINE'}
            else
  • ¶

    When exiting line continuation mode, we have to add back in the last NEWLINE.

              out_tokens.push last_token
  • ¶

    Add the token to the new stream unless we decided to skip it.

        out_tokens.push(token) unless skip_token
        last_token = token
      return out_tokens
  • ¶

    No-Paren Function Calls

  • ¶

    This stage converts implicit function calls (my_function a, b) to explicit ones (my_function(a,b)). NOPAREN_WORDS specify keywords that should not be considered as a first argument to a function call. For example, we don't want x is a to turn into x(is(a)), but we do want x y z to become x(y(z)).

    NOPAREN_WORDS = ['is','otherwise','except','else','doesnt','exist','exists','isnt','inherits',
                     'from','and','or','xor','in','when','instanceof','of','nor','if','unless',
                     'except','for','with','wait','task','fail','parallel','series','safe','but',
                     'bitwise','mod','second','seconds','while','until']
  • ¶

    This function is admittedly messy and in need of a rewrite. But it's not broken, so…

    function noparen_function_calls (tokens)
      out_tokens = []
      close_paren_count = 0
      last_token = null
      triggers = []
      closures = []
      ignore_next_indent = no
  • ¶

    We need a token counter because sometimes we look back two or three tokens.

      i = 0
      while i < tokens.length
        token = tokens[i]
  • ¶

    Check that the previous token is not a reserved word. This can happen if the last token is not a keyword, two tokens ago was a . (like x.for a), or the last token is a keyword but a valid r-value (me x).

        last_token_isnt_reserved = not (last_token?.value in KEYWORDS) or tokens[i-2]?.value is '.' or (last_token?.value in RVALUE_OK)
  • ¶

    Check if the previous token was callable. This is only true if it is an IDENTIFIER (not reserved) or a ] like x[1] a.

        last_token_callable = (last_token?.type is 'IDENTIFIER' and last_token_isnt_reserved) or last_token?.value is ']'
  • ¶

    Check that the current token isn't a no-paren word (not looking at something like x for).

        token_isnt_reserved = not (token.value in NOPAREN_WORDS)
  • ¶

    Check that the current token is not a literal (don't want my_function * 2 to become my_function(* 2)).

        non_literal = (token.type in ['IDENTIFIER','NUMBER','STRING','REGEX'])
  • ¶

    There are some exceptions for callable literals, for things like f {x:1}, f [1], and ->.

        callable_literal = (token.value is '{' or (token.value is '[' and last_token?.trailed_by_white) or (token.value is '-' and tokens[i+1]?.value is '>'))
  • ¶

    Combining previous checks, we check that this token is not an operator.

        this_token_not_operator = ((non_literal or callable_literal) and token_isnt_reserved)
  • ¶

    Check if this is a function declaration.

        declaring_a_function = tokens[i-2]?.value in ['function','task','method','class'] and last_token?.type is 'IDENTIFIER'
  • ¶

    Check if a parenthesis is soft, meaning added by the sugar and not the user.

        soft_paren = (token.value is '(' and token.soft but not declaring_a_function)
  • ¶

    Don't want to add parentheses around bitwise left or bitwise right, but we also really don't want left and right to be no-paren words, otherwise x left would not translate to x(left). These are really useful words, so we handle them in this special case to avoid this issue.

        bitwise_shift = (last_token?.value in ['left','right']) and tokens[i-2]?.value is 'bitwise'
  • ¶

    If the previous token is callable and the current token is not an operator (or it‘s a parenthesis that the user didn’t add) and we're not in the special bitwise case, then we add an open paren. We add a trigger to close the parentheses on the next NEWLINE.

        if last_token_callable and (this_token_not_operator or soft_paren) but not bitwise_shift
          triggers.push 'NEWLINE'
          out_tokens.push {text:'(', line:token.line, value:'(', type:'LITERAL'}
          closures.push ')'
  • ¶

    If we're passing a function as an argument, we want to change the close trigger to a DEDENT and ignore the next INDENT.

        else if (token.value is 'function' or (token.value is '>' and last_token?.value is '-')) and triggers[triggers.length-1] is 'NEWLINE'
          triggers[triggers.length-1] = 'DEDENT'
          ignore_next_indent = yes
  • ¶

    Keep track of indents so that streams like: x = myfunct function () NEWLINE INDENT ... DEDENT will not close out parentheses early.

        else if token.type is 'INDENT'
          if ignore_next_indent
            ignore_next_indent = no
          else
            triggers.push 'DEDENT'
            closures.push ''
  • ¶

    Reset the ignore_next_indent flag if necessary.

        else if token.type is 'NEWLINE' and tokens[i+1]?.type isnt 'INDENT'
          ignore_next_indent = no
  • ¶

    Check if we hit a “closure” (end of implied parentheses) when we are looking for a NEWLINE. This can happen on an actual NEWLINE or when we hit a tail conditional.

        if (token.type is 'NEWLINE' or token.value in ['if','unless','when','except']) and closures.length > 0 and triggers[triggers.length - 1] is 'NEWLINE'
  • ¶

    If so, pop all NEWLINE closures and add in the implied tokens. NEWLINEs can close out multiple parentheses (x = a b c).

          while closures.length > 0 and triggers[triggers.length - 1] is 'NEWLINE'
            triggers.pop()
            closure = closures.pop()
            out_tokens.push({text:closure, line:token.line, value:closure, type:'LITERAL'}) if closure isnt ''
          out_tokens.push token
  • ¶

    If our closure had a DEDENT trigger, pop it and add the token.

        else if token.type is 'DEDENT' and closures.length > 0 and triggers[triggers.length - 1] is 'DEDENT'
          out_tokens.push token
          triggers.pop()
          closure = closures.pop()
          out_tokens.push({text:closure, line:token.line, value:closure, type:'LITERAL'}) if closure isnt ''
  • ¶

    If no trigger was matched, just pass the token through.

        else if closures.length is 0 or token.type isnt triggers[triggers.length - 1]
          out_tokens.push token
        last_token = token
        i += 1
  • ¶

    If we hit EOF, pop out all the remaning closures.

      while closures.length > 0
        closure = closures.pop()
        out_tokens.push({text:closure, line:token.line, value:closure, type:'LITERAL'}) if closure isnt ''
      return out_tokens
  • ¶

    Coffee-Style Functions

  • ¶

    This function converts CoffeeScript-style functions (() ->) to Kal syntax.

    function coffee_style_functions (tokens)
      out_tokens = []
      last_token = null
  • ¶

    We need to track the token index since we look back several tokens in this stage.

      i = 0
      while i < tokens.length
        token = tokens[i]
  • ¶

    Look for a ->.

        if last_token?.value is '-' and token?.value is '>'
  • ¶

    If we see the ->, that means the current token is > and we already added the - to the new stream. We have to pop the - off the stream.

          out_tokens.pop()
  • ¶

    We create a new token stream fragment for this function header.

          new_tokens = []
  • ¶

    Next we examine the last token in the stream. Since we just popped the -, this will either be a ) if the definition is in the form (args) -> or something else if it doesn't specify arguments.

          t = out_tokens.pop()
          if t?.value is ')'
  • ¶

    If there are arguments here, keep popping until we hit the (, adding the argument tokens to the new_tokens stream. At the end of this loop, new_tokens will be the arguments passed (if any) without enclosing parens.

            while t?.value isnt '('
              new_tokens.unshift t
              t = out_tokens.pop()
  • ¶

    Pass the closing paren.

            new_tokens.unshift t
          else
  • ¶

    If no arguments were specified, let new_tokens be ()

            out_tokens.push t
            new_tokens.push {text:'(', line:token.line, value:'(', type:'LITERAL'}
            new_tokens.push {text:')', line:token.line, value:')', type:'LITERAL'}
  • ¶

    Prepend the function token to new_tokens, which currently has the arguments (if any) in parentheses. Then add it to the out_tokens stream.

          f_token = {text:'function', line:token.line, value:'function', type:'IDENTIFIER'}
          new_tokens.unshift f_token
          out_tokens = out_tokens.concat new_tokens
        else
  • ¶

    If we're not handling a Coffee-Style function, just pass tokens through.

          out_tokens.push token
        last_token = token
        i += 1
      return out_tokens
  • ¶

    Multiline Lists

  • ¶

    This function converts list definitions that span multiple lines into a single line. Tokens retain their original line numbers. This supports lists and explicit map definitions ({}).

    This function is admittedly awful and needs rework.

    function multiline_lists (tokens)
      out_tokens = []
  • ¶

    We need to track nested lists.

      list_depth = 0
      last_token_was_separator = no
      indent_depths = []
      indent_depth = 0
      leftover_indent = 0
      for token in tokens
        skip_this_token = no
  • ¶

    We need to keep track of whether or not this token is eligible as a list item separator.

        token_is_separator = (token.type in ['NEWLINE','INDENT', 'DEDENT'] or token.value is ',')
  • ¶

    When we see a list start, we push to the list stack.

        if token.value is '[' or token.value is '{'
          list_depth += 1
          indent_depths.push indent_depth
          indent_depth = 0
  • ¶

    Likewise for a list end, we pop the stack.

        else if token.value is ']' or token.value is '}'
          list_depth -= 1
          leftover_indent = indent_depth
          indent_depth = indent_depths.pop()
  • ¶

    Keep track of the indentation level, looking for a token that returns us to the original indent. We continue to skip indents/dedents until this happens. Basically, we want to ignore indentation inside these multi-line definitions. Once back to original the indent level, we push in a NEWLINE.

    Note that none of this happens unless we are inside a list definition (all these flags are ignored).

        else if token.type is 'INDENT'
          indent_depth += 1
          if leftover_indent isnt 0
            leftover_indent += 1
            skip_this_token = yes
            out_tokens.push({text:'', line:token.line, value:'\n', type:'NEWLINE'}) if leftover_indent is 0
        else if token.type is 'DEDENT'
          indent_depth -= 1
          if leftover_indent isnt 0
            leftover_indent -= 1
            out_tokens.push({text:'', line:token.line, value:'\n', type:'NEWLINE'}) if leftover_indent is 0
            skip_this_token = yes
  • ¶

    Skip newlines inside of list definitions.

        else if token.type is 'NEWLINE'
          if leftover_indent isnt 0
            skip_this_token = yes
        else
          leftover_indent = 0
    
        if list_depth > 0
  • ¶

    The first token in a newline stretch gets turned into a comma

          if token_is_separator and not last_token_was_separator
            out_tokens.push {text:',', line:token.line, value:',', type:'LITERAL'}
          else
            out_tokens.push token unless token_is_separator or skip_this_token
        else
          out_tokens.push token unless skip_this_token
        last_token_was_separator = token_is_separator and (list_depth > 0)
      return out_tokens
  • ¶

    Print Statements

  • ¶

    Convert print tokens to console . log tokens.

    function print_statement (tokens)
      new_tokens = []
      for token in tokens
        if token.value is 'print' and token.type is 'IDENTIFIER'
          new_tokens.push {text:'print', line:token.line, value:'console', type:'IDENTIFIER'}
          new_tokens.push {text:'print', line:token.line, value:'.', type:'LITERAL'}
          new_tokens.push {text:'print', line:token.line, value:'log', type:'IDENTIFIER'}
        else
          new_tokens.push token
      return new_tokens