token_types = [[/^###([^#][\s\S]*?)(?:###[^\n\S]*|(?:###)?$)|^(?:[^\n\S]*#(?!##[^#]).*)+/, 'COMMENT'],
This file is responsible for translating raw code (a long string) into an array of tokens that can be more easily parsed by the parser.
Comments are tokens that are not part of the code. These are returned seperately from the token array with line markers to indicate where they belong. Comments can be single line, starting with a #
and ending at the end of a line. They do not need to be the first thing on the line (for example x = 1 #comment
is valid). Comments can also be multiline, starting with a ###
and ending with either a ###
or end-of-file.
token_types = [[/^###([^#][\s\S]*?)(?:###[^\n\S]*|(?:###)?$)|^(?:[^\n\S]*#(?!##[^#]).*)+/, 'COMMENT'],
*Regex tokens are regular expressions. Kal does not do much with the regex contents at this time. It just passes the raw regex through to JavaScript.
[/^(\/(?![\s=])[^[\/\n\\]*(?:(?:\\[\s\S]|\[[^\]\n\\]*(?:\\[\s\S][^\]\n\\]*)*])[^[\/\n\\]*)*\/)([imgy]{0,4})(?!\w)/,'REGEX'],
Numbers can be either in hex format (like 0xa5b
) or decimal/scientific format (10
, 3.14159
, or 10.02e23
). There is no distinction between floating point and integer numbers.
[/^0x[a-f0-9]+/i, 'NUMBER'],
[/^[0-9]+(\.[0-9]+)?(e[+-]?[0-9]+)?/i, 'NUMBER'],
Strings can be either single or double quoted.
[/^'(?:[^'\\]|\\.)*'/, 'STRING'],
[/^"(?:[^"\\]|\\.)*"/, 'STRING'],
Identifiers (generally variable names), must start with a letter, $
, or underscore. Subsequent characters can also be numbers. Unicode characters are supported in variable names.
[/^[$A-Za-z_\x7f-\uffff][$\w\x7f-\uffff]*/, 'IDENTIFIER'],
A newline is used to demark the end of a statement. Whitespace without other content between newlines is ignored. Multiple newlines in a row generate just one token.
[/^\n([\f\r\t\v\u00A0\u2028\u2029 ]*\n)*\r*/, 'NEWLINE'],
Whitespace is ignored by the parser (and thrown out by the lexer), but needs to exist to break up other tokens.
[/^[\f\r\t\v\u00A0\u2028\u2029 ]+/, 'WHITESPACE'],
A literal is a symbol (like =
or +
) used as an operator. Some literals like +=
can be two characters
[/^[\<\>\!\=]\=/, 'LITERAL'],
[/^[\+\-\*\/\^\=\.><\(\)\[\]\,\.\{\}\:\?]/, 'LITERAL']]
Token objects also contain semantic values that are used during parsing. For example, + =
and +=
are both equivalent to +=
. This object maps tokens types to a function that returns their semantic value.
parse_token =
NUMBER
tokens are represented by their numeric value. This is used in constant folding in some cases.
NUMBER: (text) ->
return Number(text)
STRING
and IDENTIFIER
tokens are just the same as their content.
STRING: (text) ->
return text
IDENTIFIER: (text) ->
return text
NEWLINE
tokens have only one possible semantic value, so they are just set to empty to make printouts with showTokens
cleaner.
NEWLINE: (text) ->
return ''
WHITESPACE
tokens all have the same semantic value, so they are just replaced with a single space.
WHITESPACE: (text) ->
return ' '
COMMENT
tokens are trimmed and have their #
s removed. Any JavaScript comment markers are also escaped here.
COMMENT: (text) ->
rv = text.trim()
rv = rv.replace /^#*\s*|#*$/g, ""
rv = rv.replace /\n[\f\r\t\v\u00A0\u2028\u2029 ]*#*[\f\r\t\v\u00A0\u2028\u2029 ]*/g, '\n * '
return rv.replace(/(\/\*)|(\*\/)/g, '**')
LITERAL
tokens have their spaces removed as noted above.
LITERAL: (text) ->
return text.replace(/[\f\r\t\v\u00A0\u2028\u2029 ]/, '')
REGEX
tokens just pass through.
REGEX: (text) ->
return text
This class is initialized with a code string. It's tokens
and comments
members are then populated using the tokenize
method (called automatically during initialization).
class Lexer
The line_number
argument is optional and allows you to specify a line offset to start on.
method initialize(code, line_number=1)
me.code = code
me.line = line_number
me.indent = 0
me.indents = []
me.tokenize()
method tokenize()
me.tokens = []
me.comments = []
last_token_type = null
index = 0
This loop will try each regular expression in token_types
against the current head of the code string until one matches.
while index < me.code.length
chunk = me.code.slice(index)
for tt in token_types
regex = tt[0]
type = tt[1]
text = regex.exec(chunk)?[0]
if text exists
me.type = type
break
If there was no match, this is a bad token and we will abort compilation here. We only report up to the first 16 characters of the token in case it is very long.
if text doesnt exist
code = this.code.toString().trim()
context_len = 16 when code.length >= 16 otherwise code.length
fail with "invalid token '#{code.slice(index,index+context_len)}...' on line #{this.line}" when text doesnt exist
Parse the semantic value of the token.
val = parse_token[me.type](text)
If we previously saw a NEWLINE
, we check to see if the current indent level has changed. If so, we generate INDENT
or DEDENT
tokens as appropriate in the handleIndentation
method. COMMENT
-only lines are ignored since we want to allow arbitrary indentation on non-code lines.
if last_token_type is 'NEWLINE' and type isnt 'COMMENT'
me.handleIndentation type, text
For comments, we create a special comment token and put it in me.comments
. We mark if it is postfix (after code on a line) or prefix (alone before a line of code). This will matter when we generate JavaScript code and try to place these tokens back as JavaScript comments. Multiline comments are also marked for this same reason.
if type is 'COMMENT'
comment_token = {text:text, line:me.line, value:val, type:type}
if last_token_type is 'NEWLINE' or last_token_type doesnt exist
comment_token.post_fix = no
else
comment_token.post_fix = yes
if val.match(/\n/)
comment_token.multiline = yes
else
comment_token.multiline = no
me.comments.push comment_token
For non-comment tokens, we generally just add the token to me.tokens
. We will skip NEWLINE
tokens if it would cause multiple NEWLINE
s in a row.
The soft
attribute indicates that the token was separated from the previous token by whitespace. This is used by the sugar
module in some cases to determine whether this is a function call or not. For example my_function () ->
could mean my_function()(->)
or my_function(->)
. Marking whether or not there was whitespace allows us to translate my_function () ->
differently from my_function() ->
.
else
unless type is 'NEWLINE' and me.tokens[me.tokens.length - 1]?.type is 'NEWLINE'
me.tokens.push {text:text, line:me.line, value:val, type:type, soft: last_token_type is 'WHITESPACE'}
Update our current index and the line number we are looking at. Line numbers are used for source maps and error messages.
index += text.length
me.line += text.match(/\n/g)?.length or 0
last_token_type = type
Add a trailing newline in case the user didn't. The parser needs this in some cases.
me.tokens.push {text:'\n',line:me.line, value:'', type:'NEWLINE'}
Clear up any remaining indents at the end of the file.
me.handleIndentation 'NEWLINE', ''
Remove the newline we added if it wasn't needed.
me.tokens.pop() if me.tokens[me.tokens.length-1].type is 'NEWLINE'
The handleIndentation
method adds INDENT
and DEDENT
tokens as necessary.
method handleIndentation(type, text)
Get the current line's indentation.
indentation = text.length if type is 'WHITESPACE' otherwise 0
If indentation has changed, push tokens as appropriate. Note that we treat multiple indents (multiples of two spaces) as a single indent/dedent pair. We keep track of the indentation level of each indent separately in the me.indents
array in case the code is inconsistent.
if indentation > me.indent
me.indents.push me.indent
me.indent = indentation
me.tokens.push {text:text, line:me.line, value:'', type:'INDENT'}
else if indentation < me.indent
We allow for multiple dedents on a single line by looping until indentation matches.
while me.indents.length > 0 and indentation < me.indent
me.indent = me.indents.pop()
fail with 'indentation is misaligned on line ' + me.line if indentation > me.indent
me.tokens.push {text:text, line:me.line, value:'', type:'DEDENT'}
A misalignment is not parseable so we throw an error.
fail with 'indentation is misaligned' if indentation isnt me.indent
This function is the entry point for the compiler. It parses a code string using the lexer and returns the tokens and comments separately.
function tokenize(code)
lex = new Lexer(code)
return [lex.tokens, lex.comments]
exports.tokenize = tokenize