| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* parser.c (lisp_parse_impl): If parsing from string, check
for trailing junk and diagnose. JSON parsing doesn't use
lookahead because it doesn't have a.b syntax, so the
recent_tok gives the last token that actually went into the
syntax, and not a lookahead token. So in the case of JSON,
we call yylex to see if there is any trailing token.
* tests/010/json.tl: Extend get-json tests to more kinds of
objects, and then replicate with trailing whitespace and
trailing junk to provide coverage for these cases.
* tests/012/parse.t: Slew of new read tests and iread also.
* txr.1: Documented.
|
|
|
|
|
| |
* tests/012/parse.tl: All the tests in this file blow up on
systems that don't have a full-blown character type.
|
|
|
|
|
|
|
|
|
|
| |
* parser.l (grammar): Just like we do in SREGEX, allow an
arbitrary byte in REGEX, mapping it to the DCxx range.
Do the same inside string literals of all types.
* lex.yy.c.shipped: Updated.
* tests/012/parse.tl: New tests.
|
|
The main idea in this commit is to change a behavior of the
lexer, and take advantage of it in the parser. Currently, the
lexer recognizes a {UANYN} pattern in two places. That
pattern matches a UTF-8 character. The lexeme is passed to
the decoder, which is expected to produce exactly one wide
character. If the UTF-8 is bad (for instance, a code in the
surrogate pair range U+DCxx) then the decoder will produce
multiple characters. In that case, these rules return ERRTOK
instead of a LITCHAR or REGCHAR. The idea is: why don't we
just return those characters as a TEXT token? Then we can
just incorporate that into the literal or regex.
* parser.l (grammar): If a UANYN lexeme decodes to multiple
characters instead of the expected one, then produce a
TEXT token instead of complaining about invalid UTF-8 bytes.
* parser.y (regterm): Recognize a TEXT item as a regterm,
converting its string value to a compound node in the regex
AST, so it will be correctly treated as a fixed pattern.
(chrlit): If a hash-backslash is followed by a TEXT token,
which can happen now, that is invalid; we diagnose that
as invalid UTF-8.
(quasi_item): Remove TEXT rule, because the litchars
constituent not generates TEXT.
(litchars, restlistchar): Recognize TEXT item, similarly to
regterm.
* tests/012/parse.tl: New file.
* tests/012/parse.expected: Likewise.
|