txr - TXR: A data munging language.

	Commit message (Collapse)	Author	Age	Files	Lines
*	read/get-json: reject trailing junk in string input.	Kaz Kylheku	2021-06-20	1	-0/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* parser.c (lisp_parse_impl): If parsing from string, check for trailing junk and diagnose. JSON parsing doesn't use lookahead because it doesn't have a.b syntax, so the recent_tok gives the last token that actually went into the syntax, and not a lookahead token. So in the case of JSON, we call yylex to see if there is any trailing token. * tests/010/json.tl: Extend get-json tests to more kinds of objects, and then replicate with trailing whitespace and trailing junk to provide coverage for these cases. * tests/012/parse.t: Slew of new read tests and iread also. * txr.1: Documented.
*	tests: disable some UTF-8 tests on 16 bit wchar_t.	Kaz Kylheku	2021-04-20	1	-8/+9
\| \| \| \| \|	* tests/012/parse.tl: All the tests in this file blow up on systems that don't have a full-blown character type.
*	parser: allow non-UTF-8 bytes in literals and regexes.	Kaz Kylheku	2021-04-08	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	* parser.l (grammar): Just like we do in SREGEX, allow an arbitrary byte in REGEX, mapping it to the DCxx range. Do the same inside string literals of all types. * lex.yy.c.shipped: Updated. * tests/012/parse.tl: New tests.
*	parser: allow funny UTF-8 in regexes and literals.	Kaz Kylheku	2021-04-08	1	-0/+7
	The main idea in this commit is to change a behavior of the lexer, and take advantage of it in the parser. Currently, the lexer recognizes a {UANYN} pattern in two places. That pattern matches a UTF-8 character. The lexeme is passed to the decoder, which is expected to produce exactly one wide character. If the UTF-8 is bad (for instance, a code in the surrogate pair range U+DCxx) then the decoder will produce multiple characters. In that case, these rules return ERRTOK instead of a LITCHAR or REGCHAR. The idea is: why don't we just return those characters as a TEXT token? Then we can just incorporate that into the literal or regex. * parser.l (grammar): If a UANYN lexeme decodes to multiple characters instead of the expected one, then produce a TEXT token instead of complaining about invalid UTF-8 bytes. * parser.y (regterm): Recognize a TEXT item as a regterm, converting its string value to a compound node in the regex AST, so it will be correctly treated as a fixed pattern. (chrlit): If a hash-backslash is followed by a TEXT token, which can happen now, that is invalid; we diagnose that as invalid UTF-8. (quasi_item): Remove TEXT rule, because the litchars constituent not generates TEXT. (litchars, restlistchar): Recognize TEXT item, similarly to regterm. * tests/012/parse.tl: New file. * tests/012/parse.expected: Likewise.