diff options
author | Kaz Kylheku <kaz@kylheku.com> | 2024-07-03 18:21:17 -0700 |
---|---|---|
committer | Kaz Kylheku <kaz@kylheku.com> | 2024-07-03 18:21:17 -0700 |
commit | 0ccf4483d8ed2ad5805116df5dba4a858e5c373a (patch) | |
tree | f8e2a595c4a52afb55f3025c767ae790de231357 /struct.c | |
parent | 77deceded0e5c9143e01d07a19eb219b3273151b (diff) | |
download | txr-0ccf4483d8ed2ad5805116df5dba4a858e5c373a.tar.gz txr-0ccf4483d8ed2ad5805116df5dba4a858e5c373a.tar.bz2 txr-0ccf4483d8ed2ad5805116df5dba4a858e5c373a.zip |
regex: don't consume input past final match.
The read-until-match functions and the two others in the same
family always read a character beyond the characters matched
by the regex. This will cause blocking behavior in cases where
a TTY or network socket has provided the a matching record
delimiter already, using a trivial, fixed-length regex.
Similar behavior is seen in GNU Awk also, with its RS (record
separator); let's fix it in our world.
We introduce a REGM_MATCH_DONE result code, which, like
REGM_MATCH, indicates that the state machine is an acceptance
state. Unlike REGM_MATCH it also indicates that no more
transitions are possible.
For instance, for a regex like #/ab|c/, the REGM_MATCH_DONE
code will be indicated when the input "ab" is seen, or the
input "c" is seen. Any additional characters will cause a
mismatch. This indication makes it possible for the caller to
avoid reading more characters from an input source.
* regex.c (enum regm_reesult, regm_result_t): New
REGM_MATCH_DONE enum member.
(nfa_has_transitions): New macro.
(nfa_closure, nfa_move_closure): New pointer-to-int parameter
more. This is set to true only if one or more states in
the output state have transitions.
(nfa_run): Initialize new local variable more and pass to
nfa_closure and nfa_move closure. Break out of the character
feeding loop if more is zero.
(regex_machine_reset): Pass more parameter to nfa_closure.
(regex_machine_feed): Pass more parameter to nfa_move_closure.
When returning REG_MATCH, if more is false, return
REG_MATCH_DONE. In the derivatives implementation, we report
REGM_MATCH_DONE when the derivative we have calculated is
null.
(search_regex, match_regex): Break loop on REGM_MATCH_DONE,
and avoid feeding the null character in that case.
(match_regex_right): Likewise, and also handle the
REGM_MATCH_DONE case specially at the end. We need to check
whether the match reached the end of the string (is anchored
to the right). If not, we continue the search.
(regex_prefix_match): Break loop on REGM_MATCH_DONE.
(scan_until_common): If we hit REGM_MATCH_DONE, break out
of the loop and proceed straight to the out_match block,
indicating that no characters need to be pushed back from
the stack.
Diffstat (limited to 'struct.c')
0 files changed, 0 insertions, 0 deletions