summaryrefslogtreecommitdiffstats
path: root/regex.c
Commit message (Collapse)AuthorAgeFilesLines
...
* Optimization in derivative-based regex engine.Kaz Kylheku2010-01-261-1/+54
| | | | | | | | Exponential memory consumption behavior was observed when matching the input aaaaaa.... against the regex a?a?a?a?....aaaa.... The fix is to eliminate common subexpressions from the derivative for the or operator.
* * regex.c (reg_derivative_list, reg_derivative): RecognitionKaz Kylheku2010-01-181-6/+29
| | | | | | | | | of cases to reduce consing. In reg_derivative_list, we avoid consing the full or expression if either branch is t, and also save a cons when the first element has a null derivative. In reg_derivative the oneplus and zeroplus cases are split, since zeroplus can re-use the input expression, when it's just a one-character match, deriving nil.
* Adjust semantics of non-greedy operator R%S, to avoid the brokenKaz Kylheku2010-01-181-3/+9
| | | | | | | | case whereby R%S matches nothing at all when S is not empty but equivalent to empty, or more generally when S is nullable. A much nicer definition is ``the intersection of R* and the set of all strings that do not contain a non-empty substring that matches S, followed by S''.
* Implemented non-greedy operator.Kaz Kylheku2010-01-151-1/+20
|
* * regex.c (reg_derivative_list): Bugfix: wrong algebra,Kaz Kylheku2010-01-151-1/+1
| | | | taking a double derivative of the first item.
* * regex.c (reg_derivative): Bugfix: remove invalidKaz Kylheku2010-01-141-9/+1
| | | | algebraic reductions in the derivative for the operator.
* Dynamically determine which regex implementation to use:Kaz Kylheku2010-01-131-2/+30
| | | | | | | NFA or derivatives. The default behavior is NFA, with derivatives used if the regular expression contains uses of complement or intersection. The --dv-regex option forces derivatives always.
* Impelement derivative-based regular expressions.Kaz Kylheku2010-01-131-248/+557
|
* Remove incorrect implementation of extendedKaz Kylheku2010-01-061-273/+32
| | | | | regex operations (complement, intersection). The syntax extensions documentation are retained.
* Implemented the regular expression ~ and & operators.Kaz Kylheku2010-01-051-32/+273
| | | | | | | | | | | | | | | This turns out to be easy to do in NFA land. The complement of an NFA has exactly the same number and configuration of states and transitions, except that the states have an inverted meaning; and furthermore, failed character transitions are routed to an extra state (which in this impelmentation is permanently allocated and shared by all regexes). The regex & is implemented trivially using DeMorgan's. Also, bugfix: regular expressions like A|B|C are allowed now by the syntax, rather than constituting syntax error. Previously, this would have been entered as (A|B)|C.
* All COBJ operations have default implementations now;Kaz Kylheku2009-12-081-6/+5
| | | | | | no null pointer check over struct cobj_ops operations. New typechecking function for COBJ objects.
* Eliminate the void * disease. Generic pointers are of mem_t *Kaz Kylheku2009-12-041-1/+1
| | | | | from now on, which is compatible with unsigned char *. No implicit conversion to or from this type, in C or C++.
* Code cleanup. All private functions static. Private stuffKaz Kylheku2009-11-281-36/+136
| | | | in regex module not exposed in header. Etc.
* Changes to make the code portable to C++ compilers, whichKaz Kylheku2009-11-241-9/+9
| | | | can be taken advantage of for better diagnostics.
* Renaming global variables that denote symbols, such that theyKaz Kylheku2009-11-241-16/+16
| | | | have a _s suffix.
* Improving portability. It is no longer assumed that pointersKaz Kylheku2009-11-231-5/+6
| | | | | | | | can be converted to a type long and vice versa. The configure script tries to detect the appropriate type to use. Also, some run-time checking is performed in the streams module to detect which conversions specifier strings to use for printing numbers.
* Introducing symbol packages. Internal symbols are now inKaz Kylheku2009-11-211-1/+2
| | | | | | | | | | a system package instead of being hacked with the $ prefix. Keyword symbols are provided. In the matcher, evaluation is tightened up. Keywords, nil and t are not bindeable, and errors are thrown if attempts are made to bind them. Destructuring in dest_bind is strict in the number of items. String streams are exploited to print bindings to objects that are not strings or characters. Numerous bugfixes.
* Changing ``obj_t *'' occurences to a ``val'' typedef. (Ideally,Kaz Kylheku2009-11-201-22/+22
| | | | | we wouldn't have to declare object variables at all, so why use an obtuse syntax to do so?)
* Following-up on diagnostics obtained by running code through C++Kaz Kylheku2009-11-181-8/+8
| | | | | | compiler. Idea: allocator functions return char * instead of void *, like malloc did in classic pre-ANSI C. That way we are forced to use a cast except when the target pointer is char * already.
* Warning fixes.Kaz Kylheku2009-11-171-1/+1
|
* * regex.c (nfa_all_states, nfa_closure): visited parameterKaz Kylheku2009-11-171-2/+2
| | | | should be unsigned.
* Regular expression module updated to do unicode character sets.Kaz Kylheku2009-11-121-49/+433
| | | | | | | | | | | Most of the changes are in the area of representing sets. Also, a bug was found in the compilation of regex character sets: ranges straddling two adjacent blocks of 32 characters were not being added to the character set. However, ranges falling within a single 32 block, or spanning three or more such blocks, worked properly. This bug is not tickled by common ranges such as A-Z, or 0-9, which land within a 32 block.
* Big conversion to wide characters and UTF-8 support.Kaz Kylheku2009-11-111-3/+3
| | | | | | | | | This is incomplete. There are too many dependencies on wide character support from the C stream I/O library, and implicit use of some encoding which may not be UTF-8. The regex code does not handle wide characters properly. Character type is still int in some places, rather than wchar_t. Test suite passes though.
* Version 019txr-019Kaz Kylheku2009-11-031-7/+7
| | | | | | Regexps can be bound to variables. New freeform directive.
* Got regex working over lazy strings from freeform.Kaz Kylheku2009-11-021-25/+82
| | | | Bugfixes.
* Start of implementation for freestyle matching.Kaz Kylheku2009-11-021-0/+76
| | | | | | | | | | | Lazy strings implemented, incompletely. Changed string function to implicitly strdup; non-strdup version changed to string_own. Fixed wrong uses of strdup rather than chk_strdup. Functions added to regex module to provide regex matching as a state machine to which characters are fed.
* Trivial change allows regexps to be bound to variables,Kaz Kylheku2009-10-301-0/+5
| | | | | and used for matching. This Just Works because of the way match_line treats variables.
* txr-015 2009-10-15txr-015Kaz Kylheku2017-07-311-7/+10
|
* txr-011 2009-09-25txr-011Kaz Kylheku2017-07-311-0/+631