diff options
-rw-r--r-- | ChangeLog | 7 | ||||
-rw-r--r-- | doc/ChangeLog | 7 | ||||
-rw-r--r-- | doc/gawk.info | 1102 | ||||
-rw-r--r-- | doc/gawk.texi | 256 | ||||
-rw-r--r-- | re.c | 19 |
5 files changed, 753 insertions, 638 deletions
@@ -1,3 +1,10 @@ +Sun Jun 12 23:43:06 2011 Arnold D. Robbins <arnold@skeeve.com> + + * re.c (resetup): Always turn on RE_RANGES_IGNORE_LOCALES. + Add justifying comment with URLs for the relevant portions of + POSIX. Thanks to Paul Eggert for pointing out the happy change + to the rules and supplying the URLs. + Wed Jun 8 22:41:30 2011 Arnold D. Robbins <arnold@skeeve.com> * regcomp.c (build_range_exp): Add check for RE_NO_EMPTY_RANGES diff --git a/doc/ChangeLog b/doc/ChangeLog index eb54d5cf..8b32325b 100644 --- a/doc/ChangeLog +++ b/doc/ChangeLog @@ -1,3 +1,10 @@ +Mon Jun 13 22:28:02 2011 Arnold D. Robbins <arnold@skeeve.com> + + * gawk.texi: Document that POSIX now says [a-z] is undefined outside + the C and POSIX locales, so gawk treats it as the Good Lord intended + in all cases. Thanks to Paul Eggert for letting me know about this + and providing URLs to cite. + Fri May 27 09:59:38 2011 Arnold D. Robbins <arnold@skeeve.com> * gawk.1, gawk.texi: Minor edits w.r.t. the bug reporting address. diff --git a/doc/gawk.info b/doc/gawk.info index 39e2f76e..3dd9d731 100644 --- a/doc/gawk.info +++ b/doc/gawk.info @@ -171,7 +171,6 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) * Case-sensitivity:: How to do case-insensitive matching. * Leftmost Longest:: How much text matches. * Computed Regexps:: Using Dynamic Regexps. -* Locales:: How the locale affects things. * Records:: Controlling how data is split into records. * Fields:: An introduction to fields. @@ -270,6 +269,7 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) third subexpression. * Function Calls:: A function call is an expression. * Precedence:: How various operators nest. +* Locales:: How the locale affects things. * Pattern Overview:: What goes into a pattern. * Regexp Patterns:: Using regexps as patterns. * Expression Patterns:: Any expression can be used as a @@ -476,6 +476,8 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) * POSIX/GNU:: The extensions in `gawk' not in POSIX `awk'. * Common Extensions:: Common Extensions Summary. +* Ranges and Locales:: How locales used to affect regexp + ranges. * Contributors:: The major contributors to `gawk'. * Gawk Distribution:: What is in the `gawk' @@ -2849,7 +2851,6 @@ you specify more complicated classes of strings. * Case-sensitivity:: How to do case-insensitive matching. * Leftmost Longest:: How much text matches. * Computed Regexps:: Using Dynamic Regexps. -* Locales:: How the locale affects things. File: gawk.info, Node: Regexp Usage, Next: Escape Sequences, Up: Regexp @@ -3235,15 +3236,16 @@ File: gawk.info, Node: Bracket Expressions, Next: GNU Regexp Operators, Prev: 3.4 Using Bracket Expressions ============================= -Within a bracket expression, a "range expression" consists of two -characters separated by a hyphen. It matches any single character that -sorts between the two characters, using the locale's collating sequence -and character set. For example, `[0-9]' is equivalent to -`[0123456789]'. +As mentioned earlier, a bracket expression matches any character amongst +those listed between the opening and closing square brackets. - Unfortunately, providing simple character ranges such as `[a-z]' -usually does not work like you might expect, due to locale-related -issues. This is discussed more fully, in *note Locales::. + Within a bracket expression, a "range expression" consists of two +characters separated by a hyphen. It matches any single character that +sorts between the two characters, based upon the system's native +character set. For example, `[0-9]' is equivalent to `[0123456789]'. +(See *note Ranges and Locales::, for an explanation of how the POSIX +standard and `gawk' have changed over time. This is mainly of +historical interest.) To include one of the characters `\', `]', `-', or `^' in a bracket expression, put a `\' in front of it. For example: @@ -3293,11 +3295,9 @@ Table 3.1: POSIX Character Classes For example, before the POSIX standard, you had to write `/[A-Za-z0-9]/' to match alphanumeric characters. If your character -set had other alphabetic characters in it, this would not match them, -and if your character set collated differently from ASCII, this might -not even match the ASCII alphanumeric characters. With the POSIX -character classes, you can write `/[[:alnum:]]/' to match the alphabetic -and numeric characters in your character set. +set had other alphabetic characters in it, this would not match them. +With the POSIX character classes, you can write `/[[:alnum:]]/' to +match the alphabetic and numeric characters in your character set. Two additional special sequences can appear in bracket expressions. These apply to non-ASCII character sets, which can have single symbols @@ -3528,7 +3528,7 @@ this principle is also important for regexp-based record and field splitting (*note Records::, and also *note Field Separators::). -File: gawk.info, Node: Computed Regexps, Next: Locales, Prev: Leftmost Longest, Up: Regexp +File: gawk.info, Node: Computed Regexps, Prev: Leftmost Longest, Up: Regexp 3.8 Using Dynamic Regexps ========================= @@ -3607,86 +3607,6 @@ be used inside a bracket expression for a dynamic regexp: often in practice, but it's worth noting for future reference. -File: gawk.info, Node: Locales, Prev: Computed Regexps, Up: Regexp - -3.9 Where You Are Makes A Difference -==================================== - -Modern systems support the notion of "locales": a way to tell the -system about the local character set and language. The current locale -setting can affect the way regexp matching works, often in surprising -ways. - - For example, in the default `"C"' locale, `[a-dx-z]' is equivalent to -`[abcdxyz]'. Many locales sort characters in dictionary order, and in -these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; -instead it might be equivalent to `[aBbCcdXxYyz]', for example. - - This point needs to be emphasized: Much literature teaches that you -should use `[a-z]' to match a lowercase character. But on systems with -non-ASCII locales, this also matches all of the uppercase characters -except `Z'! This is a continuous cause of confusion, even well into -the twenty-first century. - - NOTE: In an attempt to end the confusion once and for all, when - not in POSIX mode (*note Options::), `gawk' expands ranges into - the characters they include, based only on the machine character - set. This restores the traditional, pre-POSIX, pre-locales - behavior. However, you should read the rest of this section so - that you can write portable scripts, instead of relying on - behavior specific to `gawk'. - - To obtain the traditional interpretation of bracket expressions, you -can use the `"C"' locale by setting the `LC_ALL' environment variable -to the value `C'. However, it is best to just use POSIX character -classes, such as `[[:lower:]]' to match specific classes of characters. - - To demonstrate these issues, the following example uses the `sub()' -function, which does text replacement (*note String Functions::). Here, -the intent is to remove trailing uppercase characters: - - $ echo something1234abc | gawk --posix '{ sub("[A-Z]*$", ""); print }' - -| something1234a - -This output is unexpected, since the `bc' at the end of -`something1234abc' should not normally match `[A-Z]*'. This result is -due to the locale setting (and thus you may not see it on your system). -There are two fixes. The first is to use the POSIX character class -`[[:upper:]]', instead of `[A-Z]'. (This is preferred, since then your -program will work everywhere.) - - The second is to change the locale setting in the environment, before -running `gawk', by using the shell statements: - - LANG=C LC_ALL=C - export LANG LC_ALL - - The setting `C' forces `gawk' to behave in the traditional Unix -manner, where case distinctions do matter. You may wish to put these -statements into your shell startup file, e.g., `$HOME/.profile'. - - Similar considerations apply to other ranges. For example, `["-/]' -is perfectly valid in ASCII, but is not valid in many Unicode locales, -such as `en_US.UTF-8'. (In general, such ranges should be avoided; -either list the characters individually, or use a POSIX character class -such as `[[:punct:]]'.) - - An additional factor relates to splitting records. For the normal -case of `RS = "\n"', the locale is largely irrelevant. For other -single-character record separators, using `LC_ALL=C' will give you much -better performance when reading records. Otherwise, `gawk' has to make -several function calls, _per input character_, to find the record -terminator. - - According to POSIX, string comparison is also affected by locales -(similar to regular expressions). The details are presented in *note -POSIX String Comparison::. - - Finally, the locale affects the value of the decimal point character -used when `gawk' parses input data. This is discussed in detail in -*note Conversion::. - - File: gawk.info, Node: Reading Files, Next: Printing, Prev: Regexp, Up: Top 4 Reading Input Files @@ -6451,6 +6371,7 @@ operators. * Truth Values and Conditions:: Testing for true and false. * Function Calls:: A function call is an expression. * Precedence:: How various operators nest. +* Locales:: How the locale affects things. File: gawk.info, Node: Values, Next: All Operators, Up: Expressions @@ -7897,7 +7818,7 @@ Here is a sample run: -| 5 1 -File: gawk.info, Node: Precedence, Prev: Function Calls, Up: Expressions +File: gawk.info, Node: Precedence, Next: Locales, Prev: Function Calls, Up: Expressions 6.5 Operator Precedence (How Operators Nest) ============================================ @@ -7998,6 +7919,33 @@ precedence: POSIX. For maximum portability, do not use them. +File: gawk.info, Node: Locales, Prev: Precedence, Up: Expressions + +6.6 Where You Are Makes A Difference +==================================== + +Modern systems support the notion of "locales": a way to tell the +system about the local character set and language. + + Once upon a time, the locale setting used to affect regexp matching +(*note Ranges and Locales::), but this is no longer true. + + Locales can affect record splitting. For the normal case of `RS = +"\n"', the locale is largely irrelevant. For other single-character +record separators, setting `LC_ALL=C' in the environment will give you +much better performance when reading records. Otherwise, `gawk' has to +make several function calls, _per input character_, to find the record +terminator. + + According to POSIX, string comparison is also affected by locales +(similar to regular expressions). The details are presented in *note +POSIX String Comparison::. + + Finally, the locale affects the value of the decimal point character +used when `gawk' parses input data. This is discussed in detail in +*note Conversion::. + + File: gawk.info, Node: Patterns and Actions, Next: Arrays, Prev: Expressions, Up: Top 7 Patterns, Actions, and Variables @@ -19753,6 +19701,7 @@ you can find more information. * POSIX/GNU:: The extensions in `gawk' not in POSIX `awk'. * Common Extensions:: Common Extensions Summary. +* Ranges and Locales:: How locales used to affect regexp ranges. * Contributors:: The major contributors to `gawk'. @@ -20066,7 +20015,7 @@ the current version of `gawk'. -File: gawk.info, Node: Common Extensions, Next: Contributors, Prev: POSIX/GNU, Up: Language History +File: gawk.info, Node: Common Extensions, Next: Ranges and Locales, Prev: POSIX/GNU, Up: Language History A.6 Common Extensions Summary ============================= @@ -20092,9 +20041,108 @@ Feature BWK Awk Mawk GNU Awk `BINMODE' variable X X -File: gawk.info, Node: Contributors, Prev: Common Extensions, Up: Language History +File: gawk.info, Node: Ranges and Locales, Next: Contributors, Prev: Common Extensions, Up: Language History + +A.7 Regexp Ranges and Locales: A Long Sad Story +=============================================== + +This minor node describes the confusing history of ranges within +regular expressions and their interactions with locales, and how this +affected different versions of `gawk'. + + The original Unix tools that worked with regular expressions defined +character ranges (such as `[a-z]') to match any character between the +first character in the range and the last character in the range, +inclusive. Ordering was based on the numeric value of each character +in the machine's native character set. Thus, on ASCII-based systems, +`[a-z]' matched all the lowercase letters, and only the lowercase +letters, since the numeric values for the letters from `a' through `z' +were contigous. (On an EBCDIC system, the range `[a-z]' includes +additional, non-alphabetic characters as well.) + + Almost all introductory Unix literature explained range expressions +as working in this fashion, and in particular, would teach that the +"correct" way to match lowercase letters was with `[a-z]', and that +`[A-Z]' was the the "correct" way to match uppercase letters. And +indeed, this was true. + + The 1993 POSIX standard introduced the idea of locales (*note +Locales::). Since many locales include other letters besides the plain +twenty-six letters of the American English alphabet, the POSIX standard +added character classes (*note Bracket Expressions::) as a way to match +different kinds of characters besides the traditional ones in the ASCII +character set. + + However, the standard _changed_ the interpretation of range +expressions. In the `"C"' and `"POSIX"' locales, a range expression +like `[a-dx-z]' is still equivalent to `[abcdxyz]', as in ASCII. But +outside those locales, the ordering was defined to be based on +"collation order". + + In many locales, `A' and `a' are both less than `B'. In other +words, these locales sort characters in dictionary order, and +`[a-dx-z]' is typically not equivalent to `[abcdxyz]'; instead it might +be equivalent to `[aBbCcdXxYyz]', for example. + + This point needs to be emphasized: Much literature teaches that you +should use `[a-z]' to match a lowercase character. But on systems with +non-ASCII locales, this also matched all of the uppercase characters +except `Z'! This was a continuous cause of confusion, even well into +the twenty-first century. + + To demonstrate these issues, the following example uses the `sub()' +function, which does text replacement (*note String Functions::). Here, +the intent is to remove trailing uppercase characters: + + $ echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }' + -| something1234a + +This output is unexpected, since the `bc' at the end of +`something1234abc' should not normally match `[A-Z]*'. This result is +due to the locale setting (and thus you may not see it on your system). + + Similar considerations apply to other ranges. For example, `["-/]' +is perfectly valid in ASCII, but is not valid in many Unicode locales, +such as `en_US.UTF-8'. + + Early versions of `gawk' used regexp matching code that was not +locale aware, so ranges had their traditional interpretation. + + When `gawk' switched to using locale-aware regexp matchers, the +problems began; especially as both GNU/Linux and commercial Unix +vendors started implementing non-ASCII locales, _and making them the +default_. Perhaps the most frequently asked question became something +like "why does `[A-Z]' match lowercase letters?!?" + + This situation existed for close to 10 years, if not more, and the +`gawk' maintainer grew weary of trying to explain that `gawk' was being +nicely standards-compliant, and that the issue was in the user's +locale. During the development of version 4.0, he modified `gawk' to +always treat ranges in the original, pre-POSIX fashion, unless +`--posix' was used (*note Options::). + + Fortunately, shortly before the final release of `gawk' 4.0, the +maintainer learned that the 2008 standard had changed the definition of +ranges, such that outside the `"C"' and `"POSIX"' locales, the meaning +of range expressions was _undefined_.(1) + + By using this lovely technical term, the standard gives license to +implementors to implement ranges in whatever way they choose. The +`gawk' maintainer chose to apply the pre-POSIX meaning in all cases: +the default regexp matching; with `--traditional', and with `--posix'; +in all cases, `gawk' remains POSIX compliant. + + ---------- Footnotes ---------- + + (1) See the standard +(http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05) +and its rationale +(http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05). + + +File: gawk.info, Node: Contributors, Prev: Ranges and Locales, Up: Language History -A.7 Major Contributors to `gawk' +A.8 Major Contributors to `gawk' ================================ Always give credit where credit is due. @@ -24595,7 +24643,7 @@ Index * - (hyphen), -= operator <1>: Precedence. (line 95) * - (hyphen), -= operator: Assignment Ops. (line 129) * - (hyphen), filenames beginning with: Options. (line 59) -* - (hyphen), in bracket expressions: Bracket Expressions. (line 16) +* - (hyphen), in bracket expressions: Bracket Expressions. (line 17) * --assign option: Options. (line 32) * --c option: Options. (line 78) * --characters-as-bytes option: Options. (line 68) @@ -24765,7 +24813,7 @@ Index (line 44) * \ (backslash), gsub()/gensub()/sub() functions and: Gory Details. (line 6) -* \ (backslash), in bracket expressions: Bracket Expressions. (line 16) +* \ (backslash), in bracket expressions: Bracket Expressions. (line 17) * \ (backslash), in escape sequences: Escape Sequences. (line 6) * \ (backslash), in escape sequences, POSIX and: Escape Sequences. (line 113) @@ -24776,7 +24824,7 @@ Index * ^ (caret), ^ operator: Precedence. (line 49) * ^ (caret), ^= operator <1>: Precedence. (line 95) * ^ (caret), ^= operator: Assignment Ops. (line 129) -* ^ (caret), in bracket expressions: Bracket Expressions. (line 16) +* ^ (caret), in bracket expressions: Bracket Expressions. (line 17) * ^, in FS: Regexp Field Splitting. (line 59) * _ (underscore), _ C macro: Explaining gettext. (line 70) @@ -25024,7 +25072,7 @@ Index (line 44) * backslash (\), gsub()/gensub()/sub() functions and: Gory Details. (line 6) -* backslash (\), in bracket expressions: Bracket Expressions. (line 16) +* backslash (\), in bracket expressions: Bracket Expressions. (line 17) * backslash (\), in escape sequences: Escape Sequences. (line 6) * backslash (\), in escape sequences, POSIX and: Escape Sequences. (line 113) @@ -25084,15 +25132,15 @@ Index * bracket expressions <1>: Bracket Expressions. (line 6) * bracket expressions: Regexp Operators. (line 55) * bracket expressions, character classes: Bracket Expressions. - (line 29) + (line 30) * bracket expressions, collating elements: Bracket Expressions. - (line 70) + (line 69) * bracket expressions, collating symbols: Bracket Expressions. - (line 77) + (line 76) * bracket expressions, complemented: Regexp Operators. (line 63) * bracket expressions, equivalence classes: Bracket Expressions. - (line 83) -* bracket expressions, non-ASCII: Bracket Expressions. (line 70) + (line 82) +* bracket expressions, non-ASCII: Bracket Expressions. (line 69) * bracket expressions, range expressions: Bracket Expressions. (line 6) * break debugger command: Breakpoint Control. (line 11) @@ -25135,7 +25183,7 @@ Index * caret (^), ^ operator: Precedence. (line 49) * caret (^), ^= operator <1>: Precedence. (line 95) * caret (^), ^= operator: Assignment Ops. (line 129) -* caret (^), in bracket expressions: Bracket Expressions. (line 16) +* caret (^), in bracket expressions: Bracket Expressions. (line 17) * case keyword: Switch Statement. (line 6) * case sensitivity, array indices and: Array Intro. (line 92) * case sensitivity, converting case: String Functions. (line 522) @@ -25175,8 +25223,8 @@ Index * Close, Diane <1>: Contributors. (line 21) * Close, Diane: Manual History. (line 41) * close_func() input method: Internals. (line 160) -* collating elements: Bracket Expressions. (line 70) -* collating symbols: Bracket Expressions. (line 77) +* collating elements: Bracket Expressions. (line 69) +* collating symbols: Bracket Expressions. (line 76) * Colombo, Antonio: Acknowledgments. (line 60) * columns, aligning: Print Examples. (line 70) * columns, cutting: Cut Program. (line 6) @@ -25552,7 +25600,7 @@ Index * e debugger command (alias for enable): Breakpoint Control. (line 72) * EBCDIC: Ordinal Functions. (line 45) * egrep utility <1>: Egrep Program. (line 6) -* egrep utility: Bracket Expressions. (line 23) +* egrep utility: Bracket Expressions. (line 24) * egrep.awk program: Egrep Program. (line 54) * elements in arrays: Reference to Elements. (line 6) @@ -25596,7 +25644,7 @@ Index * equals sign (=), == operator <1>: Precedence. (line 65) * equals sign (=), == operator: Comparison Operators. (line 11) -* EREs (Extended Regular Expressions): Bracket Expressions. (line 23) +* EREs (Extended Regular Expressions): Bracket Expressions. (line 24) * ERRNO variable <1>: Internals. (line 139) * ERRNO variable <2>: TCP/IP Networking. (line 54) * ERRNO variable <3>: Auto-set. (line 72) @@ -25645,7 +25693,7 @@ Index * expressions, matching, See comparison expressions: Typing and Comparison. (line 9) * expressions, selecting: Conditional Exp. (line 6) -* Extended Regular Expressions (EREs): Bracket Expressions. (line 23) +* Extended Regular Expressions (EREs): Bracket Expressions. (line 24) * eXtensible Markup Language (XML): Internals. (line 160) * extension() function (gawk): Using Internal File Ops. (line 15) @@ -25888,7 +25936,7 @@ Index * gawk, bitwise operations in: Bitwise Functions. (line 39) * gawk, break statement in: Break Statement. (line 51) * gawk, built-in variables and: Built-in Variables. (line 14) -* gawk, character classes and: Bracket Expressions. (line 91) +* gawk, character classes and: Bracket Expressions. (line 90) * gawk, coding style in: Adding Code. (line 38) * gawk, command-line options: GNU Regexp Operators. (line 70) @@ -26074,7 +26122,7 @@ Index * hyphen (-), -= operator <1>: Precedence. (line 95) * hyphen (-), -= operator: Assignment Ops. (line 129) * hyphen (-), filenames beginning with: Options. (line 59) -* hyphen (-), in bracket expressions: Bracket Expressions. (line 16) +* hyphen (-), in bracket expressions: Bracket Expressions. (line 17) * i debugger command (alias for info): Dgawk Info. (line 12) * id utility: Id Program. (line 6) * id.awk program: Id Program. (line 30) @@ -26177,7 +26225,7 @@ Index (line 13) * internationalization, localization: User-modified. (line 153) * internationalization, localization, character classes: Bracket Expressions. - (line 91) + (line 90) * internationalization, localization, gawk and: Internationalization. (line 13) * internationalization, localization, locale categories: Explaining gettext. @@ -26618,9 +26666,9 @@ Index * POSIX awk, backslashes in string constants: Escape Sequences. (line 113) * POSIX awk, BEGIN/END patterns: I/O And BEGIN/END. (line 16) -* POSIX awk, bracket expressions and: Bracket Expressions. (line 23) +* POSIX awk, bracket expressions and: Bracket Expressions. (line 24) * POSIX awk, bracket expressions and, character classes: Bracket Expressions. - (line 29) + (line 30) * POSIX awk, break statement and: Break Statement. (line 51) * POSIX awk, changes in awk versions: POSIX. (line 6) * POSIX awk, continue statement and: Continue Statement. (line 43) @@ -27314,411 +27362,413 @@ Index Tag Table: Node: Top1346 -Node: Foreword33320 -Node: Preface37665 -Ref: Preface-Footnote-140632 -Ref: Preface-Footnote-240738 -Node: History40970 -Node: Names43361 -Ref: Names-Footnote-144838 -Node: This Manual44910 -Ref: This Manual-Footnote-149857 -Node: Conventions49957 -Node: Manual History52091 -Ref: Manual History-Footnote-155361 -Ref: Manual History-Footnote-255402 -Node: How To Contribute55476 -Node: Acknowledgments56620 -Node: Getting Started60951 -Node: Running gawk63330 -Node: One-shot64516 -Node: Read Terminal65741 -Ref: Read Terminal-Footnote-167391 -Ref: Read Terminal-Footnote-267667 -Node: Long67838 -Node: Executable Scripts69214 -Ref: Executable Scripts-Footnote-171083 -Ref: Executable Scripts-Footnote-271185 -Node: Comments71636 -Node: Quoting74103 -Node: DOS Quoting78726 -Node: Sample Data Files79401 -Node: Very Simple82433 -Node: Two Rules87032 -Node: More Complex89179 -Ref: More Complex-Footnote-192109 -Node: Statements/Lines92194 -Ref: Statements/Lines-Footnote-196656 -Node: Other Features96921 -Node: When97849 -Node: Invoking Gawk99996 -Node: Command Line101381 -Node: Options102164 -Ref: Options-Footnote-1115442 -Node: Other Arguments115467 -Node: Naming Standard Input118125 -Node: Environment Variables119219 -Node: AWKPATH Variable119663 -Ref: AWKPATH Variable-Footnote-1122260 -Node: Other Environment Variables122520 -Node: Exit Status124860 -Node: Include Files125535 -Node: Obsolete129020 -Node: Undocumented129706 -Node: Regexp129947 -Node: Regexp Usage131399 -Node: Escape Sequences133425 -Node: Regexp Operators139188 -Ref: Regexp Operators-Footnote-1146385 -Ref: Regexp Operators-Footnote-2146532 -Node: Bracket Expressions146630 -Ref: table-char-classes148433 -Node: GNU Regexp Operators151077 -Node: Case-sensitivity154800 -Ref: Case-sensitivity-Footnote-1157768 -Ref: Case-sensitivity-Footnote-2158003 -Node: Leftmost Longest158111 -Node: Computed Regexps159312 -Node: Locales162738 -Node: Reading Files166445 -Node: Records168386 -Ref: Records-Footnote-1177060 -Node: Fields177097 -Ref: Fields-Footnote-1180130 -Node: Nonconstant Fields180216 -Node: Changing Fields182418 -Node: Field Separators188396 -Node: Default Field Splitting191025 -Node: Regexp Field Splitting192142 -Node: Single Character Fields195484 -Node: Command Line Field Separator196543 -Node: Field Splitting Summary199984 -Ref: Field Splitting Summary-Footnote-1203176 -Node: Constant Size203277 -Node: Splitting By Content207861 -Ref: Splitting By Content-Footnote-1211587 -Node: Multiple Line211627 -Ref: Multiple Line-Footnote-1217474 -Node: Getline217653 -Node: Plain Getline219881 -Node: Getline/Variable221970 -Node: Getline/File223111 -Node: Getline/Variable/File224433 -Ref: Getline/Variable/File-Footnote-1226032 -Node: Getline/Pipe226119 -Node: Getline/Variable/Pipe228679 -Node: Getline/Coprocess229786 -Node: Getline/Variable/Coprocess231029 -Node: Getline Notes231743 -Node: Getline Summary233685 -Ref: table-getline-variants234028 -Node: Command line directories234884 -Node: Printing235509 -Node: Print237140 -Node: Print Examples238477 -Node: Output Separators241261 -Node: OFMT243021 -Node: Printf244379 -Node: Basic Printf245285 -Node: Control Letters246824 -Node: Format Modifiers250636 -Node: Printf Examples256645 -Node: Redirection259360 -Node: Special Files266344 -Node: Special FD266877 -Ref: Special FD-Footnote-1270502 -Node: Special Network270576 -Node: Special Caveats271426 -Node: Close Files And Pipes272222 -Ref: Close Files And Pipes-Footnote-1279245 -Ref: Close Files And Pipes-Footnote-2279393 -Node: Expressions279543 -Node: Values280612 -Node: Constants281288 -Node: Scalar Constants281968 -Ref: Scalar Constants-Footnote-1282827 -Node: Nondecimal-numbers283009 -Node: Regexp Constants286068 -Node: Using Constant Regexps286543 -Node: Variables289598 -Node: Using Variables290253 -Node: Assignment Options291977 -Node: Conversion293849 -Ref: table-locale-affects299225 -Ref: Conversion-Footnote-1299849 -Node: All Operators299958 -Node: Arithmetic Ops300588 -Node: Concatenation303093 -Ref: Concatenation-Footnote-1305886 -Node: Assignment Ops306006 -Ref: table-assign-ops310994 -Node: Increment Ops312402 -Node: Truth Values and Conditions315872 -Node: Truth Values316955 -Node: Typing and Comparison318004 -Node: Variable Typing318793 -Ref: Variable Typing-Footnote-1322690 -Node: Comparison Operators322812 -Ref: table-relational-ops323222 -Node: POSIX String Comparison326771 -Ref: POSIX String Comparison-Footnote-1327727 -Node: Boolean Ops327865 -Ref: Boolean Ops-Footnote-1331943 -Node: Conditional Exp332034 -Node: Function Calls333766 -Node: Precedence337360 -Node: Patterns and Actions341013 -Node: Pattern Overview342067 -Node: Regexp Patterns343733 -Node: Expression Patterns344276 -Node: Ranges347850 -Node: BEGIN/END350816 -Node: Using BEGIN/END351578 -Ref: Using BEGIN/END-Footnote-1354309 -Node: I/O And BEGIN/END354415 -Node: BEGINFILE/ENDFILE356697 -Node: Empty359530 -Node: Using Shell Variables359846 -Node: Action Overview362131 -Node: Statements364488 -Node: If Statement366342 -Node: While Statement367841 -Node: Do Statement369885 -Node: For Statement371041 -Node: Switch Statement374193 -Node: Break Statement376290 -Node: Continue Statement378280 -Node: Next Statement380067 -Node: Nextfile Statement382457 -Node: Exit Statement384754 -Node: Built-in Variables387170 -Node: User-modified388265 -Ref: User-modified-Footnote-1396291 -Node: Auto-set396353 -Ref: Auto-set-Footnote-1405644 -Node: ARGC and ARGV405849 -Node: Arrays409700 -Node: Array Basics411205 -Node: Array Intro411916 -Node: Reference to Elements416234 -Node: Assigning Elements418504 -Node: Array Example418995 -Node: Scanning an Array420727 -Node: Delete423393 -Ref: Delete-Footnote-1425828 -Node: Numeric Array Subscripts425885 -Node: Uninitialized Subscripts428068 -Node: Multi-dimensional429696 -Node: Multi-scanning432790 -Node: Arrays of Arrays434374 -Node: Functions438951 -Node: Built-in439773 -Node: Calling Built-in440851 -Node: Numeric Functions442839 -Ref: Numeric Functions-Footnote-1446604 -Ref: Numeric Functions-Footnote-2446961 -Ref: Numeric Functions-Footnote-3447009 -Node: String Functions447278 -Ref: String Functions-Footnote-1470775 -Ref: String Functions-Footnote-2470904 -Ref: String Functions-Footnote-3471152 -Node: Gory Details471239 -Ref: table-sub-escapes472918 -Ref: table-posix-sub474232 -Ref: table-gensub-escapes475145 -Node: I/O Functions476316 -Ref: I/O Functions-Footnote-1482971 -Node: Time Functions483118 -Ref: Time Functions-Footnote-1494010 -Ref: Time Functions-Footnote-2494078 -Ref: Time Functions-Footnote-3494236 -Ref: Time Functions-Footnote-4494347 -Ref: Time Functions-Footnote-5494459 -Ref: Time Functions-Footnote-6494686 -Node: Bitwise Functions494952 -Ref: table-bitwise-ops495510 -Ref: Bitwise Functions-Footnote-1499670 -Node: Type Functions499854 -Node: I18N Functions500324 -Node: User-defined501951 -Node: Definition Syntax502755 -Ref: Definition Syntax-Footnote-1507665 -Node: Function Example507734 -Node: Function Caveats510328 -Node: Calling A Function510749 -Node: Variable Scope511864 -Node: Pass By Value/Reference513839 -Node: Return Statement517279 -Node: Dynamic Typing520260 -Node: Indirect Calls520995 -Node: Internationalization530680 -Node: I18N and L10N532106 -Node: Explaining gettext532792 -Ref: Explaining gettext-Footnote-1537858 -Ref: Explaining gettext-Footnote-2538042 -Node: Programmer i18n538207 -Node: Translator i18n542407 -Node: String Extraction543200 -Ref: String Extraction-Footnote-1544161 -Node: Printf Ordering544247 -Ref: Printf Ordering-Footnote-1547031 -Node: I18N Portability547095 -Ref: I18N Portability-Footnote-1549544 -Node: I18N Example549607 -Ref: I18N Example-Footnote-1552242 -Node: Gawk I18N552314 -Node: Advanced Features552931 -Node: Nondecimal Data554444 -Node: Array Sorting556027 -Node: Controlling Array Traversal556727 -Node: Controlling Scanning With A Function557474 -Node: Controlling Scanning565177 -Ref: Controlling Scanning-Footnote-1568978 -Node: Array Sorting Functions569294 -Ref: Array Sorting Functions-Footnote-1572810 -Ref: Array Sorting Functions-Footnote-2572903 -Node: Two-way I/O573097 -Ref: Two-way I/O-Footnote-1578529 -Node: TCP/IP Networking578599 -Node: Profiling581443 -Node: Library Functions588917 -Ref: Library Functions-Footnote-1591924 -Node: Library Names592095 -Ref: Library Names-Footnote-1595566 -Ref: Library Names-Footnote-2595786 -Node: General Functions595872 -Node: Strtonum Function596825 -Node: Assert Function599755 -Node: Round Function603081 -Node: Cliff Random Function604624 -Node: Ordinal Functions605640 -Ref: Ordinal Functions-Footnote-1608710 -Ref: Ordinal Functions-Footnote-2608962 -Node: Join Function609171 -Ref: Join Function-Footnote-1610942 -Node: Gettimeofday Function611142 -Node: Data File Management614857 -Node: Filetrans Function615489 -Node: Rewind Function619628 -Node: File Checking621015 -Node: Empty Files622109 -Node: Ignoring Assigns624339 -Node: Getopt Function625892 -Ref: Getopt Function-Footnote-1637196 -Node: Passwd Functions637399 -Ref: Passwd Functions-Footnote-1646374 -Node: Group Functions646462 -Node: Walking Arrays654546 -Node: Sample Programs656115 -Node: Running Examples656780 -Node: Clones657508 -Node: Cut Program658732 -Node: Egrep Program668577 -Ref: Egrep Program-Footnote-1676350 -Node: Id Program676460 -Node: Split Program680076 -Ref: Split Program-Footnote-1683595 -Node: Tee Program683723 -Node: Uniq Program686526 -Node: Wc Program693955 -Ref: Wc Program-Footnote-1698221 -Ref: Wc Program-Footnote-2698421 -Node: Miscellaneous Programs698513 -Node: Dupword Program699701 -Node: Alarm Program701732 -Node: Translate Program706481 -Ref: Translate Program-Footnote-1710868 -Ref: Translate Program-Footnote-2711096 -Node: Labels Program711230 -Ref: Labels Program-Footnote-1714601 -Node: Word Sorting714685 -Node: History Sorting718569 -Node: Extract Program720408 -Ref: Extract Program-Footnote-1727891 -Node: Simple Sed728019 -Node: Igawk Program731081 -Ref: Igawk Program-Footnote-1746238 -Ref: Igawk Program-Footnote-2746439 -Node: Anagram Program746577 -Node: Signature Program749645 -Node: Debugger750745 -Node: Debugging751656 -Node: Debugging Concepts752069 -Node: Debugging Terms753925 -Node: Awk Debugging756548 -Node: Sample dgawk session757440 -Node: dgawk invocation757932 -Node: Finding The Bug759114 -Node: List of Debugger Commands765600 -Node: Breakpoint Control766911 -Node: Dgawk Execution Control770547 -Node: Viewing And Changing Data773898 -Node: Dgawk Stack777235 -Node: Dgawk Info778695 -Node: Miscellaneous Dgawk Commands782643 -Node: Readline Support788071 -Node: Dgawk Limitations788909 -Node: Language History791098 -Node: V7/SVR3.1792536 -Node: SVR4794857 -Node: POSIX796299 -Node: BTL797307 -Node: POSIX/GNU798041 -Node: Common Extensions803192 -Node: Contributors804293 -Node: Installation808554 -Node: Gawk Distribution809448 -Node: Getting809932 -Node: Extracting810758 -Node: Distribution contents812450 -Node: Unix Installation817672 -Node: Quick Installation818289 -Node: Additional Configuration Options820251 -Node: Configuration Philosophy821728 -Node: Non-Unix Installation824070 -Node: PC Installation824528 -Node: PC Binary Installation825827 -Node: PC Compiling827675 -Node: PC Testing830619 -Node: PC Using831795 -Node: Cygwin835980 -Node: MSYS836980 -Node: VMS Installation837494 -Node: VMS Compilation838097 -Ref: VMS Compilation-Footnote-1839104 -Node: VMS Installation Details839162 -Node: VMS Running840797 -Node: VMS Old Gawk842404 -Node: Bugs842878 -Node: Other Versions846731 -Node: Notes852012 -Node: Compatibility Mode852704 -Node: Additions853487 -Node: Accessing The Source854299 -Node: Adding Code855724 -Node: New Ports861691 -Node: Dynamic Extensions865804 -Node: Internals867180 -Node: Plugin License876283 -Node: Sample Library876917 -Node: Internal File Description877603 -Node: Internal File Ops881318 -Ref: Internal File Ops-Footnote-1886099 -Node: Using Internal File Ops886239 -Node: Future Extensions888616 -Node: Basic Concepts891120 -Node: Basic High Level891877 -Ref: Basic High Level-Footnote-1895912 -Node: Basic Data Typing896097 -Node: Floating Point Issues900622 -Node: String Conversion Precision901705 -Ref: String Conversion Precision-Footnote-1903405 -Node: Unexpected Results903514 -Node: POSIX Floating Point Problems905340 -Ref: POSIX Floating Point Problems-Footnote-1909045 -Node: Glossary909083 -Node: Copying934059 -Node: GNU Free Documentation License971616 -Node: Index996753 +Node: Foreword33440 +Node: Preface37785 +Ref: Preface-Footnote-140752 +Ref: Preface-Footnote-240858 +Node: History41090 +Node: Names43481 +Ref: Names-Footnote-144958 +Node: This Manual45030 +Ref: This Manual-Footnote-149977 +Node: Conventions50077 +Node: Manual History52211 +Ref: Manual History-Footnote-155481 +Ref: Manual History-Footnote-255522 +Node: How To Contribute55596 +Node: Acknowledgments56740 +Node: Getting Started61071 +Node: Running gawk63450 +Node: One-shot64636 +Node: Read Terminal65861 +Ref: Read Terminal-Footnote-167511 +Ref: Read Terminal-Footnote-267787 +Node: Long67958 +Node: Executable Scripts69334 +Ref: Executable Scripts-Footnote-171203 +Ref: Executable Scripts-Footnote-271305 +Node: Comments71756 +Node: Quoting74223 +Node: DOS Quoting78846 +Node: Sample Data Files79521 +Node: Very Simple82553 +Node: Two Rules87152 +Node: More Complex89299 +Ref: More Complex-Footnote-192229 +Node: Statements/Lines92314 +Ref: Statements/Lines-Footnote-196776 +Node: Other Features97041 +Node: When97969 +Node: Invoking Gawk100116 +Node: Command Line101501 +Node: Options102284 +Ref: Options-Footnote-1115562 +Node: Other Arguments115587 +Node: Naming Standard Input118245 +Node: Environment Variables119339 +Node: AWKPATH Variable119783 +Ref: AWKPATH Variable-Footnote-1122380 +Node: Other Environment Variables122640 +Node: Exit Status124980 +Node: Include Files125655 +Node: Obsolete129140 +Node: Undocumented129826 +Node: Regexp130067 +Node: Regexp Usage131456 +Node: Escape Sequences133482 +Node: Regexp Operators139245 +Ref: Regexp Operators-Footnote-1146442 +Ref: Regexp Operators-Footnote-2146589 +Node: Bracket Expressions146687 +Ref: table-char-classes148577 +Node: GNU Regexp Operators151100 +Node: Case-sensitivity154823 +Ref: Case-sensitivity-Footnote-1157791 +Ref: Case-sensitivity-Footnote-2158026 +Node: Leftmost Longest158134 +Node: Computed Regexps159335 +Node: Reading Files162745 +Node: Records164686 +Ref: Records-Footnote-1173360 +Node: Fields173397 +Ref: Fields-Footnote-1176430 +Node: Nonconstant Fields176516 +Node: Changing Fields178718 +Node: Field Separators184696 +Node: Default Field Splitting187325 +Node: Regexp Field Splitting188442 +Node: Single Character Fields191784 +Node: Command Line Field Separator192843 +Node: Field Splitting Summary196284 +Ref: Field Splitting Summary-Footnote-1199476 +Node: Constant Size199577 +Node: Splitting By Content204161 +Ref: Splitting By Content-Footnote-1207887 +Node: Multiple Line207927 +Ref: Multiple Line-Footnote-1213774 +Node: Getline213953 +Node: Plain Getline216181 +Node: Getline/Variable218270 +Node: Getline/File219411 +Node: Getline/Variable/File220733 +Ref: Getline/Variable/File-Footnote-1222332 +Node: Getline/Pipe222419 +Node: Getline/Variable/Pipe224979 +Node: Getline/Coprocess226086 +Node: Getline/Variable/Coprocess227329 +Node: Getline Notes228043 +Node: Getline Summary229985 +Ref: table-getline-variants230328 +Node: Command line directories231184 +Node: Printing231809 +Node: Print233440 +Node: Print Examples234777 +Node: Output Separators237561 +Node: OFMT239321 +Node: Printf240679 +Node: Basic Printf241585 +Node: Control Letters243124 +Node: Format Modifiers246936 +Node: Printf Examples252945 +Node: Redirection255660 +Node: Special Files262644 +Node: Special FD263177 +Ref: Special FD-Footnote-1266802 +Node: Special Network266876 +Node: Special Caveats267726 +Node: Close Files And Pipes268522 +Ref: Close Files And Pipes-Footnote-1275545 +Ref: Close Files And Pipes-Footnote-2275693 +Node: Expressions275843 +Node: Values276975 +Node: Constants277651 +Node: Scalar Constants278331 +Ref: Scalar Constants-Footnote-1279190 +Node: Nondecimal-numbers279372 +Node: Regexp Constants282431 +Node: Using Constant Regexps282906 +Node: Variables285961 +Node: Using Variables286616 +Node: Assignment Options288340 +Node: Conversion290212 +Ref: table-locale-affects295588 +Ref: Conversion-Footnote-1296212 +Node: All Operators296321 +Node: Arithmetic Ops296951 +Node: Concatenation299456 +Ref: Concatenation-Footnote-1302249 +Node: Assignment Ops302369 +Ref: table-assign-ops307357 +Node: Increment Ops308765 +Node: Truth Values and Conditions312235 +Node: Truth Values313318 +Node: Typing and Comparison314367 +Node: Variable Typing315156 +Ref: Variable Typing-Footnote-1319053 +Node: Comparison Operators319175 +Ref: table-relational-ops319585 +Node: POSIX String Comparison323134 +Ref: POSIX String Comparison-Footnote-1324090 +Node: Boolean Ops324228 +Ref: Boolean Ops-Footnote-1328306 +Node: Conditional Exp328397 +Node: Function Calls330129 +Node: Precedence333723 +Node: Locales337392 +Node: Patterns and Actions338481 +Node: Pattern Overview339535 +Node: Regexp Patterns341201 +Node: Expression Patterns341744 +Node: Ranges345318 +Node: BEGIN/END348284 +Node: Using BEGIN/END349046 +Ref: Using BEGIN/END-Footnote-1351777 +Node: I/O And BEGIN/END351883 +Node: BEGINFILE/ENDFILE354165 +Node: Empty356998 +Node: Using Shell Variables357314 +Node: Action Overview359599 +Node: Statements361956 +Node: If Statement363810 +Node: While Statement365309 +Node: Do Statement367353 +Node: For Statement368509 +Node: Switch Statement371661 +Node: Break Statement373758 +Node: Continue Statement375748 +Node: Next Statement377535 +Node: Nextfile Statement379925 +Node: Exit Statement382222 +Node: Built-in Variables384638 +Node: User-modified385733 +Ref: User-modified-Footnote-1393759 +Node: Auto-set393821 +Ref: Auto-set-Footnote-1403112 +Node: ARGC and ARGV403317 +Node: Arrays407168 +Node: Array Basics408673 +Node: Array Intro409384 +Node: Reference to Elements413702 +Node: Assigning Elements415972 +Node: Array Example416463 +Node: Scanning an Array418195 +Node: Delete420861 +Ref: Delete-Footnote-1423296 +Node: Numeric Array Subscripts423353 +Node: Uninitialized Subscripts425536 +Node: Multi-dimensional427164 +Node: Multi-scanning430258 +Node: Arrays of Arrays431842 +Node: Functions436419 +Node: Built-in437241 +Node: Calling Built-in438319 +Node: Numeric Functions440307 +Ref: Numeric Functions-Footnote-1444072 +Ref: Numeric Functions-Footnote-2444429 +Ref: Numeric Functions-Footnote-3444477 +Node: String Functions444746 +Ref: String Functions-Footnote-1468243 +Ref: String Functions-Footnote-2468372 +Ref: String Functions-Footnote-3468620 +Node: Gory Details468707 +Ref: table-sub-escapes470386 +Ref: table-posix-sub471700 +Ref: table-gensub-escapes472613 +Node: I/O Functions473784 +Ref: I/O Functions-Footnote-1480439 +Node: Time Functions480586 +Ref: Time Functions-Footnote-1491478 +Ref: Time Functions-Footnote-2491546 +Ref: Time Functions-Footnote-3491704 +Ref: Time Functions-Footnote-4491815 +Ref: Time Functions-Footnote-5491927 +Ref: Time Functions-Footnote-6492154 +Node: Bitwise Functions492420 +Ref: table-bitwise-ops492978 +Ref: Bitwise Functions-Footnote-1497138 +Node: Type Functions497322 +Node: I18N Functions497792 +Node: User-defined499419 +Node: Definition Syntax500223 +Ref: Definition Syntax-Footnote-1505133 +Node: Function Example505202 +Node: Function Caveats507796 +Node: Calling A Function508217 +Node: Variable Scope509332 +Node: Pass By Value/Reference511307 +Node: Return Statement514747 +Node: Dynamic Typing517728 +Node: Indirect Calls518463 +Node: Internationalization528148 +Node: I18N and L10N529574 +Node: Explaining gettext530260 +Ref: Explaining gettext-Footnote-1535326 +Ref: Explaining gettext-Footnote-2535510 +Node: Programmer i18n535675 +Node: Translator i18n539875 +Node: String Extraction540668 +Ref: String Extraction-Footnote-1541629 +Node: Printf Ordering541715 +Ref: Printf Ordering-Footnote-1544499 +Node: I18N Portability544563 +Ref: I18N Portability-Footnote-1547012 +Node: I18N Example547075 +Ref: I18N Example-Footnote-1549710 +Node: Gawk I18N549782 +Node: Advanced Features550399 +Node: Nondecimal Data551912 +Node: Array Sorting553495 +Node: Controlling Array Traversal554195 +Node: Controlling Scanning With A Function554942 +Node: Controlling Scanning562645 +Ref: Controlling Scanning-Footnote-1566446 +Node: Array Sorting Functions566762 +Ref: Array Sorting Functions-Footnote-1570278 +Ref: Array Sorting Functions-Footnote-2570371 +Node: Two-way I/O570565 +Ref: Two-way I/O-Footnote-1575997 +Node: TCP/IP Networking576067 +Node: Profiling578911 +Node: Library Functions586385 +Ref: Library Functions-Footnote-1589392 +Node: Library Names589563 +Ref: Library Names-Footnote-1593034 +Ref: Library Names-Footnote-2593254 +Node: General Functions593340 +Node: Strtonum Function594293 +Node: Assert Function597223 +Node: Round Function600549 +Node: Cliff Random Function602092 +Node: Ordinal Functions603108 +Ref: Ordinal Functions-Footnote-1606178 +Ref: Ordinal Functions-Footnote-2606430 +Node: Join Function606639 +Ref: Join Function-Footnote-1608410 +Node: Gettimeofday Function608610 +Node: Data File Management612325 +Node: Filetrans Function612957 +Node: Rewind Function617096 +Node: File Checking618483 +Node: Empty Files619577 +Node: Ignoring Assigns621807 +Node: Getopt Function623360 +Ref: Getopt Function-Footnote-1634664 +Node: Passwd Functions634867 +Ref: Passwd Functions-Footnote-1643842 +Node: Group Functions643930 +Node: Walking Arrays652014 +Node: Sample Programs653583 +Node: Running Examples654248 +Node: Clones654976 +Node: Cut Program656200 +Node: Egrep Program666045 +Ref: Egrep Program-Footnote-1673818 +Node: Id Program673928 +Node: Split Program677544 +Ref: Split Program-Footnote-1681063 +Node: Tee Program681191 +Node: Uniq Program683994 +Node: Wc Program691423 +Ref: Wc Program-Footnote-1695689 +Ref: Wc Program-Footnote-2695889 +Node: Miscellaneous Programs695981 +Node: Dupword Program697169 +Node: Alarm Program699200 +Node: Translate Program703949 +Ref: Translate Program-Footnote-1708336 +Ref: Translate Program-Footnote-2708564 +Node: Labels Program708698 +Ref: Labels Program-Footnote-1712069 +Node: Word Sorting712153 +Node: History Sorting716037 +Node: Extract Program717876 +Ref: Extract Program-Footnote-1725359 +Node: Simple Sed725487 +Node: Igawk Program728549 +Ref: Igawk Program-Footnote-1743706 +Ref: Igawk Program-Footnote-2743907 +Node: Anagram Program744045 +Node: Signature Program747113 +Node: Debugger748213 +Node: Debugging749124 +Node: Debugging Concepts749537 +Node: Debugging Terms751393 +Node: Awk Debugging754016 +Node: Sample dgawk session754908 +Node: dgawk invocation755400 +Node: Finding The Bug756582 +Node: List of Debugger Commands763068 +Node: Breakpoint Control764379 +Node: Dgawk Execution Control768015 +Node: Viewing And Changing Data771366 +Node: Dgawk Stack774703 +Node: Dgawk Info776163 +Node: Miscellaneous Dgawk Commands780111 +Node: Readline Support785539 +Node: Dgawk Limitations786377 +Node: Language History788566 +Node: V7/SVR3.1790078 +Node: SVR4792399 +Node: POSIX793841 +Node: BTL794849 +Node: POSIX/GNU795583 +Node: Common Extensions800734 +Node: Ranges and Locales801841 +Ref: Ranges and Locales-Footnote-1806448 +Node: Contributors806669 +Node: Installation810931 +Node: Gawk Distribution811825 +Node: Getting812309 +Node: Extracting813135 +Node: Distribution contents814827 +Node: Unix Installation820049 +Node: Quick Installation820666 +Node: Additional Configuration Options822628 +Node: Configuration Philosophy824105 +Node: Non-Unix Installation826447 +Node: PC Installation826905 +Node: PC Binary Installation828204 +Node: PC Compiling830052 +Node: PC Testing832996 +Node: PC Using834172 +Node: Cygwin838357 +Node: MSYS839357 +Node: VMS Installation839871 +Node: VMS Compilation840474 +Ref: VMS Compilation-Footnote-1841481 +Node: VMS Installation Details841539 +Node: VMS Running843174 +Node: VMS Old Gawk844781 +Node: Bugs845255 +Node: Other Versions849108 +Node: Notes854389 +Node: Compatibility Mode855081 +Node: Additions855864 +Node: Accessing The Source856676 +Node: Adding Code858101 +Node: New Ports864068 +Node: Dynamic Extensions868181 +Node: Internals869557 +Node: Plugin License878660 +Node: Sample Library879294 +Node: Internal File Description879980 +Node: Internal File Ops883695 +Ref: Internal File Ops-Footnote-1888476 +Node: Using Internal File Ops888616 +Node: Future Extensions890993 +Node: Basic Concepts893497 +Node: Basic High Level894254 +Ref: Basic High Level-Footnote-1898289 +Node: Basic Data Typing898474 +Node: Floating Point Issues902999 +Node: String Conversion Precision904082 +Ref: String Conversion Precision-Footnote-1905782 +Node: Unexpected Results905891 +Node: POSIX Floating Point Problems907717 +Ref: POSIX Floating Point Problems-Footnote-1911422 +Node: Glossary911460 +Node: Copying936436 +Node: GNU Free Documentation License973993 +Node: Index999130 End Tag Table diff --git a/doc/gawk.texi b/doc/gawk.texi index b9190a62..a74773ca 100644 --- a/doc/gawk.texi +++ b/doc/gawk.texi @@ -20,7 +20,7 @@ @c applies to and all the info about who's publishing this edition @c These apply across the board. -@set UPDATE-MONTH May, 2011 +@set UPDATE-MONTH June, 2011 @set VERSION 4.0 @set PATCHLEVEL 0 @@ -368,7 +368,6 @@ particular records in a file and perform operations upon them. * Case-sensitivity:: How to do case-insensitive matching. * Leftmost Longest:: How much text matches. * Computed Regexps:: Using Dynamic Regexps. -* Locales:: How the locale affects things. * Records:: Controlling how data is split into records. * Fields:: An introduction to fields. @@ -467,6 +466,7 @@ particular records in a file and perform operations upon them. third subexpression. * Function Calls:: A function call is an expression. * Precedence:: How various operators nest. +* Locales:: How the locale affects things. * Pattern Overview:: What goes into a pattern. * Regexp Patterns:: Using regexps as patterns. * Expression Patterns:: Any expression can be used as a @@ -673,6 +673,8 @@ particular records in a file and perform operations upon them. * POSIX/GNU:: The extensions in @command{gawk} not in POSIX @command{awk}. * Common Extensions:: Common Extensions Summary. +* Ranges and Locales:: How locales used to affect regexp + ranges. * Contributors:: The major contributors to @command{gawk}. * Gawk Distribution:: What is in the @command{gawk} @@ -4003,7 +4005,6 @@ regular expressions work, we present more complicated instances. * Case-sensitivity:: How to do case-insensitive matching. * Leftmost Longest:: How much text matches. * Computed Regexps:: Using Dynamic Regexps. -* Locales:: How the locale affects things. @end menu @node Regexp Usage @@ -4530,15 +4531,14 @@ As in arithmetic, parentheses can change how operators are grouped. @cindex POSIX @command{awk}, regular expressions and @cindex @command{gawk}, regular expressions, precedence -In POSIX @command{awk} and @command{gawk}, the @samp{*}, @samp{+}, and @samp{?} operators -stand for themselves when there is nothing in the regexp that precedes them. -For example, @code{/+/} matches a literal plus sign. However, many other versions of -@command{awk} treat such a usage as a syntax error. - -If @command{gawk} is in compatibility mode -(@pxref{Options}), -interval expressions are not available in -regular expressions. +In POSIX @command{awk} and @command{gawk}, the @samp{*}, @samp{+}, and +@samp{?} operators stand for themselves when there is nothing in the +regexp that precedes them. For example, @code{/+/} matches a literal +plus sign. However, many other versions of @command{awk} treat such a +usage as a syntax error. + +If @command{gawk} is in compatibility mode (@pxref{Options}), interval +expressions are not available in regular expressions. @c ENDOFRANGE regexpo @node Bracket Expressions @@ -4548,15 +4548,16 @@ regular expressions. @cindex bracket expressions, range expressions @cindex range expressions (regexps) +As mentioned earlier, a bracket expression matches any character amongst +those listed between the opening and closing square brackets. + Within a bracket expression, a @dfn{range expression} consists of two characters separated by a hyphen. It matches any single character that -sorts between the two characters, using the locale's -collating sequence and character set. -For example, @samp{[0-9]} is equivalent to @samp{[0123456789]}. - -Unfortunately, providing simple character ranges such as @samp{[a-z]} -usually does not work like you might expect, due to locale-related issues. -This is discussed more fully, in @ref{Locales}. +sorts between the two characters, based upon the system's native character +set. For example, @samp{[0-9]} is equivalent to @samp{[0123456789]}. +(See @ref{Ranges and Locales}, for an explanation of how the POSIX +standard and @command{gawk} have changed over time. This is mainly +of historical interest.) @cindex @code{\} (backslash), in bracket expressions @cindex backslash (@code{\}), in bracket expressions @@ -4625,8 +4626,7 @@ control characters, or space characters). For example, before the POSIX standard, you had to write @code{/[A-Za-z0-9]/} to match alphanumeric characters. If your character set had other alphabetic characters in it, this would not -match them, and if your character set collated differently from -ASCII, this might not even match the ASCII alphanumeric characters. +match them. With the POSIX character classes, you can write @code{/[[:alnum:]]/} to match the alphabetic and numeric characters in your character set. @@ -5105,94 +5105,6 @@ occur often in practice, but it's worth noting for future reference. @c ENDOFRANGE regexpd @c ENDOFRANGE regexp -@node Locales -@section Where You Are Makes A Difference -@cindex locale, definition of - -Modern systems support the notion of @dfn{locales}: a way to tell -the system about the local character set and language. The current -locale setting can affect the way regexp matching works, often -in surprising ways. - -For example, in the default @code{"C"} locale, @samp{[a-dx-z]} is equivalent to -@samp{[abcdxyz]}. Many locales sort characters in dictionary order, -and in these locales, @samp{[a-dx-z]} is typically not equivalent to -@samp{[abcdxyz]}; instead it might be equivalent to @samp{[aBbCcdXxYyz]}, -for example. - -This point needs to be emphasized: Much literature teaches that you should -use @samp{[a-z]} to match a lowercase character. But on systems with -non-ASCII locales, this also matches all of the uppercase characters -except @samp{Z}! This is a continuous cause of confusion, even well -into the twenty-first century. - -@quotation NOTE -In an attempt to end the confusion once and for all, -when not in POSIX mode (@pxref{Options}), -@command{gawk} expands ranges into the characters they -include, based only on the machine character set. -This restores the traditional, pre-POSIX, pre-locales -behavior. However, you should read the rest of this section -so that you can write portable scripts, instead of relying -on behavior specific to @command{gawk}. -@end quotation - -To obtain the traditional interpretation of bracket expressions, you can -use the @code{"C"} locale by setting the @env{LC_ALL} environment variable to the -value @samp{C}. However, it is best to just use POSIX character classes, -such as @samp{[[:lower:]]} to match specific classes of characters. - -To demonstrate these issues, the following example uses the @code{sub()} -function, which does text replacement (@pxref{String Functions}). Here, -the intent is to remove trailing uppercase characters: - -@example -$ @kbd{echo something1234abc | gawk --posix '@{ sub("[A-Z]*$", ""); print @}'} -@print{} something1234a -@end example - -@noindent -This output is unexpected, since the @samp{bc} at the end of -@samp{something1234abc} should not normally match @samp{[A-Z]*}. -This result is due to the locale setting (and thus you may not see -it on your system). There are two fixes. The first is to use the -POSIX character class @samp{[[:upper:]]}, instead of @samp{[A-Z]}. -(This is preferred, since then your program will work everywhere.) - -The second is to change the locale setting in the environment, before -running @command{gawk}, by using the shell statements: - -@example -LANG=C LC_ALL=C -export LANG LC_ALL -@end example - -The setting @samp{C} forces @command{gawk} to behave in the traditional -Unix manner, where case distinctions do matter. -You may wish to put these statements into your shell startup file, -e.g., @file{$HOME/.profile}. - -Similar considerations apply to other ranges. For example, -@samp{["-/]} is perfectly valid in ASCII, but is not valid in many -Unicode locales, such as @samp{en_US.UTF-8}. (In general, such -ranges should be avoided; either list the characters individually, -or use a POSIX character class such as @samp{[[:punct:]]}.) - -An additional factor relates to splitting records. -For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant. -For other single-character record separators, using @samp{LC_ALL=C} -will give you much better performance when reading records. Otherwise, -@command{gawk} has to make several function calls, @emph{per input -character}, to find the record terminator. - -According to POSIX, string comparison is also affected by locales -(similar to regular expressions). The details are presented in -@ref{POSIX String Comparison}. - -Finally, the locale affects the value of the decimal point character -used when @command{gawk} parses input data. This is discussed in -detail in @ref{Conversion}. - @node Reading Files @chapter Reading Input Files @@ -8773,6 +8685,7 @@ combinations of these with various operators. * Truth Values and Conditions:: Testing for true and false. * Function Calls:: A function call is an expression. * Precedence:: How various operators nest. +* Locales:: How the locale affects things. @end menu @node Values @@ -10933,6 +10846,33 @@ For maximum portability, do not use them. @end quotation @c ENDOFRANGE prec @c ENDOFRANGE oppr + +@node Locales +@section Where You Are Makes A Difference +@cindex locale, definition of + +Modern systems support the notion of @dfn{locales}: a way to tell +the system about the local character set and language. + +Once upon a time, the locale setting used to affect regexp matching +(@pxref{Ranges and Locales}), but this is no longer true. + +Locales can affect record splitting. +For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant. +For other single-character record separators, setting @samp{LC_ALL=C} +in the environment +will give you much better performance when reading records. Otherwise, +@command{gawk} has to make several function calls, @emph{per input +character}, to find the record terminator. + +According to POSIX, string comparison is also affected by locales +(similar to regular expressions). The details are presented in +@ref{POSIX String Comparison}. + +Finally, the locale affects the value of the decimal point character +used when @command{gawk} parses input data. This is discussed in +detail in @ref{Conversion}. + @c ENDOFRANGE exps @node Patterns and Actions @@ -26434,6 +26374,7 @@ of the @value{DOCUMENT} where you can find more information. * POSIX/GNU:: The extensions in @command{gawk} not in POSIX @command{awk}. * Common Extensions:: Common Extensions Summary. +* Ranges and Locales:: How locales used to affect regexp ranges. * Contributors:: The major contributors to @command{gawk}. @end menu @@ -26977,6 +26918,103 @@ the three most widely-used freely available versions of @command{awk} @item @code{BINMODE} variable @tab @tab X @tab X @end multitable +@node Ranges and Locales +@appendixsec Regexp Ranges and Locales: A Long Sad Story + +This @value{SECTION} describes the confusing history of ranges within +regular expressions and their interactions with locales, and how this +affected different versions of @command{gawk}. + +The original Unix tools that worked with regular expressions defined +character ranges (such as @samp{[a-z]}) to match any character between +the first character in the range and the last character in the range, +inclusive. Ordering was based on the numeric value of each character +in the machine's native character set. Thus, on ASCII-based systems, +@code{[a-z]} matched all the lowercase letters, and only the lowercase +letters, since the numeric values for the letters from @samp{a} through +@samp{z} were contigous. (On an EBCDIC system, the range @samp{[a-z]} +includes additional, non-alphabetic characters as well.) + +Almost all introductory Unix literature explained range expressions +as working in this fashion, and in particular, would teach that the +``correct'' way to match lowercase letters was with @samp{[a-z]}, and +that @samp{[A-Z]} was the the ``correct'' way to match uppercase letters. +And indeed, this was true. + +The 1993 POSIX standard introduced the idea of locales (@pxref{Locales}). +Since many locales include other letters besides the plain twenty-six +letters of the American English alphabet, the POSIX standard added +character classes (@pxref{Bracket Expressions}) as a way to match +different kinds of characters besides the traditional ones in the ASCII +character set. + +However, the standard @emph{changed} the interpretation of range expressions. +In the @code{"C"} and @code{"POSIX"} locales, a range expression like +@samp{[a-dx-z]} is still equivalent to @samp{[abcdxyz]}, as in ASCII. +But outside those locales, the ordering was defined to be based on +@dfn{collation order}. + +In many locales, @samp{A} and @samp{a} are both less than @samp{B}. +In other words, these locales sort characters in dictionary order, +and @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; +instead it might be equivalent to @samp{[aBbCcdXxYyz]}, for example. + +This point needs to be emphasized: Much literature teaches that you should +use @samp{[a-z]} to match a lowercase character. But on systems with +non-ASCII locales, this also matched all of the uppercase characters +except @samp{Z}! This was a continuous cause of confusion, even well +into the twenty-first century. + +To demonstrate these issues, the following example uses the @code{sub()} +function, which does text replacement (@pxref{String Functions}). Here, +the intent is to remove trailing uppercase characters: + +@example +$ @kbd{echo something1234abc | gawk-3.1.8 '@{ sub("[A-Z]*$", ""); print @}'} +@print{} something1234a +@end example + +@noindent +This output is unexpected, since the @samp{bc} at the end of +@samp{something1234abc} should not normally match @samp{[A-Z]*}. +This result is due to the locale setting (and thus you may not see +it on your system). + +Similar considerations apply to other ranges. For example, @samp{["-/]} +is perfectly valid in ASCII, but is not valid in many Unicode locales, +such as @samp{en_US.UTF-8}. + +Early versions of @command{gawk} used regexp matching code that was not +locale aware, so ranges had their traditional interpretation. + +When @command{gawk} switched to using locale-aware regexp matchers, +the problems began; especially as both GNU/Linux and commercial Unix +vendors started implementing non-ASCII locales, @emph{and making them +the default}. Perhaps the most frequently asked question became something +like ``why does @code{[A-Z]} match lowercase letters?!?'' + +This situation existed for close to 10 years, if not more, and +the @command{gawk} maintainer grew weary of trying to explain that +@command{gawk} was being nicely standards-compliant, and that the issue +was in the user's locale. During the development of version 4.0, +he modified @command{gawk} to always treat ranges in the original, +pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}). + +Fortunately, shortly before the final release of @command{gawk} 4.0, +the maintainer learned that the 2008 standard had changed the +definition of ranges, such that outside the @code{"C"} and @code{"POSIX"} +locales, the meaning of range expressions was +@emph{undefined}.@footnote{See +@uref{http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05, the standard} +and +@uref{http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05, its rationale}.} + +By using this lovely technical term, the standard gives license +to implementors to implement ranges in whatever way they choose. +The @command{gawk} maintainer chose to apply the pre-POSIX meaning in all +cases: the default regexp matching; with @option{--traditional}, and with +@option{--posix}; in all cases, @command{gawk} remains POSIX compliant. + @node Contributors @appendixsec Major Contributors to @command{gawk} @cindex @command{gawk}, list of contributors to @@ -382,13 +382,26 @@ resetup() { if (do_posix) syn = RE_SYNTAX_POSIX_AWK; /* strict POSIX re's */ - else if (do_traditional) { + else if (do_traditional) syn = RE_SYNTAX_AWK; /* traditional Unix awk re's */ - syn |= RE_RANGES_IGNORE_LOCALES; - } else + else syn = RE_SYNTAX_GNU_AWK; /* POSIX re's + GNU ops */ /* + * As of POSIX 1003.1-2008 (see rule 7 of + * http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 + * and the rationale, at http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05) + * POSIX changed ranges outside the POSIX locale from requiring + * Collation Element Order to being "undefined". This gives an + * implementation, like gawk, the freedom to do ranges as it + * pleases. + * + * We very much please to always use numeric ordering, as + * the Good Lord intended. + */ + syn |= RE_RANGES_IGNORE_LOCALES; + + /* * Interval expressions are now on by default, as POSIX is * wide-spread enough that people want it. The do_intervals * variable remains for use with --traditional. |