diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/ChangeLog | 5 | ||||
-rw-r--r-- | doc/gawk.info | 659 | ||||
-rw-r--r-- | doc/gawk.texi | 123 | ||||
-rw-r--r-- | doc/gawktexi.in | 123 |
4 files changed, 574 insertions, 336 deletions
diff --git a/doc/ChangeLog b/doc/ChangeLog index 0fbba4ee..a6ad9a76 100644 --- a/doc/ChangeLog +++ b/doc/ChangeLog @@ -2,6 +2,11 @@ * gawktexi.in: Minor edits. + Unrelated: + + * gawktexi (Wc Program): Update to POSIX, support both bytes + and characters via the gawkextlib mbs extension. + 2020-10-01 Arnold D. Robbins <arnold@skeeve.com> * gawktexi.in (Split Program): Rewrite split to be POSIX diff --git a/doc/gawk.info b/doc/gawk.info index cf00ed11..320e0657 100644 --- a/doc/gawk.info +++ b/doc/gawk.info @@ -19011,10 +19011,76 @@ File: gawk.info, Node: Wc Program, Prev: Uniq Program, Up: Clones 11.2.7 Counting Things ---------------------- -The 'wc' (word count) utility counts lines, words, and characters in one -or more input files. Its usage is as follows: +The 'wc' (word count) utility counts lines, words, characters and bytes +in one or more input files. - 'wc' ['-lwc'] [FILES ...] +* Menu: + +* Bytes vs. Characters:: Modern character sets. +* Using extensions:: A brief intro to extensions. +* wc program:: Code for 'wc.awk'. + + +File: gawk.info, Node: Bytes vs. Characters, Next: Using extensions, Up: Wc Program + +11.2.7.1 Modern Character Sets +.............................. + +In the early days of computing, single bytes were used for storing +characters. The most common character sets were ASCII and EBCDIC, which +each provided all the English upper- and lowercase letters, the 10 +Hindu-Arabic numerals from 0 through 9, and a number of other standard +punctuation and control characters. + + Today, the most popular character set in use is Unicode (of which +ASCII is a pure subset). Unicode provides tens of thousands of unique +characters (called "code points") to cover most existing human languages +(living and dead) and a number of nonhuman ones as well (such as Klingon +and J.R.R. Tolkien's elvish languages). + + To save space in files, Unicode code points are "encoded", where each +character takes from one to four bytes in the file. UTF-8 is possibly +the most popular of such "multibyte encodings". + + The POSIX standard requires that 'awk' function in terms of +characters, not bytes. Thus in 'gawk', 'length()', 'substr()', +'split()', 'match()' and the other string functions (*note String +Functions::) all work in terms of characters in the local character set, +and not in terms of bytes. (Not all 'awk' implementations do so, +though). + + There is no standard, built-in way to distinguish characters from +bytes in an 'awk' program. For an 'awk' implementation of 'wc', which +needs to make such a distinction, we will have to use an external +extension. + + +File: gawk.info, Node: Using extensions, Next: wc program, Prev: Bytes vs. Characters, Up: Wc Program + +11.2.7.2 A Brief Introduction To Extensions +........................................... + +Loadable extensions are presented in full detail in *note Dynamic +Extensions::. They provide a way to add functions to 'gawk' which can +call out to other facilities written in C or C++. + + For the purposes of 'wc.awk', it's enough to know that the extension +is loaded with the '@load' directive, and the additional function we +will use is called 'mbs_length()'. This function returns the number of +bytes in a string, and not the number of characters. + + The '"mbs"' extension comes from the 'gawkextlib' project. *Note +gawkextlib:: for more information. + + +File: gawk.info, Node: wc program, Prev: Using extensions, Up: Wc Program + +11.2.7.3 Code for 'wc.awk' +.......................... + +The usage for 'wc' is as follows: + + 'wc' ['-lwcm'] [FILES ...] If no files are specified on the command line, 'wc' reads its standard input. If there are multiple files, it also prints total @@ -19031,21 +19097,27 @@ follows: data. '-c' + Count only bytes. Once upon a time, the 'c' in this option stood + for "characters." But, as explained earlier, bytes and character + are no longer synonymous with each other. + +'-m' Count only characters. Implementing 'wc' in 'awk' is particularly elegant, because 'awk' does a lot of the work for us; it splits lines into words (i.e., fields) and counts them, it counts lines (i.e., records), and it can easily tell -us how long a line is. +us how long a line is in characters. This program uses the 'getopt()' library function (*note Getopt Function::) and the file-transition functions (*note Filetrans Function::). - This version has one notable difference from traditional versions of -'wc': it always prints the counts in the order lines, words, and -characters. Traditional versions note the order of the '-l', '-w', and -'-c' options on the command line, and print the counts in that order. + This version has one notable difference from older versions of 'wc': +it always prints the counts in the order lines, words, characters and +bytes. Older versions note the order of the '-l', '-w', and '-c' +options on the command line, and print the counts in that order. POSIX +does not mandate this behavior, though. The 'BEGIN' rule does the argument processing. The variable 'print_total' is true if more than one file is named on the command @@ -19056,40 +19128,46 @@ line: # Options: # -l only count lines # -w only count words - # -c only count characters + # -c only count bytes + # -m only count characters # - # Default is to count lines, words, characters + # Default is to count lines, words, bytes # # Requires getopt() and file transition library functions + # Requires mbs extension from gawkextlib + + @load "mbs" BEGIN { # let getopt() print a message about # invalid options. we ignore them - while ((c = getopt(ARGC, ARGV, "lwc")) != -1) { + while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) { if (c == "l") do_lines = 1 else if (c == "w") do_words = 1 else if (c == "c") + do_bytes = 1 + else if (c == "m") do_chars = 1 } for (i = 1; i < Optind; i++) ARGV[i] = "" - # if no options, do all - if (! do_lines && ! do_words && ! do_chars) - do_lines = do_words = do_chars = 1 + # if no options, do lines, words, bytes + if (! do_lines && ! do_words && ! do_chars && ! do_bytes) + do_lines = do_words = do_bytes = 1 print_total = (ARGC - i > 1) } The 'beginfile()' function is simple; it just resets the counts of -lines, words, and characters to zero, and saves the current file name in -'fname': +lines, words, characters and bytes to zero, and saves the current file +name in 'fname': function beginfile(file) { - lines = words = chars = 0 + lines = words = chars = bytes = 0 fname = FILENAME } @@ -19103,26 +19181,31 @@ those numbers for the file that was just read. It relies on tlines += lines twords += words tchars += chars + tbytes += bytes if (do_lines) printf "\t%d", lines if (do_words) printf "\t%d", words if (do_chars) printf "\t%d", chars + if (do_bytes) + printf "\t%d", bytes printf "\t%s\n", fname } There is one rule that is executed for each line. It adds the length -of the record, plus one, to 'chars'.(1) Adding one plus the record -length is needed because the newline character separating records (the -value of 'RS') is not part of the record itself, and thus not included -in its length. Next, 'lines' is incremented for each line read, and +of the record, plus one, to 'chars'. Adding one plus the record length +is needed because the newline character separating records (the value of +'RS') is not part of the record itself, and thus not included in its +length. Similarly, it adds the length of the record in bytes, plus one, +to 'bytes'. Next, 'lines' is incremented for each line read, and 'words' is incremented by the value of 'NF', which is the number of "words" on this line: # do per line { chars += length($0) + 1 # get newline + bytes += mbs_length($0) + 1 lines++ words += NF } @@ -19137,15 +19220,12 @@ in its length. Next, 'lines' is incremented for each line read, and printf "\t%d", twords if (do_chars) printf "\t%d", tchars + if (do_bytes) + printf "\t%d", tbytes print "\ttotal" } } - ---------- Footnotes ---------- - - (1) Because 'gawk' understands multibyte locales, this code counts -characters, not bytes. - File: gawk.info, Node: Miscellaneous Programs, Next: Programs Summary, Prev: Clones, Up: Sample Programs @@ -35111,6 +35191,7 @@ Index * built-in functions: Functions. (line 6) * built-in functions, evaluation order: Calling Built-in. (line 30) * BusyBox Awk: Other Versions. (line 88) +* bytes, counting: Wc Program. (line 6) * C library functions, assert(): Assert Function. (line 6) * C library functions, getopt(): Getopt Function. (line 15) * C library functions, getpwent(): Passwd Functions. (line 16) @@ -35303,7 +35384,7 @@ Index * coprocesses <1>: Two-way I/O. (line 27) * cos: Numeric Functions. (line 16) * cosine: Numeric Functions. (line 16) -* counting words, lines, and characters: Wc Program. (line 6) +* counting words, lines, characters, and bytese: Wc Program. (line 6) * csh utility: Statements/Lines. (line 45) * csh utility, POSIXLY_CORRECT environment variable: Options. (line 405) * csh utility, |& operator, comparison with: Two-way I/O. (line 27) @@ -37765,7 +37846,7 @@ Index * watchpoint (debugger): Debugging Terms. (line 42) * watchpoints, show in debugger: Debugger Info. (line 51) * wc utility: Wc Program. (line 6) -* wc.awk program: Wc Program. (line 46) +* wc.awk program: wc program. (line 51) * Weinberger, Peter: History. (line 17) * Weinberger, Peter <1>: Contributors. (line 12) * where debugger command (alias for backtrace): Execution Stack. @@ -38140,266 +38221,268 @@ Ref: Split Program-Footnote-1766907 Node: Tee Program767080 Node: Uniq Program769870 Node: Wc Program777434 -Ref: Wc Program-Footnote-1781689 -Node: Miscellaneous Programs781783 -Node: Dupword Program782996 -Node: Alarm Program785026 -Node: Translate Program789881 -Ref: Translate Program-Footnote-1794446 -Node: Labels Program794716 -Ref: Labels Program-Footnote-1798067 -Node: Word Sorting798151 -Node: History Sorting802223 -Node: Extract Program804448 -Node: Simple Sed812502 -Node: Igawk Program815576 -Ref: Igawk Program-Footnote-1829907 -Ref: Igawk Program-Footnote-2830109 -Ref: Igawk Program-Footnote-3830231 -Node: Anagram Program830346 -Node: Signature Program833408 -Node: Programs Summary834655 -Node: Programs Exercises835869 -Ref: Programs Exercises-Footnote-1839999 -Node: Advanced Features840085 -Node: Nondecimal Data842075 -Node: Array Sorting843666 -Node: Controlling Array Traversal844366 -Ref: Controlling Array Traversal-Footnote-1852734 -Node: Array Sorting Functions852852 -Ref: Array Sorting Functions-Footnote-1857943 -Node: Two-way I/O858139 -Ref: Two-way I/O-Footnote-1865860 -Ref: Two-way I/O-Footnote-2866047 -Node: TCP/IP Networking866129 -Node: Profiling869247 -Node: Advanced Features Summary878561 -Node: Internationalization880405 -Node: I18N and L10N881885 -Node: Explaining gettext882572 -Ref: Explaining gettext-Footnote-1888464 -Ref: Explaining gettext-Footnote-2888649 -Node: Programmer i18n888814 -Ref: Programmer i18n-Footnote-1893763 -Node: Translator i18n893812 -Node: String Extraction894606 -Ref: String Extraction-Footnote-1895738 -Node: Printf Ordering895824 -Ref: Printf Ordering-Footnote-1898610 -Node: I18N Portability898674 -Ref: I18N Portability-Footnote-1901130 -Node: I18N Example901193 -Ref: I18N Example-Footnote-1904468 -Ref: I18N Example-Footnote-2904541 -Node: Gawk I18N904650 -Node: I18N Summary905299 -Node: Debugger906640 -Node: Debugging907640 -Node: Debugging Concepts908081 -Node: Debugging Terms909890 -Node: Awk Debugging912465 -Ref: Awk Debugging-Footnote-1913410 -Node: Sample Debugging Session913542 -Node: Debugger Invocation914076 -Node: Finding The Bug915462 -Node: List of Debugger Commands921936 -Node: Breakpoint Control923269 -Node: Debugger Execution Control926963 -Node: Viewing And Changing Data930325 -Node: Execution Stack933866 -Node: Debugger Info935503 -Node: Miscellaneous Debugger Commands939574 -Node: Readline Support944636 -Node: Limitations945532 -Node: Debugging Summary948086 -Node: Namespaces949365 -Node: Global Namespace950476 -Node: Qualified Names951874 -Node: Default Namespace952873 -Node: Changing The Namespace953614 -Node: Naming Rules955228 -Node: Internal Name Management957076 -Node: Namespace Example958118 -Node: Namespace And Features960680 -Node: Namespace Summary962115 -Node: Arbitrary Precision Arithmetic963592 -Node: Computer Arithmetic965079 -Ref: table-numeric-ranges968845 -Ref: table-floating-point-ranges969338 -Ref: Computer Arithmetic-Footnote-1969996 -Node: Math Definitions970053 -Ref: table-ieee-formats973369 -Ref: Math Definitions-Footnote-1973972 -Node: MPFR features974077 -Node: FP Math Caution975795 -Ref: FP Math Caution-Footnote-1976867 -Node: Inexactness of computations977236 -Node: Inexact representation978196 -Node: Comparing FP Values979556 -Node: Errors accumulate980797 -Node: Getting Accuracy982230 -Node: Try To Round984940 -Node: Setting precision985839 -Ref: table-predefined-precision-strings986536 -Node: Setting the rounding mode988366 -Ref: table-gawk-rounding-modes988740 -Ref: Setting the rounding mode-Footnote-1992671 -Node: Arbitrary Precision Integers992850 -Ref: Arbitrary Precision Integers-Footnote-1996025 -Node: Checking for MPFR996174 -Node: POSIX Floating Point Problems997648 -Ref: POSIX Floating Point Problems-Footnote-11001933 -Node: Floating point summary1001971 -Node: Dynamic Extensions1004161 -Node: Extension Intro1005714 -Node: Plugin License1006980 -Node: Extension Mechanism Outline1007777 -Ref: figure-load-extension1008216 -Ref: figure-register-new-function1009781 -Ref: figure-call-new-function1010873 -Node: Extension API Description1012935 -Node: Extension API Functions Introduction1014648 -Ref: table-api-std-headers1016484 -Node: General Data Types1020733 -Ref: General Data Types-Footnote-11029363 -Node: Memory Allocation Functions1029662 -Ref: Memory Allocation Functions-Footnote-11034163 -Node: Constructor Functions1034262 -Node: API Ownership of MPFR and GMP Values1037728 -Node: Registration Functions1039041 -Node: Extension Functions1039741 -Node: Exit Callback Functions1045063 -Node: Extension Version String1046313 -Node: Input Parsers1046976 -Node: Output Wrappers1059697 -Node: Two-way processors1064209 -Node: Printing Messages1066474 -Ref: Printing Messages-Footnote-11067645 -Node: Updating ERRNO1067798 -Node: Requesting Values1068537 -Ref: table-value-types-returned1069274 -Node: Accessing Parameters1070210 -Node: Symbol Table Access1071447 -Node: Symbol table by name1071959 -Ref: Symbol table by name-Footnote-11074983 -Node: Symbol table by cookie1075111 -Ref: Symbol table by cookie-Footnote-11079296 -Node: Cached values1079360 -Ref: Cached values-Footnote-11082896 -Node: Array Manipulation1083049 -Ref: Array Manipulation-Footnote-11084140 -Node: Array Data Types1084177 -Ref: Array Data Types-Footnote-11086835 -Node: Array Functions1086927 -Node: Flattening Arrays1091425 -Node: Creating Arrays1098401 -Node: Redirection API1103168 -Node: Extension API Variables1106001 -Node: Extension Versioning1106712 -Ref: gawk-api-version1107141 -Node: Extension GMP/MPFR Versioning1108872 -Node: Extension API Informational Variables1110500 -Node: Extension API Boilerplate1111573 -Node: Changes from API V11115547 -Node: Finding Extensions1117119 -Node: Extension Example1117678 -Node: Internal File Description1118476 -Node: Internal File Ops1122556 -Ref: Internal File Ops-Footnote-11133906 -Node: Using Internal File Ops1134046 -Ref: Using Internal File Ops-Footnote-11136429 -Node: Extension Samples1136703 -Node: Extension Sample File Functions1138232 -Node: Extension Sample Fnmatch1145881 -Node: Extension Sample Fork1147368 -Node: Extension Sample Inplace1148586 -Node: Extension Sample Ord1152212 -Node: Extension Sample Readdir1153048 -Ref: table-readdir-file-types1153937 -Node: Extension Sample Revout1155004 -Node: Extension Sample Rev2way1155593 -Node: Extension Sample Read write array1156333 -Node: Extension Sample Readfile1158275 -Node: Extension Sample Time1159370 -Node: Extension Sample API Tests1161122 -Node: gawkextlib1161614 -Node: Extension summary1164532 -Node: Extension Exercises1168234 -Node: Language History1169476 -Node: V7/SVR3.11171132 -Node: SVR41173284 -Node: POSIX1174718 -Node: BTL1176099 -Node: POSIX/GNU1176828 -Node: Feature History1182606 -Node: Common Extensions1198925 -Node: Ranges and Locales1200208 -Ref: Ranges and Locales-Footnote-11204824 -Ref: Ranges and Locales-Footnote-21204851 -Ref: Ranges and Locales-Footnote-31205086 -Node: Contributors1205309 -Node: History summary1211306 -Node: Installation1212686 -Node: Gawk Distribution1213630 -Node: Getting1214114 -Node: Extracting1215077 -Node: Distribution contents1216715 -Node: Unix Installation1223195 -Node: Quick Installation1223877 -Node: Shell Startup Files1226291 -Node: Additional Configuration Options1227380 -Node: Configuration Philosophy1229695 -Node: Non-Unix Installation1232064 -Node: PC Installation1232524 -Node: PC Binary Installation1233362 -Node: PC Compiling1233797 -Node: PC Using1234914 -Node: Cygwin1238467 -Node: MSYS1239691 -Node: VMS Installation1240293 -Node: VMS Compilation1241084 -Ref: VMS Compilation-Footnote-11242313 -Node: VMS Dynamic Extensions1242371 -Node: VMS Installation Details1244056 -Node: VMS Running1246309 -Node: VMS GNV1250588 -Node: VMS Old Gawk1251323 -Node: Bugs1251794 -Node: Bug address1252457 -Node: Usenet1255439 -Node: Maintainers1256443 -Node: Other Versions1257628 -Node: Installation summary1264716 -Node: Notes1265925 -Node: Compatibility Mode1266719 -Node: Additions1267501 -Node: Accessing The Source1268426 -Node: Adding Code1269863 -Node: New Ports1276082 -Node: Derived Files1280457 -Ref: Derived Files-Footnote-11286117 -Ref: Derived Files-Footnote-21286152 -Ref: Derived Files-Footnote-31286750 -Node: Future Extensions1286864 -Node: Implementation Limitations1287522 -Node: Extension Design1288732 -Node: Old Extension Problems1289876 -Ref: Old Extension Problems-Footnote-11291394 -Node: Extension New Mechanism Goals1291451 -Ref: Extension New Mechanism Goals-Footnote-11294815 -Node: Extension Other Design Decisions1295004 -Node: Extension Future Growth1297117 -Node: Notes summary1297723 -Node: Basic Concepts1298881 -Node: Basic High Level1299562 -Ref: figure-general-flow1299844 -Ref: figure-process-flow1300529 -Ref: Basic High Level-Footnote-11303830 -Node: Basic Data Typing1304015 -Node: Glossary1307343 -Node: Copying1339228 -Node: GNU Free Documentation License1376771 -Node: Index1401891 +Node: Bytes vs. Characters777831 +Node: Using extensions779379 +Node: wc program780137 +Node: Miscellaneous Programs784995 +Node: Dupword Program786208 +Node: Alarm Program788238 +Node: Translate Program793093 +Ref: Translate Program-Footnote-1797658 +Node: Labels Program797928 +Ref: Labels Program-Footnote-1801279 +Node: Word Sorting801363 +Node: History Sorting805435 +Node: Extract Program807660 +Node: Simple Sed815714 +Node: Igawk Program818788 +Ref: Igawk Program-Footnote-1833119 +Ref: Igawk Program-Footnote-2833321 +Ref: Igawk Program-Footnote-3833443 +Node: Anagram Program833558 +Node: Signature Program836620 +Node: Programs Summary837867 +Node: Programs Exercises839081 +Ref: Programs Exercises-Footnote-1843211 +Node: Advanced Features843297 +Node: Nondecimal Data845287 +Node: Array Sorting846878 +Node: Controlling Array Traversal847578 +Ref: Controlling Array Traversal-Footnote-1855946 +Node: Array Sorting Functions856064 +Ref: Array Sorting Functions-Footnote-1861155 +Node: Two-way I/O861351 +Ref: Two-way I/O-Footnote-1869072 +Ref: Two-way I/O-Footnote-2869259 +Node: TCP/IP Networking869341 +Node: Profiling872459 +Node: Advanced Features Summary881773 +Node: Internationalization883617 +Node: I18N and L10N885097 +Node: Explaining gettext885784 +Ref: Explaining gettext-Footnote-1891676 +Ref: Explaining gettext-Footnote-2891861 +Node: Programmer i18n892026 +Ref: Programmer i18n-Footnote-1896975 +Node: Translator i18n897024 +Node: String Extraction897818 +Ref: String Extraction-Footnote-1898950 +Node: Printf Ordering899036 +Ref: Printf Ordering-Footnote-1901822 +Node: I18N Portability901886 +Ref: I18N Portability-Footnote-1904342 +Node: I18N Example904405 +Ref: I18N Example-Footnote-1907680 +Ref: I18N Example-Footnote-2907753 +Node: Gawk I18N907862 +Node: I18N Summary908511 +Node: Debugger909852 +Node: Debugging910852 +Node: Debugging Concepts911293 +Node: Debugging Terms913102 +Node: Awk Debugging915677 +Ref: Awk Debugging-Footnote-1916622 +Node: Sample Debugging Session916754 +Node: Debugger Invocation917288 +Node: Finding The Bug918674 +Node: List of Debugger Commands925148 +Node: Breakpoint Control926481 +Node: Debugger Execution Control930175 +Node: Viewing And Changing Data933537 +Node: Execution Stack937078 +Node: Debugger Info938715 +Node: Miscellaneous Debugger Commands942786 +Node: Readline Support947848 +Node: Limitations948744 +Node: Debugging Summary951298 +Node: Namespaces952577 +Node: Global Namespace953688 +Node: Qualified Names955086 +Node: Default Namespace956085 +Node: Changing The Namespace956826 +Node: Naming Rules958440 +Node: Internal Name Management960288 +Node: Namespace Example961330 +Node: Namespace And Features963892 +Node: Namespace Summary965327 +Node: Arbitrary Precision Arithmetic966804 +Node: Computer Arithmetic968291 +Ref: table-numeric-ranges972057 +Ref: table-floating-point-ranges972550 +Ref: Computer Arithmetic-Footnote-1973208 +Node: Math Definitions973265 +Ref: table-ieee-formats976581 +Ref: Math Definitions-Footnote-1977184 +Node: MPFR features977289 +Node: FP Math Caution979007 +Ref: FP Math Caution-Footnote-1980079 +Node: Inexactness of computations980448 +Node: Inexact representation981408 +Node: Comparing FP Values982768 +Node: Errors accumulate984009 +Node: Getting Accuracy985442 +Node: Try To Round988152 +Node: Setting precision989051 +Ref: table-predefined-precision-strings989748 +Node: Setting the rounding mode991578 +Ref: table-gawk-rounding-modes991952 +Ref: Setting the rounding mode-Footnote-1995883 +Node: Arbitrary Precision Integers996062 +Ref: Arbitrary Precision Integers-Footnote-1999237 +Node: Checking for MPFR999386 +Node: POSIX Floating Point Problems1000860 +Ref: POSIX Floating Point Problems-Footnote-11005145 +Node: Floating point summary1005183 +Node: Dynamic Extensions1007373 +Node: Extension Intro1008926 +Node: Plugin License1010192 +Node: Extension Mechanism Outline1010989 +Ref: figure-load-extension1011428 +Ref: figure-register-new-function1012993 +Ref: figure-call-new-function1014085 +Node: Extension API Description1016147 +Node: Extension API Functions Introduction1017860 +Ref: table-api-std-headers1019696 +Node: General Data Types1023945 +Ref: General Data Types-Footnote-11032575 +Node: Memory Allocation Functions1032874 +Ref: Memory Allocation Functions-Footnote-11037375 +Node: Constructor Functions1037474 +Node: API Ownership of MPFR and GMP Values1040940 +Node: Registration Functions1042253 +Node: Extension Functions1042953 +Node: Exit Callback Functions1048275 +Node: Extension Version String1049525 +Node: Input Parsers1050188 +Node: Output Wrappers1062909 +Node: Two-way processors1067421 +Node: Printing Messages1069686 +Ref: Printing Messages-Footnote-11070857 +Node: Updating ERRNO1071010 +Node: Requesting Values1071749 +Ref: table-value-types-returned1072486 +Node: Accessing Parameters1073422 +Node: Symbol Table Access1074659 +Node: Symbol table by name1075171 +Ref: Symbol table by name-Footnote-11078195 +Node: Symbol table by cookie1078323 +Ref: Symbol table by cookie-Footnote-11082508 +Node: Cached values1082572 +Ref: Cached values-Footnote-11086108 +Node: Array Manipulation1086261 +Ref: Array Manipulation-Footnote-11087352 +Node: Array Data Types1087389 +Ref: Array Data Types-Footnote-11090047 +Node: Array Functions1090139 +Node: Flattening Arrays1094637 +Node: Creating Arrays1101613 +Node: Redirection API1106380 +Node: Extension API Variables1109213 +Node: Extension Versioning1109924 +Ref: gawk-api-version1110353 +Node: Extension GMP/MPFR Versioning1112084 +Node: Extension API Informational Variables1113712 +Node: Extension API Boilerplate1114785 +Node: Changes from API V11118759 +Node: Finding Extensions1120331 +Node: Extension Example1120890 +Node: Internal File Description1121688 +Node: Internal File Ops1125768 +Ref: Internal File Ops-Footnote-11137118 +Node: Using Internal File Ops1137258 +Ref: Using Internal File Ops-Footnote-11139641 +Node: Extension Samples1139915 +Node: Extension Sample File Functions1141444 +Node: Extension Sample Fnmatch1149093 +Node: Extension Sample Fork1150580 +Node: Extension Sample Inplace1151798 +Node: Extension Sample Ord1155424 +Node: Extension Sample Readdir1156260 +Ref: table-readdir-file-types1157149 +Node: Extension Sample Revout1158216 +Node: Extension Sample Rev2way1158805 +Node: Extension Sample Read write array1159545 +Node: Extension Sample Readfile1161487 +Node: Extension Sample Time1162582 +Node: Extension Sample API Tests1164334 +Node: gawkextlib1164826 +Node: Extension summary1167744 +Node: Extension Exercises1171446 +Node: Language History1172688 +Node: V7/SVR3.11174344 +Node: SVR41176496 +Node: POSIX1177930 +Node: BTL1179311 +Node: POSIX/GNU1180040 +Node: Feature History1185818 +Node: Common Extensions1202137 +Node: Ranges and Locales1203420 +Ref: Ranges and Locales-Footnote-11208036 +Ref: Ranges and Locales-Footnote-21208063 +Ref: Ranges and Locales-Footnote-31208298 +Node: Contributors1208521 +Node: History summary1214518 +Node: Installation1215898 +Node: Gawk Distribution1216842 +Node: Getting1217326 +Node: Extracting1218289 +Node: Distribution contents1219927 +Node: Unix Installation1226407 +Node: Quick Installation1227089 +Node: Shell Startup Files1229503 +Node: Additional Configuration Options1230592 +Node: Configuration Philosophy1232907 +Node: Non-Unix Installation1235276 +Node: PC Installation1235736 +Node: PC Binary Installation1236574 +Node: PC Compiling1237009 +Node: PC Using1238126 +Node: Cygwin1241679 +Node: MSYS1242903 +Node: VMS Installation1243505 +Node: VMS Compilation1244296 +Ref: VMS Compilation-Footnote-11245525 +Node: VMS Dynamic Extensions1245583 +Node: VMS Installation Details1247268 +Node: VMS Running1249521 +Node: VMS GNV1253800 +Node: VMS Old Gawk1254535 +Node: Bugs1255006 +Node: Bug address1255669 +Node: Usenet1258651 +Node: Maintainers1259655 +Node: Other Versions1260840 +Node: Installation summary1267928 +Node: Notes1269137 +Node: Compatibility Mode1269931 +Node: Additions1270713 +Node: Accessing The Source1271638 +Node: Adding Code1273075 +Node: New Ports1279294 +Node: Derived Files1283669 +Ref: Derived Files-Footnote-11289329 +Ref: Derived Files-Footnote-21289364 +Ref: Derived Files-Footnote-31289962 +Node: Future Extensions1290076 +Node: Implementation Limitations1290734 +Node: Extension Design1291944 +Node: Old Extension Problems1293088 +Ref: Old Extension Problems-Footnote-11294606 +Node: Extension New Mechanism Goals1294663 +Ref: Extension New Mechanism Goals-Footnote-11298027 +Node: Extension Other Design Decisions1298216 +Node: Extension Future Growth1300329 +Node: Notes summary1300935 +Node: Basic Concepts1302093 +Node: Basic High Level1302774 +Ref: figure-general-flow1303056 +Ref: figure-process-flow1303741 +Ref: Basic High Level-Footnote-11307042 +Node: Basic Data Typing1307227 +Node: Glossary1310555 +Node: Copying1342440 +Node: GNU Free Documentation License1379983 +Node: Index1405103 End Tag Table diff --git a/doc/gawk.texi b/doc/gawk.texi index 9446e696..91625d06 100644 --- a/doc/gawk.texi +++ b/doc/gawk.texi @@ -26761,19 +26761,76 @@ as fast.'' Consider how to rewrite the logic to follow this suggestion. @node Wc Program @subsection Counting Things -@c FIXME: One day, update to current POSIX version of wc - -@cindex counting words, lines, and characters +@cindex counting words, lines, characters, and bytese @cindex input files @subentry counting elements in @cindex words @subentry counting @cindex characters @subentry counting @cindex lines @subentry counting +@cindex bytes @subentry counting @cindex @command{wc} utility -The @command{wc} (word count) utility counts lines, words, and characters in -one or more input files. Its usage is as follows: +The @command{wc} (word count) utility counts lines, words, characters +and bytes in one or more input files. + +@menu +* Bytes vs. Characters:: Modern character sets. +* Using extensions:: A brief intro to extensions. +* @command{wc} program:: Code for @file{wc.awk}. +@end menu + +@node Bytes vs. Characters +@subsubsection Modern Character Sets + +In the early days of computing, single bytes were used for storing +characters. The most common character sets were ASCII and EBCDIC, +which each provided all the English upper- and lowercase letters, the 10 +Hindu-Arabic numerals from 0 through 9, and a number of other standard +punctuation and control characters. + +Today, the most popular character set in use is Unicode (of which ASCII +is a pure subset). Unicode provides tens of thousands of unique characters +(called @dfn{code points}) to cover most existing human languages (living +and dead) and a number of nonhuman ones as well (such as Klingon and +J.R.R.@: Tolkien's elvish languages). + +To save space in files, Unicode code points are @dfn{encoded}, where each +character takes from one to four bytes in the file. UTF-8 is possibly +the most popular of such @dfn{multibyte encodings}. + +The POSIX standard requires that @command{awk} function in terms +of characters, not bytes. Thus in @command{gawk}, @code{length()}, +@code{substr()}, @code{split()}, @code{match()} and the other string +functions (@pxref{String Functions}) all work in terms of characters in +the local character set, and not in terms of bytes. (Not all @command{awk} +implementations do so, though). + +There is no standard, built-in way to distinguish characters from bytes +in an @command{awk} program. For an @command{awk} implementation of +@command{wc}, which needs to make such a distinction, we will have to +use an external extension. + +@node Using extensions +@subsubsection A Brief Introduction To Extensions + +Loadable extensions are presented in full detail in @ref{Dynamic Extensions}. +They provide a way to add functions to @command{gawk} which can call +out to other facilities written in C or C++. + +For the purposes of +@file{wc.awk}, it's enough to know that the extension is loaded +with the @code{@@load} directive, and the additional function we +will use is called @code{mbs_length()}. This function returns the +number of bytes in a string, and not the number of characters. + +The @code{"mbs"} extension comes from the @code{gawkextlib} +project. @xref{gawkextlib} for more information. + +@node @command{wc} program +@subsubsection Code for @file{wc.awk} + +The usage for @command{wc} is as follows: @display -@command{wc} [@option{-lwc}] [@var{files} @dots{}] +@command{wc} [@option{-lwcm}] [@var{files} @dots{}] @end display If no files are specified on the command line, @command{wc} reads its standard @@ -26791,24 +26848,30 @@ by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates fields in its input data. @item -c +Count only bytes. +Once upon a time, the @samp{c} in this option stood for ``characters.'' +But, as explained earlier, bytes and character are no longer synonymous +with each other. + +@item -m Count only characters. @end table Implementing @command{wc} in @command{awk} is particularly elegant, because @command{awk} does a lot of the work for us; it splits lines into words (i.e., fields) and counts them, it counts lines (i.e., records), -and it can easily tell us how long a line is. +and it can easily tell us how long a line is in characters. This program uses the @code{getopt()} library function (@pxref{Getopt Function}) and the file-transition functions (@pxref{Filetrans Function}). -This version has one notable difference from traditional versions of +This version has one notable difference from older versions of @command{wc}: it always prints the counts in the order lines, words, -and characters. Traditional versions note the order of the @option{-l}, +characters and bytes. Older versions note the order of the @option{-l}, @option{-w}, and @option{-c} options on the command line, and print the -counts in that order. +counts in that order. POSIX does not mandate this behavior, though. The @code{BEGIN} rule does the argument processing. The variable @code{print_total} is true if more than one file is named on the @@ -26824,6 +26887,7 @@ command line: # # Arnold Robbins, arnold@@skeeve.com, Public Domain # May 1993 +# Revised September 2020 @c endfile @end ignore @c file eg/prog/wc.awk @@ -26831,29 +26895,35 @@ command line: # Options: # -l only count lines # -w only count words -# -c only count characters +# -c only count bytes +# -m only count characters # -# Default is to count lines, words, characters +# Default is to count lines, words, bytes # # Requires getopt() and file transition library functions +# Requires mbs extension from gawkextlib + +@@load "mbs" BEGIN @{ # let getopt() print a message about # invalid options. we ignore them - while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{ + while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{ if (c == "l") do_lines = 1 else if (c == "w") do_words = 1 else if (c == "c") + do_bytes = 1 + else if (c == "m") do_chars = 1 @} for (i = 1; i < Optind; i++) ARGV[i] = "" - # if no options, do all - if (! do_lines && ! do_words && ! do_chars) - do_lines = do_words = do_chars = 1 + # if no options, do lines, words, bytes + if (! do_lines && ! do_words && ! do_chars && ! do_bytes) + do_lines = do_words = do_bytes = 1 print_total = (ARGC - i > 1) @} @@ -26861,14 +26931,14 @@ BEGIN @{ @end example The @code{beginfile()} function is simple; it just resets the counts of lines, -words, and characters to zero, and saves the current @value{FN} in +words, characters and bytes to zero, and saves the current @value{FN} in @code{fname}: @example @c file eg/prog/wc.awk function beginfile(file) @{ - lines = words = chars = 0 + lines = words = chars = bytes = 0 fname = FILENAME @} @c endfile @@ -26886,6 +26956,7 @@ function endfile(file) tlines += lines twords += words tchars += chars + tbytes += bytes if (do_lines) printf "\t%d", lines @group @@ -26894,26 +26965,28 @@ function endfile(file) @end group if (do_chars) printf "\t%d", chars + if (do_bytes) + printf "\t%d", bytes printf "\t%s\n", fname @} @c endfile @end example There is one rule that is executed for each line. It adds the length of -the record, plus one, to @code{chars}.@footnote{Because @command{gawk} -understands multibyte locales, this code counts characters, not bytes.} -Adding one plus the record length +the record, plus one, to @code{chars}. Adding one plus the record length is needed because the newline character separating records (the value of @code{RS}) is not part of the record itself, and thus not included -in its length. Next, @code{lines} is incremented for each line read, -and @code{words} is incremented by the value of @code{NF}, which is the -number of ``words'' on this line: +in its length. Similarly, it adds the length of the record in bytes, +plus one, to @code{bytes}. Next, @code{lines} is incremented for each +line read, and @code{words} is incremented by the value of @code{NF}, +which is the number of ``words'' on this line: @example @c file eg/prog/wc.awk # do per line @{ chars += length($0) + 1 # get newline + bytes += mbs_length($0) + 1 lines++ words += NF @} @@ -26932,6 +27005,8 @@ END @{ printf "\t%d", twords if (do_chars) printf "\t%d", tchars + if (do_bytes) + printf "\t%d", tbytes print "\ttotal" @} @} diff --git a/doc/gawktexi.in b/doc/gawktexi.in index f96ff861..f982ae8b 100644 --- a/doc/gawktexi.in +++ b/doc/gawktexi.in @@ -25771,19 +25771,76 @@ as fast.'' Consider how to rewrite the logic to follow this suggestion. @node Wc Program @subsection Counting Things -@c FIXME: One day, update to current POSIX version of wc - -@cindex counting words, lines, and characters +@cindex counting words, lines, characters, and bytese @cindex input files @subentry counting elements in @cindex words @subentry counting @cindex characters @subentry counting @cindex lines @subentry counting +@cindex bytes @subentry counting @cindex @command{wc} utility -The @command{wc} (word count) utility counts lines, words, and characters in -one or more input files. Its usage is as follows: +The @command{wc} (word count) utility counts lines, words, characters +and bytes in one or more input files. + +@menu +* Bytes vs. Characters:: Modern character sets. +* Using extensions:: A brief intro to extensions. +* @command{wc} program:: Code for @file{wc.awk}. +@end menu + +@node Bytes vs. Characters +@subsubsection Modern Character Sets + +In the early days of computing, single bytes were used for storing +characters. The most common character sets were ASCII and EBCDIC, +which each provided all the English upper- and lowercase letters, the 10 +Hindu-Arabic numerals from 0 through 9, and a number of other standard +punctuation and control characters. + +Today, the most popular character set in use is Unicode (of which ASCII +is a pure subset). Unicode provides tens of thousands of unique characters +(called @dfn{code points}) to cover most existing human languages (living +and dead) and a number of nonhuman ones as well (such as Klingon and +J.R.R.@: Tolkien's elvish languages). + +To save space in files, Unicode code points are @dfn{encoded}, where each +character takes from one to four bytes in the file. UTF-8 is possibly +the most popular of such @dfn{multibyte encodings}. + +The POSIX standard requires that @command{awk} function in terms +of characters, not bytes. Thus in @command{gawk}, @code{length()}, +@code{substr()}, @code{split()}, @code{match()} and the other string +functions (@pxref{String Functions}) all work in terms of characters in +the local character set, and not in terms of bytes. (Not all @command{awk} +implementations do so, though). + +There is no standard, built-in way to distinguish characters from bytes +in an @command{awk} program. For an @command{awk} implementation of +@command{wc}, which needs to make such a distinction, we will have to +use an external extension. + +@node Using extensions +@subsubsection A Brief Introduction To Extensions + +Loadable extensions are presented in full detail in @ref{Dynamic Extensions}. +They provide a way to add functions to @command{gawk} which can call +out to other facilities written in C or C++. + +For the purposes of +@file{wc.awk}, it's enough to know that the extension is loaded +with the @code{@@load} directive, and the additional function we +will use is called @code{mbs_length()}. This function returns the +number of bytes in a string, and not the number of characters. + +The @code{"mbs"} extension comes from the @code{gawkextlib} +project. @xref{gawkextlib} for more information. + +@node @command{wc} program +@subsubsection Code for @file{wc.awk} + +The usage for @command{wc} is as follows: @display -@command{wc} [@option{-lwc}] [@var{files} @dots{}] +@command{wc} [@option{-lwcm}] [@var{files} @dots{}] @end display If no files are specified on the command line, @command{wc} reads its standard @@ -25801,24 +25858,30 @@ by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates fields in its input data. @item -c +Count only bytes. +Once upon a time, the @samp{c} in this option stood for ``characters.'' +But, as explained earlier, bytes and character are no longer synonymous +with each other. + +@item -m Count only characters. @end table Implementing @command{wc} in @command{awk} is particularly elegant, because @command{awk} does a lot of the work for us; it splits lines into words (i.e., fields) and counts them, it counts lines (i.e., records), -and it can easily tell us how long a line is. +and it can easily tell us how long a line is in characters. This program uses the @code{getopt()} library function (@pxref{Getopt Function}) and the file-transition functions (@pxref{Filetrans Function}). -This version has one notable difference from traditional versions of +This version has one notable difference from older versions of @command{wc}: it always prints the counts in the order lines, words, -and characters. Traditional versions note the order of the @option{-l}, +characters and bytes. Older versions note the order of the @option{-l}, @option{-w}, and @option{-c} options on the command line, and print the -counts in that order. +counts in that order. POSIX does not mandate this behavior, though. The @code{BEGIN} rule does the argument processing. The variable @code{print_total} is true if more than one file is named on the @@ -25834,6 +25897,7 @@ command line: # # Arnold Robbins, arnold@@skeeve.com, Public Domain # May 1993 +# Revised September 2020 @c endfile @end ignore @c file eg/prog/wc.awk @@ -25841,29 +25905,35 @@ command line: # Options: # -l only count lines # -w only count words -# -c only count characters +# -c only count bytes +# -m only count characters # -# Default is to count lines, words, characters +# Default is to count lines, words, bytes # # Requires getopt() and file transition library functions +# Requires mbs extension from gawkextlib + +@@load "mbs" BEGIN @{ # let getopt() print a message about # invalid options. we ignore them - while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{ + while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{ if (c == "l") do_lines = 1 else if (c == "w") do_words = 1 else if (c == "c") + do_bytes = 1 + else if (c == "m") do_chars = 1 @} for (i = 1; i < Optind; i++) ARGV[i] = "" - # if no options, do all - if (! do_lines && ! do_words && ! do_chars) - do_lines = do_words = do_chars = 1 + # if no options, do lines, words, bytes + if (! do_lines && ! do_words && ! do_chars && ! do_bytes) + do_lines = do_words = do_bytes = 1 print_total = (ARGC - i > 1) @} @@ -25871,14 +25941,14 @@ BEGIN @{ @end example The @code{beginfile()} function is simple; it just resets the counts of lines, -words, and characters to zero, and saves the current @value{FN} in +words, characters and bytes to zero, and saves the current @value{FN} in @code{fname}: @example @c file eg/prog/wc.awk function beginfile(file) @{ - lines = words = chars = 0 + lines = words = chars = bytes = 0 fname = FILENAME @} @c endfile @@ -25896,6 +25966,7 @@ function endfile(file) tlines += lines twords += words tchars += chars + tbytes += bytes if (do_lines) printf "\t%d", lines @group @@ -25904,26 +25975,28 @@ function endfile(file) @end group if (do_chars) printf "\t%d", chars + if (do_bytes) + printf "\t%d", bytes printf "\t%s\n", fname @} @c endfile @end example There is one rule that is executed for each line. It adds the length of -the record, plus one, to @code{chars}.@footnote{Because @command{gawk} -understands multibyte locales, this code counts characters, not bytes.} -Adding one plus the record length +the record, plus one, to @code{chars}. Adding one plus the record length is needed because the newline character separating records (the value of @code{RS}) is not part of the record itself, and thus not included -in its length. Next, @code{lines} is incremented for each line read, -and @code{words} is incremented by the value of @code{NF}, which is the -number of ``words'' on this line: +in its length. Similarly, it adds the length of the record in bytes, +plus one, to @code{bytes}. Next, @code{lines} is incremented for each +line read, and @code{words} is incremented by the value of @code{NF}, +which is the number of ``words'' on this line: @example @c file eg/prog/wc.awk # do per line @{ chars += length($0) + 1 # get newline + bytes += mbs_length($0) + 1 lines++ words += NF @} @@ -25942,6 +26015,8 @@ END @{ printf "\t%d", twords if (do_chars) printf "\t%d", tchars + if (do_bytes) + printf "\t%d", tbytes print "\ttotal" @} @} |