aboutsummaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/ChangeLog5
-rw-r--r--doc/gawk.info659
-rw-r--r--doc/gawk.texi123
-rw-r--r--doc/gawktexi.in123
4 files changed, 574 insertions, 336 deletions
diff --git a/doc/ChangeLog b/doc/ChangeLog
index 0fbba4ee..a6ad9a76 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -2,6 +2,11 @@
* gawktexi.in: Minor edits.
+ Unrelated:
+
+ * gawktexi (Wc Program): Update to POSIX, support both bytes
+ and characters via the gawkextlib mbs extension.
+
2020-10-01 Arnold D. Robbins <arnold@skeeve.com>
* gawktexi.in (Split Program): Rewrite split to be POSIX
diff --git a/doc/gawk.info b/doc/gawk.info
index cf00ed11..320e0657 100644
--- a/doc/gawk.info
+++ b/doc/gawk.info
@@ -19011,10 +19011,76 @@ File: gawk.info, Node: Wc Program, Prev: Uniq Program, Up: Clones
11.2.7 Counting Things
----------------------
-The 'wc' (word count) utility counts lines, words, and characters in one
-or more input files. Its usage is as follows:
+The 'wc' (word count) utility counts lines, words, characters and bytes
+in one or more input files.
- 'wc' ['-lwc'] [FILES ...]
+* Menu:
+
+* Bytes vs. Characters:: Modern character sets.
+* Using extensions:: A brief intro to extensions.
+* wc program:: Code for 'wc.awk'.
+
+
+File: gawk.info, Node: Bytes vs. Characters, Next: Using extensions, Up: Wc Program
+
+11.2.7.1 Modern Character Sets
+..............................
+
+In the early days of computing, single bytes were used for storing
+characters. The most common character sets were ASCII and EBCDIC, which
+each provided all the English upper- and lowercase letters, the 10
+Hindu-Arabic numerals from 0 through 9, and a number of other standard
+punctuation and control characters.
+
+ Today, the most popular character set in use is Unicode (of which
+ASCII is a pure subset). Unicode provides tens of thousands of unique
+characters (called "code points") to cover most existing human languages
+(living and dead) and a number of nonhuman ones as well (such as Klingon
+and J.R.R. Tolkien's elvish languages).
+
+ To save space in files, Unicode code points are "encoded", where each
+character takes from one to four bytes in the file. UTF-8 is possibly
+the most popular of such "multibyte encodings".
+
+ The POSIX standard requires that 'awk' function in terms of
+characters, not bytes. Thus in 'gawk', 'length()', 'substr()',
+'split()', 'match()' and the other string functions (*note String
+Functions::) all work in terms of characters in the local character set,
+and not in terms of bytes. (Not all 'awk' implementations do so,
+though).
+
+ There is no standard, built-in way to distinguish characters from
+bytes in an 'awk' program. For an 'awk' implementation of 'wc', which
+needs to make such a distinction, we will have to use an external
+extension.
+
+
+File: gawk.info, Node: Using extensions, Next: wc program, Prev: Bytes vs. Characters, Up: Wc Program
+
+11.2.7.2 A Brief Introduction To Extensions
+...........................................
+
+Loadable extensions are presented in full detail in *note Dynamic
+Extensions::. They provide a way to add functions to 'gawk' which can
+call out to other facilities written in C or C++.
+
+ For the purposes of 'wc.awk', it's enough to know that the extension
+is loaded with the '@load' directive, and the additional function we
+will use is called 'mbs_length()'. This function returns the number of
+bytes in a string, and not the number of characters.
+
+ The '"mbs"' extension comes from the 'gawkextlib' project. *Note
+gawkextlib:: for more information.
+
+
+File: gawk.info, Node: wc program, Prev: Using extensions, Up: Wc Program
+
+11.2.7.3 Code for 'wc.awk'
+..........................
+
+The usage for 'wc' is as follows:
+
+ 'wc' ['-lwcm'] [FILES ...]
If no files are specified on the command line, 'wc' reads its
standard input. If there are multiple files, it also prints total
@@ -19031,21 +19097,27 @@ follows:
data.
'-c'
+ Count only bytes. Once upon a time, the 'c' in this option stood
+ for "characters." But, as explained earlier, bytes and character
+ are no longer synonymous with each other.
+
+'-m'
Count only characters.
Implementing 'wc' in 'awk' is particularly elegant, because 'awk'
does a lot of the work for us; it splits lines into words (i.e., fields)
and counts them, it counts lines (i.e., records), and it can easily tell
-us how long a line is.
+us how long a line is in characters.
This program uses the 'getopt()' library function (*note Getopt
Function::) and the file-transition functions (*note Filetrans
Function::).
- This version has one notable difference from traditional versions of
-'wc': it always prints the counts in the order lines, words, and
-characters. Traditional versions note the order of the '-l', '-w', and
-'-c' options on the command line, and print the counts in that order.
+ This version has one notable difference from older versions of 'wc':
+it always prints the counts in the order lines, words, characters and
+bytes. Older versions note the order of the '-l', '-w', and '-c'
+options on the command line, and print the counts in that order. POSIX
+does not mandate this behavior, though.
The 'BEGIN' rule does the argument processing. The variable
'print_total' is true if more than one file is named on the command
@@ -19056,40 +19128,46 @@ line:
# Options:
# -l only count lines
# -w only count words
- # -c only count characters
+ # -c only count bytes
+ # -m only count characters
#
- # Default is to count lines, words, characters
+ # Default is to count lines, words, bytes
#
# Requires getopt() and file transition library functions
+ # Requires mbs extension from gawkextlib
+
+ @load "mbs"
BEGIN {
# let getopt() print a message about
# invalid options. we ignore them
- while ((c = getopt(ARGC, ARGV, "lwc")) != -1) {
+ while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) {
if (c == "l")
do_lines = 1
else if (c == "w")
do_words = 1
else if (c == "c")
+ do_bytes = 1
+ else if (c == "m")
do_chars = 1
}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
- # if no options, do all
- if (! do_lines && ! do_words && ! do_chars)
- do_lines = do_words = do_chars = 1
+ # if no options, do lines, words, bytes
+ if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
+ do_lines = do_words = do_bytes = 1
print_total = (ARGC - i > 1)
}
The 'beginfile()' function is simple; it just resets the counts of
-lines, words, and characters to zero, and saves the current file name in
-'fname':
+lines, words, characters and bytes to zero, and saves the current file
+name in 'fname':
function beginfile(file)
{
- lines = words = chars = 0
+ lines = words = chars = bytes = 0
fname = FILENAME
}
@@ -19103,26 +19181,31 @@ those numbers for the file that was just read. It relies on
tlines += lines
twords += words
tchars += chars
+ tbytes += bytes
if (do_lines)
printf "\t%d", lines
if (do_words)
printf "\t%d", words
if (do_chars)
printf "\t%d", chars
+ if (do_bytes)
+ printf "\t%d", bytes
printf "\t%s\n", fname
}
There is one rule that is executed for each line. It adds the length
-of the record, plus one, to 'chars'.(1) Adding one plus the record
-length is needed because the newline character separating records (the
-value of 'RS') is not part of the record itself, and thus not included
-in its length. Next, 'lines' is incremented for each line read, and
+of the record, plus one, to 'chars'. Adding one plus the record length
+is needed because the newline character separating records (the value of
+'RS') is not part of the record itself, and thus not included in its
+length. Similarly, it adds the length of the record in bytes, plus one,
+to 'bytes'. Next, 'lines' is incremented for each line read, and
'words' is incremented by the value of 'NF', which is the number of
"words" on this line:
# do per line
{
chars += length($0) + 1 # get newline
+ bytes += mbs_length($0) + 1
lines++
words += NF
}
@@ -19137,15 +19220,12 @@ in its length. Next, 'lines' is incremented for each line read, and
printf "\t%d", twords
if (do_chars)
printf "\t%d", tchars
+ if (do_bytes)
+ printf "\t%d", tbytes
print "\ttotal"
}
}
- ---------- Footnotes ----------
-
- (1) Because 'gawk' understands multibyte locales, this code counts
-characters, not bytes.
-

File: gawk.info, Node: Miscellaneous Programs, Next: Programs Summary, Prev: Clones, Up: Sample Programs
@@ -35111,6 +35191,7 @@ Index
* built-in functions: Functions. (line 6)
* built-in functions, evaluation order: Calling Built-in. (line 30)
* BusyBox Awk: Other Versions. (line 88)
+* bytes, counting: Wc Program. (line 6)
* C library functions, assert(): Assert Function. (line 6)
* C library functions, getopt(): Getopt Function. (line 15)
* C library functions, getpwent(): Passwd Functions. (line 16)
@@ -35303,7 +35384,7 @@ Index
* coprocesses <1>: Two-way I/O. (line 27)
* cos: Numeric Functions. (line 16)
* cosine: Numeric Functions. (line 16)
-* counting words, lines, and characters: Wc Program. (line 6)
+* counting words, lines, characters, and bytese: Wc Program. (line 6)
* csh utility: Statements/Lines. (line 45)
* csh utility, POSIXLY_CORRECT environment variable: Options. (line 405)
* csh utility, |& operator, comparison with: Two-way I/O. (line 27)
@@ -37765,7 +37846,7 @@ Index
* watchpoint (debugger): Debugging Terms. (line 42)
* watchpoints, show in debugger: Debugger Info. (line 51)
* wc utility: Wc Program. (line 6)
-* wc.awk program: Wc Program. (line 46)
+* wc.awk program: wc program. (line 51)
* Weinberger, Peter: History. (line 17)
* Weinberger, Peter <1>: Contributors. (line 12)
* where debugger command (alias for backtrace): Execution Stack.
@@ -38140,266 +38221,268 @@ Ref: Split Program-Footnote-1766907
Node: Tee Program767080
Node: Uniq Program769870
Node: Wc Program777434
-Ref: Wc Program-Footnote-1781689
-Node: Miscellaneous Programs781783
-Node: Dupword Program782996
-Node: Alarm Program785026
-Node: Translate Program789881
-Ref: Translate Program-Footnote-1794446
-Node: Labels Program794716
-Ref: Labels Program-Footnote-1798067
-Node: Word Sorting798151
-Node: History Sorting802223
-Node: Extract Program804448
-Node: Simple Sed812502
-Node: Igawk Program815576
-Ref: Igawk Program-Footnote-1829907
-Ref: Igawk Program-Footnote-2830109
-Ref: Igawk Program-Footnote-3830231
-Node: Anagram Program830346
-Node: Signature Program833408
-Node: Programs Summary834655
-Node: Programs Exercises835869
-Ref: Programs Exercises-Footnote-1839999
-Node: Advanced Features840085
-Node: Nondecimal Data842075
-Node: Array Sorting843666
-Node: Controlling Array Traversal844366
-Ref: Controlling Array Traversal-Footnote-1852734
-Node: Array Sorting Functions852852
-Ref: Array Sorting Functions-Footnote-1857943
-Node: Two-way I/O858139
-Ref: Two-way I/O-Footnote-1865860
-Ref: Two-way I/O-Footnote-2866047
-Node: TCP/IP Networking866129
-Node: Profiling869247
-Node: Advanced Features Summary878561
-Node: Internationalization880405
-Node: I18N and L10N881885
-Node: Explaining gettext882572
-Ref: Explaining gettext-Footnote-1888464
-Ref: Explaining gettext-Footnote-2888649
-Node: Programmer i18n888814
-Ref: Programmer i18n-Footnote-1893763
-Node: Translator i18n893812
-Node: String Extraction894606
-Ref: String Extraction-Footnote-1895738
-Node: Printf Ordering895824
-Ref: Printf Ordering-Footnote-1898610
-Node: I18N Portability898674
-Ref: I18N Portability-Footnote-1901130
-Node: I18N Example901193
-Ref: I18N Example-Footnote-1904468
-Ref: I18N Example-Footnote-2904541
-Node: Gawk I18N904650
-Node: I18N Summary905299
-Node: Debugger906640
-Node: Debugging907640
-Node: Debugging Concepts908081
-Node: Debugging Terms909890
-Node: Awk Debugging912465
-Ref: Awk Debugging-Footnote-1913410
-Node: Sample Debugging Session913542
-Node: Debugger Invocation914076
-Node: Finding The Bug915462
-Node: List of Debugger Commands921936
-Node: Breakpoint Control923269
-Node: Debugger Execution Control926963
-Node: Viewing And Changing Data930325
-Node: Execution Stack933866
-Node: Debugger Info935503
-Node: Miscellaneous Debugger Commands939574
-Node: Readline Support944636
-Node: Limitations945532
-Node: Debugging Summary948086
-Node: Namespaces949365
-Node: Global Namespace950476
-Node: Qualified Names951874
-Node: Default Namespace952873
-Node: Changing The Namespace953614
-Node: Naming Rules955228
-Node: Internal Name Management957076
-Node: Namespace Example958118
-Node: Namespace And Features960680
-Node: Namespace Summary962115
-Node: Arbitrary Precision Arithmetic963592
-Node: Computer Arithmetic965079
-Ref: table-numeric-ranges968845
-Ref: table-floating-point-ranges969338
-Ref: Computer Arithmetic-Footnote-1969996
-Node: Math Definitions970053
-Ref: table-ieee-formats973369
-Ref: Math Definitions-Footnote-1973972
-Node: MPFR features974077
-Node: FP Math Caution975795
-Ref: FP Math Caution-Footnote-1976867
-Node: Inexactness of computations977236
-Node: Inexact representation978196
-Node: Comparing FP Values979556
-Node: Errors accumulate980797
-Node: Getting Accuracy982230
-Node: Try To Round984940
-Node: Setting precision985839
-Ref: table-predefined-precision-strings986536
-Node: Setting the rounding mode988366
-Ref: table-gawk-rounding-modes988740
-Ref: Setting the rounding mode-Footnote-1992671
-Node: Arbitrary Precision Integers992850
-Ref: Arbitrary Precision Integers-Footnote-1996025
-Node: Checking for MPFR996174
-Node: POSIX Floating Point Problems997648
-Ref: POSIX Floating Point Problems-Footnote-11001933
-Node: Floating point summary1001971
-Node: Dynamic Extensions1004161
-Node: Extension Intro1005714
-Node: Plugin License1006980
-Node: Extension Mechanism Outline1007777
-Ref: figure-load-extension1008216
-Ref: figure-register-new-function1009781
-Ref: figure-call-new-function1010873
-Node: Extension API Description1012935
-Node: Extension API Functions Introduction1014648
-Ref: table-api-std-headers1016484
-Node: General Data Types1020733
-Ref: General Data Types-Footnote-11029363
-Node: Memory Allocation Functions1029662
-Ref: Memory Allocation Functions-Footnote-11034163
-Node: Constructor Functions1034262
-Node: API Ownership of MPFR and GMP Values1037728
-Node: Registration Functions1039041
-Node: Extension Functions1039741
-Node: Exit Callback Functions1045063
-Node: Extension Version String1046313
-Node: Input Parsers1046976
-Node: Output Wrappers1059697
-Node: Two-way processors1064209
-Node: Printing Messages1066474
-Ref: Printing Messages-Footnote-11067645
-Node: Updating ERRNO1067798
-Node: Requesting Values1068537
-Ref: table-value-types-returned1069274
-Node: Accessing Parameters1070210
-Node: Symbol Table Access1071447
-Node: Symbol table by name1071959
-Ref: Symbol table by name-Footnote-11074983
-Node: Symbol table by cookie1075111
-Ref: Symbol table by cookie-Footnote-11079296
-Node: Cached values1079360
-Ref: Cached values-Footnote-11082896
-Node: Array Manipulation1083049
-Ref: Array Manipulation-Footnote-11084140
-Node: Array Data Types1084177
-Ref: Array Data Types-Footnote-11086835
-Node: Array Functions1086927
-Node: Flattening Arrays1091425
-Node: Creating Arrays1098401
-Node: Redirection API1103168
-Node: Extension API Variables1106001
-Node: Extension Versioning1106712
-Ref: gawk-api-version1107141
-Node: Extension GMP/MPFR Versioning1108872
-Node: Extension API Informational Variables1110500
-Node: Extension API Boilerplate1111573
-Node: Changes from API V11115547
-Node: Finding Extensions1117119
-Node: Extension Example1117678
-Node: Internal File Description1118476
-Node: Internal File Ops1122556
-Ref: Internal File Ops-Footnote-11133906
-Node: Using Internal File Ops1134046
-Ref: Using Internal File Ops-Footnote-11136429
-Node: Extension Samples1136703
-Node: Extension Sample File Functions1138232
-Node: Extension Sample Fnmatch1145881
-Node: Extension Sample Fork1147368
-Node: Extension Sample Inplace1148586
-Node: Extension Sample Ord1152212
-Node: Extension Sample Readdir1153048
-Ref: table-readdir-file-types1153937
-Node: Extension Sample Revout1155004
-Node: Extension Sample Rev2way1155593
-Node: Extension Sample Read write array1156333
-Node: Extension Sample Readfile1158275
-Node: Extension Sample Time1159370
-Node: Extension Sample API Tests1161122
-Node: gawkextlib1161614
-Node: Extension summary1164532
-Node: Extension Exercises1168234
-Node: Language History1169476
-Node: V7/SVR3.11171132
-Node: SVR41173284
-Node: POSIX1174718
-Node: BTL1176099
-Node: POSIX/GNU1176828
-Node: Feature History1182606
-Node: Common Extensions1198925
-Node: Ranges and Locales1200208
-Ref: Ranges and Locales-Footnote-11204824
-Ref: Ranges and Locales-Footnote-21204851
-Ref: Ranges and Locales-Footnote-31205086
-Node: Contributors1205309
-Node: History summary1211306
-Node: Installation1212686
-Node: Gawk Distribution1213630
-Node: Getting1214114
-Node: Extracting1215077
-Node: Distribution contents1216715
-Node: Unix Installation1223195
-Node: Quick Installation1223877
-Node: Shell Startup Files1226291
-Node: Additional Configuration Options1227380
-Node: Configuration Philosophy1229695
-Node: Non-Unix Installation1232064
-Node: PC Installation1232524
-Node: PC Binary Installation1233362
-Node: PC Compiling1233797
-Node: PC Using1234914
-Node: Cygwin1238467
-Node: MSYS1239691
-Node: VMS Installation1240293
-Node: VMS Compilation1241084
-Ref: VMS Compilation-Footnote-11242313
-Node: VMS Dynamic Extensions1242371
-Node: VMS Installation Details1244056
-Node: VMS Running1246309
-Node: VMS GNV1250588
-Node: VMS Old Gawk1251323
-Node: Bugs1251794
-Node: Bug address1252457
-Node: Usenet1255439
-Node: Maintainers1256443
-Node: Other Versions1257628
-Node: Installation summary1264716
-Node: Notes1265925
-Node: Compatibility Mode1266719
-Node: Additions1267501
-Node: Accessing The Source1268426
-Node: Adding Code1269863
-Node: New Ports1276082
-Node: Derived Files1280457
-Ref: Derived Files-Footnote-11286117
-Ref: Derived Files-Footnote-21286152
-Ref: Derived Files-Footnote-31286750
-Node: Future Extensions1286864
-Node: Implementation Limitations1287522
-Node: Extension Design1288732
-Node: Old Extension Problems1289876
-Ref: Old Extension Problems-Footnote-11291394
-Node: Extension New Mechanism Goals1291451
-Ref: Extension New Mechanism Goals-Footnote-11294815
-Node: Extension Other Design Decisions1295004
-Node: Extension Future Growth1297117
-Node: Notes summary1297723
-Node: Basic Concepts1298881
-Node: Basic High Level1299562
-Ref: figure-general-flow1299844
-Ref: figure-process-flow1300529
-Ref: Basic High Level-Footnote-11303830
-Node: Basic Data Typing1304015
-Node: Glossary1307343
-Node: Copying1339228
-Node: GNU Free Documentation License1376771
-Node: Index1401891
+Node: Bytes vs. Characters777831
+Node: Using extensions779379
+Node: wc program780137
+Node: Miscellaneous Programs784995
+Node: Dupword Program786208
+Node: Alarm Program788238
+Node: Translate Program793093
+Ref: Translate Program-Footnote-1797658
+Node: Labels Program797928
+Ref: Labels Program-Footnote-1801279
+Node: Word Sorting801363
+Node: History Sorting805435
+Node: Extract Program807660
+Node: Simple Sed815714
+Node: Igawk Program818788
+Ref: Igawk Program-Footnote-1833119
+Ref: Igawk Program-Footnote-2833321
+Ref: Igawk Program-Footnote-3833443
+Node: Anagram Program833558
+Node: Signature Program836620
+Node: Programs Summary837867
+Node: Programs Exercises839081
+Ref: Programs Exercises-Footnote-1843211
+Node: Advanced Features843297
+Node: Nondecimal Data845287
+Node: Array Sorting846878
+Node: Controlling Array Traversal847578
+Ref: Controlling Array Traversal-Footnote-1855946
+Node: Array Sorting Functions856064
+Ref: Array Sorting Functions-Footnote-1861155
+Node: Two-way I/O861351
+Ref: Two-way I/O-Footnote-1869072
+Ref: Two-way I/O-Footnote-2869259
+Node: TCP/IP Networking869341
+Node: Profiling872459
+Node: Advanced Features Summary881773
+Node: Internationalization883617
+Node: I18N and L10N885097
+Node: Explaining gettext885784
+Ref: Explaining gettext-Footnote-1891676
+Ref: Explaining gettext-Footnote-2891861
+Node: Programmer i18n892026
+Ref: Programmer i18n-Footnote-1896975
+Node: Translator i18n897024
+Node: String Extraction897818
+Ref: String Extraction-Footnote-1898950
+Node: Printf Ordering899036
+Ref: Printf Ordering-Footnote-1901822
+Node: I18N Portability901886
+Ref: I18N Portability-Footnote-1904342
+Node: I18N Example904405
+Ref: I18N Example-Footnote-1907680
+Ref: I18N Example-Footnote-2907753
+Node: Gawk I18N907862
+Node: I18N Summary908511
+Node: Debugger909852
+Node: Debugging910852
+Node: Debugging Concepts911293
+Node: Debugging Terms913102
+Node: Awk Debugging915677
+Ref: Awk Debugging-Footnote-1916622
+Node: Sample Debugging Session916754
+Node: Debugger Invocation917288
+Node: Finding The Bug918674
+Node: List of Debugger Commands925148
+Node: Breakpoint Control926481
+Node: Debugger Execution Control930175
+Node: Viewing And Changing Data933537
+Node: Execution Stack937078
+Node: Debugger Info938715
+Node: Miscellaneous Debugger Commands942786
+Node: Readline Support947848
+Node: Limitations948744
+Node: Debugging Summary951298
+Node: Namespaces952577
+Node: Global Namespace953688
+Node: Qualified Names955086
+Node: Default Namespace956085
+Node: Changing The Namespace956826
+Node: Naming Rules958440
+Node: Internal Name Management960288
+Node: Namespace Example961330
+Node: Namespace And Features963892
+Node: Namespace Summary965327
+Node: Arbitrary Precision Arithmetic966804
+Node: Computer Arithmetic968291
+Ref: table-numeric-ranges972057
+Ref: table-floating-point-ranges972550
+Ref: Computer Arithmetic-Footnote-1973208
+Node: Math Definitions973265
+Ref: table-ieee-formats976581
+Ref: Math Definitions-Footnote-1977184
+Node: MPFR features977289
+Node: FP Math Caution979007
+Ref: FP Math Caution-Footnote-1980079
+Node: Inexactness of computations980448
+Node: Inexact representation981408
+Node: Comparing FP Values982768
+Node: Errors accumulate984009
+Node: Getting Accuracy985442
+Node: Try To Round988152
+Node: Setting precision989051
+Ref: table-predefined-precision-strings989748
+Node: Setting the rounding mode991578
+Ref: table-gawk-rounding-modes991952
+Ref: Setting the rounding mode-Footnote-1995883
+Node: Arbitrary Precision Integers996062
+Ref: Arbitrary Precision Integers-Footnote-1999237
+Node: Checking for MPFR999386
+Node: POSIX Floating Point Problems1000860
+Ref: POSIX Floating Point Problems-Footnote-11005145
+Node: Floating point summary1005183
+Node: Dynamic Extensions1007373
+Node: Extension Intro1008926
+Node: Plugin License1010192
+Node: Extension Mechanism Outline1010989
+Ref: figure-load-extension1011428
+Ref: figure-register-new-function1012993
+Ref: figure-call-new-function1014085
+Node: Extension API Description1016147
+Node: Extension API Functions Introduction1017860
+Ref: table-api-std-headers1019696
+Node: General Data Types1023945
+Ref: General Data Types-Footnote-11032575
+Node: Memory Allocation Functions1032874
+Ref: Memory Allocation Functions-Footnote-11037375
+Node: Constructor Functions1037474
+Node: API Ownership of MPFR and GMP Values1040940
+Node: Registration Functions1042253
+Node: Extension Functions1042953
+Node: Exit Callback Functions1048275
+Node: Extension Version String1049525
+Node: Input Parsers1050188
+Node: Output Wrappers1062909
+Node: Two-way processors1067421
+Node: Printing Messages1069686
+Ref: Printing Messages-Footnote-11070857
+Node: Updating ERRNO1071010
+Node: Requesting Values1071749
+Ref: table-value-types-returned1072486
+Node: Accessing Parameters1073422
+Node: Symbol Table Access1074659
+Node: Symbol table by name1075171
+Ref: Symbol table by name-Footnote-11078195
+Node: Symbol table by cookie1078323
+Ref: Symbol table by cookie-Footnote-11082508
+Node: Cached values1082572
+Ref: Cached values-Footnote-11086108
+Node: Array Manipulation1086261
+Ref: Array Manipulation-Footnote-11087352
+Node: Array Data Types1087389
+Ref: Array Data Types-Footnote-11090047
+Node: Array Functions1090139
+Node: Flattening Arrays1094637
+Node: Creating Arrays1101613
+Node: Redirection API1106380
+Node: Extension API Variables1109213
+Node: Extension Versioning1109924
+Ref: gawk-api-version1110353
+Node: Extension GMP/MPFR Versioning1112084
+Node: Extension API Informational Variables1113712
+Node: Extension API Boilerplate1114785
+Node: Changes from API V11118759
+Node: Finding Extensions1120331
+Node: Extension Example1120890
+Node: Internal File Description1121688
+Node: Internal File Ops1125768
+Ref: Internal File Ops-Footnote-11137118
+Node: Using Internal File Ops1137258
+Ref: Using Internal File Ops-Footnote-11139641
+Node: Extension Samples1139915
+Node: Extension Sample File Functions1141444
+Node: Extension Sample Fnmatch1149093
+Node: Extension Sample Fork1150580
+Node: Extension Sample Inplace1151798
+Node: Extension Sample Ord1155424
+Node: Extension Sample Readdir1156260
+Ref: table-readdir-file-types1157149
+Node: Extension Sample Revout1158216
+Node: Extension Sample Rev2way1158805
+Node: Extension Sample Read write array1159545
+Node: Extension Sample Readfile1161487
+Node: Extension Sample Time1162582
+Node: Extension Sample API Tests1164334
+Node: gawkextlib1164826
+Node: Extension summary1167744
+Node: Extension Exercises1171446
+Node: Language History1172688
+Node: V7/SVR3.11174344
+Node: SVR41176496
+Node: POSIX1177930
+Node: BTL1179311
+Node: POSIX/GNU1180040
+Node: Feature History1185818
+Node: Common Extensions1202137
+Node: Ranges and Locales1203420
+Ref: Ranges and Locales-Footnote-11208036
+Ref: Ranges and Locales-Footnote-21208063
+Ref: Ranges and Locales-Footnote-31208298
+Node: Contributors1208521
+Node: History summary1214518
+Node: Installation1215898
+Node: Gawk Distribution1216842
+Node: Getting1217326
+Node: Extracting1218289
+Node: Distribution contents1219927
+Node: Unix Installation1226407
+Node: Quick Installation1227089
+Node: Shell Startup Files1229503
+Node: Additional Configuration Options1230592
+Node: Configuration Philosophy1232907
+Node: Non-Unix Installation1235276
+Node: PC Installation1235736
+Node: PC Binary Installation1236574
+Node: PC Compiling1237009
+Node: PC Using1238126
+Node: Cygwin1241679
+Node: MSYS1242903
+Node: VMS Installation1243505
+Node: VMS Compilation1244296
+Ref: VMS Compilation-Footnote-11245525
+Node: VMS Dynamic Extensions1245583
+Node: VMS Installation Details1247268
+Node: VMS Running1249521
+Node: VMS GNV1253800
+Node: VMS Old Gawk1254535
+Node: Bugs1255006
+Node: Bug address1255669
+Node: Usenet1258651
+Node: Maintainers1259655
+Node: Other Versions1260840
+Node: Installation summary1267928
+Node: Notes1269137
+Node: Compatibility Mode1269931
+Node: Additions1270713
+Node: Accessing The Source1271638
+Node: Adding Code1273075
+Node: New Ports1279294
+Node: Derived Files1283669
+Ref: Derived Files-Footnote-11289329
+Ref: Derived Files-Footnote-21289364
+Ref: Derived Files-Footnote-31289962
+Node: Future Extensions1290076
+Node: Implementation Limitations1290734
+Node: Extension Design1291944
+Node: Old Extension Problems1293088
+Ref: Old Extension Problems-Footnote-11294606
+Node: Extension New Mechanism Goals1294663
+Ref: Extension New Mechanism Goals-Footnote-11298027
+Node: Extension Other Design Decisions1298216
+Node: Extension Future Growth1300329
+Node: Notes summary1300935
+Node: Basic Concepts1302093
+Node: Basic High Level1302774
+Ref: figure-general-flow1303056
+Ref: figure-process-flow1303741
+Ref: Basic High Level-Footnote-11307042
+Node: Basic Data Typing1307227
+Node: Glossary1310555
+Node: Copying1342440
+Node: GNU Free Documentation License1379983
+Node: Index1405103

End Tag Table
diff --git a/doc/gawk.texi b/doc/gawk.texi
index 9446e696..91625d06 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -26761,19 +26761,76 @@ as fast.'' Consider how to rewrite the logic to follow this suggestion.
@node Wc Program
@subsection Counting Things
-@c FIXME: One day, update to current POSIX version of wc
-
-@cindex counting words, lines, and characters
+@cindex counting words, lines, characters, and bytese
@cindex input files @subentry counting elements in
@cindex words @subentry counting
@cindex characters @subentry counting
@cindex lines @subentry counting
+@cindex bytes @subentry counting
@cindex @command{wc} utility
-The @command{wc} (word count) utility counts lines, words, and characters in
-one or more input files. Its usage is as follows:
+The @command{wc} (word count) utility counts lines, words, characters
+and bytes in one or more input files.
+
+@menu
+* Bytes vs. Characters:: Modern character sets.
+* Using extensions:: A brief intro to extensions.
+* @command{wc} program:: Code for @file{wc.awk}.
+@end menu
+
+@node Bytes vs. Characters
+@subsubsection Modern Character Sets
+
+In the early days of computing, single bytes were used for storing
+characters. The most common character sets were ASCII and EBCDIC,
+which each provided all the English upper- and lowercase letters, the 10
+Hindu-Arabic numerals from 0 through 9, and a number of other standard
+punctuation and control characters.
+
+Today, the most popular character set in use is Unicode (of which ASCII
+is a pure subset). Unicode provides tens of thousands of unique characters
+(called @dfn{code points}) to cover most existing human languages (living
+and dead) and a number of nonhuman ones as well (such as Klingon and
+J.R.R.@: Tolkien's elvish languages).
+
+To save space in files, Unicode code points are @dfn{encoded}, where each
+character takes from one to four bytes in the file. UTF-8 is possibly
+the most popular of such @dfn{multibyte encodings}.
+
+The POSIX standard requires that @command{awk} function in terms
+of characters, not bytes. Thus in @command{gawk}, @code{length()},
+@code{substr()}, @code{split()}, @code{match()} and the other string
+functions (@pxref{String Functions}) all work in terms of characters in
+the local character set, and not in terms of bytes. (Not all @command{awk}
+implementations do so, though).
+
+There is no standard, built-in way to distinguish characters from bytes
+in an @command{awk} program. For an @command{awk} implementation of
+@command{wc}, which needs to make such a distinction, we will have to
+use an external extension.
+
+@node Using extensions
+@subsubsection A Brief Introduction To Extensions
+
+Loadable extensions are presented in full detail in @ref{Dynamic Extensions}.
+They provide a way to add functions to @command{gawk} which can call
+out to other facilities written in C or C++.
+
+For the purposes of
+@file{wc.awk}, it's enough to know that the extension is loaded
+with the @code{@@load} directive, and the additional function we
+will use is called @code{mbs_length()}. This function returns the
+number of bytes in a string, and not the number of characters.
+
+The @code{"mbs"} extension comes from the @code{gawkextlib}
+project. @xref{gawkextlib} for more information.
+
+@node @command{wc} program
+@subsubsection Code for @file{wc.awk}
+
+The usage for @command{wc} is as follows:
@display
-@command{wc} [@option{-lwc}] [@var{files} @dots{}]
+@command{wc} [@option{-lwcm}] [@var{files} @dots{}]
@end display
If no files are specified on the command line, @command{wc} reads its standard
@@ -26791,24 +26848,30 @@ by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates
fields in its input data.
@item -c
+Count only bytes.
+Once upon a time, the @samp{c} in this option stood for ``characters.''
+But, as explained earlier, bytes and character are no longer synonymous
+with each other.
+
+@item -m
Count only characters.
@end table
Implementing @command{wc} in @command{awk} is particularly elegant,
because @command{awk} does a lot of the work for us; it splits lines into
words (i.e., fields) and counts them, it counts lines (i.e., records),
-and it can easily tell us how long a line is.
+and it can easily tell us how long a line is in characters.
This program uses the @code{getopt()} library function
(@pxref{Getopt Function})
and the file-transition functions
(@pxref{Filetrans Function}).
-This version has one notable difference from traditional versions of
+This version has one notable difference from older versions of
@command{wc}: it always prints the counts in the order lines, words,
-and characters. Traditional versions note the order of the @option{-l},
+characters and bytes. Older versions note the order of the @option{-l},
@option{-w}, and @option{-c} options on the command line, and print the
-counts in that order.
+counts in that order. POSIX does not mandate this behavior, though.
The @code{BEGIN} rule does the argument processing. The variable
@code{print_total} is true if more than one file is named on the
@@ -26824,6 +26887,7 @@ command line:
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
+# Revised September 2020
@c endfile
@end ignore
@c file eg/prog/wc.awk
@@ -26831,29 +26895,35 @@ command line:
# Options:
# -l only count lines
# -w only count words
-# -c only count characters
+# -c only count bytes
+# -m only count characters
#
-# Default is to count lines, words, characters
+# Default is to count lines, words, bytes
#
# Requires getopt() and file transition library functions
+# Requires mbs extension from gawkextlib
+
+@@load "mbs"
BEGIN @{
# let getopt() print a message about
# invalid options. we ignore them
- while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
+ while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{
if (c == "l")
do_lines = 1
else if (c == "w")
do_words = 1
else if (c == "c")
+ do_bytes = 1
+ else if (c == "m")
do_chars = 1
@}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
- # if no options, do all
- if (! do_lines && ! do_words && ! do_chars)
- do_lines = do_words = do_chars = 1
+ # if no options, do lines, words, bytes
+ if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
+ do_lines = do_words = do_bytes = 1
print_total = (ARGC - i > 1)
@}
@@ -26861,14 +26931,14 @@ BEGIN @{
@end example
The @code{beginfile()} function is simple; it just resets the counts of lines,
-words, and characters to zero, and saves the current @value{FN} in
+words, characters and bytes to zero, and saves the current @value{FN} in
@code{fname}:
@example
@c file eg/prog/wc.awk
function beginfile(file)
@{
- lines = words = chars = 0
+ lines = words = chars = bytes = 0
fname = FILENAME
@}
@c endfile
@@ -26886,6 +26956,7 @@ function endfile(file)
tlines += lines
twords += words
tchars += chars
+ tbytes += bytes
if (do_lines)
printf "\t%d", lines
@group
@@ -26894,26 +26965,28 @@ function endfile(file)
@end group
if (do_chars)
printf "\t%d", chars
+ if (do_bytes)
+ printf "\t%d", bytes
printf "\t%s\n", fname
@}
@c endfile
@end example
There is one rule that is executed for each line. It adds the length of
-the record, plus one, to @code{chars}.@footnote{Because @command{gawk}
-understands multibyte locales, this code counts characters, not bytes.}
-Adding one plus the record length
+the record, plus one, to @code{chars}. Adding one plus the record length
is needed because the newline character separating records (the value
of @code{RS}) is not part of the record itself, and thus not included
-in its length. Next, @code{lines} is incremented for each line read,
-and @code{words} is incremented by the value of @code{NF}, which is the
-number of ``words'' on this line:
+in its length. Similarly, it adds the length of the record in bytes,
+plus one, to @code{bytes}. Next, @code{lines} is incremented for each
+line read, and @code{words} is incremented by the value of @code{NF},
+which is the number of ``words'' on this line:
@example
@c file eg/prog/wc.awk
# do per line
@{
chars += length($0) + 1 # get newline
+ bytes += mbs_length($0) + 1
lines++
words += NF
@}
@@ -26932,6 +27005,8 @@ END @{
printf "\t%d", twords
if (do_chars)
printf "\t%d", tchars
+ if (do_bytes)
+ printf "\t%d", tbytes
print "\ttotal"
@}
@}
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index f96ff861..f982ae8b 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -25771,19 +25771,76 @@ as fast.'' Consider how to rewrite the logic to follow this suggestion.
@node Wc Program
@subsection Counting Things
-@c FIXME: One day, update to current POSIX version of wc
-
-@cindex counting words, lines, and characters
+@cindex counting words, lines, characters, and bytese
@cindex input files @subentry counting elements in
@cindex words @subentry counting
@cindex characters @subentry counting
@cindex lines @subentry counting
+@cindex bytes @subentry counting
@cindex @command{wc} utility
-The @command{wc} (word count) utility counts lines, words, and characters in
-one or more input files. Its usage is as follows:
+The @command{wc} (word count) utility counts lines, words, characters
+and bytes in one or more input files.
+
+@menu
+* Bytes vs. Characters:: Modern character sets.
+* Using extensions:: A brief intro to extensions.
+* @command{wc} program:: Code for @file{wc.awk}.
+@end menu
+
+@node Bytes vs. Characters
+@subsubsection Modern Character Sets
+
+In the early days of computing, single bytes were used for storing
+characters. The most common character sets were ASCII and EBCDIC,
+which each provided all the English upper- and lowercase letters, the 10
+Hindu-Arabic numerals from 0 through 9, and a number of other standard
+punctuation and control characters.
+
+Today, the most popular character set in use is Unicode (of which ASCII
+is a pure subset). Unicode provides tens of thousands of unique characters
+(called @dfn{code points}) to cover most existing human languages (living
+and dead) and a number of nonhuman ones as well (such as Klingon and
+J.R.R.@: Tolkien's elvish languages).
+
+To save space in files, Unicode code points are @dfn{encoded}, where each
+character takes from one to four bytes in the file. UTF-8 is possibly
+the most popular of such @dfn{multibyte encodings}.
+
+The POSIX standard requires that @command{awk} function in terms
+of characters, not bytes. Thus in @command{gawk}, @code{length()},
+@code{substr()}, @code{split()}, @code{match()} and the other string
+functions (@pxref{String Functions}) all work in terms of characters in
+the local character set, and not in terms of bytes. (Not all @command{awk}
+implementations do so, though).
+
+There is no standard, built-in way to distinguish characters from bytes
+in an @command{awk} program. For an @command{awk} implementation of
+@command{wc}, which needs to make such a distinction, we will have to
+use an external extension.
+
+@node Using extensions
+@subsubsection A Brief Introduction To Extensions
+
+Loadable extensions are presented in full detail in @ref{Dynamic Extensions}.
+They provide a way to add functions to @command{gawk} which can call
+out to other facilities written in C or C++.
+
+For the purposes of
+@file{wc.awk}, it's enough to know that the extension is loaded
+with the @code{@@load} directive, and the additional function we
+will use is called @code{mbs_length()}. This function returns the
+number of bytes in a string, and not the number of characters.
+
+The @code{"mbs"} extension comes from the @code{gawkextlib}
+project. @xref{gawkextlib} for more information.
+
+@node @command{wc} program
+@subsubsection Code for @file{wc.awk}
+
+The usage for @command{wc} is as follows:
@display
-@command{wc} [@option{-lwc}] [@var{files} @dots{}]
+@command{wc} [@option{-lwcm}] [@var{files} @dots{}]
@end display
If no files are specified on the command line, @command{wc} reads its standard
@@ -25801,24 +25858,30 @@ by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates
fields in its input data.
@item -c
+Count only bytes.
+Once upon a time, the @samp{c} in this option stood for ``characters.''
+But, as explained earlier, bytes and character are no longer synonymous
+with each other.
+
+@item -m
Count only characters.
@end table
Implementing @command{wc} in @command{awk} is particularly elegant,
because @command{awk} does a lot of the work for us; it splits lines into
words (i.e., fields) and counts them, it counts lines (i.e., records),
-and it can easily tell us how long a line is.
+and it can easily tell us how long a line is in characters.
This program uses the @code{getopt()} library function
(@pxref{Getopt Function})
and the file-transition functions
(@pxref{Filetrans Function}).
-This version has one notable difference from traditional versions of
+This version has one notable difference from older versions of
@command{wc}: it always prints the counts in the order lines, words,
-and characters. Traditional versions note the order of the @option{-l},
+characters and bytes. Older versions note the order of the @option{-l},
@option{-w}, and @option{-c} options on the command line, and print the
-counts in that order.
+counts in that order. POSIX does not mandate this behavior, though.
The @code{BEGIN} rule does the argument processing. The variable
@code{print_total} is true if more than one file is named on the
@@ -25834,6 +25897,7 @@ command line:
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
+# Revised September 2020
@c endfile
@end ignore
@c file eg/prog/wc.awk
@@ -25841,29 +25905,35 @@ command line:
# Options:
# -l only count lines
# -w only count words
-# -c only count characters
+# -c only count bytes
+# -m only count characters
#
-# Default is to count lines, words, characters
+# Default is to count lines, words, bytes
#
# Requires getopt() and file transition library functions
+# Requires mbs extension from gawkextlib
+
+@@load "mbs"
BEGIN @{
# let getopt() print a message about
# invalid options. we ignore them
- while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
+ while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{
if (c == "l")
do_lines = 1
else if (c == "w")
do_words = 1
else if (c == "c")
+ do_bytes = 1
+ else if (c == "m")
do_chars = 1
@}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
- # if no options, do all
- if (! do_lines && ! do_words && ! do_chars)
- do_lines = do_words = do_chars = 1
+ # if no options, do lines, words, bytes
+ if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
+ do_lines = do_words = do_bytes = 1
print_total = (ARGC - i > 1)
@}
@@ -25871,14 +25941,14 @@ BEGIN @{
@end example
The @code{beginfile()} function is simple; it just resets the counts of lines,
-words, and characters to zero, and saves the current @value{FN} in
+words, characters and bytes to zero, and saves the current @value{FN} in
@code{fname}:
@example
@c file eg/prog/wc.awk
function beginfile(file)
@{
- lines = words = chars = 0
+ lines = words = chars = bytes = 0
fname = FILENAME
@}
@c endfile
@@ -25896,6 +25966,7 @@ function endfile(file)
tlines += lines
twords += words
tchars += chars
+ tbytes += bytes
if (do_lines)
printf "\t%d", lines
@group
@@ -25904,26 +25975,28 @@ function endfile(file)
@end group
if (do_chars)
printf "\t%d", chars
+ if (do_bytes)
+ printf "\t%d", bytes
printf "\t%s\n", fname
@}
@c endfile
@end example
There is one rule that is executed for each line. It adds the length of
-the record, plus one, to @code{chars}.@footnote{Because @command{gawk}
-understands multibyte locales, this code counts characters, not bytes.}
-Adding one plus the record length
+the record, plus one, to @code{chars}. Adding one plus the record length
is needed because the newline character separating records (the value
of @code{RS}) is not part of the record itself, and thus not included
-in its length. Next, @code{lines} is incremented for each line read,
-and @code{words} is incremented by the value of @code{NF}, which is the
-number of ``words'' on this line:
+in its length. Similarly, it adds the length of the record in bytes,
+plus one, to @code{bytes}. Next, @code{lines} is incremented for each
+line read, and @code{words} is incremented by the value of @code{NF},
+which is the number of ``words'' on this line:
@example
@c file eg/prog/wc.awk
# do per line
@{
chars += length($0) + 1 # get newline
+ bytes += mbs_length($0) + 1
lines++
words += NF
@}
@@ -25942,6 +26015,8 @@ END @{
printf "\t%d", twords
if (do_chars)
printf "\t%d", tchars
+ if (do_bytes)
+ printf "\t%d", tbytes
print "\ttotal"
@}
@}