1 files changed, 99 insertions, 24 deletions
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index f96ff861..f982ae8b 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -25771,19 +25771,76 @@ as fast.''  Consider how to rewrite the logic to follow this suggestion.
 @node Wc Program
 @subsection Counting Things
 
-@c FIXME: One day, update to current POSIX version of wc
-
-@cindex counting words, lines, and characters
+@cindex counting words, lines, characters, and bytese
 @cindex input files @subentry counting elements in
 @cindex words @subentry counting
 @cindex characters @subentry counting
 @cindex lines @subentry counting
+@cindex bytes @subentry counting
 @cindex @command{wc} utility
-The @command{wc} (word count) utility counts lines, words, and characters in
-one or more input files. Its usage is as follows:
+The @command{wc} (word count) utility counts lines, words, characters
+and bytes in one or more input files.
+
+@menu
+* Bytes vs. Characters::        Modern character sets.
+* Using extensions::            A brief intro to extensions.
+* @command{wc} program::                  Code for @file{wc.awk}.
+@end menu
+
+@node Bytes vs. Characters
+@subsubsection Modern Character Sets
+
+In the early days of computing, single bytes were used for storing
+characters.  The most common character sets were ASCII and EBCDIC,
+which each provided all the English upper- and lowercase letters, the 10
+Hindu-Arabic numerals from 0 through 9, and a number of other standard
+punctuation and control characters.
+
+Today, the most popular character set in use is Unicode (of which ASCII
+is a pure subset). Unicode provides tens of thousands of unique characters
+(called @dfn{code points}) to cover most existing human languages (living
+and dead) and a number of  nonhuman ones as well (such as Klingon and
+J.R.R.@: Tolkien's elvish languages).
+
+To save space in files, Unicode code points are @dfn{encoded}, where each
+character takes from one to four bytes in the file.  UTF-8 is possibly
+the most popular of such @dfn{multibyte encodings}.
+
+The POSIX standard requires that @command{awk} function in terms
+of characters, not bytes.  Thus in @command{gawk}, @code{length()},
+@code{substr()}, @code{split()}, @code{match()} and the other string
+functions (@pxref{String Functions}) all work in terms of characters in
+the local character set, and not in terms of bytes. (Not all @command{awk}
+implementations do so, though).
+
+There is no standard, built-in way to distinguish characters from bytes
+in an @command{awk} program.  For an @command{awk} implementation of
+@command{wc}, which needs to make such a distinction, we will have to
+use an external extension.
+
+@node Using extensions
+@subsubsection A Brief Introduction To Extensions
+
+Loadable extensions are presented in full detail in @ref{Dynamic Extensions}.
+They provide a way to add functions to @command{gawk} which can call
+out to other facilities written in C or C++.
+
+For the purposes of
+@file{wc.awk}, it's enough to know that the extension is loaded
+with the @code{@@load} directive, and the additional function we
+will use is called @code{mbs_length()}.  This function returns the
+number of bytes in a string, and not the number of characters.
+
+The @code{"mbs"} extension comes from the @code{gawkextlib}
+project. @xref{gawkextlib} for more information.
+
+@node @command{wc} program
+@subsubsection Code for @file{wc.awk}
+
+The usage for @command{wc} is as follows:
 
 @display
-@command{wc} [@option{-lwc}] [@var{files} @dots{}]
+@command{wc} [@option{-lwcm}] [@var{files} @dots{}]
 @end display
 
 If no files are specified on the command line, @command{wc} reads its standard
@@ -25801,24 +25858,30 @@ by spaces and/or TABs.  Luckily, this is the normal way @command{awk} separates
 fields in its input data.
 
 @item -c
+Count only bytes.
+Once upon a time, the @samp{c} in this option stood for ``characters.''
+But, as explained earlier, bytes and character are no longer synonymous
+with each other.
+
+@item -m
 Count only characters.
 @end table
 
 Implementing @command{wc} in @command{awk} is particularly elegant,
 because @command{awk} does a lot of the work for us; it splits lines into
 words (i.e., fields) and counts them, it counts lines (i.e., records),
-and it can easily tell us how long a line is.
+and it can easily tell us how long a line is in characters.
 
 This program uses the @code{getopt()} library function
 (@pxref{Getopt Function})
 and the file-transition functions
 (@pxref{Filetrans Function}).
 
-This version has one notable difference from traditional versions of
+This version has one notable difference from older versions of
 @command{wc}: it always prints the counts in the order lines, words,
-and characters.  Traditional versions note the order of the @option{-l},
+characters and bytes.  Older versions note the order of the @option{-l},
 @option{-w}, and @option{-c} options on the command line, and print the
-counts in that order.
+counts in that order.  POSIX does not mandate this behavior, though.
 
 The @code{BEGIN} rule does the argument processing.  The variable
 @code{print_total} is true if more than one file is named on the
@@ -25834,6 +25897,7 @@ command line:
 #
 # Arnold Robbins, arnold@@skeeve.com, Public Domain
 # May 1993
+# Revised September 2020
 @c endfile
 @end ignore
 @c file eg/prog/wc.awk
@@ -25841,29 +25905,35 @@ command line:
 # Options:
 #    -l    only count lines
 #    -w    only count words
-#    -c    only count characters
+#    -c    only count bytes
+#    -m    only count characters
 #
-# Default is to count lines, words, characters
+# Default is to count lines, words, bytes
 #
 # Requires getopt() and file transition library functions
+# Requires mbs extension from gawkextlib
+
+@@load "mbs"
 
 BEGIN @{
     # let getopt() print a message about
     # invalid options. we ignore them
-    while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
+    while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{
         if (c == "l")
             do_lines = 1
         else if (c == "w")
             do_words = 1
         else if (c == "c")
+            do_bytes = 1
+        else if (c == "m")
             do_chars = 1
     @}
     for (i = 1; i < Optind; i++)
         ARGV[i] = ""
 
-    # if no options, do all
-    if (! do_lines && ! do_words && ! do_chars)
-        do_lines = do_words = do_chars = 1
+    # if no options, do lines, words, bytes
+    if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
+        do_lines = do_words = do_bytes = 1
 
     print_total = (ARGC - i > 1)
 @}
@@ -25871,14 +25941,14 @@ BEGIN @{
 @end example
 
 The @code{beginfile()} function is simple; it just resets the counts of lines,
-words, and characters to zero, and saves the current @value{FN} in
+words, characters and bytes to zero, and saves the current @value{FN} in
 @code{fname}:
 
 @example
 @c file eg/prog/wc.awk
 function beginfile(file)
 @{
-    lines = words = chars = 0
+    lines = words = chars = bytes = 0
     fname = FILENAME
 @}
 @c endfile
@@ -25896,6 +25966,7 @@ function endfile(file)
     tlines += lines
     twords += words
     tchars += chars
+    tbytes += bytes
     if (do_lines)
         printf "\t%d", lines
 @group
@@ -25904,26 +25975,28 @@ function endfile(file)
 @end group
     if (do_chars)
         printf "\t%d", chars
+    if (do_bytes)
+        printf "\t%d", bytes
     printf "\t%s\n", fname
 @}
 @c endfile
 @end example
 
 There is one rule that is executed for each line. It adds the length of
-the record, plus one, to @code{chars}.@footnote{Because @command{gawk}
-understands multibyte locales, this code counts characters, not bytes.}
-Adding one plus the record length
+the record, plus one, to @code{chars}.  Adding one plus the record length
 is needed because the newline character separating records (the value
 of @code{RS}) is not part of the record itself, and thus not included
-in its length.  Next, @code{lines} is incremented for each line read,
-and @code{words} is incremented by the value of @code{NF}, which is the
-number of ``words'' on this line:
+in its length.  Similarly, it adds the length of the record in bytes,
+plus one, to @code{bytes}.  Next, @code{lines} is incremented for each
+line read, and @code{words} is incremented by the value of @code{NF},
+which is the number of ``words'' on this line:
 
 @example
 @c file eg/prog/wc.awk
 # do per line
 @{
     chars += length($0) + 1    # get newline
+    bytes += mbs_length($0) + 1
     lines++
     words += NF
 @}
@@ -25942,6 +26015,8 @@ END @{
             printf "\t%d", twords
         if (do_chars)
             printf "\t%d", tchars
+        if (do_bytes)
+            printf "\t%d", tbytes
         print "\ttotal"
     @}
 @}