diff options
Diffstat (limited to 'doc/gawktexi.in')
-rw-r--r-- | doc/gawktexi.in | 123 |
1 files changed, 99 insertions, 24 deletions
diff --git a/doc/gawktexi.in b/doc/gawktexi.in index f96ff861..f982ae8b 100644 --- a/doc/gawktexi.in +++ b/doc/gawktexi.in @@ -25771,19 +25771,76 @@ as fast.'' Consider how to rewrite the logic to follow this suggestion. @node Wc Program @subsection Counting Things -@c FIXME: One day, update to current POSIX version of wc - -@cindex counting words, lines, and characters +@cindex counting words, lines, characters, and bytese @cindex input files @subentry counting elements in @cindex words @subentry counting @cindex characters @subentry counting @cindex lines @subentry counting +@cindex bytes @subentry counting @cindex @command{wc} utility -The @command{wc} (word count) utility counts lines, words, and characters in -one or more input files. Its usage is as follows: +The @command{wc} (word count) utility counts lines, words, characters +and bytes in one or more input files. + +@menu +* Bytes vs. Characters:: Modern character sets. +* Using extensions:: A brief intro to extensions. +* @command{wc} program:: Code for @file{wc.awk}. +@end menu + +@node Bytes vs. Characters +@subsubsection Modern Character Sets + +In the early days of computing, single bytes were used for storing +characters. The most common character sets were ASCII and EBCDIC, +which each provided all the English upper- and lowercase letters, the 10 +Hindu-Arabic numerals from 0 through 9, and a number of other standard +punctuation and control characters. + +Today, the most popular character set in use is Unicode (of which ASCII +is a pure subset). Unicode provides tens of thousands of unique characters +(called @dfn{code points}) to cover most existing human languages (living +and dead) and a number of nonhuman ones as well (such as Klingon and +J.R.R.@: Tolkien's elvish languages). + +To save space in files, Unicode code points are @dfn{encoded}, where each +character takes from one to four bytes in the file. UTF-8 is possibly +the most popular of such @dfn{multibyte encodings}. + +The POSIX standard requires that @command{awk} function in terms +of characters, not bytes. Thus in @command{gawk}, @code{length()}, +@code{substr()}, @code{split()}, @code{match()} and the other string +functions (@pxref{String Functions}) all work in terms of characters in +the local character set, and not in terms of bytes. (Not all @command{awk} +implementations do so, though). + +There is no standard, built-in way to distinguish characters from bytes +in an @command{awk} program. For an @command{awk} implementation of +@command{wc}, which needs to make such a distinction, we will have to +use an external extension. + +@node Using extensions +@subsubsection A Brief Introduction To Extensions + +Loadable extensions are presented in full detail in @ref{Dynamic Extensions}. +They provide a way to add functions to @command{gawk} which can call +out to other facilities written in C or C++. + +For the purposes of +@file{wc.awk}, it's enough to know that the extension is loaded +with the @code{@@load} directive, and the additional function we +will use is called @code{mbs_length()}. This function returns the +number of bytes in a string, and not the number of characters. + +The @code{"mbs"} extension comes from the @code{gawkextlib} +project. @xref{gawkextlib} for more information. + +@node @command{wc} program +@subsubsection Code for @file{wc.awk} + +The usage for @command{wc} is as follows: @display -@command{wc} [@option{-lwc}] [@var{files} @dots{}] +@command{wc} [@option{-lwcm}] [@var{files} @dots{}] @end display If no files are specified on the command line, @command{wc} reads its standard @@ -25801,24 +25858,30 @@ by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates fields in its input data. @item -c +Count only bytes. +Once upon a time, the @samp{c} in this option stood for ``characters.'' +But, as explained earlier, bytes and character are no longer synonymous +with each other. + +@item -m Count only characters. @end table Implementing @command{wc} in @command{awk} is particularly elegant, because @command{awk} does a lot of the work for us; it splits lines into words (i.e., fields) and counts them, it counts lines (i.e., records), -and it can easily tell us how long a line is. +and it can easily tell us how long a line is in characters. This program uses the @code{getopt()} library function (@pxref{Getopt Function}) and the file-transition functions (@pxref{Filetrans Function}). -This version has one notable difference from traditional versions of +This version has one notable difference from older versions of @command{wc}: it always prints the counts in the order lines, words, -and characters. Traditional versions note the order of the @option{-l}, +characters and bytes. Older versions note the order of the @option{-l}, @option{-w}, and @option{-c} options on the command line, and print the -counts in that order. +counts in that order. POSIX does not mandate this behavior, though. The @code{BEGIN} rule does the argument processing. The variable @code{print_total} is true if more than one file is named on the @@ -25834,6 +25897,7 @@ command line: # # Arnold Robbins, arnold@@skeeve.com, Public Domain # May 1993 +# Revised September 2020 @c endfile @end ignore @c file eg/prog/wc.awk @@ -25841,29 +25905,35 @@ command line: # Options: # -l only count lines # -w only count words -# -c only count characters +# -c only count bytes +# -m only count characters # -# Default is to count lines, words, characters +# Default is to count lines, words, bytes # # Requires getopt() and file transition library functions +# Requires mbs extension from gawkextlib + +@@load "mbs" BEGIN @{ # let getopt() print a message about # invalid options. we ignore them - while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{ + while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{ if (c == "l") do_lines = 1 else if (c == "w") do_words = 1 else if (c == "c") + do_bytes = 1 + else if (c == "m") do_chars = 1 @} for (i = 1; i < Optind; i++) ARGV[i] = "" - # if no options, do all - if (! do_lines && ! do_words && ! do_chars) - do_lines = do_words = do_chars = 1 + # if no options, do lines, words, bytes + if (! do_lines && ! do_words && ! do_chars && ! do_bytes) + do_lines = do_words = do_bytes = 1 print_total = (ARGC - i > 1) @} @@ -25871,14 +25941,14 @@ BEGIN @{ @end example The @code{beginfile()} function is simple; it just resets the counts of lines, -words, and characters to zero, and saves the current @value{FN} in +words, characters and bytes to zero, and saves the current @value{FN} in @code{fname}: @example @c file eg/prog/wc.awk function beginfile(file) @{ - lines = words = chars = 0 + lines = words = chars = bytes = 0 fname = FILENAME @} @c endfile @@ -25896,6 +25966,7 @@ function endfile(file) tlines += lines twords += words tchars += chars + tbytes += bytes if (do_lines) printf "\t%d", lines @group @@ -25904,26 +25975,28 @@ function endfile(file) @end group if (do_chars) printf "\t%d", chars + if (do_bytes) + printf "\t%d", bytes printf "\t%s\n", fname @} @c endfile @end example There is one rule that is executed for each line. It adds the length of -the record, plus one, to @code{chars}.@footnote{Because @command{gawk} -understands multibyte locales, this code counts characters, not bytes.} -Adding one plus the record length +the record, plus one, to @code{chars}. Adding one plus the record length is needed because the newline character separating records (the value of @code{RS}) is not part of the record itself, and thus not included -in its length. Next, @code{lines} is incremented for each line read, -and @code{words} is incremented by the value of @code{NF}, which is the -number of ``words'' on this line: +in its length. Similarly, it adds the length of the record in bytes, +plus one, to @code{bytes}. Next, @code{lines} is incremented for each +line read, and @code{words} is incremented by the value of @code{NF}, +which is the number of ``words'' on this line: @example @c file eg/prog/wc.awk # do per line @{ chars += length($0) + 1 # get newline + bytes += mbs_length($0) + 1 lines++ words += NF @} @@ -25942,6 +26015,8 @@ END @{ printf "\t%d", twords if (do_chars) printf "\t%d", tchars + if (do_bytes) + printf "\t%d", tbytes print "\ttotal" @} @} |