aboutsummaryrefslogtreecommitdiffstats
path: root/doc/gawktexi.in
diff options
context:
space:
mode:
Diffstat (limited to 'doc/gawktexi.in')
-rw-r--r--doc/gawktexi.in123
1 files changed, 99 insertions, 24 deletions
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index f96ff861..f982ae8b 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -25771,19 +25771,76 @@ as fast.'' Consider how to rewrite the logic to follow this suggestion.
@node Wc Program
@subsection Counting Things
-@c FIXME: One day, update to current POSIX version of wc
-
-@cindex counting words, lines, and characters
+@cindex counting words, lines, characters, and bytese
@cindex input files @subentry counting elements in
@cindex words @subentry counting
@cindex characters @subentry counting
@cindex lines @subentry counting
+@cindex bytes @subentry counting
@cindex @command{wc} utility
-The @command{wc} (word count) utility counts lines, words, and characters in
-one or more input files. Its usage is as follows:
+The @command{wc} (word count) utility counts lines, words, characters
+and bytes in one or more input files.
+
+@menu
+* Bytes vs. Characters:: Modern character sets.
+* Using extensions:: A brief intro to extensions.
+* @command{wc} program:: Code for @file{wc.awk}.
+@end menu
+
+@node Bytes vs. Characters
+@subsubsection Modern Character Sets
+
+In the early days of computing, single bytes were used for storing
+characters. The most common character sets were ASCII and EBCDIC,
+which each provided all the English upper- and lowercase letters, the 10
+Hindu-Arabic numerals from 0 through 9, and a number of other standard
+punctuation and control characters.
+
+Today, the most popular character set in use is Unicode (of which ASCII
+is a pure subset). Unicode provides tens of thousands of unique characters
+(called @dfn{code points}) to cover most existing human languages (living
+and dead) and a number of nonhuman ones as well (such as Klingon and
+J.R.R.@: Tolkien's elvish languages).
+
+To save space in files, Unicode code points are @dfn{encoded}, where each
+character takes from one to four bytes in the file. UTF-8 is possibly
+the most popular of such @dfn{multibyte encodings}.
+
+The POSIX standard requires that @command{awk} function in terms
+of characters, not bytes. Thus in @command{gawk}, @code{length()},
+@code{substr()}, @code{split()}, @code{match()} and the other string
+functions (@pxref{String Functions}) all work in terms of characters in
+the local character set, and not in terms of bytes. (Not all @command{awk}
+implementations do so, though).
+
+There is no standard, built-in way to distinguish characters from bytes
+in an @command{awk} program. For an @command{awk} implementation of
+@command{wc}, which needs to make such a distinction, we will have to
+use an external extension.
+
+@node Using extensions
+@subsubsection A Brief Introduction To Extensions
+
+Loadable extensions are presented in full detail in @ref{Dynamic Extensions}.
+They provide a way to add functions to @command{gawk} which can call
+out to other facilities written in C or C++.
+
+For the purposes of
+@file{wc.awk}, it's enough to know that the extension is loaded
+with the @code{@@load} directive, and the additional function we
+will use is called @code{mbs_length()}. This function returns the
+number of bytes in a string, and not the number of characters.
+
+The @code{"mbs"} extension comes from the @code{gawkextlib}
+project. @xref{gawkextlib} for more information.
+
+@node @command{wc} program
+@subsubsection Code for @file{wc.awk}
+
+The usage for @command{wc} is as follows:
@display
-@command{wc} [@option{-lwc}] [@var{files} @dots{}]
+@command{wc} [@option{-lwcm}] [@var{files} @dots{}]
@end display
If no files are specified on the command line, @command{wc} reads its standard
@@ -25801,24 +25858,30 @@ by spaces and/or TABs. Luckily, this is the normal way @command{awk} separates
fields in its input data.
@item -c
+Count only bytes.
+Once upon a time, the @samp{c} in this option stood for ``characters.''
+But, as explained earlier, bytes and character are no longer synonymous
+with each other.
+
+@item -m
Count only characters.
@end table
Implementing @command{wc} in @command{awk} is particularly elegant,
because @command{awk} does a lot of the work for us; it splits lines into
words (i.e., fields) and counts them, it counts lines (i.e., records),
-and it can easily tell us how long a line is.
+and it can easily tell us how long a line is in characters.
This program uses the @code{getopt()} library function
(@pxref{Getopt Function})
and the file-transition functions
(@pxref{Filetrans Function}).
-This version has one notable difference from traditional versions of
+This version has one notable difference from older versions of
@command{wc}: it always prints the counts in the order lines, words,
-and characters. Traditional versions note the order of the @option{-l},
+characters and bytes. Older versions note the order of the @option{-l},
@option{-w}, and @option{-c} options on the command line, and print the
-counts in that order.
+counts in that order. POSIX does not mandate this behavior, though.
The @code{BEGIN} rule does the argument processing. The variable
@code{print_total} is true if more than one file is named on the
@@ -25834,6 +25897,7 @@ command line:
#
# Arnold Robbins, arnold@@skeeve.com, Public Domain
# May 1993
+# Revised September 2020
@c endfile
@end ignore
@c file eg/prog/wc.awk
@@ -25841,29 +25905,35 @@ command line:
# Options:
# -l only count lines
# -w only count words
-# -c only count characters
+# -c only count bytes
+# -m only count characters
#
-# Default is to count lines, words, characters
+# Default is to count lines, words, bytes
#
# Requires getopt() and file transition library functions
+# Requires mbs extension from gawkextlib
+
+@@load "mbs"
BEGIN @{
# let getopt() print a message about
# invalid options. we ignore them
- while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
+ while ((c = getopt(ARGC, ARGV, "lwcm")) != -1) @{
if (c == "l")
do_lines = 1
else if (c == "w")
do_words = 1
else if (c == "c")
+ do_bytes = 1
+ else if (c == "m")
do_chars = 1
@}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
- # if no options, do all
- if (! do_lines && ! do_words && ! do_chars)
- do_lines = do_words = do_chars = 1
+ # if no options, do lines, words, bytes
+ if (! do_lines && ! do_words && ! do_chars && ! do_bytes)
+ do_lines = do_words = do_bytes = 1
print_total = (ARGC - i > 1)
@}
@@ -25871,14 +25941,14 @@ BEGIN @{
@end example
The @code{beginfile()} function is simple; it just resets the counts of lines,
-words, and characters to zero, and saves the current @value{FN} in
+words, characters and bytes to zero, and saves the current @value{FN} in
@code{fname}:
@example
@c file eg/prog/wc.awk
function beginfile(file)
@{
- lines = words = chars = 0
+ lines = words = chars = bytes = 0
fname = FILENAME
@}
@c endfile
@@ -25896,6 +25966,7 @@ function endfile(file)
tlines += lines
twords += words
tchars += chars
+ tbytes += bytes
if (do_lines)
printf "\t%d", lines
@group
@@ -25904,26 +25975,28 @@ function endfile(file)
@end group
if (do_chars)
printf "\t%d", chars
+ if (do_bytes)
+ printf "\t%d", bytes
printf "\t%s\n", fname
@}
@c endfile
@end example
There is one rule that is executed for each line. It adds the length of
-the record, plus one, to @code{chars}.@footnote{Because @command{gawk}
-understands multibyte locales, this code counts characters, not bytes.}
-Adding one plus the record length
+the record, plus one, to @code{chars}. Adding one plus the record length
is needed because the newline character separating records (the value
of @code{RS}) is not part of the record itself, and thus not included
-in its length. Next, @code{lines} is incremented for each line read,
-and @code{words} is incremented by the value of @code{NF}, which is the
-number of ``words'' on this line:
+in its length. Similarly, it adds the length of the record in bytes,
+plus one, to @code{bytes}. Next, @code{lines} is incremented for each
+line read, and @code{words} is incremented by the value of @code{NF},
+which is the number of ``words'' on this line:
@example
@c file eg/prog/wc.awk
# do per line
@{
chars += length($0) + 1 # get newline
+ bytes += mbs_length($0) + 1
lines++
words += NF
@}
@@ -25942,6 +26015,8 @@ END @{
printf "\t%d", twords
if (do_chars)
printf "\t%d", tchars
+ if (do_bytes)
+ printf "\t%d", tbytes
print "\ttotal"
@}
@}