aboutsummaryrefslogtreecommitdiffstats
path: root/doc/gawk.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/gawk.texi')
-rw-r--r--doc/gawk.texi248
1 files changed, 247 insertions, 1 deletions
diff --git a/doc/gawk.texi b/doc/gawk.texi
index 08fc7faf..78c4f198 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -20,7 +20,7 @@
@c applies to and all the info about who's publishing this edition
@c These apply across the board.
-@set UPDATE-MONTH December, 2010
+@set UPDATE-MONTH January, 2011
@set VERSION 4.0
@set PATCHLEVEL 0
@@ -607,6 +607,7 @@ particular records in a file and perform operations upon them.
* Simple Sed:: A Simple Stream Editor.
* Igawk Program:: A wrapper for @command{awk} that includes
files.
+* Anagram Program:: Finding anagrams from a dictionary.
* Signature Program:: People do amazing things with too much time
on their hands.
* Debugging:: Introduction to @command{dgawk}.
@@ -12716,6 +12717,7 @@ If this element exists in @code{PROCINFO},
@emph{no matter what its value},
then @command{gawk} will cause @samp{for(i in arr) @dots{}} loops
to traverse the array indices in sorted order.
+@xref{Scanning an Array}, for more information.
@item PROCINFO["strftime"]
The default time format string for @code{strftime()}.
@@ -13379,6 +13381,39 @@ the loop body; it is not predictable whether the @code{for} loop will
reach them. Similarly, changing @var{var} inside the loop may produce
strange results. It is best to avoid such things.
+As an extension, @command{gawk} makes it possible for you to
+loop over the elements of an array in order, sorted by index.
+Sorting is based on string comparison (since all array indices are
+strings), and you cannot control the style of sorting; it is always
+from lowest to highest (@code{"A"} before @code{"B"}).
+To enable this feature, create the array element
+@code{PROCINFO["sorted_in"]} (@pxref{Auto-set}).
+The value of this element does not
+matter; @command{gawk} only tests if the element with this index
+exists in @code{PROCINFO} or not.
+This extension is disabled in POSIX mode, since the @code{PROCINFO}
+array is not special in that case. For example:
+
+@example
+$ @kbd{gawk 'BEGIN @{}
+> @kbd{ a[4] = 4}
+> @kbd{ a[3] = 3}
+> @kbd{ for (i in a)}
+> @kbd{ print i, a[i]}
+> @kbd{@}'}
+@print{} 4 4
+@print{} 3 3
+$ @kbd{gawk 'BEGIN @{}
+> @kbd{ PROCINFO["sorted_in"]++}
+> @kbd{ a[4] = 4}
+> @kbd{ a[3] = 3}
+> @kbd{ for (i in a)}
+> @kbd{ print i, a[i]}
+> @kbd{@}'}
+@print{} 3 3
+@print{} 4 4
+@end example
+
@node Delete
@section The @code{delete} Statement
@cindex @code{delete} statement
@@ -13960,6 +13995,19 @@ for (i in array)
print array[i][j]
@end example
+The @code{isarray()} function (@pxref{Type Functions})
+lets you test if an array element is itself an array:
+
+@example
+for (i in array) @{
+ if (isarray(array[i]) @{
+ for (j in array[i]) @{
+ print array[i][j]
+ @}
+ @}
+@}
+@end example
+
If the structure of a jagged array of arrays is known in advance,
you can often devise workarounds using control statements. For example,
the following code prints the elements of our main array @code{a}:
@@ -13976,6 +14024,10 @@ for (i in a) @{
@}
@end example
+@noindent
+@xref{Walking Arrays}, for a user-defined function that will ``walk'' an
+arbitrarily-dimensioned array of arrays.
+
Recall that a reference to an uninitialized array element yields a value
of @code{""}, the null string. This has one important implication when you
intend to use a subarray as an argument to a function, as illustrated by
@@ -18807,6 +18859,7 @@ comparisons use only lowercase letters.
arguments.
* Passwd Functions:: Functions for getting user information.
* Group Functions:: Functions for getting group information.
+* Walking Arrays:: A function to walk arrays of arrays.
@end menu
@node Library Names
@@ -21191,6 +21244,68 @@ simple, relying on @command{awk}'s associative arrays to do work.
The @command{id} program in @ref{Id Program},
uses these functions.
+
+@node Walking Arrays
+@section Traversing Arrays of Arrays
+
+@ref{Arrays of Arrays}, described how @command{gawk}
+provides arrays of arrays. In particular, any element of
+an array may be either a scalar, or another array. The
+@code{isarray()} function (@pxref{Type Functions})
+lets you distinguish an array
+from a scalar.
+The following function, @code{walk_array}, recursively traverses
+an array, printing each element's indices and value.
+You call it with the array and a string representing the name
+of the array:
+
+@cindex @code{walk_array()} user-defined function
+@example
+@c file eg/lib/walkarray.awk
+function walk_array(arr, name, i)
+@{
+ for (i in arr) @{
+ if (isarray(arr[i]))
+ walk_array(arr[i], (name "[" i "]"))
+ else
+ printf("%s[%s] = %s\n", name, i, arr[i])
+ @}
+@}
+@c endfile
+@end example
+
+@noindent
+It works by looping over each element of the array. If any given
+element is itself an array, the function calls itself recursively,
+passing the subarray and a new string representing the current index.
+Otherwise, the function simply prints the element's name, index, and value.
+Here is a main program to demonstrate
+
+@example
+BEGIN @{
+ a[1] = 1
+ a[2][1] = 21
+ a[2][2] = 22
+ a[3] = 3
+ a[4][1][1] = 411
+ a[4][2] = 42
+
+ walk_array(a, "a")
+@}
+@end example
+
+When run, the program produces the following output:
+
+@example
+$ @kbd{gawk -f walk_array.awk}
+@print{} a[4][1][1] = 411
+@print{} a[4][2] = 42
+@print{} a[1] = 1
+@print{} a[2][1] = 21
+@print{} a[2][2] = 22
+@print{} a[3] = 3
+@end example
+
@c ENDOFRANGE libfgdata
@c ENDOFRANGE flibgdata
@c ENDOFRANGE gdatar
@@ -22806,6 +22921,7 @@ We hope you find them both interesting and enjoyable.
* Simple Sed:: A Simple Stream Editor.
* Igawk Program:: A wrapper for @command{awk} that includes
files.
+* Anagram Program:: Finding anagrams from a dictionary.
* Signature Program:: People do amazing things with too much time on
their hands.
@end menu
@@ -24415,6 +24531,136 @@ statements for the desired library functions.
@c ENDOFRANGE flibex
@c ENDOFRANGE awkpex
+@node Anagram Program
+@subsection Finding Anagrams From A Dictionary
+
+An interesting programming challenge is to
+read a word list (such as
+@file{/usr/share/dict/words} on many GNU/Linux systems)
+and find words that are @dfn{anagrams} of each other.
+One word is an anagram of another if both words contain
+the same letters
+(for example, ``babbling'' and ``blabbing'').
+
+An elegant algorithm is presented in Column 2, Problem C of
+Jon Bentley's @cite{Programming Pearls}, second edition.
+The idea is to give words that are anagrams a common signature,
+sort all the words together by their signature, and then print them.
+Dr.@: Bentley observes that taking the letters in each word and
+sorting them produces that common signature.
+
+The following program uses arrays of arrays to bring together
+words with the same signature and array sorting to print the words
+in sorted order.
+
+@cindex @code{anagram.awk} program
+@example
+@c file eg/prog/anagram.awk
+# anagram.awk --- An implementation of the anagram finding algorithm
+# from Jon Bentley's "Programming Pearls", 2nd edition.
+# Addison Wesley, 2000, ISBN 0-201-65788-0.
+# Column 2, Problem C, section 2.8, pp 18-20.
+@c endfile
+@ignore
+@c file eg/prog/anagram.awk
+#
+# This program requires gawk 4.0 or newer.
+# Required gawk-specific features:
+# - True multidimensional arrays
+# - split() with "" as separater splits out individual characters
+# - asort() and asorti() functions
+#
+# See http://savannah.gnu.org/projects/gawk.
+#
+# Arnold Robbins
+# arnold@@skeeve.com
+# Public Domain
+# January, 2011
+@c endfile
+@end ignore
+@c file eg/prog/anagram.awk
+
+/'s$/ @{ next @} # Skip possessives
+@c endfile
+@end example
+
+The program starts with a header, and then a rule to skip
+possessives in the dictionary file. The next rule builds
+up the data structure. The first dimension of the array
+is indexed by the signature; the second dimension is the word
+itself:
+
+@example
+@c file eg/prog/anagram.awk
+@{
+ key = word2key($1) # Build signature
+ data[key][$1] = $1 # Store word with signature
+@}
+@c endfile
+@end example
+
+The @code{word2key()} function creates the signature.
+It splits the word apart into individual letters,
+sorts the letters, and then joins them back together:
+
+@example
+@c file eg/prog/anagram.awk
+# word2key --- split word apart into letters, sort, joining back together
+
+function word2key(word, a, i, n, result)
+@{
+ n = split(word, a, "")
+ asort(a)
+
+ for (i = 1; i <= n; i++)
+ result = result a[i]
+
+ return result
+@}
+@c endfile
+@end example
+
+Finally, the @code{END} rule traverses the array
+and prints out the anagram lists. It sends the output
+to the system @command{sort} command, since otherwise
+the anagrams would appear in arbitrary order:
+
+@example
+@c file eg/prog/anagram.awk
+END @{
+ sort = "sort"
+ for (key in data) @{
+ # Sort words with same key
+ nwords = asorti(data[key], words)
+ if (nwords == 1)
+ continue
+
+ # And print. Minor glitch: trailing space at end of each line
+ for (j = 1; j <= nwords; j++)
+ printf("%s ", words[j]) | sort
+ print "" | sort
+ @}
+ close(sort)
+@}
+@c endfile
+@end example
+
+Here is some partial output when the program is run:
+
+@example
+$ @kbd{gawk -f anagram.awk /usr/share/dict/words | grep '^b'}
+@dots{}
+babbled blabbed
+babbler blabber brabble
+babblers blabbers brabbles
+babbling blabbing
+babbly blabby
+babel bable
+babels beslab
+babery yabber
+@dots{}
+@end example
+
@node Signature Program
@subsection And Now For Something Completely Different