diff options
Diffstat (limited to 'doc/gawktexi.in')
-rw-r--r-- | doc/gawktexi.in | 9975 |
1 files changed, 6344 insertions, 3631 deletions
diff --git a/doc/gawktexi.in b/doc/gawktexi.in index aac8c2af..00d544b4 100644 --- a/doc/gawktexi.in +++ b/doc/gawktexi.in @@ -14,6 +14,31 @@ * awk: (gawk)Invoking gawk. Text scanning and processing. @end direntry +@ifset FOR_PRINT +@tex +\gdef\xrefprintnodename#1{``#1''} +@end tex +@end ifset + +@ifclear FOR_PRINT +@c With early 2014 texinfo.tex, restore PDF links and colors +@tex +\gdef\linkcolor{0.5 0.09 0.12} % Dark Red +\gdef\urlcolor{0.5 0.09 0.12} % Also +\global\urefurlonlylinktrue +@end tex +@end ifclear + +@ifnotdocbook +@set BULLET @bullet{} +@set MINUS @minus{} +@end ifnotdocbook + +@ifdocbook +@set BULLET +@set MINUS +@end ifdocbook + @set xref-automatic-section-title @c The following information should be updated here only! @@ -21,11 +46,9 @@ @c applies to and all the info about who's publishing this edition @c These apply across the board. -@set UPDATE-MONTH May, 2013 +@set UPDATE-MONTH June, 2014 @set VERSION 4.1 -@set PATCHLEVEL 0 - -@set FSF +@set PATCHLEVEL 1 @set TITLE GAWK: Effective AWK Programming @set SUBTITLE A User's Guide for GNU Awk @@ -39,6 +62,7 @@ @set SUBSECTION subsection @set DARKCORNER @inmargin{@image{lflashlight,1cm}, @image{rflashlight,1cm}} @set COMMONEXT (c.e.) +@set PAGE page @end iftex @ifinfo @set DOCUMENT Info file @@ -48,6 +72,7 @@ @set SUBSECTION node @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) +@set PAGE screen @end ifinfo @ifhtml @set DOCUMENT Web page @@ -57,6 +82,7 @@ @set SUBSECTION subsection @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) +@set PAGE screen @end ifhtml @ifdocbook @set DOCUMENT book @@ -66,6 +92,7 @@ @set SUBSECTION subsection @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) +@set PAGE page @end ifdocbook @ifxml @set DOCUMENT book @@ -75,6 +102,7 @@ @set SUBSECTION subsection @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) +@set PAGE page @end ifxml @ifplaintext @set DOCUMENT book @@ -84,24 +112,69 @@ @set SUBSECTION subsection @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) +@set PAGE page @end ifplaintext +@ifdocbook +@c empty on purpose +@set PART1 +@set PART2 +@set PART3 +@set PART4 +@end ifdocbook + +@ifnotdocbook +@set PART1 Part I:@* +@set PART2 Part II:@* +@set PART3 Part III:@* +@set PART4 Part IV:@* +@end ifnotdocbook + @c some special symbols @iftex @set LEQ @math{@leq} @set PI @math{@pi} @end iftex +@ifdocbook +@set LEQ @inlineraw{docbook, ≤} +@set PI @inlineraw{docbook, &pgr;} +@end ifdocbook @ifnottex +@ifnotdocbook @set LEQ <= @set PI @i{pi} +@end ifnotdocbook @end ifnottex @ifnottex +@ifnotdocbook @macro ii{text} @i{\text\} @end macro +@end ifnotdocbook @end ifnottex +@ifdocbook +@macro ii{text} +@inlineraw{docbook,<lineannotation>\text\</lineannotation>} +@end macro +@end ifdocbook + +@ifclear FOR_PRINT +@set FN file name +@set FFN File Name +@set DF data file +@set DDF Data File +@set PVERSION version +@end ifclear +@ifset FOR_PRINT +@set FN filename +@set FFN Filename +@set DF datafile +@set DDF Datafile +@set PVERSION Version +@end ifset + @c For HTML, spell out email addresses, to avoid problems with @c address harvesters for spammers. @ifhtml @@ -115,12 +188,36 @@ @end macro @end ifnothtml +@c Indexing macros +@ifinfo + +@macro cindexawkfunc{name} +@cindex @code{\name\} +@end macro + +@macro cindexgawkfunc{name} +@cindex @code{\name\} +@end macro + +@end ifinfo + +@ifnotinfo + +@macro cindexawkfunc{name} +@cindex @code{\name\()} function +@end macro + +@macro cindexgawkfunc{name} +@cindex @code{\name\()} function (@command{gawk}) +@end macro +@end ifnotinfo + @ignore Some comments on the layout for TeX. -1. Use at least texinfo.tex 2000-09-06.09 -2. I have done A LOT of work to make this look good. There are `@page' commands - and use of `@group ... @end group' in a number of places. If you muck - with anything, it's your responsibility not to break the layout. +1. Use at least texinfo.tex 2014-01-30.15 +2. When using @docbook, if the last line is part of a paragraph, end +it with a space and @c so that the lines won't run together. This is a +quirk of the language / makeinfo, and isn't going to change. @end ignore @c merge the function and variable indexes into the concept index @@ -136,6 +233,10 @@ Some comments on the layout for TeX. @syncodeindex fn cp @syncodeindex vr cp @end ifxml +@ifdocbook +@synindex fn cp +@synindex vr cp +@end ifdocbook @c If "finalout" is commented out, the printed output will show @c black boxes that mark lines that are too long. Thus, it is @@ -147,9 +248,30 @@ Some comments on the layout for TeX. @end iftex @copying -Copyright @copyright{} 1989, 1991, 1992, 1993, 1996, 1997, 1998, 1999, -2000, 2001, 2002, 2003, 2004, 2005, 2007, 2009, 2010, 2011, 2012, 2013 +@docbook +<para> +“To boldly go where no man has gone before” is a +Registered Trademark of Paramount Pictures Corporation.</para> + +<para>Published by:</para> + +<literallayout class="normal">Free Software Foundation +51 Franklin Street, Fifth Floor +Boston, MA 02110-1301 USA +Phone: +1-617-542-5942 +Fax: +1-617-542-2652 +Email: <email>gnu@@gnu.org</email> +URL: <ulink url="http://www.gnu.org">http://www.gnu.org/</ulink></literallayout> + +<literallayout class="normal">Copyright © 1989, 1991, 1992, 1993, 1996–2005, 2007, 2009–2014 +Free Software Foundation, Inc. +All Rights Reserved.</literallayout> +@end docbook + +@ifnotdocbook +Copyright @copyright{} 1989, 1991, 1992, 1993, 1996--2005, 2007, 2009--2014 @* Free Software Foundation, Inc. +@end ifnotdocbook @sp 2 This is Edition @value{EDITION} of @cite{@value{TITLE}: @value{SUBTITLE}}, @@ -197,6 +319,7 @@ supports it in developing GNU and promoting software freedom.'' @subtitle @value{UPDATE-MONTH} @author Arnold D. Robbins +@ifnotdocbook @c Include the Distribution inside the titlepage environment so @c that headings are turned off. Headings on and off do not work. @@ -221,6 +344,7 @@ URL: @uref{http://www.gnu.org/} @* ISBN 1-882114-28-0 @* @sp 2 @insertcopying +@end ifnotdocbook @end titlepage @c Thanks to Bob Chassell for directions on doing dedications. @@ -229,15 +353,13 @@ ISBN 1-882114-28-0 @* @page @w{ } @sp 9 -@center @i{To Miriam, for making me complete.} -@sp 1 -@center @i{To Chana, for the joy you bring us.} +@center @i{To my parents, for their love, and for the wonderful example they set for me.} @sp 1 -@center @i{To Rivka, for the exponential increase.} +@center @i{To my wife Miriam, for making me complete. +Thank you for building your life together with me.} @sp 1 -@center @i{To Nachum, for the added dimension.} +@center @i{To our children Chana, Rivka, Nachum and Malka, for enrichening our lives in innumerable ways.} @sp 1 -@center @i{To Malka, for the new beginning.} @w{ } @page @w{ } @@ -245,6 +367,17 @@ ISBN 1-882114-28-0 @* @headings on @end iftex +@docbook +<dedication> +<para>To my parents, for their love, and for the wonderful +example they set for me.</para> +<para>To my wife Miriam, for making me complete. +Thank you for building your life together with me.</para> +<para>To our children Chana, Rivka, Nachum and Malka, +for enrichening our lives in innumerable ways.</para> +</dedication> +@end docbook + @iftex @headings off @evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| @@ -253,6 +386,7 @@ ISBN 1-882114-28-0 @* @ifnottex @ifnotxml +@ifnotdocbook @node Top @top General Introduction @c Preface node should come right after the Top @@ -264,6 +398,7 @@ particular records in a file and perform operations upon them. @insertcopying +@end ifnotdocbook @end ifnotxml @end ifnottex @@ -331,8 +466,8 @@ particular records in a file and perform operations upon them. includes command-line syntax. * One-shot:: Running a short throwaway @command{awk} program. -* Read Terminal:: Using no input files (input from - terminal instead). +* Read Terminal:: Using no input files (input from the + keyboard instead). * Long:: Putting permanent @command{awk} programs in files. * Executable Scripts:: Making self-contained @command{awk} @@ -354,6 +489,7 @@ particular records in a file and perform operations upon them. * Other Features:: Other Features of @command{awk}. * When:: When to use @command{gawk} and when to use other things. +* Intro Summary:: Summary of the introduction. * Command Line:: How to run @command{awk}. * Options:: Command-line options and their meanings. @@ -375,6 +511,7 @@ particular records in a file and perform operations upon them. program. * Obsolete:: Obsolete Options and/or features. * Undocumented:: Undocumented Options and Features. +* Invoking Summary:: Invocation summary. * Regexp Usage:: How to Use Regular Expressions. * Escape Sequences:: How to write nonprinting characters. * Regexp Operators:: Regular Expression Operators. @@ -383,8 +520,12 @@ particular records in a file and perform operations upon them. * Case-sensitivity:: How to do case-insensitive matching. * Leftmost Longest:: How much text matches. * Computed Regexps:: Using Dynamic Regexps. +* Regexp Summary:: Regular expressions summary. * Records:: Controlling how data is split into records. +* awk split records:: How standard @command{awk} splits + records. +* gawk split records:: How @command{gawk} splits records. * Fields:: An introduction to fields. * Nonconstant Fields:: Nonconstant Field Numbers. * Changing Fields:: Changing the Contents of a Field. @@ -396,6 +537,8 @@ particular records in a file and perform operations upon them. field. * Command Line Field Separator:: Setting @code{FS} from the command-line. +* Full Line Fields:: Making the full line be a single + field. * Field Splitting Summary:: Some final points and a summary table. * Constant Size:: Reading constant width data. * Splitting By Content:: Defining Fields By Content @@ -421,6 +564,8 @@ particular records in a file and perform operations upon them. * Read Timeout:: Reading input with a timeout. * Command line directories:: What happens if you put a directory on the command line. +* Input Summary:: Input summary. +* Input Exercises:: Exercises. * Print:: The @code{print} statement. * Print Examples:: Simple examples of @code{print} statements. @@ -444,6 +589,8 @@ particular records in a file and perform operations upon them. * Special Caveats:: Things to watch out for. * Close Files And Pipes:: Closing Input and Output Files and Pipes. +* Output Summary:: Output summary. +* Output exercises:: Exercises. * Values:: Constants, Variables, and Regular Expressions. * Constants:: String, numeric and regexp constants. @@ -459,6 +606,9 @@ particular records in a file and perform operations upon them. This is an advanced method of input. * Conversion:: The conversion of strings to numbers and vice versa. +* Strings And Numbers:: How @command{awk} Converts Between + Strings And Numbers. +* Locale influences conversions:: How the locale may affect conversions. * All Operators:: @command{gawk}'s operators. * Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, etc.) @@ -486,6 +636,7 @@ particular records in a file and perform operations upon them. * Function Calls:: A function call is an expression. * Precedence:: How various operators nest. * Locales:: How the locale affects things. +* Expressions Summary:: Expressions summary. * Pattern Overview:: What goes into a pattern. * Regexp Patterns:: Using regexps as patterns. * Expression Patterns:: Any expression can be used as a @@ -532,6 +683,7 @@ particular records in a file and perform operations upon them. gives you information. * ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. +* Pattern Action Summary:: Patterns and Actions summary. * Array Basics:: The basics of arrays. * Array Intro:: Introduction to Arrays * Reference to Elements:: How to examine one element of an @@ -554,6 +706,7 @@ particular records in a file and perform operations upon them. @command{awk}. * Multiscanning:: Scanning multidimensional arrays. * Arrays of Arrays:: True multidimensional arrays. +* Arrays Summary:: Summary of arrays. * Built-in:: Summarizes the built-in functions. * Calling Built-in:: How to call built-in functions. * Numeric Functions:: Functions that work with numbers, @@ -588,6 +741,7 @@ particular records in a file and perform operations upon them. runtime. * Indirect Calls:: Choosing the function to call at runtime. +* Functions Summary:: Summary of functions. * Library Names:: How to best name private global variables in library functions. * General Functions:: Functions that are of general use. @@ -622,6 +776,8 @@ particular records in a file and perform operations upon them. * Group Functions:: Functions for getting group information. * Walking Arrays:: A function to walk arrays of arrays. +* Library Functions Summary:: Summary of library functions. +* Library exercises:: Exercises. * Running Examples:: How to run these examples. * Clones:: Clones of common utilities. * Cut Program:: The @command{cut} utility. @@ -651,6 +807,8 @@ particular records in a file and perform operations upon them. * Anagram Program:: Finding anagrams from a dictionary. * Signature Program:: People do amazing things with too much time on their hands. +* Programs Summary:: Summary of programs. +* Programs Exercises:: Exercises. * Nondecimal Data:: Allowing nondecimal input data. * Array Sorting:: Facilities for controlling array traversal and sorting arrays. @@ -662,8 +820,9 @@ particular records in a file and perform operations upon them. * TCP/IP Networking:: Using @command{gawk} for network programming. * Profiling:: Profiling your @command{awk} programs. +* Advanced Features Summary:: Summary of advanced features. * I18N and L10N:: Internationalization and Localization. -* Explaining gettext:: How GNU @code{gettext} works. +* Explaining gettext:: How GNU @command{gettext} works. * Programmer i18n:: Features for the programmer. * Translator i18n:: Features for the translator. * String Extraction:: Extracting marked strings. @@ -673,6 +832,7 @@ particular records in a file and perform operations upon them. * I18N Example:: A simple i18n example. * Gawk I18N:: @command{gawk} is also internationalized. +* I18N Summary:: Summary of I18N stuff. * Debugging:: Introduction to @command{gawk} debugger. * Debugging Concepts:: Debugging in General. @@ -691,31 +851,23 @@ particular records in a file and perform operations upon them. * Miscellaneous Debugger Commands:: Miscellaneous Commands. * Readline Support:: Readline support. * Limitations:: Limitations and future plans. -* General Arithmetic:: An introduction to computer - arithmetic. -* Floating Point Issues:: Stuff to know about floating-point - numbers. -* String Conversion Precision:: The String Value Can Lie. -* Unexpected Results:: Floating Point Numbers Are Not - Abstract Numbers. -* POSIX Floating Point Problems:: Standards Versus Existing Practice. -* Integer Programming:: Effective integer programming. -* Floating-point Programming:: Effective Floating-point Programming. -* Floating-point Representation:: Binary floating-point representation. -* Floating-point Context:: Floating-point context. -* Rounding Mode:: Floating-point rounding mode. -* Gawk and MPFR:: How @command{gawk} provides - arbitrary-precision arithmetic. -* Arbitrary Precision Floats:: Arbitrary Precision Floating-point - Arithmetic with @command{gawk}. -* Setting Precision:: Setting the working precision. -* Setting Rounding Mode:: Setting the rounding mode. -* Floating-point Constants:: Representing floating-point constants. -* Changing Precision:: Changing the precision of a number. -* Exact Arithmetic:: Exact arithmetic with floating-point - numbers. +* Debugging Summary:: Debugging summary. +* Computer Arithmetic:: A quick intro to computer math. +* Math Definitions:: Defining terms used. +* MPFR features:: The MPFR features in @command{gawk}. +* FP Math Caution:: Things to know. +* Inexactness of computations:: Floating point math is not exact. +* Inexact representation:: Numbers are not exactly represented. +* Comparing FP Values:: How to compare floating point values. +* Errors accumulate:: Errors get bigger as they go. +* Getting Accuracy:: Getting more accuracy takes some work. +* Try To Round:: Add digits and round. +* Setting precision:: How to set the precision. +* Setting the rounding mode:: How to set the rounding mode. * Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with @command{gawk}. +* POSIX Floating Point Problems:: Standards Versus Existing Practice. +* Floating point summary:: Summary of floating point discussion. * Extension Intro:: What is an extension. * Plugin License:: A note about licensing. * Extension Mechanism Outline:: An outline of how it works. @@ -723,6 +875,7 @@ particular records in a file and perform operations upon them. * Extension API Functions Introduction:: Introduction to the API functions. * General Data Types:: The data types. * Requesting Values:: How to get a value. +* Memory Allocation Functions:: Functions for allocating memory. * Constructor Functions:: Functions for creating values. * Registration Functions:: Functions to register things with @command{gawk}. @@ -776,6 +929,8 @@ particular records in a file and perform operations upon them. * Extension Sample Time:: An interface to @code{gettimeofday()} and @code{sleep()}. * gawkextlib:: The @code{gawkextlib} project. +* Extension summary:: Extension summary. +* Extension Exercises:: Exercises. * V7/SVR3.1:: The major changes between V7 and System V Release 3.1. * SVR4:: Minor changes between System V @@ -785,11 +940,14 @@ particular records in a file and perform operations upon them. version of @command{awk}. * POSIX/GNU:: The extensions in @command{gawk} not in POSIX @command{awk}. +* Feature History:: The history of the features in + @command{gawk}. * Common Extensions:: Common Extensions Summary. * Ranges and Locales:: How locales used to affect regexp ranges. * Contributors:: The major contributors to @command{gawk}. +* History summary:: History summary. * Gawk Distribution:: What is in the @command{gawk} distribution. * Getting:: How to get the distribution. @@ -817,14 +975,18 @@ particular records in a file and perform operations upon them. * VMS Installation:: Installing @command{gawk} on VMS. * VMS Compilation:: How to compile @command{gawk} under VMS. +* VMS Dynamic Extensions:: Compiling @command{gawk} dynamic + extensions on VMS. * VMS Installation Details:: How to install @command{gawk} under VMS. * VMS Running:: How to run @command{gawk} under VMS. +* VMS GNV:: The VMS GNV Project. * VMS Old Gawk:: An old version comes with some VMS systems. * Bugs:: Reporting Problems and Bugs. * Other Versions:: Other freely available @command{awk} implementations. +* Installation summary:: Summary of installation. * Compatibility Mode:: How to disable certain @command{gawk} extensions. * Additions:: Making Additions To @command{gawk}. @@ -833,8 +995,8 @@ particular records in a file and perform operations upon them. @command{gawk}. * New Ports:: Porting @command{gawk} to a new operating system. -* Derived Files:: Why derived files are kept in the - @command{git} repository. +* Derived Files:: Why derived files are kept in the Git + repository. * Future Extensions:: New features that may be implemented one day. * Implementation Limitations:: Some limitations of the @@ -845,6 +1007,7 @@ particular records in a file and perform operations upon them. * Extension Other Design Decisions:: Some other design decisions. * Extension Future Growth:: Some room for future growth. * Old Extension Mechanism:: Some compatibility for old extensions. +* Notes summary:: Summary of implementation notes. * Basic High Level:: The high level view. * Basic Data Typing:: A very quick intro to data types. @end detailmenu @@ -852,15 +1015,14 @@ particular records in a file and perform operations upon them. @c dedication for Info file @ifinfo -@center To Miriam, for making me complete. -@sp 1 -@center To Chana, for the joy you bring us. +To my parents, for their love, and for the wonderful +example they set for me. @sp 1 -@center To Rivka, for the exponential increase. +To my wife Miriam, for making me complete. +Thank you for building your life together with me. @sp 1 -@center To Nachum, for the added dimension. -@sp 1 -@center To Malka, for the new beginning. +To our children Chana, Rivka, Nachum and Malka, +for enrichening our lives in innumerable ways. @end ifinfo @summarycontents @@ -869,6 +1031,21 @@ particular records in a file and perform operations upon them. @node Foreword @unnumbered Foreword +@c This bit is post-processed by a script which turns the chapter +@c tag into a preface tag, and moves this stuff to before the title. +@c Bleah. +@docbook + <prefaceinfo> + <author> + <firstname>Michael</firstname> + <surname>Brennan</surname> + <!-- can't put mawk into command tags. sigh. --> + <affiliation><jobtitle>Author of mawk</jobtitle></affiliation> + </author> + <date>March, 2001</date> + </prefaceinfo> +@end docbook + Arnold Robbins and I are good friends. We were introduced @c 11 years ago in 1990 @@ -953,21 +1130,37 @@ and the AWK prototype becomes the product. The new @command{pgawk} (profiling @command{gawk}), produces program execution counts. I recently experimented with an algorithm that for -@math{n} lines of input, exhibited +@ifnotdocbook +@math{n} +@end ifnotdocbook +@ifdocbook +@i{n} +@end ifdocbook +lines of input, exhibited @tex $\sim\! Cn^2$ @end tex @ifnottex +@ifnotdocbook ~ C n^2 +@end ifnotdocbook @end ifnottex +@docbook +<emphasis>∼ Cn<superscript>2</superscript></emphasis> @c +@end docbook performance, while theory predicted @tex $\sim\! Cn\log n$ @end tex @ifnottex +@ifnotdocbook ~ C n log n +@end ifnotdocbook @end ifnottex +@docbook +<emphasis>∼ Cn log n</emphasis> @c +@end docbook behavior. A few minutes poring over the @file{awkprof.out} profile pinpointed the problem to a single line of code. @command{pgawk} is a welcome addition to @@ -977,11 +1170,14 @@ Arnold has distilled over a decade of experience writing and using AWK programs, and developing @command{gawk}, into this book. If you use AWK or want to learn how, then read this book. +@ifnotdocbook +@cindex Brennan, Michael @display Michael Brennan Author of @command{mawk} March, 2001 @end display +@end ifnotdocbook @node Preface @unnumbered Preface @@ -990,6 +1186,21 @@ March, 2001 @c @c 12/2000: Chuck wants the preface & intro combined. +@c This bit is post-processed by a script which turns the chapter +@c tag into a preface tag, and moves this stuff to before the title. +@c Bleah. +@docbook + <prefaceinfo> + <author> + <firstname>Arnold</firstname> + <surname>Robbins</surname> + <affiliation><jobtitle>Nof Ayalon</jobtitle></affiliation> + <affiliation><jobtitle>ISRAEL</jobtitle></affiliation> + </author> + <date>June, 2014</date> + </prefaceinfo> +@end docbook + Several kinds of tasks occur repeatedly when working with text files. You might want to extract certain lines and discard the rest. @@ -1001,12 +1212,13 @@ Such jobs are often easier with @command{awk}. The @command{awk} utility interprets a special-purpose programming language that makes it easy to handle simple data-reformatting jobs. +@cindex Brian Kernighan's @command{awk} The GNU implementation of @command{awk} is called @command{gawk}; if you invoke it with the proper options or environment variables (@pxref{Options}), it is fully compatible with -the POSIX@footnote{The 2008 POSIX standard is online at -@url{http://www.opengroup.org/onlinepubs/9699919799/}.} +the POSIX@footnote{The 2008 POSIX standard is accessable online at +@w{@url{http://www.opengroup.org/onlinepubs/9699919799/}.}} specification of the @command{awk} language and with the Unix version of @command{awk} maintained by Brian Kernighan. @@ -1023,7 +1235,7 @@ Thus, we usually don't distinguish between @command{gawk} and other @cindex @command{awk}, uses for Using @command{awk} allows you to: -@itemize @bullet +@itemize @value{BULLET} @item Manage small, personal databases @@ -1048,7 +1260,7 @@ In addition, @command{gawk} provides facilities that make it easy to: -@itemize @bullet +@itemize @value{BULLET} @item Extract bits and pieces of data for processing @@ -1057,6 +1269,12 @@ Sort data @item Perform simple network communications + +@item +Profile and debug @command{awk} programs. + +@item +Extend the language with functions written in C or C++. @end itemize This @value{DOCUMENT} teaches you about the @command{awk} language and @@ -1072,12 +1290,18 @@ Implementations of the @command{awk} language are available for many different computing environments. This @value{DOCUMENT}, while describing the @command{awk} language in general, also describes the particular implementation of @command{awk} called @command{gawk} (which stands for -``GNU awk''). @command{gawk} runs on a broad range of Unix systems, +``GNU @command{awk}''). @command{gawk} runs on a broad range of Unix systems, ranging from Intel@registeredsymbol{}-architecture PC-based computers -up through large-scale systems, -such as Crays. @command{gawk} has also been ported to Mac OS X, -Microsoft Windows (all versions) and OS/2 PCs, -and VMS. +up through large-scale systems. +@command{gawk} has also been ported to Mac OS X, +Microsoft Windows +@ifset FOR_PRINT +(all versions), +@end ifset +@ifclear FOR_PRINT +(all versions) and OS/2 PCs, +@end ifclear +and OpenVMS. (Some other, obsolete systems to which @command{gawk} was once ported are no longer supported and the code for those systems has been removed.) @@ -1151,11 +1375,11 @@ wrote the bulk of @cite{TCP/IP Internetworking with @command{gawk}} (a separate document, available as part of the @command{gawk} distribution). His code finally became part of the main @command{gawk} distribution -with @command{gawk} version 3.1. +with @command{gawk} @value{PVERSION} 3.1. John Haque rewrote the @command{gawk} internals, in the process providing an @command{awk}-level debugger. This version became available as -@command{gawk} version 4.0, in 2011. +@command{gawk} @value{PVERSION} 4.0, in 2011. @xref{Contributors}, for a complete list of those who made important contributions to @command{gawk}. @@ -1170,26 +1394,26 @@ The language described in this @value{DOCUMENT} is often referred to as ``new @command{awk}'' (@command{nawk}). @cindex @command{awk}, versions of -Because of this, there are systems with multiple -versions of @command{awk}. -Some systems have an @command{awk} utility that implements the -original version of the @command{awk} language and a @command{nawk} utility -for the new version. -Others have an @command{oawk} version for the ``old @command{awk}'' -language and plain @command{awk} for the new one. Still others only -have one version, which is usually the new one.@footnote{Often, these systems -use @command{gawk} for their @command{awk} implementation!} - @cindex @command{nawk} utility @cindex @command{oawk} utility -All in all, this makes it difficult for you to know which version of -@command{awk} you should run when writing your programs. The best advice -we can give here is to check your local documentation. Look for @command{awk}, -@command{oawk}, and @command{nawk}, as well as for @command{gawk}. -It is likely that you already -have some version of new @command{awk} on your system, which is what -you should use when running your programs. (Of course, if you're reading -this @value{DOCUMENT}, chances are good that you have @command{gawk}!) +For some time after new @command{awk} was introduced, there were +systems with multiple versions of @command{awk}. Some systems had +an @command{awk} utility that implemented the original version of the +@command{awk} language and a @command{nawk} utility for the new version. +Others had an @command{oawk} version for the ``old @command{awk}'' +language and plain @command{awk} for the new one. Still others only +had one version, which is usually the new one. + +Today, only Solaris systems still use an old @command{awk} for the +default @command{awk} utility. (A more modern @command{awk} lives in +@file{/usr/xpg6/bin} on these systems.) All other modern systems use +some version of new @command{awk}.@footnote{Many of these systems use +@command{gawk} for their @command{awk} implementation!} + +It is likely that you already have some version of new @command{awk} on +your system, which is what you should use when running your programs. +(Of course, if you're reading this @value{DOCUMENT}, chances are good +that you have @command{gawk}!) Throughout this @value{DOCUMENT}, whenever we refer to a language feature that should be available in any complete implementation of POSIX @command{awk}, @@ -1207,7 +1431,7 @@ and the program ``the @command{awk} utility.'' This @value{DOCUMENT} explains both how to write programs in the @command{awk} language and how to run the @command{awk} utility. -The term @dfn{@command{awk} program} refers to a program written by you in +The term ``@command{awk} program'' refers to a program written by you in the @command{awk} programming language. @cindex @command{gawk}, @command{awk} and @@ -1217,9 +1441,15 @@ Primarily, this @value{DOCUMENT} explains the features of @command{awk} as defined in the POSIX standard. It does so in the context of the @command{gawk} implementation. While doing so, it also attempts to describe important differences between @command{gawk} -and other @command{awk} implementations.@footnote{All such differences +and other @command{awk} +@ifclear FOR_PRINT +implementations.@footnote{All such differences appear in the index under the entry ``differences in @command{awk} and @command{gawk}.''} +@end ifclear +@ifset FOR_PRINT +implementations. +@end ifset Finally, any @command{gawk} features that are not in the POSIX standard for @command{awk} are noted. @@ -1227,7 +1457,7 @@ the POSIX standard for @command{awk} are noted. This @value{DOCUMENT} has the difficult task of being both a tutorial and a reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross-references; they are for the -expert user and for the online Info and HTML versions of the document. +expert user and for the online Info and HTML versions of the @value{DOCUMENT}. @end ifnotinfo There are sidebars @@ -1251,6 +1481,8 @@ should be of interest. This @value{DOCUMENT} is split into several parts, as follows: +@c FULLXREF ON + Part I describes the @command{awk} language and @command{gawk} program in detail. It starts with the basics, and continues through all of the features of @command{awk}. It contains the following chapters: @@ -1334,9 +1566,14 @@ describes advanced arithmetic facilities provided by @ref{Dynamic Extensions}, describes how to add new variables and functions to @command{gawk} by writing extensions in C or C++. +@ifclear FOR_PRINT Part IV provides the appendices, the Glossary, and two licenses that cover the @command{gawk} source code and this @value{DOCUMENT}, respectively. It contains the following appendices: +@end ifclear +@ifset FOR_PRINT +Part IV provides the following appendices: +@end ifset @ref{Language History}, describes how the @command{awk} language has evolved since @@ -1351,6 +1588,36 @@ non-POSIX systems. It also describes how to report bugs in @command{gawk} and where to get other freely available @command{awk} implementations. +@ifset FOR_PRINT +The version of this @value{DOCUMENT} distributed with @command{gawk} +contains additional appendices and other end material. +To save space, we have omitted them from the +printed edition. You may find them online, as follows: + +@uref{http://www.gnu.org/software/gawk/manual/html_node/Notes.html, +The appendix on implementation notes} +describes how to disable @command{gawk}'s extensions, as +well as how to contribute new code to @command{gawk}, +and some possible future directions for @command{gawk} development. + +@uref{http://www.gnu.org/software/gawk/manual/html_node/Basic-Concepts.html, +The appendix on basic concepts} +provides some very cursory background material for those who +are completely unfamiliar with computer programming. + +@uref{http://www.gnu.org/software/gawk/manual/html_node/Glossary.html, +The Glossary} +defines most, if not all, the significant terms used +throughout the @value{DOCUMENT}. If you find terms that you aren't familiar with, +try looking them up here. + +@uref{http://www.gnu.org/software/gawk/manual/html_node/Copying.html, The GNU GPL} and +@uref{http://www.gnu.org/software/gawk/manual/html_node/GNU-Free-Documentation-License.html, the GNU FDL} +are the licenses that cover the @command{gawk} source code +and this @value{DOCUMENT}, respectively. +@end ifset + +@ifclear FOR_PRINT @ref{Notes}, describes how to disable @command{gawk}'s extensions, as well as how to contribute new code to @command{gawk}, @@ -1361,13 +1628,16 @@ provides some very cursory background material for those who are completely unfamiliar with computer programming. The @ref{Glossary}, defines most, if not all, the significant terms used -throughout the book. If you find terms that you aren't familiar with, +throughout the @value{DOCUMENT}. If you find terms that you aren't familiar with, try looking them up here. @ref{Copying}, and @ref{GNU Free Documentation License}, present the licenses that cover the @command{gawk} source code and this @value{DOCUMENT}, respectively. +@end ifclear + +@c FULLXREF OFF @node Conventions @unnumberedsec Typographical Conventions @@ -1409,7 +1679,7 @@ emphasized @emph{like this}, and if a point needs to be made strongly, it is done @strong{like this}. The first occurrence of a new term is usually its @dfn{definition} and appears in the same font as the previous occurrence of ``definition'' in this sentence. -Finally, file names are indicated like this: @file{/path/to/ourfile}. +Finally, @value{FN}s are indicated like this: @file{/path/to/ourfile}. @end ifnotinfo Characters that you type at the keyboard look @kbd{like this}. In particular, @@ -1441,16 +1711,22 @@ the picture of a flashlight in the margin, as shown here. @ifnottex ``(d.c.)''. @end ifnottex +@ifclear FOR_PRINT They also appear in the index under the heading ``dark corner.'' +@end ifclear -As noted by the opening quote, though, any -coverage of dark corners -is, by definition, incomplete. +As noted by the opening quote, though, any coverage of dark corners is, +by definition, incomplete. Extensions to the standard @command{awk} language that are supported by more than one @command{awk} implementation are marked +@ifclear FOR_PRINT ``@value{COMMONEXT},'' and listed in the index under ``common extensions'' and ``extensions, common.'' +@end ifclear +@ifset FOR_PRINT +``@value{COMMONEXT}.'' +@end ifset @node Manual History @unnumberedsec The GNU Project and This Book @@ -1473,13 +1749,15 @@ Foundation to create a complete, freely distributable, POSIX-compliant computing environment. The FSF uses the ``GNU General Public License'' (GPL) to ensure that their software's -source code is always available to the end user. A -copy of the GPL is included +source code is always available to the end user. +@ifclear FOR_PRINT +A copy of the GPL is included @ifnotinfo in this @value{DOCUMENT} @end ifnotinfo for your reference (@pxref{Copying}). +@end ifclear The GPL applies to the C language source code for @command{gawk}. To find out more about the FSF and the GNU Project online, see @uref{http://www.gnu.org, the GNU Project's home page}. @@ -1502,8 +1780,13 @@ consider using GNU/Linux, a freely distributable, Unix-like operating system for Intel@registeredsymbol{}, Power Architecture, Sun SPARC, IBM S/390, and other +@ifclear FOR_PRINT systems.@footnote{The terminology ``GNU/Linux'' is explained in the @ref{Glossary}.} +@end ifclear +@ifset FOR_PRINT +systems. +@end ifset Many GNU/Linux distributions are available for download from the Internet. @@ -1523,53 +1806,13 @@ The @value{DOCUMENT} you are reading is actually free---at least, the information in it is free to anyone. The machine-readable source code for the @value{DOCUMENT} comes with @command{gawk}; anyone may take this @value{DOCUMENT} to a copying machine and make as many -copies as they like. (Take a moment to check the Free Documentation +copies as they like. +@ifclear FOR_PRINT +(Take a moment to check the Free Documentation License in @ref{GNU Free Documentation License}.) +@end ifclear @end ifnotinfo -@ignore -@cindex Close, Diane -The @value{DOCUMENT} itself has gone through several previous, -preliminary editions. -Paul Rubin wrote the very first draft of @cite{The GAWK Manual}; -it was around 40 pages in size. -Diane Close and Richard Stallman improved it, yielding the -version which I started working with in the fall of 1988. -It was around 90 pages long and barely described the original, ``old'' -version of @command{awk}. After substantial revision, the first version of -the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in -October of 1989. The manual then underwent more substantial revision -for Edition 0.13 of December 1991. -David Trueman, Pat Rankin and Michal Jaegermann contributed sections -of the manual for Edition 0.13. -That edition was published by the -FSF as a bound book early in 1992. Since then there were several -minor revisions, notably Edition 0.14 of November 1992 that was published -by the FSF in January of 1993 and Edition 0.16 of August 1993. - -Edition 1.0 of @cite{GAWK: The GNU Awk User's Guide} represented a significant re-working -of @cite{The GAWK Manual}, with much additional material. -The FSF and I agreed that I was now the primary author. -@c I also felt that the manual needed a more descriptive title. - -In January 1996, SSC published Edition 1.0 under the title @cite{Effective AWK Programming}. -In February 1997, they published Edition 1.0.3 which had minor changes -as a ``second edition.'' -In 1999, the FSF published this same version as Edition 2 -of @cite{GAWK: The GNU Awk User's Guide}. - -Edition @value{EDITION} maintains the basic structure of Edition 1.0, -but with significant additional material, reflecting the host of new features -in @command{gawk} version @value{VERSION}. -Of particular note is -@ref{Array Sorting}, -@ref{Bitwise Functions}, -@ref{Internationalization}, -@ref{Advanced Features}, -and -@ref{Dynamic Extensions}. -@end ignore - @cindex Close, Diane The @value{DOCUMENT} itself has gone through a number of previous editions. Paul Rubin wrote the very first draft of @cite{The GAWK Manual}; @@ -1585,24 +1828,50 @@ the FSF published several preliminary versions (numbered 0.@var{x}). In 1996, Edition 1.0 was released with @command{gawk} 3.0.0. The FSF published the first two editions under the title @cite{The GNU Awk User's Guide}. +@ifset FOR_PRINT +SSC published two editions of the @value{DOCUMENT} under the +title @cite{Effective awk Programming}, and in O'Reilly published +the third edition in 2001. +@end ifset This edition maintains the basic structure of the previous editions. -For Edition 4.0, the content has been thoroughly reviewed +For FSF edition 4.0, the content has been thoroughly reviewed and updated. All references to @command{gawk} versions prior to 4.0 have been removed. Of significant note for this edition was @ref{Debugger}. -For edition @value{EDITION}, the content has been reorganized into parts, +For FSF edition +@ifclear FOR_PRINT +@value{EDITION}, +@end ifclear +@ifset FOR_PRINT +@value{EDITION} +(the fourth edition as published by O'Reilly), +@end ifset +the content has been reorganized into parts, and the major new additions are @ref{Arbitrary Precision Arithmetic}, and @ref{Dynamic Extensions}. -@cite{@value{TITLE}} will undoubtedly continue to evolve. -An electronic version -comes with the @command{gawk} distribution from the FSF. -If you find an error in this @value{DOCUMENT}, please report it! -@xref{Bugs}, for information on submitting -problem reports electronically. +This @value{DOCUMENT} will undoubtedly continue to evolve. An electronic +version comes with the @command{gawk} distribution from the FSF. If you +find an error in this @value{DOCUMENT}, please report it! @xref{Bugs}, +for information on submitting problem reports electronically. + +@ifset FOR_PRINT +@c fakenode --- for prepinfo +@unnumberedsec How to Stay Current + +It may be you have a version of @command{gawk} which is newer than the +one described in this @value{DOCUMENT}. To find out what has changed, +you should first look at the @file{NEWS} file in the @command{gawk} +distribution, which provides a high level summary of what changed in +each release. + +You can then look at the @uref{http://www.gnu.org/software/gawk/manual/, +online version} of this @value{DOCUMENT} to read about any new features. +@end ifset +@ifclear FOR_PRINT @node How To Contribute @unnumberedsec How to Contribute @@ -1619,7 +1888,7 @@ However, I found that I could not dedicate enough time to managing contributed code: the archive did not grow and the domain went unused for several years. -Fortunately, late in 2008, a volunteer took on the task of setting up +Late in 2008, a volunteer took on the task of setting up an @command{awk}-related web site---@uref{http://awk.info}---and did a very nice job. @@ -1628,11 +1897,15 @@ a @command{gawk} extension that you would like to share with the rest of the world, please see @uref{http://awk.info/?contribute} for how to contribute it to the web site. +As of this writing, this website is in search of a maintainer; please +contact me if you are interested. + @ignore Other links: http://www.reddit.com/r/linux/comments/dtect/composing_music_in_awk/ @end ignore +@end ifclear @node Acknowledgments @unnumberedsec Acknowledgments @@ -1723,7 +1996,7 @@ significant editorial help for this @value{DOCUMENT} for the 3.1 release of @command{gawk}. @end quotation -@cindex Beebe, Nelson +@cindex Beebe, Nelson H.F.@: @cindex Buening, Andreas @cindex Collado, Manuel @cindex Colombo, Antonio @@ -1740,7 +2013,6 @@ significant editorial help for this @value{DOCUMENT} for the @cindex Rankin, Pat @cindex Schorr, Andrew @cindex Vinschen, Corinna -@cindex Wallin, Anders @cindex Zaretskii, Eli Dr.@: Nelson Beebe, @@ -1760,7 +2032,6 @@ Chet Ramey, Pat Rankin, Andrew Schorr, Corinna Vinschen, -Anders Wallin, and Eli Zaretskii (in alphabetical order) make up the current @@ -1772,6 +2043,10 @@ people. Notable code and documentation contributions were made by a number of people. @xref{Contributors}, for the full list. +Thanks to Patrice Dumas for the new @command{makeinfo} program. +Thanks to Karl Berry who continues to work to keep +the Texinfo markup language sane. + @cindex Kernighan, Brian I would like to thank Brian Kernighan for invaluable assistance during the testing and debugging of @command{gawk}, and for ongoing @@ -1791,26 +2066,28 @@ which they raised and educated me. Finally, I also must acknowledge my gratitude to G-d, for the many opportunities He has sent my way, as well as for the gifts He has given me with which to take advantage of those opportunities. +@iftex @sp 2 @noindent Arnold Robbins @* Nof Ayalon @* ISRAEL @* -May, 2013 - -@iftex -@part Part I:@* The @command{awk} Language +May, 2014 @end iftex -@ignore +@ifnotinfo +@part @value{PART1}The @command{awk} Language +@end ifnotinfo + @ifdocbook -@part Part I:@* The @command{awk} Language -Part I describes the @command{awk} language and @command{gawk} program in detail. -It starts with the basics, and continues through all of the features of @command{awk} -and @command{gawk}. It contains the following chapters: +Part I describes the @command{awk} language and @command{gawk} program +in detail. It starts with the basics, and continues through all of +the features of @command{awk}. Included also are many, but not all, +of the features of @command{gawk}. This part contains the +following chapters: -@itemize @bullet +@itemize @value{BULLET} @item @ref{Getting Started}. @@ -1839,7 +2116,6 @@ and @command{gawk}. It contains the following chapters: @ref{Functions}. @end itemize @end ifdocbook -@end ignore @node Getting Started @chapter Getting Started with @command{awk} @@ -1879,7 +2155,7 @@ pattern to search for and one action to perform upon finding the pattern. Syntactically, a rule consists of a pattern followed by an action. The -action is enclosed in curly braces to separate it from the pattern. +action is enclosed in braces to separate it from the pattern. Newlines usually separate rules. Therefore, an @command{awk} program looks like this: @@ -1903,6 +2179,7 @@ program looks like this: * Other Features:: Other Features of @command{awk}. * When:: When to use @command{gawk} and when to use other things. +* Intro Summary:: Summary of the introduction. @end menu @node Running gawk @@ -1931,7 +2208,7 @@ variations of each. @menu * One-shot:: Running a short throwaway @command{awk} program. -* Read Terminal:: Using no input files (input from terminal +* Read Terminal:: Using no input files (input from the keyboard instead). * Long:: Putting permanent @command{awk} programs in files. @@ -1995,10 +2272,15 @@ awk '@var{program}' @noindent @command{awk} applies the @var{program} to the @dfn{standard input}, -which usually means whatever you type on the terminal. This continues +which usually means whatever you type on the keyboard. This continues until you indicate end-of-file by typing @kbd{Ctrl-d}. +@ifset FOR_PRINT +(On other operating systems, the end-of-file character may be different.) +@end ifset +@ifclear FOR_PRINT (On other operating systems, the end-of-file character may be different. For example, on OS/2, it is @kbd{Ctrl-z}.) +@end ifclear @cindex files, input, See input files @cindex input files, running @command{awk} without @@ -2018,11 +2300,11 @@ $ @kbd{awk "BEGIN @{ print \"Don't Panic!\" @}"} @print{} Don't Panic! @end example -@cindex quoting -@cindex double quote (@code{"}) -@cindex @code{"} (double quote) -@cindex @code{\} (backslash) -@cindex backslash (@code{\}) +@cindex shell quoting, double quote +@cindex double quote (@code{"}) in shell commands +@cindex @code{"} (double quote) in shell commands +@cindex @code{\} (backslash) in shell commands +@cindex backslash (@code{\}) in shell commands This program does not read any input. The @samp{\} before each of the inner double quotes is necessary because of the shell's quoting rules---in particular because it mixes both single quotes and @@ -2061,11 +2343,10 @@ more convenient to put the program into a separate file. In order to tell awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{} @end example -@cindex @code{-f} option -@cindex command line, options -@cindex options, command-line +@cindex @option{-f} option +@cindex command line, option @option{-f} The @option{-f} instructs the @command{awk} utility to get the @command{awk} program -from the file @var{source-file}. Any file name can be used for +from the file @var{source-file}. Any @value{FN} can be used for @var{source-file}. For example, you could put the program: @example @@ -2086,22 +2367,22 @@ does the same thing as this one: awk "BEGIN @{ print \"Don't Panic!\" @}" @end example -@cindex quoting +@cindex quoting in @command{gawk} command lines @noindent This was explained earlier (@pxref{Read Terminal}). -Note that you don't usually need single quotes around the file name that you -specify with @option{-f}, because most file names don't contain any of the shell's +Note that you don't usually need single quotes around the @value{FN} that you +specify with @option{-f}, because most @value{FN}s don't contain any of the shell's special characters. Notice that in @file{advice}, the @command{awk} program did not have single quotes around it. The quotes are only needed for programs that are provided on the @command{awk} command line. @c STARTOFRANGE sq1x -@cindex single quote (@code{'}) +@cindex single quote (@code{'}) in @command{gawk} command lines @c STARTOFRANGE qs2x -@cindex @code{'} (single quote) +@cindex @code{'} (single quote) in @command{gawk} command lines If you want to clearly identify your @command{awk} program files as such, -you can add the extension @file{.awk} to the file name. This doesn't +you can add the extension @file{.awk} to the @value{FN}. This doesn't affect the execution of the @command{awk} program but it does make ``housekeeping'' easier. @@ -2128,13 +2409,13 @@ BEGIN @{ print "Don't Panic!" @} After making this file executable (with the @command{chmod} utility), simply type @samp{advice} at the shell and the system arranges to run @command{awk}@footnote{The -line beginning with @samp{#!} lists the full file name of an interpreter +line beginning with @samp{#!} lists the full @value{FN} of an interpreter to run and an optional initial command-line argument to pass to that interpreter. The operating system then runs the interpreter with the given argument and the full argument list of the executed program. The first argument -in the list is the full file name of the @command{awk} program. +in the list is the full @value{FN} of the @command{awk} program. The rest of the -argument list contains either options to @command{awk}, or data files, +argument list contains either options to @command{awk}, or @value{DF}s, or both. Note that on many systems @command{awk} may be found in @file{/usr/bin} instead of in @file{/bin}. Caveat Emptor.} as if you had typed @samp{awk -f advice}: @@ -2209,7 +2490,7 @@ programs, but this usually isn't very useful; the purpose of a comment is to help you or another person understand the program when reading it at a later time. -@cindex quoting +@cindex quoting, for small awk programs @cindex single quote (@code{'}), vs.@: apostrophe @cindex @code{'} (single quote), vs.@: apostrophe @quotation CAUTION @@ -2225,7 +2506,7 @@ runs, it will probably print strange messages about syntax errors. For example, look at the following: @example -$ @kbd{awk '@{ print "hello" @} # let's be cute'} +$ @kbd{awk 'BEGIN @{ print "hello" @} # let's be cute'} > @end example @@ -2250,7 +2531,7 @@ The next @value{SUBSECTION} describes the shell's quoting rules. @node Quoting @subsection Shell-Quoting Issues -@cindex quoting, rules for +@cindex shell quoting, rules for @menu * DOS Quoting:: Quoting in Windows Batch Files. @@ -2273,7 +2554,28 @@ knowledge of shell quoting rules. The following rules apply only to POSIX-compliant, Bourne-style shells (such as Bash, the GNU Bourne-Again Shell). If you use the C shell, you're on your own. -@itemize @bullet +Before diving into the rules, we introduce a concept that appears +throughout this @value{DOCUMENT}, which is that of the @dfn{null}, +or empty, string. + +The null string is character data that has no value. +In other words, it is empty. It is written in @command{awk} programs +like this: @code{""}. In the shell, it can be written using single +or double quotes: @code{""} or @code{''}. While the null string has +no characters in it, it does exist. Consider this command: + +@example +$ @kbd{echo ""} +@end example + +@noindent +Here, the @command{echo} utility receives a single argument, even +though that argument has no characters in it. In the rest of this +@value{DOCUMENT}, we use the terms @dfn{null string} and @dfn{empty string} +interchangeably. Now, on to the quoting rules. + + +@itemize @value{BULLET} @item Quoted items can be concatenated with nonquoted items as well as with other quoted items. The shell turns everything into one argument for @@ -2285,10 +2587,10 @@ that character. The shell removes the backslash and passes the quoted character on to the command. @item -@cindex @code{\} (backslash) -@cindex backslash (@code{\}) -@cindex single quote (@code{'}) -@cindex @code{'} (single quote) +@cindex @code{\} (backslash), in shell commands +@cindex backslash (@code{\}), in shell commands +@cindex single quote (@code{'}), in shell commands +@cindex @code{'} (single quote), in shell commands Single quotes protect everything between the opening and closing quotes. The shell does no interpretation of the quoted text, passing it on verbatim to the command. @@ -2298,8 +2600,8 @@ Refer back to for an example of what happens if you try. @item -@cindex double quote (@code{"}) -@cindex @code{"} (double quote) +@cindex double quote (@code{"}), in shell commands +@cindex @code{"} (double quote), in shell commands Double quotes protect most things between the opening and closing quotes. The shell does at least variable and command substitution on the quoted text. Different shells may do additional kinds of processing on double-quoted text. @@ -2336,7 +2638,7 @@ awk -F "" '@var{program}' @var{files} # correct @end example @noindent -@cindex null strings, quoting and +@cindex null strings in @command{gawk} arguments, quoting and Don't use this: @example @@ -2345,11 +2647,11 @@ awk -F"" '@var{program}' @var{files} # wrong! @noindent In the second case, @command{awk} will attempt to use the text of the program -as the value of @code{FS}, and the first file name as the text of the program! +as the value of @code{FS}, and the first @value{FN} as the text of the program! This results in syntax errors at best, and confusing behavior at worst. @end itemize -@cindex quoting, tricks for +@cindex quoting in @command{gawk} command lines, tricks for Mixing single and double quotes is difficult. You have to resort to shell quoting tricks, like this: @@ -2448,6 +2750,7 @@ Although this @value{DOCUMENT} generally only worries about POSIX systems and th POSIX shell, the following issue arises often enough for many users that it is worth addressing. +@cindex Brink, Jeroen The ``shells'' on Microsoft Windows systems use the double-quote character for quoting, and make it difficult or impossible to include an escaped double-quote character in a command-line script. @@ -2460,49 +2763,47 @@ gawk "@{ print \"\042\" $0 \"\042\" @}" @var{file} @node Sample Data Files -@section Data Files for the Examples -@c For gawk >= 4.0, update these data files. No-one has such slow modems! +@section @value{DDF}s for the Examples @cindex input files, examples -@cindex @code{BBS-list} file +@cindex @code{mail-list} file Many of the examples in this @value{DOCUMENT} take their input from two sample -data files. The first, @file{BBS-list}, represents a list of -computer bulletin board systems together with information about those systems. -The second data file, called @file{inventory-shipped}, contains +@value{DF}s. The first, @file{mail-list}, represents a list of peoples' names +together with their email addresses and information about those people. +The second @value{DF}, called @file{inventory-shipped}, contains information about monthly shipments. In both files, each line is considered to be one @dfn{record}. -In the data file @file{BBS-list}, each record contains the name of a computer -bulletin board, its phone number, the board's baud rate(s), and a code for -the number of hours it is operational. An @samp{A} in the last column -means the board operates 24 hours a day. A @samp{B} in the last -column means the board only operates on evening and weekend hours. -A @samp{C} means the board operates only on weekends: +In the @value{DF} @file{mail-list}, each record contains the name of a person, +his/her phone number, his/her email-address, and a code for their relationship +with the author of the list. An @samp{A} in the last column +means that the person is an acquaintance. An @samp{F} in the last +column means that the person is a friend. +An @samp{R} means that the person is a relative: -@c 2e: Update the baud rates to reflect today's faster modems @example @c system if test ! -d eg ; then mkdir eg ; fi @c system if test ! -d eg/lib ; then mkdir eg/lib ; fi @c system if test ! -d eg/data ; then mkdir eg/data ; fi @c system if test ! -d eg/prog ; then mkdir eg/prog ; fi @c system if test ! -d eg/misc ; then mkdir eg/misc ; fi -@c file eg/data/BBS-list -aardvark 555-5553 1200/300 B -alpo-net 555-3412 2400/1200/300 A -barfly 555-7685 1200/300 A -bites 555-1675 2400/1200/300 A -camelot 555-0542 300 C -core 555-2912 1200/300 C -fooey 555-1234 2400/1200/300 B -foot 555-6699 1200/300 B -macfoo 555-6480 1200/300 A -sdace 555-3430 2400/1200/300 A -sabafoo 555-2127 1200/300 C +@c file eg/data/mail-list +Amelia 555-5553 amelia.zodiacusque@@gmail.com F +Anthony 555-3412 anthony.asserturo@@hotmail.com A +Becky 555-7685 becky.algebrarum@@gmail.com A +Bill 555-1675 bill.drowning@@hotmail.com A +Broderick 555-0542 broderick.aliquotiens@@yahoo.com R +Camilla 555-2912 camilla.infusarum@@skynet.be R +Fabius 555-1234 fabius.undevicesimus@@ucb.edu F +Julie 555-6699 julie.perscrutabor@@skeeve.com F +Martin 555-6480 martin.codicibus@@hotmail.com A +Samuel 555-3430 samuel.lanceolis@@shu.edu A +Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R @c endfile @end example @cindex @code{inventory-shipped} file -The data file @file{inventory-shipped} represents +The @value{DF} @file{inventory-shipped} represents information about shipments during the year. Each record contains the month, the number of green crates shipped, the number of red boxes shipped, the number of @@ -2532,45 +2833,30 @@ Apr 21 70 74 514 @c endfile @end example -@ifinfo -If you are reading this in GNU Emacs using Info, you can copy the regions -of text showing these sample files into your own test files. This way you -can try out the examples shown in the remainder of this document. You do -this by using the command @kbd{M-x write-region} to copy text from the Info -file into a file for use with @command{awk} -(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual}, -for more information). Using this information, create your own -@file{BBS-list} and @file{inventory-shipped} files and practice what you -learn in this @value{DOCUMENT}. - -@cindex Texinfo -If you are using the stand-alone version of Info, -see @ref{Extract Program}, -for an @command{awk} program that extracts these data files from -@file{gawk.texi}, the (generated) Texinfo source file for this Info file. -@end ifinfo +The sample files are included in the @command{gawk} distribution, +in the directory @file{awklib/eg/data}. @node Very Simple @section Some Simple Examples The following command runs a simple @command{awk} program that searches the -input file @file{BBS-list} for the character string @samp{foo} (a +input file @file{mail-list} for the character string @samp{li} (a grouping of characters is usually called a @dfn{string}; the term @dfn{string} is based on similar usage in English, such as ``a string of pearls,'' or ``a string of cars in a train''): @example -awk '/foo/ @{ print $0 @}' BBS-list +awk '/li/ @{ print $0 @}' mail-list @end example @noindent -When lines containing @samp{foo} are found, they are printed because +When lines containing @samp{li} are found, they are printed because @w{@samp{print $0}} means print the current line. (Just @samp{print} by itself means the same thing, so we could have written that instead.) -You will notice that slashes (@samp{/}) surround the string @samp{foo} -in the @command{awk} program. The slashes indicate that @samp{foo} +You will notice that slashes (@samp{/}) surround the string @samp{li} +in the @command{awk} program. The slashes indicate that @samp{li} is the pattern to search for. This type of pattern is called a @dfn{regular expression}, which is covered in more detail later (@pxref{Regexp}). @@ -2582,11 +2868,11 @@ interpret any of it as special shell characters. Here is what this program prints: @example -$ @kbd{awk '/foo/ @{ print $0 @}' BBS-list} -@print{} fooey 555-1234 2400/1200/300 B -@print{} foot 555-6699 1200/300 B -@print{} macfoo 555-6480 1200/300 A -@print{} sabafoo 555-2127 1200/300 C +$ @kbd{awk '/li/ @{ print $0 @}' mail-list} +@print{} Amelia 555-5553 amelia.zodiacusque@@gmail.com F +@print{} Broderick 555-0542 broderick.aliquotiens@@yahoo.com R +@print{} Julie 555-6699 julie.perscrutabor@@skeeve.com F +@print{} Samuel 555-3430 samuel.lanceolis@@shu.edu A @end example @cindex actions, default @@ -2597,10 +2883,10 @@ for @emph{every} input line. If the action is omitted, the default action is to print all lines that match the pattern. @cindex actions, empty -Thus, we could leave out the action (the @code{print} statement and the curly +Thus, we could leave out the action (the @code{print} statement and the braces) in the previous example and the result would be the same: -@command{awk} prints all lines matching the pattern @samp{foo}. By comparison, -omitting the @code{print} statement but retaining the curly braces makes an +@command{awk} prints all lines matching the pattern @samp{li}. By comparison, +omitting the @code{print} statement but retaining the braces makes an empty action that does nothing (i.e., no lines are printed). @cindex @command{awk} programs, one-line examples @@ -2609,15 +2895,15 @@ collection of useful, short programs to get you started. Some of these programs contain constructs that haven't been covered yet. (The description of the program will give you a good idea of what is going on, but please read the rest of the @value{DOCUMENT} to become an @command{awk} expert!) -Most of the examples use a data file named @file{data}. This is just a +Most of the examples use a @value{DF} named @file{data}. This is just a placeholder; if you use these programs yourself, substitute -your own file names for @file{data}. +your own @value{FN}s for @file{data}. For future reference, note that there is often more than one way to do things in @command{awk}. At some point, you may want to look back at these examples and see if you can come up with different ways to do the same things shown here: -@itemize @bullet +@itemize @value{BULLET} @item Print the length of the longest input line: @@ -2634,7 +2920,7 @@ awk 'length($0) > 80' data @end example The sole rule has a relational expression as its pattern and it has no -action---so the default action, printing the record, is used. +action---so it uses the default action, printing the record. @cindex @command{expand} utility @item @@ -2701,7 +2987,7 @@ awk 'END @{ print NR @}' data @end example @item -Print the even-numbered lines in the data file: +Print the even-numbered lines in the @value{DF}: @example awk 'NR % 2 == 0' data @@ -2717,9 +3003,9 @@ the program would print the odd-numbered lines. The @command{awk} utility reads the input files one line at a time. For each line, @command{awk} tries the patterns of each of the rules. -If several patterns match, then several actions are run in the order in +If several patterns match, then several actions execute in the order in which they appear in the @command{awk} program. If no patterns match, then -no actions are run. +no actions run. After processing all the rules that match the line (and perhaps there are none), @command{awk} reads the next line. (However, @@ -2743,30 +3029,24 @@ This program prints every line that contains the string @samp{12} @emph{or} the string @samp{21}. If a line contains both strings, it is printed twice, once by each rule. -This is what happens if we run this program on our two sample data files, -@file{BBS-list} and @file{inventory-shipped}: +This is what happens if we run this program on our two sample @value{DF}s, +@file{mail-list} and @file{inventory-shipped}: @example $ @kbd{awk '/12/ @{ print $0 @}} -> @kbd{/21/ @{ print $0 @}' BBS-list inventory-shipped} -@print{} aardvark 555-5553 1200/300 B -@print{} alpo-net 555-3412 2400/1200/300 A -@print{} barfly 555-7685 1200/300 A -@print{} bites 555-1675 2400/1200/300 A -@print{} core 555-2912 1200/300 C -@print{} fooey 555-1234 2400/1200/300 B -@print{} foot 555-6699 1200/300 B -@print{} macfoo 555-6480 1200/300 A -@print{} sdace 555-3430 2400/1200/300 A -@print{} sabafoo 555-2127 1200/300 C -@print{} sabafoo 555-2127 1200/300 C +> @kbd{/21/ @{ print $0 @}' mail-list inventory-shipped} +@print{} Anthony 555-3412 anthony.asserturo@@hotmail.com A +@print{} Camilla 555-2912 camilla.infusarum@@skynet.be R +@print{} Fabius 555-1234 fabius.undevicesimus@@ucb.edu F +@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R +@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R @print{} Jan 21 36 64 620 @print{} Apr 21 70 74 514 @end example @noindent -Note how the line beginning with @samp{sabafoo} -in @file{BBS-list} was printed twice, once for each rule. +Note how the line beginning with @samp{Jean-Paul} +in @file{mail-list} was printed twice, once for each rule. @node More Complex @section A More Complex Example @@ -2809,7 +3089,7 @@ the file. The fourth field identifies the group of the file. The fifth field contains the size of the file in bytes. The sixth, seventh, and eighth fields contain the month, day, and time, respectively, that the file was last modified. Finally, the ninth field -contains the file name.@footnote{The @samp{LC_ALL=C} is +contains the @value{FN}.@footnote{The @samp{LC_ALL=C} is needed to produce this traditional-style output from @command{ls}.} @c @cindex automatic initialization @@ -2817,8 +3097,8 @@ needed to produce this traditional-style output from @command{ls}.} The @samp{$6 == "Nov"} in our @command{awk} program is an expression that tests whether the sixth field of the output from @w{@samp{ls -l}} matches the string @samp{Nov}. Each time a line has the string -@samp{Nov} for its sixth field, the action @samp{sum += $5} is -performed. This adds the fifth field (the file's size) to the variable +@samp{Nov} for its sixth field, @command{awk} performs the action +@samp{sum += $5}. This adds the fifth field (the file's size) to the variable @code{sum}. As a result, when @command{awk} has finished reading all the input lines, @code{sum} is the total of the sizes of the files whose lines matched the pattern. (This works because @command{awk} variables @@ -2845,7 +3125,7 @@ separate rule, like this: @example awk '/12/ @{ print $0 @} - /21/ @{ print $0 @}' BBS-list inventory-shipped + /21/ @{ print $0 @}' mail-list inventory-shipped @end example @cindex @command{gawk}, newlines in @@ -2885,7 +3165,7 @@ We have generally not used backslash continuation in our sample programs. @command{gawk} places no limit on the length of a line, so backslash continuation is never strictly necessary; it just makes programs more readable. For this same reason, as well as -for clarity, we have kept most statements short in the sample programs +for clarity, we have kept most statements short in the programs presented throughout the @value{DOCUMENT}. Backslash continuation is most useful when your @command{awk} program is in a separate source file instead of entered from the command line. You should also note that @@ -2950,7 +3230,7 @@ $ gawk 'BEGIN @{ print "dont panic" # a friendly \ > BEGIN rule > @}' @error{} gawk: cmd. line:2: BEGIN rule -@error{} gawk: cmd. line:2: ^ parse error +@error{} gawk: cmd. line:2: ^ syntax error @end example @noindent @@ -2960,8 +3240,8 @@ noticed because it is ``hidden'' inside the comment. Thus, the @code{BEGIN} is noted as a syntax error. @cindex statements, multiple -@cindex @code{;} (semicolon) -@cindex semicolon (@code{;}) +@cindex @code{;} (semicolon), separating statements in actions +@cindex semicolon (@code{;}), separating statements in actions When @command{awk} statements within one rule are short, you might want to put more than one of them on a line. This is accomplished by separating the statements with a semicolon (@samp{;}). @@ -3021,9 +3301,16 @@ used once, and thrown away. Because @command{awk} programs are interpreted, you can avoid the (usually lengthy) compilation part of the typical edit-compile-test-debug cycle of software development. +@cindex Brian Kernighan's @command{awk} Complex programs have been written in @command{awk}, including a complete -retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for -more information), and a microcode assembler for a special-purpose Prolog +retargetable assembler for +@ifclear FOR_PRINT +eight-bit microprocessors (@pxref{Glossary}, for more information), +@end ifclear +@ifset FOR_PRINT +eight-bit microprocessors, +@end ifset +and a microcode assembler for a special-purpose Prolog computer. While the original @command{awk}'s capabilities were strained by tasks of such complexity, modern versions are more capable. Even Brian Kernighan's @@ -3033,19 +3320,55 @@ that it has are much larger than they used to be. @cindex @command{awk} programs, complex If you find yourself writing @command{awk} scripts of more than, say, a few hundred lines, you might consider using a different programming -language. Emacs Lisp is a good choice if you need sophisticated string -or pattern matching capabilities. The shell is also good at string and +language. +The shell is good at string and pattern matching; in addition, it allows powerful use of the system utilities. More conventional languages, such as C, C++, and Java, offer better facilities for system programming and for managing the complexity -of large programs. Programs in these languages may require more lines +of large programs. +Python offers a nice balance between high-level ease of programming and +access to system facilities. +Programs in these languages may require more lines of source code than the equivalent @command{awk} programs, but they are easier to maintain and usually run more efficiently. +@node Intro Summary +@section Summary + +@itemize @value{BULLET} +@item +Programs in @command{awk} consist of @var{pattern}-@var{action} pairs. + +@item +Use either +@samp{awk '@var{program}' @var{files}} +or +@samp{awk -f @var{program-file} @var{files}} +to run @command{awk}. + +@item +You may use the special @samp{#!} header line to create @command{awk} +programs that are directly executable. + +@item +Comments in @command{awk} programs start with @samp{#} and continue to +the end of the same line. + +@item +Be aware of quoting issues when writing @command{awk} programs as +part of a larger shell script (or MS-Windows batch file). + +@item +You may use backslash continuation to continue a source line. +Lines are automatically continued after +a comma, open brace, question mark, colon, +@samp{||}, @samp{&&}, @code{do} and @code{else}. +@end itemize + @node Invoking Gawk @chapter Running @command{awk} and @command{gawk} -This @value{CHAPTER} covers how to run awk, both POSIX-standard +This @value{CHAPTER} covers how to run @command{awk}, both POSIX-standard and @command{gawk}-specific command-line options, and what @command{awk} and @command{gawk} do with non-option arguments. @@ -3070,6 +3393,7 @@ things in this @value{CHAPTER} that don't interest you right now. * Loading Shared Libraries:: Loading shared libraries into your program. * Obsolete:: Obsolete Options and/or features. * Undocumented:: Undocumented Options and Features. +* Invoking Summary:: Invocation summary. @end menu @node Command Line @@ -3083,10 +3407,10 @@ There are two ways to run @command{awk}---with an explicit program or with one or more program files. Here are templates for both of them; items enclosed in [@dots{}] in these templates are optional: -@example -awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{} -awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} -@end example +@display +@command{awk} [@var{options}] @option{-f} @var{progfile} [@option{--}] @var{file} @dots{} +@command{awk} [@var{options}] [@option{--}] @code{'@var{program}'} @var{file} @dots{} +@end display @cindex GNU long options @cindex long options @@ -3102,7 +3426,7 @@ It is possible to invoke @command{awk} with an empty program: awk '' datafile1 datafile2 @end example -@cindex @code{--lint} option +@cindex @option{--lint} option @noindent Doing so makes little sense, though; @command{awk} exits silently when given an empty program. @@ -3142,43 +3466,27 @@ The following list describes options mandated by the POSIX standard: @table @code @item -F @var{fs} @itemx --field-separator @var{fs} -@cindex @code{-F} option -@cindex @code{--field-separator} option +@cindex @option{-F} option +@cindex @option{--field-separator} option @cindex @code{FS} variable, @code{--field-separator} option and Set the @code{FS} variable to @var{fs} (@pxref{Field Separators}). @item -f @var{source-file} @itemx --file @var{source-file} -@cindex @code{-f} option -@cindex @code{--file} option +@cindex @option{-f} option +@cindex @option{--file} option @cindex @command{awk} programs, location of Read @command{awk} program source from @var{source-file} instead of in the first non-option argument. This option may be given multiple times; the @command{awk} -program consists of the concatenation the contents of +program consists of the concatenation of the contents of each specified @var{source-file}. -@item -i @var{source-file} -@itemx --include @var{source-file} -@cindex @code{-i} option -@cindex @code{--include} option -@cindex @command{awk} programs, location of -Read @command{awk} source library from @var{source-file}. This option is -completely equivalent to using the @samp{@@include} directive inside -your program. This option is very -similar to the @option{-f} option, but there are two important differences. -First, when @option{-i} is used, the program source will not be loaded if it has -been previously loaded, whereas the @option{-f} will always load the file. -Second, because this option is intended to be used with code libraries, -@command{gawk} does not recognize such files as constituting main program -input. Thus, after processing an @option{-i} argument, @command{gawk} still expects to -find the main source code via the @option{-f} option or on the command-line. - @item -v @var{var}=@var{val} @itemx --assign @var{var}=@var{val} -@cindex @code{-v} option -@cindex @code{--assign} option +@cindex @option{-v} option +@cindex @option{--assign} option @cindex variables, setting Set the variable @var{var} to the value @var{val} @emph{before} execution of the program begins. Such variable values are available @@ -3199,7 +3507,7 @@ predefined value you may have given. @end quotation @item -W @var{gawk-opt} -@cindex @code{-W} option +@cindex @option{-W} option Provide an implementation-specific option. This is the POSIX convention for providing implementation-specific options. These options @@ -3218,8 +3526,8 @@ conventions. @cindex @code{-} (hyphen), filenames beginning with @cindex hyphen (@code{-}), filenames beginning with -This is useful if you have file names that start with @samp{-}, -or in shell scripts, if you have file names that will be specified +This is useful if you have @value{FN}s that start with @samp{-}, +or in shell scripts, if you have @value{FN}s that will be specified by the user that could start with @samp{-}. It is also useful for passing options on to the @command{awk} program; see @ref{Getopt Function}. @@ -3229,47 +3537,52 @@ program; see @ref{Getopt Function}. The following list describes @command{gawk}-specific options: -@table @code -@item -b -@itemx --characters-as-bytes -@cindex @code{-b} option -@cindex @code{--characters-as-bytes} option +@c Have to use @asis here to get docbook to come out right. +@table @asis +@item @option{-b} +@itemx @option{--characters-as-bytes} +@cindex @option{-b} option +@cindex @option{--characters-as-bytes} option Cause @command{gawk} to treat all input data as single-byte characters. In addition, all output written with @code{print} or @code{printf} are treated as single-byte characters. Normally, @command{gawk} follows the POSIX standard and attempts to process -its input data according to the current locale. This can often involve +its input data according to the current locale (@pxref{Locales}). This can often involve converting multibyte characters into wide characters (internally), and can lead to problems or confusion if the input data does not contain valid multibyte characters. This option is an easy way to tell @command{gawk}: ``hands off my data!''. -@item -c -@itemx --traditional -@cindex @code{--c} option -@cindex @code{--traditional} option +@item @option{-c} +@itemx @option{--traditional} +@cindex @option{-c} option +@cindex @option{--traditional} option @cindex compatibility mode (@command{gawk}), specifying Specify @dfn{compatibility mode}, in which the GNU extensions to the @command{awk} language are disabled, so that @command{gawk} behaves just like Brian Kernighan's version @command{awk}. @xref{POSIX/GNU}, -which summarizes the extensions. Also see +which summarizes the extensions. +@ifclear FOR_PRINT +Also see @ref{Compatibility Mode}. +@end ifclear -@item -C -@itemx --copyright -@cindex @code{-C} option -@cindex @code{--copyright} option +@item @option{-C} +@itemx @option{--copyright} +@cindex @option{-C} option +@cindex @option{--copyright} option @cindex GPL (General Public License), printing Print the short version of the General Public License and then exit. -@item -d@r{[}@var{file}@r{]} -@itemx --dump-variables@r{[}=@var{file}@r{]} -@cindex @code{-d} option -@cindex @code{--dump-variables} option -@cindex @code{awkvars.out} file -@cindex files, @code{awkvars.out} +@item @option{-d}[@var{file}] +@itemx @option{--dump-variables}[@code{=}@var{file}] +@cindex @option{-d} option +@cindex @option{--dump-variables} option +@cindex dump all variables of a program +@cindex @file{awkvars.out} file +@cindex files, @file{awkvars.out} @cindex variables, global, printing list of Print a sorted list of global variables, their types, and final values to @var{file}. If no @var{file} is provided, print this @@ -3286,23 +3599,23 @@ inadvertently use global variables that you meant to be local. (This is a particularly easy mistake to make with simple variable names like @code{i}, @code{j}, etc.) -@item -D@r{[}@var{file}@r{]} -@itemx --debug=@r{[}@var{file}@r{]} -@cindex @code{-D} option -@cindex @code{--debug} option +@item @option{-D}[@var{file}] +@itemx @option{--debug}[@code{=}@var{file}] +@cindex @option{-D} option +@cindex @option{--debug} option @cindex @command{awk} debugging, enabling Enable debugging of @command{awk} programs (@pxref{Debugging}). -By default, the debugger reads commands interactively from the terminal. +By default, the debugger reads commands interactively from the keyboard. The optional @var{file} argument allows you to specify a file with a list of commands for the debugger to execute non-interactively. No space is allowed between the @option{-D} and @var{file}, if @var{file} is supplied. -@item -e @var{program-text} -@itemx --source @var{program-text} -@cindex @code{-e} option -@cindex @code{--source} option +@item @option{-e} @var{program-text} +@itemx @option{--source} @var{program-text} +@cindex @option{-e} option +@cindex @option{--source} option @cindex source code, mixing Provide program source code in the @var{program-text}. This option allows you to mix source code in files with source @@ -3311,16 +3624,16 @@ This is particularly useful when you have library functions that you want to use from your command-line programs (@pxref{AWKPATH Variable}). -@item -E @var{file} -@itemx --exec @var{file} -@cindex @code{-E} option -@cindex @code{--exec} option +@item @option{-E} @var{file} +@itemx @option{--exec} @var{file} +@cindex @option{-E} option +@cindex @option{--exec} option @cindex @command{awk} programs, location of @cindex CGI, @command{awk} scripts for Similar to @option{-f}, read @command{awk} program text from @var{file}. There are two differences from @option{-f}: -@itemize @bullet +@itemize @value{BULLET} @item This option terminates option processing; anything else on the command line is passed on directly to the @command{awk} program. @@ -3342,48 +3655,69 @@ with @samp{#!} scripts (@pxref{Executable Scripts}), like so: @var{awk program here @dots{}} @end example -@item -g -@itemx --gen-pot -@cindex @code{-g} option -@cindex @code{--gen-pot} option +@item @option{-g} +@itemx @option{--gen-pot} +@cindex @option{-g} option +@cindex @option{--gen-pot} option @cindex portable object files, generating @cindex files, portable object, generating Analyze the source program and -generate a GNU @code{gettext} Portable Object Template file on standard +generate a GNU @command{gettext} Portable Object Template file on standard output for all string constants that have been marked for translation. @xref{Internationalization}, for information about this option. -@item -h -@itemx --help -@cindex @code{-h} option -@cindex @code{--help} option +@item @option{-h} +@itemx @option{--help} +@cindex @option{-h} option +@cindex @option{--help} option @cindex GNU long options, printing list of @cindex options, printing list of @cindex printing, list of options Print a ``usage'' message summarizing the short and long style options that @command{gawk} accepts and then exit. -@item -l @var{lib} -@itemx --load @var{lib} -@cindex @code{-l} option -@cindex @code{--load} option -@cindex loading, library -Load a shared library @var{lib}. This searches for the library using the @env{AWKLIBPATH} +@item @option{-i} @var{source-file} +@itemx @option{--include} @var{source-file} +@cindex @option{-i} option +@cindex @option{--include} option +@cindex @command{awk} programs, location of +Read @command{awk} source library from @var{source-file}. This option +is completely equivalent to using the @code{@@include} directive inside +your program. This option is very similar to the @option{-f} option, +but there are two important differences. First, when @option{-i} is +used, the program source is not loaded if it has been previously +loaded, whereas with @option{-f}, @command{gawk} always loads the file. +Second, because this option is intended to be used with code libraries, +@command{gawk} does not recognize such files as constituting main program +input. Thus, after processing an @option{-i} argument, @command{gawk} +still expects to find the main source code via the @option{-f} option +or on the command-line. + +@item @option{-l} @var{ext} +@itemx @option{--load} @var{ext} +@cindex @option{-l} option +@cindex @option{--load} option +@cindex loading, extensions +Load a dynamic extension named @var{ext}. Extensions +are stored as system shared libraries. +This option searches for the library using the @env{AWKLIBPATH} environment variable. The correct library suffix for your platform will be -supplied by default, so it need not be specified in the library name. -The library initialization routine should be named @code{dl_load()}. -An alternative is to use the @samp{@@load} keyword inside the program to load -a shared library. - -@item -L @r{[}value@r{]} -@itemx --lint@r{[}=value@r{]} -@cindex @code{-l} option -@cindex @code{--lint} option +supplied by default, so it need not be specified in the extension name. +The extension initialization routine should be named @code{dl_load()}. +An alternative is to use the @code{@@load} keyword inside the program to load +a shared library. This feature is described in detail in @ref{Dynamic Extensions}. + +@item @option{-L}[@var{value}] +@itemx @option{--lint}[@code{=}@var{value}] +@cindex @option{-l} option +@cindex @option{--lint} option @cindex lint checking, issuing warnings @cindex warnings, issuing Warn about constructs that are dubious or nonportable to other @command{awk} implementations. +No space is allowed between the @option{-D} and @var{value}, if +@var{value} is supplied. Some warnings are issued when @command{gawk} first reads your program. Others are issued at runtime, as your program executes. With an optional argument of @samp{fatal}, @@ -3399,18 +3733,18 @@ when eliminating problems pointed out by @option{--lint}, you should take care to search for all occurrences of each inappropriate construct. As @command{awk} programs are usually short, doing so is not burdensome. -@item -M -@itemx --bignum -@cindex @code{-M} option -@cindex @code{--bignum} option +@item @option{-M} +@itemx @option{--bignum} +@cindex @option{-M} option +@cindex @option{--bignum} option Force arbitrary precision arithmetic on numbers. This option has no effect if @command{gawk} is not compiled to use the GNU MPFR and MP libraries (@pxref{Arbitrary Precision Arithmetic}). -@item -n -@itemx --non-decimal-data -@cindex @code{-n} option -@cindex @code{--non-decimal-data} option +@item @option{-n} +@itemx @option{--non-decimal-data} +@cindex @option{-n} option +@cindex @option{--non-decimal-data} option @cindex hexadecimal values@comma{} enabling interpretation of @cindex octal values@comma{} enabling interpretation of @cindex troubleshooting, @code{--non-decimal-data} option @@ -3423,52 +3757,59 @@ This option can severely break old programs. Use with care. @end quotation -@item -N -@itemx --use-lc-numeric -@cindex @code{-N} option -@cindex @code{--use-lc-numeric} option +@item @option{-N} +@itemx @option{--use-lc-numeric} +@cindex @option{-N} option +@cindex @option{--use-lc-numeric} option Force the use of the locale's decimal point character when parsing numeric input data (@pxref{Locales}). -@item -o@r{[}@var{file}@r{]} -@itemx --pretty-print@r{[}=@var{file}@r{]} -@cindex @code{-o} option -@cindex @code{--pretty-print} option +@item @option{-o}[@var{file}] +@itemx @option{--pretty-print}[@code{=}@var{file}] +@cindex @option{-o} option +@cindex @option{--pretty-print} option Enable pretty-printing of @command{awk} programs. -By default, output program is created in a file named @file{awkprof.out}. +By default, output program is created in a file named @file{awkprof.out} +(@pxref{Profiling}). The optional @var{file} argument allows you to specify a different -file name for the output. +@value{FN} for the output. No space is allowed between the @option{-o} and @var{file}, if @var{file} is supplied. -@item -O -@itemx --optimize -@cindex @code{--optimize} option -@cindex @code{-O} option +@quotation NOTE +Due to the way @command{gawk} has evolved, with this option +your program is still executed. This will change in the +next major release such that @command{gawk} will only +pretty-print the program and not run it. +@end quotation + +@item @option{-O} +@itemx @option{--optimize} +@cindex @option{--optimize} option +@cindex @option{-O} option Enable some optimizations on the internal representation of the program. -At the moment this includes just simple constant folding. The @command{gawk} -maintainer hopes to add more optimizations over time. +At the moment this includes just simple constant folding. -@item -p@r{[}@var{file}@r{]} -@itemx --profile@r{[}=@var{file}@r{]} -@cindex @code{-p} option -@cindex @code{--profile} option +@item @option{-p}[@var{file}] +@itemx @option{--profile}[@code{=}@var{file}] +@cindex @option{-p} option +@cindex @option{--profile} option @cindex @command{awk} profiling, enabling Enable profiling of @command{awk} programs (@pxref{Profiling}). By default, profiles are created in a file named @file{awkprof.out}. The optional @var{file} argument allows you to specify a different -file name for the profile file. +@value{FN} for the profile file. No space is allowed between the @option{-p} and @var{file}, if @var{file} is supplied. The profile contains execution counts for each statement in the program in the left margin, and function call counts for each function. -@item -P -@itemx --posix -@cindex @code{-P} option -@cindex @code{--posix} option +@item @option{-P} +@itemx @option{--posix} +@cindex @option{-P} option +@cindex @option{--posix} option @cindex POSIX mode @cindex @command{gawk}, extensions@comma{} disabling Operate in strict POSIX mode. This disables all @command{gawk} @@ -3480,7 +3821,7 @@ Also, the following additional restrictions apply: -@itemize @bullet +@itemize @value{BULLET} @cindex newlines @cindex whitespace, newlines as @@ -3509,28 +3850,28 @@ data (@pxref{Locales}). @c @cindex automatic warnings @c @cindex warnings, automatic -@cindex @code{--traditional} option, @code{--posix} option and -@cindex @code{--posix} option, @code{--traditional} option and +@cindex @option{--traditional} option, @code{--posix} option and +@cindex @option{--posix} option, @code{--traditional} option and If you supply both @option{--traditional} and @option{--posix} on the command line, @option{--posix} takes precedence. @command{gawk} -also issues a warning if both options are supplied. +issues a warning if both options are supplied. -@item -r -@itemx --re-interval -@cindex @code{-r} option -@cindex @code{--re-interval} option +@item @option{-r} +@itemx @option{--re-interval} +@cindex @option{-r} option +@cindex @option{--re-interval} option @cindex regular expressions, interval expressions and Allow interval expressions (@pxref{Regexp Operators}) in regexps. This is now @command{gawk}'s default behavior. Nevertheless, this option remains both for backward compatibility, -and for use in combination with the @option{--traditional} option. +and for use in combination with @option{--traditional}. -@item -S -@itemx --sandbox -@cindex @code{-S} option -@cindex @code{--sandbox} option +@item @option{-S} +@itemx @option{--sandbox} +@cindex @option{-S} option +@cindex @option{--sandbox} option @cindex sandbox mode Disable the @code{system()} function, input redirections with @code{getline}, @@ -3538,20 +3879,20 @@ output redirections with @code{print} and @code{printf}, and dynamic extensions. This is particularly useful when you want to run @command{awk} scripts from questionable sources and need to make sure the scripts -can't access your system (other than the specified input data file). +can't access your system (other than the specified input @value{DF}). -@item -t -@itemx --lint-old -@cindex @code{--L} option -@cindex @code{--lint-old} option +@item @option{-t} +@itemx @option{--lint-old} +@cindex @option{-L} option +@cindex @option{--lint-old} option Warn about constructs that are not available in the original version of @command{awk} from Version 7 Unix (@pxref{V7/SVR3.1}). -@item -V -@itemx --version -@cindex @code{-V} option -@cindex @code{--version} option +@item @option{-V} +@itemx @option{--version} +@cindex @option{-V} option +@cindex @option{--version} option @cindex @command{gawk}, versions of, information about@comma{} printing Print version information for this particular copy of @command{gawk}. This allows you to determine if your copy of @command{gawk} is up to date @@ -3565,14 +3906,14 @@ As long as program text has been supplied, any other options are flagged as invalid with a warning message but are otherwise ignored. -@cindex @code{-F} option, @code{-Ft} sets @code{FS} to TAB +@cindex @option{-F} option, @option{-Ft} sets @code{FS} to TAB In compatibility mode, as a special case, if the value of @var{fs} supplied to the @option{-F} option is @samp{t}, then @code{FS} is set to the TAB character (@code{"\t"}). This is true only for @option{--traditional} and not for @option{--posix} (@pxref{Field Separators}). -@cindex @code{-f} option, multiple uses +@cindex @option{-f} option, multiple uses The @option{-f} option may be used more than once on the command line. If it is, @command{awk} reads its program source from all of the named files, as if they had been concatenated together into one big file. This is @@ -3584,22 +3925,22 @@ of having to be included into each individual program. function names must be unique.) With standard @command{awk}, library functions can still be used, even -if the program is entered at the terminal, +if the program is entered at the keyboard, by specifying @samp{-f /dev/tty}. After typing your program, type @kbd{Ctrl-d} (the end-of-file character) to terminate it. (You may also use @samp{-f -} to read program source from the standard input but then you will not be able to also use the standard input as a source of data.) -Because it is clumsy using the standard @command{awk} mechanisms to mix source -file and command-line @command{awk} programs, @command{gawk} provides the -@option{--source} option. This does not require you to pre-empt the standard -input for your source code; it allows you to easily mix command-line -and library source code -(@pxref{AWKPATH Variable}). -The @option{--source} option may also be used multiple times on the command line. +Because it is clumsy using the standard @command{awk} mechanisms to mix +source file and command-line @command{awk} programs, @command{gawk} +provides the @option{--source} option. This does not require you to +pre-empt the standard input for your source code; it allows you to easily +mix command-line and library source code (@pxref{AWKPATH Variable}). +As with @option{-f}, the @option{--source} and @option{--include} +options may also be used multiple times on the command line. -@cindex @code{--source} option +@cindex @option{--source} option If no @option{-f} or @option{--source} option is specified, then @command{gawk} uses the first non-option command-line argument as the text of the program source code. @@ -3609,7 +3950,7 @@ program source code. @cindex POSIX mode If the environment variable @env{POSIXLY_CORRECT} exists, then @command{gawk} behaves in strict POSIX mode, exactly as if -you had supplied the @option{--posix} command-line option. +you had supplied @option{--posix}. Many GNU programs look for this environment variable to suppress extensions that conflict with POSIX, but @command{gawk} behaves differently: it suppresses all extensions, even those that do not @@ -3658,6 +3999,7 @@ file at all. @cindex @command{gawk}, @code{ARGIND} variable in @cindex @code{ARGIND} variable, command-line arguments +@cindex @code{ARGV} array, indexing into @cindex @code{ARGC}/@code{ARGV} variables, command-line arguments All these arguments are made available to your @command{awk} program in the @code{ARGV} array (@pxref{Built-in Variables}). Command-line options @@ -3668,9 +4010,10 @@ sets the variable @code{ARGIND} to the index in @code{ARGV} of the current element. @cindex input files, variable assignments and -The distinction between file name arguments and variable-assignment +@cindex variable assignments and input files +The distinction between @value{FN} arguments and variable-assignment arguments is made when @command{awk} is about to open the next input file. -At that point in execution, it checks the file name to see whether +At that point in execution, it checks the @value{FN} to see whether it is really a variable assignment; if so, @command{awk} sets the variable instead of reading a file. @@ -3686,8 +4029,8 @@ The variable values given on the command line are processed for escape sequences (@pxref{Escape Sequences}). @value{DARKCORNER} -In some earlier implementations of @command{awk}, when a variable assignment -occurred before any file names, the assignment would happen @emph{before} +In some very early implementations of @command{awk}, when a variable assignment +occurred before any @value{FN}s, the assignment would happen @emph{before} the @code{BEGIN} rule was executed. @command{awk}'s behavior was thus inconsistent; some command-line assignments were available inside the @code{BEGIN} rule, while others were not. Unfortunately, @@ -3698,8 +4041,8 @@ upon the old behavior. The variable assignment feature is most useful for assigning to variables such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and -output formats before scanning the data files. It is also useful for -controlling state if multiple passes are needed over a data file. For +output formats, before scanning the @value{DF}s. It is also useful for +controlling state if multiple passes are needed over a @value{DF}. For example: @cindex files, multiple passes over @@ -3735,16 +4078,17 @@ You may also use @code{"-"} to name standard input when reading files with @code{getline} (@pxref{Getline/File}). In addition, @command{gawk} allows you to specify the special -file name @file{/dev/stdin}, both on the command line and +@value{FN} @file{/dev/stdin}, both on the command line and with @code{getline}. Some other versions of @command{awk} also support this, but it is not standard. (Some operating systems provide a @file{/dev/stdin} file -in the file system, however, @command{gawk} always processes -this file name itself.) +in the file system; however, @command{gawk} always processes +this @value{FN} itself.) @node Environment Variables @section The Environment Variables @command{gawk} Uses +@cindex environment variables used by @command{gawk} A number of environment variables influence how @command{gawk} behaves. @@ -3760,8 +4104,7 @@ behaves. @node AWKPATH Variable @subsection The @env{AWKPATH} Environment Variable @cindex @env{AWKPATH} environment variable -@cindex directories, searching -@cindex search paths +@cindex directories, searching for source files @cindex search paths, for source files @cindex differences in @command{awk} and @command{gawk}, @code{AWKPATH} environment variable @ifinfo @@ -3771,14 +4114,14 @@ on the command-line with the @option{-f} option. In most @command{awk} implementations, you must supply a precise path name for each program file, unless the file is in the current directory. -But in @command{gawk}, if the file name supplied to the @option{-f} +But in @command{gawk}, if the @value{FN} supplied to the @option{-f} or @option{-i} options -does not contain a @samp{/}, then @command{gawk} searches a list of +does not contain a directory separator @samp{/}, then @command{gawk} searches a list of directories (called the @dfn{search path}), one by one, looking for a file with the specified name. The search path is a string consisting of directory names -separated by colons. @command{gawk} gets its search path from the +separated by colons@footnote{Semicolons on MS-Windows and MS-DOS.}. @command{gawk} gets its search path from the @env{AWKPATH} environment variable. If that variable does not exist, @command{gawk} uses a default path, @samp{.:/usr/local/share/awk}.@footnote{Your version of @command{gawk} @@ -3788,10 +4131,10 @@ directory is the value of @samp{$(datadir)} generated when @command{gawk} was configured. You probably don't need to worry about this, though.} -The search path feature is particularly useful for building libraries +The search path feature is particularly helpful for building libraries of useful @command{awk} functions. The library files can be placed in a standard directory in the default path and then specified on -the command line with a short file name. Otherwise, the full file name +the command line with a short @value{FN}. Otherwise, the full @value{FN} would have to be typed for each file. By using the @option{-i} option, or the @option{--source} and @option{-f} options, your command-line @@ -3802,17 +4145,20 @@ This is true for both @option{--traditional} and @option{--posix}. @xref{Options}. If the source code is not found after the initial search, the path is searched -again after adding the default @samp{.awk} suffix to the filename. +again after adding the default @samp{.awk} suffix to the @value{FN}. @quotation NOTE +@c 4/2014: +@c using @samp{.} to get quotes, since @file{} no longer supplies them. To include the current directory in the path, either place -@file{.} explicitly in the path or write a null entry in the +@samp{.} explicitly in the path or write a null entry in the path. (A null entry is indicated by starting or ending the path with a -colon or by placing two colons next to each other (@samp{::}).) +colon or by placing two colons next to each other [@samp{::}].) This path search mechanism is similar to the shell's. -@c someday, @cite{The Bourne Again Shell}.... +(See @uref{http://www.gnu.org/software/bash/manual/, +@cite{The Bourne-Again SHell manual}.}) However, @command{gawk} always looks in the current directory @emph{before} searching @env{AWKPATH}, so there is no real reason to include @@ -3824,7 +4170,7 @@ the current directory in the search path. If @env{AWKPATH} is not defined in the environment, @command{gawk} places its default search path into @code{ENVIRON["AWKPATH"]}. This makes it easy to determine -the actual search path that @command{gawk} will use +the actual search path that @command{gawk} used from within an @command{awk} program. While you can change @code{ENVIRON["AWKPATH"]} within your @command{awk} @@ -3836,19 +4182,18 @@ found, and @command{gawk} no longer needs to use @env{AWKPATH}. @node AWKLIBPATH Variable @subsection The @env{AWKLIBPATH} Environment Variable @cindex @env{AWKLIBPATH} environment variable -@cindex directories, searching -@cindex search paths -@cindex search paths, for shared libraries +@cindex directories, searching for loadable extensions +@cindex search paths, for loadable extensions @cindex differences in @command{awk} and @command{gawk}, @code{AWKLIBPATH} environment variable The @env{AWKLIBPATH} environment variable is similar to the @env{AWKPATH} -variable, but it is used to search for shared libraries specified -with the @option{-l} option rather than for source files. If the library -is not found, the path is searched again after adding the appropriate -shared library suffix for the platform. For example, on GNU/Linux systems, -the suffix @samp{.so} is used. -The search path specified is also used for libraries loaded via the -@samp{@@load} keyword (@pxref{Loading Shared Libraries}). +variable, but it is used to search for loadable extensions (stored as +system shared libraries) specified with the @option{-l} option rather +than for source files. If the extension is not found, the path is +searched again after adding the appropriate shared library suffix for +the platform. For example, on GNU/Linux systems, the suffix @samp{.so} +is used. The search path specified is also used for extensions loaded +via the @code{@@load} keyword (@pxref{Loading Shared Libraries}). @node Other Environment Variables @subsection Other Environment Variables @@ -3864,7 +4209,7 @@ mode, disabling all traditional and GNU extensions. @xref{Options}. @item GAWK_SOCK_RETRIES -Controls the number of time @command{gawk} will attempt to +Controls the number of times @command{gawk} attempts to retry a two-way TCP/IP (socket) connection before giving up. @xref{TCP/IP Networking}. @@ -3885,9 +4230,18 @@ for use by the @command{gawk} developers for testing and tuning. They are subject to change. The variables are: @table @env +@item AWKBUFSIZE +This variable only affects @command{gawk} on POSIX-compliant systems. +With a value of @samp{exact}, @command{gawk} uses the size of each input +file as the size of the memory buffer to allocate for I/O. Otherwise, +the value should be a number, and @command{gawk} uses that number as +the size of the buffer to allocate. (When this variable is not set, +@command{gawk} uses the smaller of the file's size and the ``default'' +blocksize, which is usually the file systems I/O blocksize.) + @item AWK_HASH If this variable exists with a value of @samp{gst}, @command{gawk} -will switch to using the hash function from GNU Smalltalk for +switches to using the hash function from GNU Smalltalk for managing arrays. This function may be marginally faster than the standard function. @@ -3912,6 +4266,11 @@ two regexp matchers that @command{gawk} uses internally. (There aren't supposed to be differences, but occasionally theory and practice don't coordinate with each other.) +@item GAWK_NO_PP_RUN +If this variable exists, then when invoked with the @option{--pretty-print} +option, @command{gawk} skips running the program. This variable will +not survive into the next major release. + @item GAWK_STACKSIZE This specifies the amount by which @command{gawk} should grow its internal evaluation stack, when needed. @@ -3956,13 +4315,13 @@ to @code{EXIT_FAILURE}. This @value{SECTION} describes a feature that is specific to @command{gawk}. -The @samp{@@include} keyword can be used to read external @command{awk} source +The @code{@@include} keyword can be used to read external @command{awk} source files. This gives you the ability to split large @command{awk} source files into smaller, more manageable pieces, and also lets you reuse common @command{awk} code from various @command{awk} scripts. In other words, you can group together @command{awk} functions, used to carry out specific tasks, into external files. These files can be used just like function libraries, -using the @samp{@@include} keyword in conjunction with the @env{AWKPATH} +using the @code{@@include} keyword in conjunction with the @env{AWKPATH} environment variable. Note that source files may also be included using the @option{-i} option. @@ -3996,14 +4355,14 @@ $ @kbd{gawk -f test2} @end example @code{gawk} runs the @file{test2} script which includes @file{test1} -using the @samp{@@include} +using the @code{@@include} keyword. So, to include external @command{awk} source files you just -use @samp{@@include} followed by the name of the file to be included, +use @code{@@include} followed by the name of the file to be included, enclosed in double quotes. @quotation NOTE -Keep in mind that this is a language construct and the file name cannot -be a string variable, but rather just a literal string in double quotes. +Keep in mind that this is a language construct and the @value{FN} cannot +be a string variable, but rather just a literal string constant in double quotes. @end quotation The files to be included may be nested; e.g., given a third @@ -4027,7 +4386,7 @@ $ @kbd{gawk -f test3} @print{} This is file test3. @end example -The file name can, of course, be a pathname. For example: +The @value{FN} can, of course, be a pathname. For example: @example @@include "../io_funcs" @@ -4042,49 +4401,50 @@ or: @noindent are valid. The @code{AWKPATH} environment variable can be of great -value when using @samp{@@include}. The same rules for the use +value when using @code{@@include}. The same rules for the use of the @code{AWKPATH} variable in command-line file searches (@pxref{AWKPATH Variable}) apply to -@samp{@@include} also. +@code{@@include} also. This is very helpful in constructing @command{gawk} function libraries. If you have a large script with useful, general purpose @command{awk} functions, you can break it down into library files and put those files in a special directory. You can then include those ``libraries,'' using either the full pathnames of the files, or by setting the @code{AWKPATH} -environment variable accordingly and then using @samp{@@include} with +environment variable accordingly and then using @code{@@include} with just the file part of the full pathname. Of course you can have more than one directory to keep library files; the more complex the working environment is, the more directories you may need to organize the files to be included. Given the ability to specify multiple @option{-f} options, the -@samp{@@include} mechanism is not strictly necessary. -However, the @samp{@@include} keyword +@code{@@include} mechanism is not strictly necessary. +However, the @code{@@include} keyword can help you in constructing self-contained @command{gawk} programs, thus reducing the need for writing complex and tedious command lines. -In particular, @samp{@@include} is very useful for writing CGI scripts +In particular, @code{@@include} is very useful for writing CGI scripts to be run from web pages. As mentioned in @ref{AWKPATH Variable}, the current directory is always searched first for source files, before searching in @env{AWKPATH}, -and this also applies to files named with @samp{@@include}. +and this also applies to files named with @code{@@include}. @node Loading Shared Libraries -@section Loading Shared Libraries Into Your Program +@section Loading Dynamic Extensions Into Your Program This @value{SECTION} describes a feature that is specific to @command{gawk}. -The @samp{@@load} keyword can be used to read external @command{awk} shared -libraries. This allows you to link in compiled code that may offer superior +The @code{@@load} keyword can be used to read external @command{awk} extensions +(stored as system shared libraries). +This allows you to link in compiled code that may offer superior performance and/or give you access to extended capabilities not supported by the @command{awk} language. The @env{AWKLIBPATH} variable is used to -search for the shared library. Using @samp{@@load} is completely equivalent +search for the extension. Using @code{@@load} is completely equivalent to using the @option{-l} command-line option. -If the shared library is not initially found in @env{AWKLIBPATH}, another +If the extension is not initially found in @env{AWKLIBPATH}, another search is conducted after appending the platform's default shared library -suffix to the filename. For example, on GNU/Linux systems, the suffix +suffix to the @value{FN}. For example, on GNU/Linux systems, the suffix @samp{.so} is used. @example @@ -4102,16 +4462,17 @@ $ @kbd{gawk -lordchr 'BEGIN @{print chr(65)@}'} @noindent For command-line usage, the @option{-l} option is more convenient, -but @samp{@@load} is useful for embedding inside an @command{awk} source file -that requires access to a shared library. +but @code{@@load} is useful for embedding inside an @command{awk} source file +that requires access to an extension. @ref{Dynamic Extensions}, describes how to write extensions (in C or C++) -that can be loaded with either @samp{@@load} or the @option{-l} option. +that can be loaded with either @code{@@load} or the @option{-l} option. @node Obsolete @section Obsolete Options and/or Features -@cindex features, advanced, See advanced features +@c update this section for each release! + @cindex options, deprecated @cindex features, deprecated @cindex obsolete features @@ -4120,12 +4481,9 @@ previous releases of @command{gawk} that are either not available in the current version or that are still supported but deprecated (meaning that they will @emph{not} be in the next release). -@c update this section for each release! - -@cindex @code{PROCINFO} array The process-related special files @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and @file{/dev/user} were deprecated in @command{gawk} -3.1, but still worked. As of version 4.0, they are no longer +3.1, but still worked. As of @value{PVERSION} 4.0, they are no longer interpreted specially by @command{gawk}. (Use @code{PROCINFO} instead; see @ref{Auto-set}.) @@ -4148,6 +4506,7 @@ in case some option becomes obsolete in a future version of @command{gawk}. @author Obi-Wan @end quotation +@cindex shells, sea This @value{SECTION} intentionally left blank. @@ -4160,7 +4519,7 @@ blank. @table @code @item -W nostalgia @itemx --nostalgia -Print the message @code{"awk: bailing out near line 1"} and dump core. +Print the message @samp{awk: bailing out near line 1} and dump core. This option was inspired by the common behavior of very early versions of Unix @command{awk} and by a t--shirt. The message is @emph{not} subject to translation in non-English locales. @@ -4204,9 +4563,61 @@ long-undocumented ``feature'' of Unix @code{awk}. @end ignore +@node Invoking Summary +@section Summary + +@itemize @value{BULLET} +@item +Use either +@samp{awk '@var{program}' @var{files}} +or +@samp{awk -f @var{program-file} @var{files}} +to run @command{awk}. + +@item +The three standard @command{awk} options are @option{-f}, @option{-F} +and @option{-v}. @command{gawk} supplies these and many others, as well +as corresponding GNU-style long options. + +@item +Non-option command-line arguments are usually treated as @value{FN}s, +unless they have the form @samp{@var{var}=@var{value}}, in which case +they are taken as variable assignments to be performed at that point +in processing the input. + +@item +All non-option command-line arguments, excluding the program text, +are placed in the @code{ARGV} array. Adjusting @code{ARGC} and @code{ARGV} +affects how @command{awk} processes input. + +@item +You can use a single minus sign (@samp{-}) to refer to standard input +on the command line. + +@item +@command{gawk} pays attention to a number of environment variables. +@env{AWKPATH}, @env{AWKLIBPATH}, and @env{POSIXLY_CORRECT} are the +most important ones. + +@item +@command{gawk}'s exit status conveys information to the program +that invoked it. Use the @code{exit} statement from within +an @command{awk} program to set the exit status. + +@item +@command{gawk} allows you to include other @command{awk} source files into +your program using the @code{@@include} statement and/or the @option{-i} +and @option{-f} command-line options. + +@item +@command{gawk} allows you to load additional functions written in C +or C++ using the @code{@@load} statement and/or the @option{-l} option. +(This advanced feature is described later on in @ref{Dynamic Extensions}.) +@end itemize + @node Regexp @chapter Regular Expressions -@cindex regexp, See regular expressions +@cindex regexp @c STARTOFRANGE regexp @cindex regular expressions @@ -4215,8 +4626,8 @@ set of strings. Because regular expressions are such a fundamental part of @command{awk} programming, their format and use deserve a separate @value{CHAPTER}. -@cindex forward slash (@code{/}) -@cindex @code{/} (forward slash) +@cindex forward slash (@code{/}) to enclose regular expressions +@cindex @code{/} (forward slash) to enclose regular expressions A regular expression enclosed in slashes (@samp{/}) is an @command{awk} pattern that matches every input record whose text belongs to that set. @@ -4242,6 +4653,7 @@ regular expressions work, we present more complicated instances. * Case-sensitivity:: How to do case-insensitive matching. * Leftmost Longest:: How much text matches. * Computed Regexps:: Using Dynamic Regexps. +* Regexp Summary:: Regular expressions summary. @end menu @node Regexp Usage @@ -4252,15 +4664,15 @@ A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, the -following prints the second field of each record that contains the string -@samp{foo} anywhere in it: +following prints the second field of each record where the string +@samp{li} appears anywhere in the record: @example -$ @kbd{awk '/foo/ @{ print $2 @}' BBS-list} -@print{} 555-1234 +$ @kbd{awk '/li/ @{ print $2 @}' mail-list} +@print{} 555-5553 +@print{} 555-0542 @print{} 555-6699 -@print{} 555-6480 -@print{} 555-2127 +@print{} 555-3430 @end example @cindex regular expressions, operators @@ -4272,9 +4684,9 @@ $ @kbd{awk '/foo/ @{ print $2 @}' BBS-list} @cindex @code{!} (exclamation point), @code{!~} operator @cindex exclamation point (@code{!}), @code{!~} operator @c @cindex operators, @code{!~} -@cindex @code{if} statement -@cindex @code{while} statement -@cindex @code{do}-@code{while} statement +@cindex @code{if} statement, use of regexps in +@cindex @code{while} statement, use of regexps in +@cindex @code{do}-@code{while} statement, use of regexps in @c @cindex statements, @code{if} @c @cindex statements, @code{while} @c @cindex statements, @code{do} @@ -4333,6 +4745,7 @@ $ @kbd{awk '$1 !~ /J/' inventory-shipped} @end example @cindex regexp constants +@cindex constant regexps @cindex regular expressions, constants, See regexp constants When a regexp is enclosed in slashes, such as @code{/foo/}, we call it a @dfn{regexp constant}, much like @code{5.27} is a numeric constant and @@ -4341,7 +4754,7 @@ a @dfn{regexp constant}, much like @code{5.27} is a numeric constant and @node Escape Sequences @section Escape Sequences -@cindex escape sequences +@cindex escape sequences, in strings @cindex backslash (@code{\}), in escape sequences @cindex @code{\} (backslash), in escape sequences Some characters cannot be included literally in string constants @@ -4382,7 +4795,7 @@ A literal backslash, @samp{\}. @cindex backslash (@code{\}), @code{\a} escape sequence @item \a The ``alert'' character, @kbd{Ctrl-g}, ASCII code 7 (BEL). -(This usually makes some sort of audible noise.) +(This often makes some sort of audible noise.) @cindex @code{\} (backslash), @code{\b} escape sequence @cindex backslash (@code{\}), @code{\b} escape sequence @@ -4476,7 +4889,7 @@ shown in the previous list. To summarize: -@itemize @bullet +@itemize @value{BULLET} @item The escape sequences in the table above are always processed first, for both string constants and regexp constants. This happens very early, @@ -4506,6 +4919,7 @@ leaves what happens as undefined. There are two choices: @c @cindex automatic warnings @c @cindex warnings, automatic +@cindex Brian Kernighan's @command{awk} @table @asis @item Strip the backslash out This is what Brian Kernighan's @command{awk} and @command{gawk} both do. @@ -4519,6 +4933,7 @@ two backslashes in the string: @samp{FS = @w{"[ \t]+\\|[ \t]+"}}.) @cindex @command{gawk}, escape sequences @cindex Unix @command{awk}, backslashes in escape sequences +@cindex @command{mawk} utility @item Leave the backslash alone Some other @command{awk} implementations do this. In such implementations, typing @code{"a\qc"} is the same as typing @@ -4550,6 +4965,7 @@ escape sequences literally when used in regexp constants. Thus, @section Regular Expression Operators @c STARTOFRANGE regexpo @cindex regular expressions, operators +@cindex metacharacters in regular expressions You can combine regular expressions with special characters, called @dfn{regular expression operators} or @dfn{metacharacters}, to @@ -4567,10 +4983,11 @@ the very first step in processing regexps. Here is a list of metacharacters. All characters that are not escape sequences and that are not listed in the table stand for themselves: -@table @code -@cindex backslash (@code{\}) -@cindex @code{\} (backslash) -@item \ +@c Use @asis so the docbook comes out ok. Sigh. +@table @asis +@cindex backslash (@code{\}), regexp operator +@cindex @code{\} (backslash), regexp operator +@item @code{\} This is used to suppress the special meaning of a character when matching. For example, @samp{\$} matches the character @samp{$}. @@ -4579,7 +4996,7 @@ matches the character @samp{$}. @cindex Texinfo, chapter beginnings in files @cindex @code{^} (caret), regexp operator @cindex caret (@code{^}), regexp operator -@item ^ +@item @code{^} This matches the beginning of a string. For example, @samp{^@@chapter} matches @samp{@@chapter} at the beginning of a string and can be used to identify chapter beginnings in Texinfo source files. @@ -4587,29 +5004,31 @@ The @samp{^} is known as an @dfn{anchor}, because it anchors the pattern to match only at the beginning of the string. It is important to realize that @samp{^} does not match the beginning of -a line embedded in a string. +a line (the point right after a @samp{\n} newline character) embedded in a string. The condition is not true in the following example: @example if ("line1\nLINE 2" ~ /^L/) @dots{} @end example -@cindex @code{$} (dollar sign) -@cindex dollar sign (@code{$}) -@item $ +@cindex @code{$} (dollar sign), regexp operator +@cindex dollar sign (@code{$}), regexp operator +@item @code{$} This is similar to @samp{^}, but it matches only at the end of a string. For example, @samp{p$} matches a record that ends with a @samp{p}. The @samp{$} is an anchor -and does not match the end of a line embedded in a string. +and does not match the end of a line +(the point right before a @samp{\n} newline character) +embedded in a string. The condition in the following example is not true: @example if ("line1\nLINE 2" ~ /1$/) @dots{} @end example -@cindex @code{.} (period) -@cindex period (@code{.}) -@item . @r{(period)} +@cindex @code{.} (period), regexp operator +@cindex period (@code{.}), regexp operator +@item @code{.} (period) This matches any single character, @emph{including} the newline character. For example, @samp{.P} matches any single character followed by a @samp{P} in a string. Using @@ -4624,12 +5043,13 @@ character, which is a character with all bits equal to zero. Otherwise, @sc{nul} is just another character. Other versions of @command{awk} may not be able to match the @sc{nul} character. -@cindex @code{[]} (square brackets) -@cindex square brackets (@code{[]}) +@cindex @code{[]} (square brackets), regexp operator +@cindex square brackets (@code{[]}), regexp operator @cindex bracket expressions @cindex character sets, See Also bracket expressions @cindex character lists, See bracket expressions -@item [@dots{}] +@cindex character classes, See bracket expressions +@item @code{[}@dots{}@code{]} This is called a @dfn{bracket expression}.@footnote{In other literature, you may see a bracket expression referred to as either a @dfn{character set}, a @dfn{character class}, or a @dfn{character list}.} @@ -4641,7 +5061,7 @@ is given in @ref{Bracket Expressions}. @cindex bracket expressions, complemented -@item [^ @dots{}] +@item @code{[^}@dots{}@code{]} This is a @dfn{complemented bracket expression}. The first character after the @samp{[} @emph{must} be a @samp{^}. It matches any characters @emph{except} those in the square brackets. For example, @samp{[^awk]} @@ -4650,7 +5070,7 @@ or @samp{k}. @cindex @code{|} (vertical bar) @cindex vertical bar (@code{|}) -@item | +@item @code{|} This is the @dfn{alternation operator} and it is used to specify alternatives. The @samp{|} has the lowest precedence of all the regular @@ -4661,9 +5081,9 @@ means it matches any string that starts with @samp{P} or contains a digit. The alternation applies to the largest possible regexps on either side. -@cindex @code{()} (parentheses) -@cindex parentheses @code{()} -@item (@dots{}) +@cindex @code{()} (parentheses), regexp operator +@cindex parentheses @code{()}, regexp operator +@item @code{(}@dots{}@code{)} Parentheses are used for grouping in regular expressions, as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, @samp{|}. For example, @@ -4674,7 +5094,7 @@ explained further on in this list.) @cindex @code{*} (asterisk), @code{*} operator, as regexp operator @cindex asterisk (@code{*}), @code{*} operator, as regexp operator -@item * +@item @code{*} This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, @samp{ph*} applies the @samp{*} symbol to the preceding @samp{h} and looks for matches @@ -4690,13 +5110,13 @@ prints every record in @file{sample} containing a string of the form Notice the escaping of the parentheses by preceding them with backslashes. -@cindex @code{+} (plus sign) -@cindex plus sign (@code{+}) -@item + +@cindex @code{+} (plus sign), regexp operator +@cindex plus sign (@code{+}), regexp operator +@item @code{+} This symbol is similar to @samp{*}, except that the preceding expression must be matched at least once. This means that @samp{wh+y} would match @samp{why} and @samp{whhy}, but not @samp{wy}, whereas -@samp{wh*y} would match all three of these strings. +@samp{wh*y} would match all three. The following is a simpler way of writing the last @samp{*} example: @@ -4704,17 +5124,17 @@ way of writing the last @samp{*} example: awk '/\(c[ad]+r x\)/ @{ print @}' sample @end example -@cindex @code{?} (question mark) regexp operator -@cindex question mark (@code{?}) regexp operator -@item ? +@cindex @code{?} (question mark), regexp operator +@cindex question mark (@code{?}), regexp operator +@item @code{?} This symbol is similar to @samp{*}, except that the preceding expression can be matched either once or not at all. For example, @samp{fe?d} matches @samp{fed} and @samp{fd}, but nothing else. -@cindex interval expressions -@item @{@var{n}@} -@itemx @{@var{n},@} -@itemx @{@var{n},@var{m}@} +@cindex interval expressions, regexp operator +@item @code{@{}@var{n}@code{@}} +@itemx @code{@{}@var{n}@code{,@}} +@itemx @code{@{}@var{n}@code{,}@var{m}@code{@}} One or two numbers inside braces denote an @dfn{interval expression}. If there is one number in the braces, the preceding regexp is repeated @var{n} times. @@ -4745,7 +5165,7 @@ constants, @command{gawk} did @emph{not} match interval expressions in regexps. -However, beginning with version 4.0, +However, beginning with @value{PVERSION} 4.0, @command{gawk} does match interval expressions by default. This is because compatibility with POSIX has become more important to most @command{gawk} users than compatibility with @@ -4788,6 +5208,7 @@ expressions are not available in regular expressions. @cindex bracket expressions @cindex bracket expressions, range expressions @cindex range expressions (regexps) +@cindex character lists in regular expression As mentioned earlier, a bracket expression matches any character amongst those listed between the opening and closing square brackets. @@ -4889,8 +5310,8 @@ These sequences are: @item Collating symbols Multicharacter collating elements enclosed between @samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element, -then @code{[[.ch.]]} is a regexp that matches this collating element, whereas -@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}. +then @samp{[[.ch.]]} is a regexp that matches this collating element, whereas +@samp{[ch]} is a regexp that matches either @samp{c} or @samp{h}. @cindex bracket expressions, equivalence classes @item Equivalence classes @@ -4898,7 +5319,7 @@ Locale-specific names for a list of characters that are equal. The name is enclosed between @samp{[=} and @samp{=]}. For example, the name @samp{e} might be used to represent all of -``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e=]]} is a regexp +``e,'' ``@`e,'' and ``@'e.'' In this case, @samp{[[=e=]]} is a regexp that matches any of @samp{e}, @samp{@'e}, or @samp{@`e}. @end table @@ -4942,7 +5363,7 @@ or underscores (@samp{_}): @item \s Matches any whitespace character. Think of it as shorthand for -@w{@code{[[:space:]]}}. +@w{@samp{[[:space:]]}}. @c @cindex operators, @code{\S} (@command{gawk}) @cindex backslash (@code{\}), @code{\S} operator (@command{gawk}) @@ -4950,7 +5371,7 @@ Think of it as shorthand for @item \S Matches any character that is not whitespace. Think of it as shorthand for -@w{@code{[^[:space:]]}}. +@w{@samp{[^[:space:]]}}. @c @cindex operators, @code{\w} (@command{gawk}) @cindex backslash (@code{\}), @code{\w} operator (@command{gawk}) @@ -4958,7 +5379,7 @@ Think of it as shorthand for @item \w Matches any word-constituent character---that is, it matches any letter, digit, or underscore. Think of it as shorthand for -@w{@code{[[:alnum:]_]}}. +@w{@samp{[[:alnum:]_]}}. @c @cindex operators, @code{\W} (@command{gawk}) @cindex backslash (@code{\}), @code{\W} operator (@command{gawk}) @@ -4966,7 +5387,7 @@ letter, digit, or underscore. Think of it as shorthand for @item \W Matches any character that is not word-constituent. Think of it as shorthand for -@w{@code{[^[:alnum:]_]}}. +@w{@samp{[^[:alnum:]_]}}. @c @cindex operators, @code{\<} (@command{gawk}) @cindex backslash (@code{\}), @code{\<} operator (@command{gawk}) @@ -5027,10 +5448,10 @@ Matches the empty string at the end of a buffer (string). @end table -@cindex @code{^} (caret) -@cindex caret (@code{^}) -@cindex @code{?} (question mark) regexp operator -@cindex question mark (@code{?}) regexp operator +@cindex @code{^} (caret), regexp operator +@cindex caret (@code{^}), regexp operator +@cindex @code{?} (question mark), regexp operator +@cindex question mark (@code{?}), regexp operator Because @samp{^} and @samp{$} always work in terms of the beginning and end of strings, these operators don't add any new capabilities for @command{awk}. They are provided for compatibility with other @@ -5047,11 +5468,8 @@ GNU operators, but this was deemed too confusing. The current method of using @samp{\y} for the GNU @samp{\b} appears to be the lesser of two evils. -@c NOTE!!! Keep this in sync with the same table in the summary appendix! -@c -@c Should really do this with file inclusion. @cindex regular expressions, @command{gawk}, command-line options -@cindex @command{gawk}, command-line options +@cindex @command{gawk}, command-line options, and regular expressions The various command-line options (@pxref{Options}) control how @command{gawk} interprets characters in regexps: @@ -5065,8 +5483,10 @@ previously described GNU regexp operators. @end ifnotinfo @ifnottex +@ifnotdocbook GNU regexp operators described in @ref{Regexp Operators}. +@end ifnotdocbook @end ifnottex @item @code{--posix} @@ -5074,10 +5494,11 @@ Only POSIX regexps are supported; the GNU operators are not special (e.g., @samp{\w} matches a literal @samp{w}). Interval expressions are allowed. +@cindex Brian Kernighan's @command{awk} @item @code{--traditional} Traditional Unix @command{awk} regexps are matched. The GNU operators are not special, and interval expressions are not available. -The POSIX character classes (@code{[[:alnum:]]}, etc.) are supported, +The POSIX character classes (@samp{[[:alnum:]]}, etc.) are supported, as Brian Kernighan's @command{awk} does support them. Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters. @@ -5129,16 +5550,18 @@ This works in any POSIX-compliant @command{awk}. @cindex tilde (@code{~}), @code{~} operator @cindex @code{!} (exclamation point), @code{!~} operator @cindex exclamation point (@code{!}), @code{!~} operator -@cindex @code{IGNORECASE} variable +@cindex @code{IGNORECASE} variable, with @code{~} and @code{!~} operators @cindex @command{gawk}, @code{IGNORECASE} variable in @c @cindex variables, @code{IGNORECASE} Another method, specific to @command{gawk}, is to set the variable @code{IGNORECASE} to a nonzero value (@pxref{Built-in Variables}). When @code{IGNORECASE} is not zero, @emph{all} regexp and string -operations ignore case. Changing the value of -@code{IGNORECASE} dynamically controls the case-sensitivity of the -program as it runs. Case is significant by default because -@code{IGNORECASE} (like most variables) is initialized to zero: +operations ignore case. + +Changing the value of @code{IGNORECASE} dynamically controls the +case-sensitivity of the program as it runs. Case is significant by +default because @code{IGNORECASE} (like most variables) is initialized +to zero: @example x = "aB" @@ -5168,9 +5591,6 @@ case-sensitivity on or off for all the rules at once. Setting @code{IGNORECASE} from the command line is a way to make a program case-insensitive without having to edit it. -Both regexp and string comparison -operations are affected by @code{IGNORECASE}. - @c @cindex ISO 8859-1 @c @cindex ISO Latin-1 In multibyte locales, @@ -5248,7 +5668,7 @@ regexp constant (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated and converted to a string if necessary; the contents of the string are then used as the regexp. A regexp computed in this way is called a @dfn{dynamic -regexp}: +regexp} or a @dfn{computed regexp}: @example BEGIN @{ digits_regexp = "[[:digit:]]+" @} @@ -5294,7 +5714,7 @@ Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is ``regexp constants,'' for several reasons: -@itemize @bullet +@itemize @value{BULLET} @item String constants are more complicated to write and more difficult to read. Using regexp constants makes your programs @@ -5317,7 +5737,7 @@ intend a regexp match. @cindex regular expressions, dynamic, with embedded newlines @cindex newlines, in dynamic regexps -Some commercial versions of @command{awk} do not allow the newline +Some versions of @command{awk} do not allow the newline character to be used inside a bracket expression for a dynamic regexp: @example @@ -5344,12 +5764,59 @@ occur often in practice, but it's worth noting for future reference. @end sidebar @c ENDOFRANGE dregexp @c ENDOFRANGE regexpd + +@node Regexp Summary +@section Summary + +@itemize @value{BULLET} +@item +Regular expressions describe sets of strings to be matched. +In @command{awk}, regular expression constants are written enclosed +between slashes: @code{/}@dots{}@code{/}. + +@item +Regexp constants may be used by standalone in patterns and +in conditional expressions, or as part of matching expressions +using the @samp{~} and @samp{!~} operators. + +@item +Escape sequences let you represent non-printable characters and +also let you represent regexp metacharacters as literal characters +to be matched. + +@item +Regexp operators provide grouping, alternation and repetition. + +@item +Bracket expressions give you a shorthand for specifying sets +of characters that can match at a particular point in a regexp. +Within bracket expressions, POSIX character classes let you specify +certain groups of characters in a locale-independent fashion. + +@item +@command{gawk}'s @code{IGNORECASE} variable lets you control the +case sensitivity of regexp matching. In other @command{awk} +versions, use @code{tolower()} or @code{toupper()}. + +@item +Regular expressions match the leftmost longest text in the string being +matched. This matters for cases where you need to know the extent of +the match, such as for text substitution and when the record separator +is a regexp. + +@item +Matching expressions may use dynamic regexps; that is string values +treated as regular expressions. + +@end itemize + @c ENDOFRANGE regexp @node Reading Files @chapter Reading Input Files @c STARTOFRANGE infir +@cindex reading input files @cindex input files, reading @cindex input files @cindex @code{FILENAME} variable @@ -5392,6 +5859,8 @@ used with it do not have to be named on the @command{awk} command line * Read Timeout:: Reading input with a timeout. * Command line directories:: What happens if you put a directory on the command line. +* Input Summary:: Input summary. +* Input Exercises:: Exercises. @end menu @node Records @@ -5411,9 +5880,17 @@ so far from the current input file. This value is stored in a built-in variable called @code{FNR}. It is reset to zero when a new file is started. Another built-in variable, @code{NR}, records the total -number of input records read so far from all data files. It starts at zero, +number of input records read so far from all @value{DF}s. It starts at zero, but is never automatically reset to zero. +@menu +* awk split records:: How standard @command{awk} splits records. +* gawk split records:: How @command{gawk} splits records. +@end menu + +@node awk split records +@subsection Record Splitting With Standard @command{awk} + @cindex separators, for records @cindex record separators Records are separated by a character called the @dfn{record separator}. @@ -5436,69 +5913,80 @@ To do this, use the special @code{BEGIN} pattern (@pxref{BEGIN/END}). For example: -@cindex @code{BEGIN} pattern @example -awk 'BEGIN @{ RS = "/" @} - @{ print $0 @}' BBS-list +awk 'BEGIN @{ RS = "u" @} + @{ print $0 @}' mail-list @end example @noindent -changes the value of @code{RS} to @code{"/"}, before reading any input. -This is a string whose first character is a slash; as a result, records -are separated by slashes. Then the input file is read, and the second +changes the value of @code{RS} to @samp{u}, before reading any input. +This is a string whose first character is the letter ``u;'' as a result, records +are separated by the letter ``u.'' Then the input file is read, and the second rule in the @command{awk} program (the action with no pattern) prints each record. Because each @code{print} statement adds a newline at the end of its output, this @command{awk} program copies the input -with each slash changed to a newline. Here are the results of running -the program on @file{BBS-list}: - -@example -$ @kbd{awk 'BEGIN @{ RS = "/" @}} -> @kbd{@{ print $0 @}' BBS-list} -@print{} aardvark 555-5553 1200 -@print{} 300 B -@print{} alpo-net 555-3412 2400 -@print{} 1200 -@print{} 300 A -@print{} barfly 555-7685 1200 -@print{} 300 A -@print{} bites 555-1675 2400 -@print{} 1200 -@print{} 300 A -@print{} camelot 555-0542 300 C -@print{} core 555-2912 1200 -@print{} 300 C -@print{} fooey 555-1234 2400 -@print{} 1200 -@print{} 300 B -@print{} foot 555-6699 1200 -@print{} 300 B -@print{} macfoo 555-6480 1200 -@print{} 300 A -@print{} sdace 555-3430 2400 -@print{} 1200 -@print{} 300 A -@print{} sabafoo 555-2127 1200 -@print{} 300 C -@print{} +with each @samp{u} changed to a newline. Here are the results of running +the program on @file{mail-list}: + +@example +$ @kbd{awk 'BEGIN @{ RS = "u" @}} +> @kbd{@{ print $0 @}' mail-list} +@print{} Amelia 555-5553 amelia.zodiac +@print{} sq +@print{} e@@gmail.com F +@print{} Anthony 555-3412 anthony.assert +@print{} ro@@hotmail.com A +@print{} Becky 555-7685 becky.algebrar +@print{} m@@gmail.com A +@print{} Bill 555-1675 bill.drowning@@hotmail.com A +@print{} Broderick 555-0542 broderick.aliq +@print{} otiens@@yahoo.com R +@print{} Camilla 555-2912 camilla.inf +@print{} sar +@print{} m@@skynet.be R +@print{} Fabi +@print{} s 555-1234 fabi +@print{} s. +@print{} ndevicesim +@print{} s@@ +@print{} cb.ed +@print{} F +@print{} J +@print{} lie 555-6699 j +@print{} lie.perscr +@print{} tabor@@skeeve.com F +@print{} Martin 555-6480 martin.codicib +@print{} s@@hotmail.com A +@print{} Sam +@print{} el 555-3430 sam +@print{} el.lanceolis@@sh +@print{} .ed +@print{} A +@print{} Jean-Pa +@print{} l 555-2127 jeanpa +@print{} l.campanor +@print{} m@@ny +@print{} .ed +@print{} R +@print{} @end example @noindent -Note that the entry for the @samp{camelot} BBS is not split. -In the original data file +Note that the entry for the name @samp{Bill} is not split. +In the original @value{DF} (@pxref{Sample Data Files}), the line looks like this: @example -camelot 555-0542 300 C +Bill 555-1675 bill.drowning@@hotmail.com A @end example @noindent -It has one baud rate only, so there are no slashes in the record, -unlike the others which have two or more baud rates. -In fact, this record is treated as part of the record -for the @samp{core} BBS; the newline separating them in the output -is the original newline in the data file, not the one added by +It contains no @samp{u} so there is no reason to split the record, +unlike the others which have one or more occurrences of the @samp{u}. +In fact, this record is treated as part of the previous record; +the newline separating them in the output +is the original newline in the @value{DF}, not the one added by @command{awk} when it printed the record! @cindex record separators, changing @@ -5508,14 +5996,17 @@ using the variable-assignment feature (@pxref{Other Arguments}): @example -awk '@{ print $0 @}' RS="/" BBS-list +awk '@{ print $0 @}' RS="u" mail-list @end example @noindent -This sets @code{RS} to @samp{/} before processing @file{BBS-list}. +This sets @code{RS} to @samp{u} before processing @file{mail-list}. -Using an unusual character such as @samp{/} for the record separator -produces correct behavior in the vast majority of cases. +Using an alphabetic character such as @samp{u} for the record separator +is highly likely to produce strange results. +Using an unusual character such as @samp{/} is more likely to +produce correct behavior in the majority of cases, but there +are no guarantees. The moral is: Know Your Data. There is one unusual case, that occurs when @command{gawk} is being fully POSIX-compliant (@pxref{Options}). @@ -5537,6 +6028,7 @@ Reaching the end of an input file terminates the current input record, even if the last character in the file is not the character in @code{RS}. @value{DARKCORNER} +@cindex empty strings @cindex null strings @cindex strings, empty, See null strings The empty string @code{""} (a string without any characters) @@ -5562,6 +6054,9 @@ After the end of the record has been determined, @command{gawk} sets the variable @code{RT} to the text in the input that matched @code{RS}. +@node gawk split records +@subsection Record Splitting With @command{gawk} + @cindex common extensions, @code{RS} as a regexp @cindex extensions, common@comma{} @code{RS} as a regexp When using @command{gawk}, @@ -5635,12 +6130,11 @@ In compatibility mode, only the first character of the value of @sidebar @code{RS = "\0"} Is Not Portable @cindex portability, data files as single record -There are times when you might want to treat an entire data file as a +There are times when you might want to treat an entire @value{DF} as a single record. The only way to make this happen is to give @code{RS} a value that you know doesn't occur in the input file. This is hard to do in a general way, such that a program always works for arbitrary input files. -@c can you say `understatement' boys and girls? You might think that for text files, the @sc{nul} character, which consists of a character with all bits equal to zero, is a good @@ -5653,21 +6147,27 @@ BEGIN @{ RS = "\0" @} # whole file becomes one record? @cindex differences in @command{awk} and @command{gawk}, strings, storing @command{gawk} in fact accepts this, and uses the @sc{nul} character for the record separator. +This works for certain special files, such as @file{/proc/environ} on +GNU/Linux systems, where the @sc{nul} character is in fact the record separator. However, this usage is @emph{not} portable -to other @command{awk} implementations. +to most other @command{awk} implementations. @cindex dark corner, strings, storing -All other @command{awk} implementations@footnote{At least that we know +Almost all other @command{awk} implementations@footnote{At least that we know about.} store strings internally as C-style strings. C strings use the @sc{nul} character as the string terminator. In effect, this means that @samp{RS = "\0"} is the same as @samp{RS = ""}. @value{DARKCORNER} +It happens that recent versions of @command{mawk} can use the @sc{nul} +character as a record separator. However, this is a special case: +@command{mawk} does not allow embedded @sc{nul} characters in strings. + @cindex records, treating files as -@cindex files, as single records -The best way to treat a whole file as a single record is to -simply read the file in, one record at a time, concatenating each -record onto the end of the previous ones. +@cindex treating files, as single records +@xref{Readfile Function}, for an interesting, portable way to read +whole files. If you are using @command{gawk}, see @ref{Extension Sample +Readfile}, for another option. @end sidebar @c ENDOFRANGE inspl @c ENDOFRANGE recspl @@ -5703,7 +6203,7 @@ simple @command{awk} programs so powerful. @cindex @code{$} (dollar sign), @code{$} field operator @cindex dollar sign (@code{$}), @code{$} field operator @cindex field operators@comma{} dollar sign as -A dollar-sign (@samp{$}) is used +You use a dollar-sign (@samp{$}) to refer to a field in an @command{awk} program, followed by the number of the field you want. Thus, @code{$1} refers to the first field, @code{$2} to the second, and so on. @@ -5734,36 +6234,34 @@ one (such as @code{$8} when the record has only seven fields), you get the empty string. (If used in a numeric operation, you get zero.) The use of @code{$0}, which looks like a reference to the ``zero-th'' field, is -a special case: it represents the whole input record +a special case: it represents the whole input record. Use it when you are not interested in specific fields. Here are some more examples: @example -$ @kbd{awk '$1 ~ /foo/ @{ print $0 @}' BBS-list} -@print{} fooey 555-1234 2400/1200/300 B -@print{} foot 555-6699 1200/300 B -@print{} macfoo 555-6480 1200/300 A -@print{} sabafoo 555-2127 1200/300 C +$ @kbd{awk '$1 ~ /li/ @{ print $0 @}' mail-list} +@print{} Amelia 555-5553 amelia.zodiacusque@@gmail.com F +@print{} Julie 555-6699 julie.perscrutabor@@skeeve.com F @end example @noindent -This example prints each record in the file @file{BBS-list} whose first -field contains the string @samp{foo}. The operator @samp{~} is called a +This example prints each record in the file @file{mail-list} whose first +field contains the string @samp{li}. The operator @samp{~} is called a @dfn{matching operator} (@pxref{Regexp Usage}); it tests whether a string (here, the field @code{$1}) matches a given regular expression. By contrast, the following example -looks for @samp{foo} in @emph{the entire record} and prints the first +looks for @samp{li} in @emph{the entire record} and prints the first field and the last field for each matching input record: @example -$ @kbd{awk '/foo/ @{ print $1, $NF @}' BBS-list} -@print{} fooey B -@print{} foot B -@print{} macfoo A -@print{} sabafoo C +$ @kbd{awk '/li/ @{ print $1, $NF @}' mail-list} +@print{} Amelia F +@print{} Broderick R +@print{} Julie F +@print{} Samuel A @end example @c ENDOFRANGE fiex @@ -5772,7 +6270,7 @@ $ @kbd{awk '/foo/ @{ print $1, $NF @}' BBS-list} @cindex fields, numbers @cindex field numbers -The number of a field does not need to be a constant. Any expression in +A field number need not be a constant. Any expression in the @command{awk} language can be used after a @samp{$} to refer to a field. The value of the expression specifies the field number. If the value is a string, rather than a number, it is converted to a number. @@ -5791,7 +6289,7 @@ the record has fewer than 20 fields, so this prints a blank line. Here is another example of using expressions as field numbers: @example -awk '@{ print $(2*2) @}' BBS-list +awk '@{ print $(2*2) @}' mail-list @end example @command{awk} evaluates the expression @samp{(2*2)} and uses @@ -5799,9 +6297,13 @@ its value as the number of the field to print. The @samp{*} sign represents multiplication, so the expression @samp{2*2} evaluates to four. The parentheses are used so that the multiplication is done before the @samp{$} operation; they are necessary whenever there is a binary -operator in the field-number expression. This example, then, prints the -hours of operation (the fourth field) for every line of the file -@file{BBS-list}. (All of the @command{awk} operators are listed, in +operator@footnote{A @dfn{binary operator}, such as @samp{*} for +multiplication, is one that takes two operands. The distinction +is required, since @command{awk} also has unary (one-operand) +and ternary (three-operand) operators.} +in the field-number expression. This example, then, prints the +type of relationship (the fourth field) for every line of the file +@file{mail-list}. (All of the @command{awk} operators are listed, in order of decreasing precedence, in @ref{Precedence}.) @@ -5849,7 +6351,7 @@ Then it prints the original and new values for field three. (Someone in the warehouse made a consistent mistake while inventorying the red boxes.) -For this to work, the text in field @code{$3} must make sense +For this to work, the text in @code{$3} must make sense as a number; the string of characters must be converted to a number for the computer to do arithmetic on it. The number resulting from the subtraction is converted back to a string of characters that @@ -5940,7 +6442,7 @@ $ @kbd{echo a b c d | awk '@{ OFS = ":"; $2 = ""} @end example @noindent -The field is still there; it just has an empty value, denoted by +The field is still there; it just has an empty value, delimited by the two colons between @samp{a} and @samp{c}. This example shows what happens if you create a new field: @@ -6024,6 +6526,7 @@ with a statement such as @samp{$1 = $1}, as described earlier. * Regexp Field Splitting:: Using regexps as the field separator. * Single Character Fields:: Making each character a separate field. * Command Line Field Separator:: Setting @code{FS} from the command-line. +* Full Line Fields:: Making the full line be a single field. * Field Splitting Summary:: Some final points and a summary table. @end menu @@ -6191,7 +6694,7 @@ $ @kbd{echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t\n]+" @}} @cindex null strings @cindex strings, null @cindex empty strings, See null strings -In this case, the first field is @dfn{null} or empty. +In this case, the first field is null, or empty. The stripping of leading and trailing whitespace also comes into play whenever @code{$0} is recomputed. For instance, study this pipeline: @@ -6211,7 +6714,7 @@ was ignored when finding @code{$1}, it is not part of the new @code{$0}. Finally, the last @code{print} statement prints the new @code{$0}. @cindex @code{FS}, containing @code{^} -@cindex @code{^}, in @code{FS} +@cindex @code{^} (caret), in @code{FS} @cindex dark corner, @code{^}, in @code{FS} There is an additional subtlety to be aware of when using regular expressions for field splitting. @@ -6222,6 +6725,7 @@ different @command{awk} versions answer this question differently, and you should not rely on any specific behavior in your programs. @value{DARKCORNER} +@cindex Brian Kernighan's @command{awk} As a point of information, Brian Kernighan's @command{awk} allows @samp{^} to match only at the beginning of the record. @command{gawk} also works this way. For example: @@ -6265,7 +6769,7 @@ $ @kbd{echo a b | gawk 'BEGIN @{ FS = "" @}} @end example @cindex dark corner, @code{FS} as null string -@cindex FS variable, as null string +@cindex @code{FS} variable, as null string Traditionally, the behavior of @code{FS} equal to @code{""} was not defined. In this case, most versions of Unix @command{awk} simply treat the entire record as only having one field. @@ -6277,10 +6781,8 @@ behaves this way. @node Command Line Field Separator @subsection Setting @code{FS} from the Command Line -@cindex @code{-F} option -@cindex options, command-line -@cindex command line, options -@cindex field separators, on command line +@cindex @option{-F} option, command line +@cindex field separator, on command line @cindex command line, @code{FS} on@comma{} setting @cindex @code{FS} variable, setting from command line @@ -6330,68 +6832,75 @@ figures that you really want your fields to be separated with TABs and not @samp{t}s. Use @samp{-v FS="t"} or @samp{-F"[t]"} on the command line if you really do want to separate your fields with @samp{t}s. -As an example, let's use an @command{awk} program file called @file{baud.awk} -that contains the pattern @code{/300/} and the action @samp{print $1}: +As an example, let's use an @command{awk} program file called @file{edu.awk} +that contains the pattern @code{/edu/} and the action @samp{print $1}: @example -/300/ @{ print $1 @} +/edu/ @{ print $1 @} @end example Let's also set @code{FS} to be the @samp{-} character and run the -program on the file @file{BBS-list}. The following command prints a -list of the names of the bulletin boards that operate at 300 baud and +program on the file @file{mail-list}. The following command prints a +list of the names of the people that work at or attend a university, and the first three digits of their phone numbers: -@c tweaked to make the tex output look better in @smallbook @example -$ @kbd{awk -F- -f baud.awk BBS-list} -@print{} aardvark 555 -@print{} alpo -@print{} barfly 555 -@print{} bites 555 -@print{} camelot 555 -@print{} core 555 -@print{} fooey 555 -@print{} foot 555 -@print{} macfoo 555 -@print{} sdace 555 -@print{} sabafoo 555 +$ @kbd{awk -F- -f edu.awk mail-list} +@print{} Fabius 555 +@print{} Samuel 555 +@print{} Jean @end example @noindent -Note the second line of output. The second line +Note the third line of output. The third line in the original file looked like this: @example -alpo-net 555-3412 2400/1200/300 A +Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R @end example -The @samp{-} as part of the system's name was used as the field +The @samp{-} as part of the person's name was used as the field separator, instead of the @samp{-} in the phone number that was originally intended. This demonstrates why you have to be careful in choosing your field and record separators. @cindex Unix @command{awk}, password files@comma{} field separators and -Perhaps the most common use of a single character as the field -separator occurs when processing the Unix system password file. -On many Unix systems, each user has a separate entry in the system password -file, one line per user. The information in these lines is separated -by colons. The first field is the user's login name and the second is -the user's (encrypted or shadow) password. A password file entry might look -like this: +Perhaps the most common use of a single character as the field separator +occurs when processing the Unix system password file. On many Unix +systems, each user has a separate entry in the system password file, one +line per user. The information in these lines is separated by colons. +The first field is the user's login name and the second is the user's +encrypted or shadow password. (A shadow password is indicated by the +presence of a single @samp{x} in the second field.) A password file +entry might look like this: @cindex Robbins, Arnold @example -arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/bash +arnold:x:2076:10:Arnold Robbins:/home/arnold:/bin/bash @end example The following program searches the system password file and prints -the entries for users who have no password: +the entries for users whose full name is not indicated: @example -awk -F: '$2 == ""' /etc/passwd +awk -F: '$5 == ""' /etc/passwd @end example +@node Full Line Fields +@subsection Making The Full Line Be A Single Field + +Occasionally, it's useful to treat the whole input line as a +single field. This can be done easily and portably simply by +setting @code{FS} to @code{"\n"} (a newline).@footnote{Thanks to +Andrew Schorr for this tip.} + +@example +awk -F'\n' '@var{program}' @var{files @dots{}} +@end example + +@noindent +When you do this, @code{$1} is the same as @code{$0}. + @node Field Splitting Summary @subsection Field-Splitting Summary @@ -6432,7 +6941,7 @@ POSIX standard.) @sidebar Changing @code{FS} Does Not Affect the Fields @cindex POSIX @command{awk}, field separators and -@cindex field separators, POSIX and +@cindex field separator, POSIX and According to the POSIX standard, @command{awk} is supposed to behave as if each record is split into fields at the time it is read. In particular, this means that if you change the value of @code{FS} @@ -6502,19 +7011,11 @@ will take effect. @node Constant Size @section Reading Fixed-Width Data -@ifnotinfo @quotation NOTE This @value{SECTION} discusses an advanced feature of @command{gawk}. If you are a novice @command{awk} user, you might want to skip it on the first reading. @end quotation -@end ifnotinfo - -@ifinfo -(This @value{SECTION} discusses an advanced feature of @command{awk}. -If you are a novice @command{awk} user, you might want to skip it on -the first reading.) -@end ifinfo @cindex data, fixed-width @cindex fixed-width data @@ -6612,10 +7113,6 @@ program for processing such data could use the @code{FIELDWIDTHS} feature to simplify reading the data. (Of course, getting @command{gawk} to run on a system with card readers is another story!) -@ignore -Exercise: Write a ballot card reading program -@end ignore - @cindex @command{gawk}, splitting fields and Assigning a value to @code{FS} causes @command{gawk} to use @code{FS} for field splitting again. Use @samp{FS = FS} to make this happen, @@ -6632,7 +7129,7 @@ if (PROCINFO["FS"] == "FS") else if (PROCINFO["FS"] == "FIELDWIDTHS") @var{fixed-width field splitting} @dots{} else - @var{content-based field splitting} @dots{} (see next @value{SECTION}) + @var{content-based field splitting} @dots{} @ii{(see next @value{SECTION})} @end example This information is useful when writing a function @@ -6644,19 +7141,11 @@ for an example of such a function). @node Splitting By Content @section Defining Fields By Content -@ifnotinfo @quotation NOTE This @value{SECTION} discusses an advanced feature of @command{gawk}. If you are a novice @command{awk} user, you might want to skip it on the first reading. @end quotation -@end ifnotinfo - -@ifinfo -(This @value{SECTION} discusses an advanced feature of @command{awk}. -If you are a novice @command{awk} user, you might want to skip it on -the first reading.) -@end ifinfo @cindex advanced features, specifying field content Normally, when using @code{FS}, @command{gawk} defines the fields as the @@ -6754,7 +7243,7 @@ the double quotes. @command{gawk} provides no way to deal with this. Since there is no formal specification for CSV data, there isn't much more to be done; the @code{FPAT} mechanism provides an elegant solution for the majority -of cases, and the @command{gawk} maintainer is satisfied with that. +of cases, and the @command{gawk} developers are satisfied with that. @end quotation As written, the regexp used for @code{FPAT} requires that each field @@ -6771,6 +7260,7 @@ available for splitting regular strings (@pxref{String Functions}). @node Multiple Line @section Multiple-Line Records +@cindex multiple-line records @c STARTOFRANGE recm @cindex records, multiline @c STARTOFRANGE imr @@ -6815,14 +7305,15 @@ the first nonblank line that follows---no matter how many blank lines appear in a row, they are considered one record separator. @cindex dark corner, multiline records -There is an important difference between @samp{RS = ""} and +However, there is an important difference between @samp{RS = ""} and @samp{RS = "\n\n+"}. In the first case, leading newlines in the input -data file are ignored, and if a file ends without extra blank lines +@value{DF} are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done. @value{DARKCORNER} -@cindex field separators, in multiline records +@cindex field separator, in multiline records +@cindex @code{FS}, in multiline records Now that the input is separated into records, the second step is to separate the fields in the record. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default @@ -6852,7 +7343,7 @@ Another way to separate fields is to put each field on a separate line: to do this, just set the variable @code{FS} to the string @code{"\n"}. (This single character separator matches a single newline.) -A practical example of a data file organized this way might be a mailing +A practical example of a @value{DF} organized this way might be a mailing list, where each entry is separated by blank lines. Consider a mailing list in a file named @file{addresses}, which looks like this: @@ -6917,7 +7408,7 @@ value of @table @code @item RS == "\n" Records are separated by the newline character (@samp{\n}). In effect, -every line in the data file is a separate record, including blank lines. +every line in the @value{DF} is a separate record, including blank lines. This is the default. @item RS == @var{any single character} @@ -6953,9 +7444,10 @@ then @command{gawk} sets @code{RT} to the null string. @c STARTOFRANGE getl @cindex @code{getline} command, explicit input with +@c STARTOFRANGE inex @cindex input, explicit So far we have been getting our input data from @command{awk}'s main -input stream---either the standard input (usually your terminal, sometimes +input stream---either the standard input (usually your keyboard, sometimes the output from another program) or from the files specified on the command line. The @command{awk} language has a special built-in command called @code{getline} that @@ -6966,13 +7458,25 @@ The @code{getline} command is used in several different ways and should The examples that follow the explanation of the @code{getline} command include material that has not been covered yet. Therefore, come back and study the @code{getline} command @emph{after} you have reviewed the -rest of this @value{DOCUMENT} and have a good knowledge of how @command{awk} works. +rest of +@ifinfo +this @value{DOCUMENT} +@end ifinfo +@ifhtml +this @value{DOCUMENT} +@end ifhtml +@ifnotinfo +@ifnothtml +Parts I and II +@end ifnothtml +@end ifnotinfo +and have a good knowledge of how @command{awk} works. @cindex @command{gawk}, @code{ERRNO} variable in -@cindex @code{ERRNO} variable +@cindex @code{ERRNO} variable, with @command{getline} command @cindex differences in @command{awk} and @command{gawk}, @code{getline} command @cindex @code{getline} command, return values -@cindex @code{--sandbox} option, input redirection with @command{getline} +@cindex @option{--sandbox} option, input redirection with @code{getline} The @code{getline} command returns one if it finds a record and zero if it encounters the end of the file. If there is some error in getting @@ -7045,11 +7549,6 @@ decommented input, such as searching for matches of a regular expression. (This program has a subtle problem---it does not work if one comment ends and another begins on the same line.) -@ignore -Exercise, -write a program that does handle multiple comments on the line. -@end ignore - This form of the @code{getline} command sets @code{NF}, @code{NR}, @code{FNR}, @code{RT}, and the value of @code{$0}. @@ -7065,6 +7564,7 @@ rule in the program. @xref{Next Statement}. @node Getline/Variable @subsection Using @code{getline} into a Variable +@cindex @code{getline} into a variable @cindex variables, @code{getline} command into@comma{} using You can use @samp{getline @var{var}} to read the next record from @@ -7116,6 +7616,7 @@ the value of @code{NF} do not change. @node Getline/File @subsection Using @code{getline} from a File +@cindex @code{getline} from a file @cindex input redirection @cindex redirection of input @cindex @code{<} (left angle bracket), @code{<} operator (I/O) @@ -7123,7 +7624,7 @@ the value of @code{NF} do not change. @cindex operators, input/output Use @samp{getline < @var{file}} to read the next record from @var{file}. Here @var{file} is a string-valued expression that -specifies the file name. @samp{< @var{file}} is called a @dfn{redirection} +specifies the @value{FN}. @samp{< @var{file}} is called a @dfn{redirection} because it directs input to come from a different place. For example, the following program reads its input record from the file @file{secondary.input} when it @@ -7151,9 +7652,9 @@ changed, resulting in a new value of @code{NF}. According to POSIX, @samp{getline < @var{expression}} is ambiguous if @var{expression} contains unparenthesized operators other than @samp{$}; for example, @samp{getline < dir "/" file} is ambiguous -because the concatenation operator is not parenthesized. You should -write it as @samp{getline < (dir "/" file)} if you want your program -to be portable to all @command{awk} implementations. +because the concatenation operator (not discussed yet; @pxref{Concatenation}) +is not parenthesized. You should write it as @samp{getline < (dir "/" file)} if +you want your program to be portable to all @command{awk} implementations. @node Getline/Variable/File @subsection Using @code{getline} into a Variable from a File @@ -7164,8 +7665,6 @@ from the file @var{file}, and put it in the variable @var{var}. As above, @var{file} is a string-valued expression that specifies the file from which to read. -@cindex @command{gawk}, @code{RT} variable in -@cindex @code{RT} variable In this version of @code{getline}, none of the built-in variables are changed and the record is not split into fields. The only variable changed is @var{var}.@footnote{This is not quite true. @code{RT} could @@ -7188,25 +7687,25 @@ Such a record is replaced by the contents of the file Note here how the name of the extra input file is not built into the program; it is taken directly from the data, specifically from the second field on -the @samp{@@include} line. +the @code{@@include} line. -@cindex @code{close()} function The @code{close()} function is called to ensure that if two identical -@samp{@@include} lines appear in the input, the entire specified file is +@code{@@include} lines appear in the input, the entire specified file is included twice. @xref{Close Files And Pipes}. One deficiency of this program is that it does not process nested -@samp{@@include} statements -(i.e., @samp{@@include} statements in included files) +@code{@@include} statements +(i.e., @code{@@include} statements in included files) the way a true macro preprocessor would. @xref{Igawk Program}, for a program -that does handle nested @samp{@@include} statements. +that does handle nested @code{@@include} statements. @node Getline/Pipe @subsection Using @code{getline} from a Pipe @c From private email, dated October 2, 1988. Used by permission, March 2013. +@cindex Kernighan, Brian @quotation @i{Omniscience has much to recommend it. Failing that, attention to details would be useful.} @@ -7216,7 +7715,7 @@ Failing that, attention to details would be useful.} @cindex @code{|} (vertical bar), @code{|} operator (I/O) @cindex vertical bar (@code{|}), @code{|} operator (I/O) @cindex input pipeline -@cindex pipes, input +@cindex pipe, input @cindex operators, input/output The output of a command can also be piped into @code{getline}, using @samp{@var{command} | getline}. In @@ -7240,14 +7739,14 @@ produced by running the rest of the line as a shell command: @end example @noindent -@cindex @code{close()} function The @code{close()} function is called to ensure that if two identical @samp{@@execute} lines appear in the input, the command is run for each one. @ifnottex +@ifnotdocbook @xref{Close Files And Pipes}. +@end ifnotdocbook @end ifnottex -@c Exercise!! @c This example is unrealistic, since you could just use system Given the input: @@ -7294,6 +7793,8 @@ because the concatenation operator is not parenthesized. You should write it as @samp{(@w{"echo "} "date") | getline} if you want your program to be portable to all @command{awk} implementations. +@cindex Brian Kernighan's @command{awk} +@cindex @command{mawk} utility @quotation NOTE Unfortunately, @command{gawk} has not been consistent in its treatment of a construct like @samp{@w{"echo "} "date" | getline}. @@ -7405,7 +7906,7 @@ where coprocesses are discussed in more detail. Here are some miscellaneous points about @code{getline} that you should bear in mind: -@itemize @bullet +@itemize @value{BULLET} @item When @code{getline} changes the value of @code{$0} and @code{NF}, @command{awk} does @emph{not} automatically jump to the start of the @@ -7417,7 +7918,7 @@ However, the new record is tested against any subsequent rules. @cindex @command{awk}, implementations, limits @cindex @command{gawk}, implementation issues, limits @item -Many @command{awk} implementations limit the number of pipelines that an @command{awk} +Some very old @command{awk} implementations limit the number of pipelines that an @command{awk} program may have open to just one. In @command{gawk}, there is no such limit. You can open as many pipelines (and coprocesses) as the underlying operating system permits. @@ -7430,10 +7931,10 @@ system permits. @item An interesting side effect occurs if you use @code{getline} without a redirection inside a @code{BEGIN} rule. Because an unredirected @code{getline} -reads from the command-line data files, the first @code{getline} command +reads from the command-line @value{DF}s, the first @code{getline} command causes @command{awk} to set the value of @code{FILENAME}. Normally, @code{FILENAME} does not have a value inside @code{BEGIN} rules, because you -have not yet started to process the command-line data files. +have not yet started to process the command-line @value{DF}s. @value{DARKCORNER} (@xref{BEGIN/END}, also @pxref{Auto-set}.) @@ -7456,6 +7957,7 @@ can cause @code{FILENAME} to be updated if they cause @command{awk} to start reading a new input file. @item +@cindex Moore, Duncan If the variable being assigned is an expression with side effects, different versions of @command{awk} behave differently upon encountering end-of-file. Some versions don't evaluate the expression; many versions @@ -7480,7 +7982,7 @@ end of file is encountered, before the element in @code{a} is assigned? @command{gawk} treats @code{getline} like a function call, and evaluates the expression @samp{a[++c]} before attempting to read from @file{f}. -Other versions of @command{awk} only evaluate the expression once they +However, some versions of @command{awk} only evaluate the expression once they know that there is a string value to be assigned. Caveat Emptor. @end itemize @@ -7516,10 +8018,13 @@ Note: for each variant, @command{gawk} sets the @code{RT} built-in variable. @section Reading Input With A Timeout @cindex timeout, reading input -You may specify a timeout in milliseconds for reading input from a terminal, -pipe or two-way communication including, TCP/IP sockets. This can be done +@cindex differences in @command{awk} and @command{gawk}, read timeouts +This @value{SECTION} describes a feature that is specific to @command{gawk}. + +You may specify a timeout in milliseconds for reading input from the keyboard, +a pipe, or two-way communication, including TCP/IP sockets. This can be done on a per input, command or connection basis, by setting a special element -in the @code{PROCINFO} array: +in the @code{PROCINFO} (@pxref{Auto-set}) array: @example PROCINFO["input_name", "READ_TIMEOUT"] = @var{timeout in milliseconds} @@ -7539,8 +8044,8 @@ else if (ERRNO != "") print ERRNO @end example -Here is how to read interactively from the terminal@footnote{This assumes -that standard input is the keyboard} without waiting +Here is how to read interactively from the user@footnote{This assumes +that standard input is the keyboard.} without waiting for more than five seconds: @example @@ -7549,13 +8054,13 @@ while ((getline < "/dev/stdin") > 0) print $0 @end example -@command{gawk} will terminate the read operation if input does not -arrive after waiting for the timeout period, return failure -and set the @code{ERRNO} variable to an appropriate string value. +@command{gawk} terminates the read operation if input does not +arrive after waiting for the timeout period, returns failure +and sets the @code{ERRNO} variable to an appropriate string value. A negative or zero value for the timeout is the same as specifying no timeout at all. -A timeout can also be set for reading from the terminal in the implicit +A timeout can also be set for reading from the keyboard in the implicit loop that reads input records and matches them against patterns, like so: @@ -7618,19 +8123,120 @@ indefinitely until some other process opens it for writing. @node Command line directories @section Directories On The Command Line +@cindex differences in @command{awk} and @command{gawk}, command line directories @cindex directories, command line @cindex command line, directories on According to the POSIX standard, files named on the @command{awk} -command line must be text files. It is a fatal error if they are not. +command line must be text files; it is a fatal error if they are not. Most versions of @command{awk} treat a directory on the command line as a fatal error. By default, @command{gawk} produces a warning for a directory on the -command line, but otherwise ignores it. If either of the @option{--posix} +command line, but otherwise ignores it. This makes it easier to use +shell wildcards with your @command{awk} program: + +@example +$ @kbd{gawk -f whizprog.awk *} @ii{Directories could kill this progam} +@end example + +If either of the @option{--posix} or @option{--traditional} options is given, then @command{gawk} reverts to treating a directory on the command line as a fatal error. +@xref{Extension Sample Readdir}, for a way to treat directories +as usable data from an @command{awk} program. + +@node Input Summary +@section Summary + +@itemize @value{BULLET} +@item +Input is split into records based on the value of @code{RS}. +The possibilities are as follows: + +@multitable @columnfractions .25 .35 .40 +@headitem Value of @code{RS} @tab Records are split on @tab @command{awk} / @command{gawk} +@item Any single character @tab That character @tab @command{awk} +@item The empty string (@code{""}) @tab Runs of two or more newlines @tab @command{awk} +@item A regexp @tab Text that matches the regexp @tab @command{gawk} +@end multitable + +@item +@command{gawk} sets @code{RT} to the text matched by @code{RS}. + +@item +After splitting the input into records, @command{awk} further splits +the record into individual fields, named @code{$1}, @code{$2} and so +on. @code{$0} is the whole record, and @code{NF} indicates how many +fields there are. The default way to split fields is between whitespace +characters. + +@item +Fields may be referenced using a variable, as in @samp{$NF}. Fields +may also be assigned values, which causes the value of @code{$0} to be +recomputed when it is later referenced. Assigning to a field with a number +greater than @code{NF} creates the field and rebuilds the record, using +@code{OFS} to separate the fields. Incrementing @code{NF} does the same +thing. Decrementing @code{NF} throws away fields and rebuilds the record. + +@item +Field splitting is more complicated than record splitting. + +@multitable @columnfractions .40 .40 .20 +@headitem Field separator value @tab Fields are split @dots{} @tab @command{awk} / @command{gawk} +@item @code{FS == " "} @tab On runs of whitespace @tab @command{awk} +@item @code{FS == @var{any single character}} @tab On that character @tab @command{awk} +@item @code{FS == @var{regexp}} @tab On text matching the regexp @tab @command{awk} +@item @code{FS == ""} @tab Each individual character is a separate field @tab @command{gawk} +@item @code{FIELDWIDTHS == @var{list of columns}} @tab Based on character position @tab @command{gawk} +@item @code{FPAT == @var{regexp}} @tab On text around text matching the regexp @tab @command{gawk} +@end multitable + +Using @samp{FS = "\n"} causes the entire record to be a single field +(assuming that newlines separate records). + +@item +@code{FS} may be set from the command line using the @option{-F} option. +This can also be done using command-line variable assignment. + +@item +@code{PROCINFO["FS"]} can be used to see how fields are being split. + +@item +Use @code{getline} in its various forms to read additional records, +from the default input stream, from a file, or from a pipe or co-process. + +@item +Use @code{PROCINFO[@var{file}, "READ_TIMEOUT"]} to cause reads to timeout +for @var{file}. + +@item +Directories on the command line are fatal for standard @command{awk}; +@command{gawk} ignores them if not in POSIX mode. + +@end itemize + +@node Input Exercises +@section Exercises + +@enumerate +@item +Using the @code{FIELDWIDTHS} variable (@pxref{Constant Size}), +write a program to read election data, where each record represents +one voter's votes. Come up with a way to define which columns are +associated with each ballot item, and print the total votes, +including abstentions, for each item. + +@item +@ref{Plain Getline}, presented a program to remove C-style +comments (@samp{/* @dots{} */}) from the input. That program +does not work if one comment ends on one line and another one +starts later on the same line. +Write a program that does handle multiple comments on the line. + +@end enumerate + @node Printing @chapter Printing Output @@ -7655,7 +8261,7 @@ For printing with specifications, you need the @code{printf} statement @cindex @code{printf} statement Besides basic and formatted printing, this @value{CHAPTER} also covers I/O redirections to files and pipes, introduces -the special file names that @command{gawk} processes internally, +the special @value{FN}s that @command{gawk} processes internally, and discusses the @code{close()} built-in function. @menu @@ -7670,13 +8276,15 @@ and discusses the @code{close()} built-in function. @command{gawk} allows access to inherited file descriptors. * Close Files And Pipes:: Closing Input and Output Files and Pipes. +* Output Summary:: Output summary. +* Output exercises:: Exercises. @end menu @node Print @section The @code{print} Statement The @code{print} statement is used for producing output with simple, standardized -formatting. Specify only the strings or numbers to print, in a +formatting. You specify only the strings or numbers to print, in a list separated by commas. They are output, separated by single spaces, followed by a newline. The statement looks like this: @@ -7759,10 +8367,9 @@ $ @kbd{awk '@{ print $1 $2 @}' inventory-shipped} To someone unfamiliar with the @file{inventory-shipped} file, neither example's output makes much sense. A heading line at the beginning would make it clearer. Let's add some headings to our table of months -(@code{$1}) and green crates shipped (@code{$2}). We do this using the -@code{BEGIN} pattern -(@pxref{BEGIN/END}) -so that the headings are only printed once: +(@code{$1}) and green crates shipped (@code{$2}). We do this using +a @code{BEGIN} rule (@pxref{BEGIN/END}) so that the headings are only +printed once: @example awk 'BEGIN @{ print "Month Crates" @@ -7848,26 +8455,32 @@ The following example prints the first and second fields of each input record, separated by a semicolon, with a blank line added after each newline: -@ignore -Exercise, -Rewrite the -@example -awk 'BEGIN @{ print "Month Crates" - print "----- ------" @} - @{ print $1, " ", $2 @}' inventory-shipped -@end example -program by using a new value of @code{OFS}. -@end ignore @example $ @kbd{awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}} -> @kbd{@{ print $1, $2 @}' BBS-list} -@print{} aardvark;555-5553 -@print{} -@print{} alpo-net;555-3412 -@print{} -@print{} barfly;555-7685 -@dots{} +> @kbd{@{ print $1, $2 @}' mail-list} +@print{} Amelia;555-5553 +@print{} +@print{} Anthony;555-3412 +@print{} +@print{} Becky;555-7685 +@print{} +@print{} Bill;555-1675 +@print{} +@print{} Broderick;555-0542 +@print{} +@print{} Camilla;555-2912 +@print{} +@print{} Fabius;555-1234 +@print{} +@print{} Julie;555-6699 +@print{} +@print{} Martin;555-6480 +@print{} +@print{} Samuel;555-3430 +@print{} +@print{} Jean-Paul;555-2127 +@print{} @end example If the value of @code{ORS} does not contain a newline, the program's output @@ -7889,7 +8502,7 @@ numbers can be formatted. The different format specifications are discussed more fully in @ref{Control Letters}. -@cindex @code{sprintf()} function +@cindexawkfunc{sprintf} @cindex @code{OFMT} variable @cindex output, format specifier@comma{} @code{OFMT} The built-in variable @code{OFMT} contains the default format specification @@ -7955,7 +8568,7 @@ parentheses are necessary if any of the item expressions use the @samp{>} relational operator; otherwise, it can be confused with an output redirection (@pxref{Redirection}). -@cindex format strings +@cindex format specifiers The difference between @code{printf} and @code{print} is the @var{format} argument. This is an expression whose value is taken as a string; it specifies how to output each of the other arguments. It is called the @@ -7998,8 +8611,9 @@ of value to print. The rest of the format specifier is made up of optional @dfn{modifiers} that control @emph{how} to print the value, such as the field width. Here is a list of the format-control letters: -@table @code -@item %c +@c @asis for docbook to come out right +@table @asis +@item @code{%c} Print a number as an ASCII character; thus, @samp{printf "%c", 65} outputs the letter @samp{A}. The output for a string value is the first character of the string. @@ -8007,16 +8621,6 @@ the first character of the string. @cindex dark corner, format-control characters @cindex @command{gawk}, format-control characters @quotation NOTE -@ignore -The @samp{%c} format does @emph{not} handle values outside the range -0--255. On most systems, values from 0--127 are within the range of -ASCII and will yield an ASCII character. Values in the range 128--255 -may format as characters in some extended character set, or they may not. -System 390 (IBM architecture mainframe) systems use 8-bit characters, -and thus values from 0--255 yield the corresponding EBCDIC character. -Any value above 255 is treated as modulo 255; i.e., the lowest eight bits -of the value are used. The locale and character set are always ignored. -@end ignore The POSIX standard says the first character of a string is printed. In locales with multibyte characters, @command{gawk} attempts to convert the leading bytes of the string into a valid wide character @@ -8031,12 +8635,12 @@ a single byte (0--255). @end quotation -@item %d@r{,} %i +@item @code{%d}, @code{%i} Print a decimal integer. The two control letters are equivalent. (The @samp{%i} specification is for compatibility with ISO C.) -@item %e@r{,} %E +@item @code{%e}, @code{%E} Print a number in scientific (exponential) notation; for example: @@ -8051,7 +8655,7 @@ which follow the decimal point. discussed in the next @value{SUBSECTION}.) @samp{%E} uses @samp{E} instead of @samp{e} in the output. -@item %f +@item @code{%f} Print a number in floating-point notation. For example: @@ -8071,39 +8675,40 @@ infinity are formatted as @samp{-inf} or @samp{-infinity}, and positive infinity as @samp{inf} and @samp{infinity}. -The special ``not a number'' value formats as @samp{-nan} or @samp{nan}. +The special ``not a number'' value formats as @samp{-nan} or @samp{nan} +(@pxref{Math Definitions}). -@item %F +@item @code{%F} Like @samp{%f} but the infinity and ``not a number'' values are spelled using uppercase letters. The @samp{%F} format is a POSIX extension to ISO C; not all systems support it. On those that don't, @command{gawk} uses @samp{%f} instead. -@item %g@r{,} %G +@item @code{%g}, @code{%G} Print a number in either scientific notation or in floating-point notation, whichever uses fewer characters; if the result is printed in scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}. -@item %o +@item @code{%o} Print an unsigned octal integer (@pxref{Nondecimal-numbers}). -@item %s +@item @code{%s} Print a string. -@item %u +@item @code{%u} Print an unsigned decimal integer. (This format is of marginal use, because all numbers in @command{awk} are floating-point; it is provided primarily for compatibility with C.) -@item %x@r{,} %X +@item @code{%x}, @code{%X} Print an unsigned hexadecimal integer; @samp{%X} uses the letters @samp{A} through @samp{F} instead of @samp{a} through @samp{f} (@pxref{Nondecimal-numbers}). -@item %% +@item @code{%%} Print a single @samp{%}. This does not consume an argument and it ignores any modifiers. @@ -8138,7 +8743,7 @@ which they may appear: @table @code @cindex differences in @command{awk} and @command{gawk}, @code{print}/@code{printf} statements @cindex @code{printf} statement, positional specifiers -@c the command does NOT start a secondary +@c the code{} does NOT start a secondary @cindex positional specifiers, @code{printf} statement @item @var{N}$ An integer constant followed by a @samp{$} is a @dfn{positional specifier}. @@ -8214,7 +8819,7 @@ For example: $ @kbd{cat thousands.awk} @ii{Show source program} @print{} BEGIN @{ printf "%'d\n", 1234567 @} $ @kbd{LC_ALL=C gawk -f thousands.awk} -@print{} 1234567 @ii{Results in "C" locale} +@print{} 1234567 @ii{Results in} "C" @ii{locale} $ @kbd{LC_ALL=en_US.UTF-8 gawk -f thousands.awk} @print{} 1,234,567 @ii{Results in US English UTF locale} @end example @@ -8324,14 +8929,12 @@ This is not particularly easy to read but it does work. @c @cindex lint checks @cindex troubleshooting, fatal errors, @code{printf} format strings @cindex POSIX @command{awk}, @code{printf} format strings and -C programmers may be used to supplying additional -@samp{l}, @samp{L}, and @samp{h} -modifiers in @code{printf} format strings. These are not valid in @command{awk}. -Most @command{awk} implementations silently ignore them. -If @option{--lint} is provided on the command line -(@pxref{Options}), -@command{gawk} warns about their use. If @option{--posix} is supplied, -their use is a fatal error. +C programmers may be used to supplying additional modifiers (@samp{h}, +@samp{j}, @samp{l}, @samp{L}, @samp{t}, and @samp{z}) in @code{printf} +format strings. These are not valid in @command{awk}. Most @command{awk} +implementations silently ignore them. If @option{--lint} is provided +on the command line (@pxref{Options}), @command{gawk} warns about their +use. If @option{--posix} is supplied, their use is a fatal error. @c ENDOFRANGE pfm @node Printf Examples @@ -8341,30 +8944,30 @@ The following simple example shows how to use @code{printf} to make an aligned table: @example -awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list +awk '@{ printf "%-10s %s\n", $1, $2 @}' mail-list @end example @noindent This command -prints the names of the bulletin boards (@code{$1}) in the file -@file{BBS-list} as a string of 10 characters that are left-justified. It also +prints the names of the people (@code{$1}) in the file +@file{mail-list} as a string of 10 characters that are left-justified. It also prints the phone numbers (@code{$2}) next on the line. This produces an aligned two-column table of names and phone numbers, as shown here: @example -$ @kbd{awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list} -@print{} aardvark 555-5553 -@print{} alpo-net 555-3412 -@print{} barfly 555-7685 -@print{} bites 555-1675 -@print{} camelot 555-0542 -@print{} core 555-2912 -@print{} fooey 555-1234 -@print{} foot 555-6699 -@print{} macfoo 555-6480 -@print{} sdace 555-3430 -@print{} sabafoo 555-2127 +$ @kbd{awk '@{ printf "%-10s %s\n", $1, $2 @}' mail-list} +@print{} Amelia 555-5553 +@print{} Anthony 555-3412 +@print{} Becky 555-7685 +@print{} Bill 555-1675 +@print{} Broderick 555-0542 +@print{} Camilla 555-2912 +@print{} Fabius 555-1234 +@print{} Julie 555-6699 +@print{} Martin 555-6480 +@print{} Samuel 555-3430 +@print{} Jean-Paul 555-2127 @end example In this case, the phone numbers had to be printed as strings because @@ -8377,7 +8980,7 @@ they are last on their lines. They don't need to have spaces after them. The table could be made to look even nicer by adding headings to the -tops of the columns. This is done using the @code{BEGIN} pattern +tops of the columns. This is done using a @code{BEGIN} rule (@pxref{BEGIN/END}) so that the headers are only printed once, at the beginning of the @command{awk} program: @@ -8385,7 +8988,7 @@ the @command{awk} program: @example awk 'BEGIN @{ print "Name Number" print "---- ------" @} - @{ printf "%-10s %s\n", $1, $2 @}' BBS-list + @{ printf "%-10s %s\n", $1, $2 @}' mail-list @end example The above example mixes @code{print} and @code{printf} statements in @@ -8395,7 +8998,7 @@ same results: @example awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number" printf "%-10s %s\n", "----", "------" @} - @{ printf "%-10s %s\n", $1, $2 @}' BBS-list + @{ printf "%-10s %s\n", $1, $2 @}' mail-list @end example @noindent @@ -8410,23 +9013,19 @@ emphasized by storing it in a variable, like this: awk 'BEGIN @{ format = "%-10s %s\n" printf format, "Name", "Number" printf format, "----", "------" @} - @{ printf format, $1, $2 @}' BBS-list + @{ printf format, $1, $2 @}' mail-list @end example -@c !!! exercise -At this point, it would be a worthwhile exercise to use the -@code{printf} statement to line up the headings and table data for the -@file{inventory-shipped} example that was covered earlier in the @value{SECTION} -on the @code{print} statement -(@pxref{Print}). @c ENDOFRANGE printfs @node Redirection @section Redirecting Output of @code{print} and @code{printf} +@c STARTOFRANGE outre @cindex output redirection +@c STARTOFRANGE reout @cindex redirection of output -@cindex @code{--sandbox} option, output redirection with @code{print}, @code{printf} +@cindex @option{--sandbox} option, output redirection with @code{print}, @code{printf} So far, the output from @code{print} and @code{printf} has gone to the standard output, usually the screen. Both @code{print} and @code{printf} can @@ -8443,11 +9042,11 @@ Redirections in @command{awk} are written just like redirections in shell commands, except that they are written inside the @command{awk} program. @c the commas here are part of the see also -@cindex @code{print} statement, See Also redirection, of output -@cindex @code{printf} statement, See Also redirection, of output +@cindex @code{print} statement, See Also redirection@comma{} of output +@cindex @code{printf} statement, See Also redirection@comma{} of output There are four forms of output redirection: output to a file, output appended to a file, output through a pipe to another command, and output -to a coprocess. They are all shown for the @code{print} statement, +to a coprocess. We show them all for the @code{print} statement, but they work identically for @code{printf}: @table @code @@ -8456,29 +9055,29 @@ but they work identically for @code{printf}: @cindex operators, input/output @item print @var{items} > @var{output-file} This redirection prints the items into the output file named -@var{output-file}. The file name @var{output-file} can be any +@var{output-file}. The @value{FN} @var{output-file} can be any expression. Its value is changed to a string and then used as a -file name (@pxref{Expressions}). +@value{FN} (@pxref{Expressions}). When this type of redirection is used, the @var{output-file} is erased before the first output is written to it. Subsequent writes to the same @var{output-file} do not erase @var{output-file}, but append to it. (This is different from how you use redirections in shell scripts.) If @var{output-file} does not exist, it is created. For example, here -is how an @command{awk} program can write a list of BBS names to one +is how an @command{awk} program can write a list of peoples' names to one file named @file{name-list}, and a list of phone numbers to another file named @file{phone-list}: @example $ @kbd{awk '@{ print $2 > "phone-list"} -> @kbd{print $1 > "name-list" @}' BBS-list} +> @kbd{print $1 > "name-list" @}' mail-list} $ @kbd{cat phone-list} @print{} 555-5553 @print{} 555-3412 @dots{} $ @kbd{cat name-list} -@print{} aardvark -@print{} alpo-net +@print{} Amelia +@print{} Anthony @dots{} @end example @@ -8496,7 +9095,7 @@ appended to the file. If @var{output-file} does not exist, then it is created. @cindex @code{|} (vertical bar), @code{|} operator (I/O) -@cindex pipes, output +@cindex pipe, output @cindex output, pipes @item print @var{items} | @var{command} It is possible to send output to another program through a pipe @@ -8507,7 +9106,7 @@ to another process created to execute @var{command}. The redirection argument @var{command} is actually an @command{awk} expression. Its value is converted to a string whose contents give the shell command to be run. For example, the following produces two -files, one unsorted list of BBS names, and one list sorted in reverse +files, one unsorted list of peoples' names, and one list sorted in reverse alphabetical order: @ignore @@ -8520,7 +9119,7 @@ alone for now and let's hope no-one notices. @example awk '@{ print $1 > "names.unsorted" command = "sort -r > names.sorted" - print $1 | command @}' BBS-list + print $1 | command @}' mail-list @end example The unsorted list is written with an ordinary redirection, while @@ -8552,7 +9151,7 @@ This example also illustrates the use of a variable to represent a @var{file} or @var{command}---it is not necessary to always use a string constant. Using a variable is generally a good idea, because (if you mean to refer to that same file or command) -@command{awk} requires that the string value be spelled identically +@command{awk} requires that the string value be written identically every time. @cindex coprocesses @@ -8611,7 +9210,9 @@ As mentioned earlier many @end ifnotinfo @ifnottex +@ifnotdocbook Many +@end ifnotdocbook @end ifnottex older @command{awk} implementations limit the number of pipelines that an @command{awk} @@ -8624,7 +9225,7 @@ open as many pipelines as the underlying operating system permits. A particularly powerful way to use redirection is to build command lines and pipe them into the shell, @command{sh}. For example, suppose you -have a list of files brought over from a system where all the file names +have a list of files brought over from a system where all the @value{FN}s are stored in uppercase, and you wish to rename them to have names in all lowercase. The following program is both simple and efficient: @@ -8646,12 +9247,12 @@ It then sends the list to the shell for execution. @c ENDOFRANGE reout @node Special Files -@section Special File Names in @command{gawk} +@section Special @value{FFN} in @command{gawk} @c STARTOFRANGE gfn @cindex @command{gawk}, file names in -@command{gawk} provides a number of special file names that it interprets -internally. These file names provide access to standard file descriptors +@command{gawk} provides a number of special @value{FN}s that it interprets +internally. These @value{FN}s provide access to standard file descriptors and TCP/IP networking. @menu @@ -8715,12 +9316,12 @@ that happens, writing to the screen is not correct. In fact, if terminal at all. Then opening @file{/dev/tty} fails. -@command{gawk} provides special file names for accessing the three standard -streams. @value{COMMONEXT}. It also provides syntax for accessing -any other inherited open files. If the file name matches +@command{gawk} provides special @value{FN}s for accessing the three standard +streams. @value{COMMONEXT} It also provides syntax for accessing +any other inherited open files. If the @value{FN} matches one of these special names when @command{gawk} redirects input or output, -then it directly uses the stream that the file name stands for. -These special file names work for all operating systems that @command{gawk} +then it directly uses the stream that the @value{FN} stands for. +These special @value{FN}s work for all operating systems that @command{gawk} has been ported to, not just those that are POSIX-compliant: @cindex common extensions, @code{/dev/stdin} special file @@ -8730,9 +9331,9 @@ has been ported to, not just those that are POSIX-compliant: @cindex extensions, common@comma{} @code{/dev/stdout} special file @cindex extensions, common@comma{} @code{/dev/stderr} special file @cindex file names, standard streams in @command{gawk} -@cindex @code{/dev/@dots{}} special files (@command{gawk}) +@cindex @code{/dev/@dots{}} special files @cindex files, @code{/dev/@dots{}} special files -@cindex @code{/dev/fd/@var{N}} special files +@cindex @code{/dev/fd/@var{N}} special files (@command{gawk}) @table @file @item /dev/stdin The standard input (file descriptor 0). @@ -8750,7 +9351,7 @@ the shell). Unless special pains are taken in the shell from which @command{gawk} is invoked, only descriptors 0, 1, and 2 are available. @end table -The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} +The @value{FN}s @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2}, respectively. However, they are more self-explanatory. The proper way to write an error message in a @command{gawk} program @@ -8761,13 +9362,12 @@ print "Serious error detected!" > "/dev/stderr" @end example @cindex troubleshooting, quotes with file names -Note the use of quotes around the file name. +Note the use of quotes around the @value{FN}. Like any other redirection, the value must be a string. It is a common error to omit the quotes, which leads to confusing results. -@c Exercise: What does it do? :-) -Finally, using the @code{close()} function on a file name of the +Finally, using the @code{close()} function on a @value{FN} of the form @code{"/dev/fd/@var{N}"}, for file descriptor numbers above two, does actually close the given file descriptor. @@ -8783,7 +9383,7 @@ versions of @command{awk}. @command{gawk} programs can open a two-way TCP/IP connection, acting as either a client or a server. -This is done using a special file name of the form: +This is done using a special @value{FN} of the form: @example @file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}} @@ -8793,7 +9393,7 @@ The @var{net-type} is one of @samp{inet}, @samp{inet4} or @samp{inet6}. The @var{protocol} is one of @samp{tcp} or @samp{udp}, and the other fields represent the other essential pieces of information for making a networking connection. -These file names are used with the @samp{|&} operator for communicating +These @value{FN}s are used with the @samp{|&} operator for communicating with a coprocess (@pxref{Two-way I/O}). This is an advanced feature, mentioned here only for completeness. @@ -8801,21 +9401,21 @@ Full discussion is delayed until @ref{TCP/IP Networking}. @node Special Caveats -@subsection Special File Name Caveats +@subsection Special @value{FFN} Caveats Here is a list of things to bear in mind when using the -special file names that @command{gawk} provides: +special @value{FN}s that @command{gawk} provides: -@itemize @bullet +@itemize @value{BULLET} @cindex compatibility mode (@command{gawk}), file names @cindex file names, in compatibility mode @item -Recognition of these special file names is disabled if @command{gawk} is in +Recognition of these special @value{FN}s is disabled if @command{gawk} is in compatibility mode (@pxref{Options}). @item @command{gawk} @emph{always} -interprets these special file names. +interprets these special @value{FN}s. For example, using @samp{/dev/fd/4} for output actually writes on file descriptor 4, and not on a new file descriptor that is @code{dup()}'ed from file descriptor 4. Most of @@ -8833,12 +9433,12 @@ Doing so results in unpredictable behavior. @c STARTOFRANGE ofc @cindex output, files@comma{} closing @c STARTOFRANGE pc -@cindex pipes, closing +@cindex pipe, closing @c STARTOFRANGE cc @cindex coprocesses, closing @cindex @code{getline} command, coprocesses@comma{} using from -If the same file name or the same shell command is used with @code{getline} +If the same @value{FN} or the same shell command is used with @code{getline} more than once during the execution of an @command{awk} program (@pxref{Getline}), the file is opened (or the command is executed) the first time only. @@ -8847,11 +9447,11 @@ The next time the same file or command is used with @code{getline}, another record is read from it, and so on. Similarly, when a file or pipe is opened for output, @command{awk} remembers -the file name or command associated with it, and subsequent +the @value{FN} or command associated with it, and subsequent writes to the same file or command are appended to the previous writes. The file or pipe stays open until @command{awk} exits. -@cindex @code{close()} function +@cindexawkfunc{close} This implies that special steps are necessary in order to read the same file again from the beginning, or to rerun a shell command (rather than reading more output from the same command). The @code{close()} function @@ -8889,7 +9489,7 @@ file or command, or the next @code{print} or @code{printf} to that file or command, reopens the file or reruns the command. Because the expression that you use to close a file or pipeline must exactly match the expression used to open the file or run the command, -it is good practice to use a variable to store the file name or command. +it is good practice to use a variable to store the @value{FN} or command. The previous example becomes the following: @example @@ -8903,7 +9503,7 @@ close(sortcom) This helps avoid hard-to-find typographical errors in your @command{awk} programs. Here are some of the reasons for closing an output file: -@itemize @bullet +@itemize @value{BULLET} @item To write a file and read it back later on in the same @command{awk} program. Close the file after writing it, then @@ -8936,9 +9536,10 @@ a separate message. @cindex differences in @command{awk} and @command{gawk}, @code{close()} function @cindex portability, @code{close()} function and +@cindex @code{close()} function, portability If you use more files than the system allows you to have open, @command{gawk} attempts to multiplex the available open files among -your data files. @command{gawk}'s ability to do this depends upon the +your @value{DF}s. @command{gawk}'s ability to do this depends upon the facilities of your operating system, so it may not always work. It is therefore both good practice and good portability advice to always use @code{close()} on your files when you are done with them. @@ -8971,15 +9572,16 @@ more importantly, the file descriptor for the pipe is not closed and released until @code{close()} is called or @command{awk} exits. -@code{close()} will silently do nothing if given an argument that +@code{close()} silently does nothing if given an argument that does not represent a file, pipe or coprocess that was opened with -a redirection. +a redirection. In such a case, it returns a negative value, +indicating an error. In addition, @command{gawk} sets @code{ERRNO} +to a string indicating the error. -Note also that @samp{close(FILENAME)} has no -``magic'' effects on the implicit loop that reads through the -files named on the command line. It is, more likely, a close -of a file that was never opened, so @command{awk} silently -does nothing. +Note also that @samp{close(FILENAME)} has no ``magic'' effects on the +implicit loop that reads through the files named on the command line. +It is, more likely, a close of a file that was never opened with a +redirection, so @command{awk} silently does nothing. @cindex @code{|} (vertical bar), @code{|&} operator (I/O), pipes@comma{} closing When using the @samp{|&} operator to communicate with a coprocess, @@ -9003,7 +9605,7 @@ which discusses it in more detail and gives an example. @cindex differences in @command{awk} and @command{gawk}, @code{close()} function @cindex Unix @command{awk}, @code{close()} function and -In many versions of Unix @command{awk}, the @code{close()} function +In many older versions of Unix @command{awk}, the @code{close()} function is actually a statement. It is a syntax error to try and use the return value from @code{close()}: @value{DARKCORNER} @@ -9015,7 +9617,7 @@ retval = close(command) # syntax error in many Unix awks @end example @cindex @command{gawk}, @code{ERRNO} variable in -@cindex @code{ERRNO} variable +@cindex @code{ERRNO} variable, with @command{close()} function @command{gawk} treats @code{close()} as a function. The return value is @minus{}1 if the argument names something that was never opened with a redirection, or if there is @@ -9048,6 +9650,67 @@ when closing a pipe. @c ENDOFRANGE ofc @c ENDOFRANGE pc @c ENDOFRANGE cc + +@node Output Summary +@section Summary + +@itemize @value{BULLET} +@item +The @code{print} statement prints comma-separated expressions. Each +expression is separated by the value of @code{OFS} and terminated by +the value of @code{ORS}. @code{OFMT} provides the conversion format +for numeric values for the @code{print} statement. + +@item +The @code{printf} statement provides finer-grained control over output, +with format control letters for different data types and various flags +that modify the behavior of the format control letters. + +@item +Output from both @code{print} and @code{printf} may be redirected to +files, pipes, and co-processes. + +@item +@command{gawk} provides special file names for access to standard input, +output and error, and for network communications. + +@item +Use @code{close()} to close open file, pipe and co-process redirections. +For co-processes, it is possible to close only one direction of the +communications. + +@end itemize + +@node Output exercises +@section Exercises + +@enumerate +@item +Rewrite the program: + +@example +awk 'BEGIN @{ print "Month Crates" + print "----- ------" @} + @{ print $1, " ", $2 @}' inventory-shipped +@end example + +@noindent +from @ref{Output Separators}, by using a new value of @code{OFS}. + +@item +Use the @code{printf} statement to line up the headings and table data +for the @file{inventory-shipped} example that was covered in @ref{Print}. + +@item +What happens if you forget the double quotes when redirecting +output, as follows: + +@example +BEGIN @{ print "Serious error detected!" > /dev/stderr @} +@end example + +@end enumerate + @c ENDOFRANGE prnt @node Expressions @@ -9074,6 +9737,7 @@ combinations of these with various operators. * Function Calls:: A function call is an expression. * Precedence:: How various operators nest. * Locales:: How the locale affects things. +* Expressions Summary:: Expressions summary. @end menu @node Values @@ -9093,6 +9757,8 @@ which provide the values used in expressions. @node Constants @subsection Constant Expressions + +@c STARTOFRANGE cnst @cindex constants, types of The simplest type of expression is the @dfn{constant}, which always has @@ -9112,13 +9778,14 @@ have different forms, but are stored identically internally. @node Scalar Constants @subsubsection Numeric and String Constants -@cindex numeric, constants +@cindex constants, numeric +@cindex numeric constants A @dfn{numeric constant} stands for a number. This number can be an integer, a decimal fraction, or a number in scientific (exponential) notation.@footnote{The internal representation of all numbers, -including integers, uses double precision -floating-point numbers. -On most modern systems, these are in IEEE 754 standard format.} +including integers, uses double precision floating-point numbers. +On most modern systems, these are in IEEE 754 standard format. +@xref{Arbitrary Precision Arithmetic}, for much more information.} Here are some examples of numeric constants that all have the same value: @@ -9138,7 +9805,7 @@ double-quotation marks. For example: @noindent @cindex differences in @command{awk} and @command{gawk}, strings -@cindex strings, length of +@cindex strings, length limitations represents the string whose contents are @samp{parrot}. Strings in @command{gawk} can be of any length, and they can contain any of the possible eight-bit ASCII characters including ASCII @sc{nul} (character code zero). @@ -9325,13 +9992,13 @@ upon the contents of the current input record. @cindex differences in @command{awk} and @command{gawk}, regexp constants @cindex dark corner, regexp constants, as arguments to user-defined functions -@cindex @code{gensub()} function (@command{gawk}) -@cindex @code{sub()} function -@cindex @code{gsub()} function +@cindexgawkfunc{gensub} +@cindexawkfunc{sub} +@cindexawkfunc{gsub} Constant regular expressions are also used as the first argument for the @code{gensub()}, @code{sub()}, and @code{gsub()} functions, as the second argument of the @code{match()} function, -and as the third argument of the @code{patsplit()} function +and as the third argument of the @code{split()} and @code{patsplit()} functions (@pxref{String Functions}). Modern implementations of @command{awk}, including @command{gawk}, allow the third argument of @code{split()} to be a regexp constant, but some @@ -9438,7 +10105,7 @@ Such an assignment has the following form: @var{variable}=@var{text} @end example -@cindex @code{-v} option +@cindex @option{-v} option @noindent With it, a variable is set either at the beginning of the @command{awk} run or in between input files. @@ -9452,7 +10119,7 @@ as in the following: @noindent the variable is set at the very beginning, even before the @code{BEGIN} rules execute. The @option{-v} option and its assignment -must precede all the file name arguments, as well as the program text. +must precede all the @value{FN} arguments, as well as the program text. (@xref{Options}, for more information about the @option{-v} option.) Otherwise, the variable assignment is performed at a time determined by @@ -9460,7 +10127,7 @@ its position among the input file arguments---after the processing of the preceding input file argument. For example: @example -awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list +awk '@{ print $n @}' n=4 inventory-shipped n=2 mail-list @end example @noindent @@ -9469,10 +10136,10 @@ the first file is read, the command line sets the variable @code{n} equal to four. This causes the fourth field to be printed in lines from @file{inventory-shipped}. After the first file has finished, but before the second file is started, @code{n} is set to two, so that the -second field is printed in lines from @file{BBS-list}: +second field is printed in lines from @file{mail-list}: @example -$ @kbd{awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list} +$ @kbd{awk '@{ print $n @}' n=4 inventory-shipped n=2 mail-list} @print{} 15 @print{} 24 @dots{} @@ -9493,6 +10160,19 @@ sequences @node Conversion @subsection Conversion of Strings and Numbers +Number to string and string to number conversion are generally +straightforward. There can be subtleties to be aware of; +this @value{SECTION} discusses this important facet of @command{awk}. + +@menu +* Strings And Numbers:: How @command{awk} Converts Between Strings And + Numbers. +* Locale influences conversions:: How the locale may affect conversions. +@end menu + +@node Strings And Numbers +@subsubsection How @command{awk} Converts Between Strings And Numbers + @cindex converting, strings to numbers @cindex strings, converting @cindex numbers, converting @@ -9562,6 +10242,7 @@ b = a "" @code{b} has the value @code{"12"}, not @code{"12.00"}. @value{DARKCORNER} +@sidebar Pre-POSIX @command{awk} Used @code{OFMT} For String Conversion @cindex POSIX @command{awk}, @code{OFMT} variable and @cindex @code{OFMT} variable @cindex portability, new @command{awk} vs.@: old @command{awk} @@ -9573,32 +10254,32 @@ specifies the output format to use when printing numbers with @code{print}. conversion from the semantics of printing. Both @code{CONVFMT} and @code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority of cases, old @command{awk} programs do not change their behavior. -However, these semantics for @code{OFMT} are something to keep in mind if you must -port your new-style program to older implementations of @command{awk}. -We recommend -that instead of changing your programs, just port @command{gawk} itself. -@xref{Print}, -for more information on the @code{print} statement. - -And, once again, where you are can matter when it comes to converting -between numbers and strings. In @ref{Locales}, we mentioned that -the local character set and language (the locale) can affect how -@command{gawk} matches characters. The locale also affects numeric -formats. In particular, for @command{awk} programs, it affects the -decimal point character. The @code{"C"} locale, and most English-language -locales, use the period character (@samp{.}) as the decimal point. -However, many (if not most) European and non-English locales use the comma -(@samp{,}) as the decimal point character. +@xref{Print}, for more information on the @code{print} statement. +@end sidebar + +@node Locale influences conversions +@subsubsection Locales Can Influence Conversion + +Where you are can matter when it comes to converting between numbers and +strings. The local character set and language---the @dfn{locale}---can +affect numeric formats. In particular, for @command{awk} programs, +it affects the decimal point character and the thousands-separator +character. The @code{"C"} locale, and most English-language locales, +use the period character (@samp{.}) as the decimal point and don't +have a thousands separator. However, many (if not most) European and +non-English locales use the comma (@samp{,}) as the decimal point +character. European locales often use either a space or a period as +the thousands separator, if they have one. @cindex dark corner, locale's decimal point character The POSIX standard says that @command{awk} always uses the period as the decimal -point when reading the @command{awk} program source code, and for command-line -variable assignments (@pxref{Other Arguments}). -However, when interpreting input data, for @code{print} and @code{printf} output, -and for number to string conversion, the local decimal point character is used. -@value{DARKCORNER}. -Here are some examples indicating the difference in behavior, -on a GNU/Linux system: +point when reading the @command{awk} program source code, and for +command-line variable assignments (@pxref{Other Arguments}). However, +when interpreting input data, for @code{print} and @code{printf} output, +and for number to string conversion, the local decimal point character +is used. @value{DARKCORNER} In all cases, numbers in source code and +in input data cannot have a thousands separator. Here are some examples +indicating the difference in behavior, on a GNU/Linux system: @example $ @kbd{export POSIXLY_CORRECT=1} @ii{Force POSIX behavior} @@ -9613,7 +10294,7 @@ $ @kbd{echo 4,321 | LC_ALL=en_DK.utf-8 gawk '@{ print $1 + 1 @}'} @end example @noindent -The @samp{en_DK.utf-8} locale is for English in Denmark, where the comma acts as +The @code{en_DK.utf-8} locale is for English in Denmark, where the comma acts as the decimal point separator. In the normal @code{"C"} locale, @command{gawk} treats @samp{4,321} as @samp{4}, while in the Danish locale, it's treated as the full number, 4.321. @@ -9760,7 +10441,7 @@ b * int(a / b) + (a % b) == a @end example One possibly undesirable effect of this definition of remainder is that -@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus: +@samp{@var{x} % @var{y}} is negative if @var{x} is negative. Thus: @example -17 % 8 = -1 @@ -9768,7 +10449,7 @@ One possibly undesirable effect of this definition of remainder is that In other @command{awk} implementations, the signedness of the remainder may be machine-dependent. -@c !!! what does posix say? +@c FIXME !!! what does posix say? @cindex portability, @code{**} operator and @cindex @code{*} (asterisk), @code{**} operator @@ -9795,9 +10476,9 @@ specific operator to represent it. Instead, concatenation is performed by writing expressions next to one another, with no operator. For example: @example -$ @kbd{awk '@{ print "Field number one: " $1 @}' BBS-list} -@print{} Field number one: aardvark -@print{} Field number one: alpo-net +$ @kbd{awk '@{ print "Field number one: " $1 @}' mail-list} +@print{} Field number one: Amelia +@print{} Field number one: Anthony @dots{} @end example @@ -9805,9 +10486,9 @@ Without the space in the string constant after the @samp{:}, the line runs together. For example: @example -$ @kbd{awk '@{ print "Field number one:" $1 @}' BBS-list} -@print{} Field number one:aardvark -@print{} Field number one:alpo-net +$ @kbd{awk '@{ print "Field number one:" $1 @}' mail-list} +@print{} Field number one:Amelia +@print{} Field number one:Anthony @dots{} @end example @@ -9824,6 +10505,8 @@ name = "name" print "something meaningful" > file name @end example +@cindex Brian Kernighan's @command{awk} +@cindex @command{mawk} utility @noindent This produces a syntax error with some versions of Unix @command{awk}.@footnote{It happens that Brian Kernighan's @@ -9852,7 +10535,7 @@ BEGIN @{ @end example @noindent -It is not defined whether the assignment to @code{a} happens +It is not defined whether the second assignment to @code{a} happens before or after the value of @code{a} is retrieved for producing the concatenated value. The result could be either @samp{don't panic}, or @samp{panic panic}. @@ -9974,8 +10657,8 @@ element. (Such values are called @dfn{rvalues}.) @cindex variables, types of It is important to note that variables do @emph{not} have permanent types. -A variable's type is simply the type of whatever value it happens -to hold at the moment. In the following program fragment, the variable +A variable's type is simply the type of whatever value was last assigned +to it. In the following program fragment, the variable @code{foo} has a numeric value at first, and a string value later on: @example @@ -10076,6 +10759,7 @@ The indices of @code{bar} are practically guaranteed to be different, because and see @ref{Numeric Functions}, for more information). This example illustrates an important fact about assignment operators: the lefthand expression is only evaluated @emph{once}. + It is up to the implementation as to which expression is evaluated first, the lefthand or the righthand. Consider this example: @@ -10108,17 +10792,17 @@ to a number. @caption{Arithmetic Assignment Operators} @multitable @columnfractions .30 .70 @headitem Operator @tab Effect -@item @var{lvalue} @code{+=} @var{increment} @tab Adds @var{increment} to the value of @var{lvalue}. -@item @var{lvalue} @code{-=} @var{decrement} @tab Subtracts @var{decrement} from the value of @var{lvalue}. -@item @var{lvalue} @code{*=} @var{coefficient} @tab Multiplies the value of @var{lvalue} by @var{coefficient}. -@item @var{lvalue} @code{/=} @var{divisor} @tab Divides the value of @var{lvalue} by @var{divisor}. -@item @var{lvalue} @code{%=} @var{modulus} @tab Sets @var{lvalue} to its remainder by @var{modulus}. +@item @var{lvalue} @code{+=} @var{increment} @tab Add @var{increment} to the value of @var{lvalue}. +@item @var{lvalue} @code{-=} @var{decrement} @tab Subtract @var{decrement} from the value of @var{lvalue}. +@item @var{lvalue} @code{*=} @var{coefficient} @tab Multiply the value of @var{lvalue} by @var{coefficient}. +@item @var{lvalue} @code{/=} @var{divisor} @tab Divide the value of @var{lvalue} by @var{divisor}. +@item @var{lvalue} @code{%=} @var{modulus} @tab Set @var{lvalue} to its remainder by @var{modulus}. @cindex common extensions, @code{**=} operator @cindex extensions, common@comma{} @code{**=} operator @cindex @command{awk} language, POSIX version @cindex POSIX @command{awk} @item @var{lvalue} @code{^=} @var{power} @tab -@item @var{lvalue} @code{**=} @var{power} @tab Raises @var{lvalue} to the power @var{power}. @value{COMMONEXT} +@item @var{lvalue} @code{**=} @var{power} @tab Raise @var{lvalue} to the power @var{power}. @value{COMMONEXT} @end multitable @end float @@ -10163,10 +10847,8 @@ A workaround is: awk '/[=]=/' /dev/null @end example -@command{gawk} does not have this problem, -nor do the other -freely available versions described in -@ref{Other Versions}. +@command{gawk} does not have this problem; Brian Kernighan's @command{awk} +and @command{mawk} also do not (@pxref{Other Versions}). @end sidebar @c ENDOFRANGE exas @c ENDOFRANGE opas @@ -10190,11 +10872,10 @@ are convenient abbreviations for very common operations. @cindex side effects, decrement/increment operators The operator used for adding one is written @samp{++}. It can be used to increment a variable either before or after taking its value. -To pre-increment a variable @code{v}, write @samp{++v}. This adds +To @dfn{pre-increment} a variable @code{v}, write @samp{++v}. This adds one to the value of @code{v}---that new value is also the value of the -expression. (The assignment expression @samp{v += 1} is completely -equivalent.) -Writing the @samp{++} after the variable specifies post-increment. This +expression. (The assignment expression @samp{v += 1} is completely equivalent.) +Writing the @samp{++} after the variable specifies @dfn{post-increment}. This increments the variable value just the same; the difference is that the value of the increment expression itself is the variable's @emph{old} value. Thus, if @code{foo} has the value four, then the expression @samp{foo++} @@ -10206,7 +10887,18 @@ The post-increment @samp{foo++} is nearly the same as writing @samp{(foo += 1) - 1}. It is not perfectly equivalent because all numbers in @command{awk} are floating-point---in floating-point, @samp{foo + 1 - 1} does not necessarily equal @code{foo}. But the difference is minute as -long as you stick to numbers that are fairly small (less than 10e12). +long as you stick to numbers that are fairly small (less than +@iftex +@math{10^12}). +@end iftex +@ifnottex +@ifnotdocbook +10e12). +@end ifnotdocbook +@end ifnottex +@docbook +10<superscript>12</superscript>). @c +@end docbook @cindex @code{$} (dollar sign), incrementing fields and arrays @cindex dollar sign (@code{$}), incrementing fields and arrays @@ -10215,6 +10907,7 @@ just like variables. (Use @samp{$(i++)} when you want to do a field reference and a variable increment at the same time. The parentheses are necessary because of the precedence of the field reference operator @samp{$}.) +@c STARTOFRANGE deop @cindex decrement operators The decrement operator @samp{--} works just like @samp{++}, except that it subtracts one instead of adding it. As with @samp{++}, it can be used before @@ -10393,6 +11086,7 @@ like a number---for example, @code{@w{" +2"}}. This concept is used for determining the type of a variable. The type of the variable is important because the types of two variables determine how they are compared. + The various versions of the POSIX standard did not get the rules quite right for several editions. Fortunately, as of at least the 2008 standard (and possibly earlier), the standard has been fixed, @@ -10400,7 +11094,7 @@ and variable typing follows these rules:@footnote{@command{gawk} has followed these rules for many years, and it is gratifying that the POSIX standard is also now correct.} -@itemize @bullet +@itemize @value{BULLET} @item A numeric constant or the result of a numeric operation has the @var{numeric} attribute. @@ -10486,6 +11180,7 @@ STRNUM &&string &numeric &numeric\cr }}} @end tex @ifnottex +@ifnotdocbook @display +---------------------------------------------- | STRING NUMERIC STRNUM @@ -10498,7 +11193,51 @@ NUMERIC | string numeric numeric STRNUM | string numeric numeric --------+---------------------------------------------- @end display +@end ifnotdocbook @end ifnottex +@docbook +<informaltable> +<tgroup cols="4"> +<colspec colname="1" align="left"/> +<colspec colname="2" align="left"/> +<colspec colname="3" align="left"/> +<colspec colname="4" align="left"/> +<thead> +<row> +<entry/> +<entry>STRING</entry> +<entry>NUMERIC</entry> +<entry>STRNUM</entry> +</row> +</thead> + +<tbody> +<row> +<entry><emphasis role="bold">STRING</emphasis></entry> +<entry>string</entry> +<entry>string</entry> +<entry>string</entry> +</row> + +<row> +<entry><emphasis role="bold">NUMERIC</emphasis></entry> +<entry>string</entry> +<entry>numeric</entry> +<entry>numeric</entry> +</row> + +<row> +<entry><emphasis role="bold">STRNUM</emphasis></entry> +<entry>string</entry> +<entry>numeric</entry> +<entry>numeric</entry> +</row> + +</tbody> +</tgroup> +</informaltable> + +@end docbook The basic idea is that user input that looks numeric---and @emph{only} user input---should be treated as numeric, even though it is actually @@ -10517,8 +11256,8 @@ This point bears additional emphasis: All user input is made of characters, and so is first and foremost of @var{string} type; input strings that look numeric are additionally given the @var{strnum} attribute. Thus, the six-character input string @w{@samp{ +3.14}} receives the -@var{strnum} attribute. In contrast, the eight-character literal -@w{@code{" +3.14"}} appearing in program text is a string constant. +@var{strnum} attribute. In contrast, the eight characters +@w{@code{" +3.14"}} appearing in program text comprise a string constant. The following examples print @samp{1} when the comparison between the two different constants is true, @samp{0} otherwise: @@ -10633,7 +11372,7 @@ string comparison (true) string comparison (true) @item a = 2; b = " +2" -@item a == b +@itemx a == b string comparison (false) @end table @@ -10679,7 +11418,7 @@ has the value one if @code{x} contains @samp{foo}, such as @cindex @code{!} (exclamation point), @code{!~} operator @cindex exclamation point (@code{!}), @code{!~} operator The righthand operand of the @samp{~} and @samp{!~} operators may be -either a regexp constant (@code{/@dots{}/}) or an ordinary +either a regexp constant (@code{/}@dots{}@code{/}) or an ordinary expression. In the latter case, the value of the expression as a string is used as a dynamic regexp (@pxref{Regexp Usage}; also @pxref{Computed Regexps}). @@ -10704,7 +11443,9 @@ where this is discussed in more detail. @subsubsection String Comparison With POSIX Rules The POSIX standard says that string comparison is performed based -on the locale's collating order. This is usually very different +on the locale's @dfn{collating order}. This is the order in which +characters sort, as defined by the locale (for more discussion, +@pxref{Ranges and Locales}). This order is usually very different from the results obtained when doing straight character-by-character comparison.@footnote{Technically, string comparison is supposed to behave the same way as if the strings are compared with the C @@ -10712,7 +11453,7 @@ to behave the same way as if the strings are compared with the C Because this behavior differs considerably from existing practice, @command{gawk} only implements it when in POSIX mode (@pxref{Options}). -Here is an example to illustrate the difference, in an @samp{en_US.UTF-8} +Here is an example to illustrate the difference, in an @code{en_US.UTF-8} locale: @example @@ -10767,10 +11508,10 @@ The Boolean operators are: @item @var{boolean1} && @var{boolean2} True if both @var{boolean1} and @var{boolean2} are true. For example, the following statement prints the current input record if it contains -both @samp{2400} and @samp{foo}: +both @samp{edu} and @samp{li}: @example -if ($0 ~ /2400/ && $0 ~ /foo/) print +if ($0 ~ /edu/ && $0 ~ /li/) print @end example @cindex side effects, Boolean operators @@ -10783,11 +11524,11 @@ no substring @samp{foo} in the record. @item @var{boolean1} || @var{boolean2} True if at least one of @var{boolean1} or @var{boolean2} is true. For example, the following statement prints all records in the input -that contain @emph{either} @samp{2400} or -@samp{foo} or both: +that contain @emph{either} @samp{edu} or +@samp{li} or both: @example -if ($0 ~ /2400/ || $0 ~ /foo/) print +if ($0 ~ /edu/ || $0 ~ /li/) print @end example The subexpression @var{boolean2} is evaluated only if @var{boolean1} @@ -10928,7 +11669,7 @@ However, putting a newline in front of either character does not work without using backslash continuation (@pxref{Statements/Lines}). If @option{--posix} is specified -(@pxref{Options}), then this extension is disabled. +(@pxref{Options}), this extension is disabled. @node Function Calls @section Function Calls @@ -10947,6 +11688,8 @@ functions and their descriptions. In addition, you can define functions for use in your program. @xref{User-defined}, for instructions on how to do this. +Finally, @command{gawk} lets you write functions in C or C++ +that may be called from your program: see @ref{Dynamic Extensions}. @cindex arguments, in function calls The way to use a function is with a @dfn{function call} expression, @@ -10988,7 +11731,9 @@ If those arguments are not supplied, the functions use a reasonable default value. @xref{Built-in}, for full details. If arguments are omitted in calls to user-defined functions, then those arguments are -treated as local variables and initialized to the empty string +treated as local variables. Such local variables act like the +empty string if referenced where a string value is required, +and like zero if referenced where a numeric value is required (@pxref{User-defined}). As an advanced feature, @command{gawk} provides indirect function calls, @@ -10997,12 +11742,12 @@ when you write the source code to your program. We defer discussion of this feature until later; see @ref{Indirect Calls}. @cindex side effects, function calls -Like every other expression, the function call has a value, which is -computed by the function based on the arguments you give it. In this -example, the value of @samp{sqrt(@var{argument})} is the square root of -@var{argument}. -The following program reads numbers, one number per line, and prints the -square root of each one: +Like every other expression, the function call has a value, often +called the @dfn{return value}, which is computed by the function +based on the arguments you give it. In this example, the return value +of @samp{sqrt(@var{argument})} is the square root of @var{argument}. +The following program reads numbers, one number per line, and prints +the square root of each one: @example $ @kbd{awk '@{ print "The square root of", $1, "is", sqrt($1) @}'} @@ -11090,28 +11835,28 @@ expression because the first @samp{$} has higher precedence than the This table presents @command{awk}'s operators, in order of highest to lowest precedence: -@c use @code in the items, looks better in TeX w/o all the quotes -@table @code -@item (@dots{}) +@c @asis for docbook to come out right +@table @asis +@item @code{(}@dots{}@code{)} Grouping. @cindex @code{$} (dollar sign), @code{$} field operator @cindex dollar sign (@code{$}), @code{$} field operator -@item $ +@item @code{$} Field reference. @cindex @code{+} (plus sign), @code{++} operator @cindex plus sign (@code{+}), @code{++} operator @cindex @code{-} (hyphen), @code{--} operator @cindex hyphen (@code{-}), @code{--} operator -@item ++ -- +@item @code{++ --} Increment, decrement. @cindex @code{^} (caret), @code{^} operator @cindex caret (@code{^}), @code{^} operator @cindex @code{*} (asterisk), @code{**} operator @cindex asterisk (@code{*}), @code{**} operator -@item ^ ** +@item @code{^ **} Exponentiation. These operators group right-to-left. @cindex @code{+} (plus sign), @code{+} operator @@ -11120,7 +11865,7 @@ Exponentiation. These operators group right-to-left. @cindex hyphen (@code{-}), @code{-} operator @cindex @code{!} (exclamation point), @code{!} operator @cindex exclamation point (@code{!}), @code{!} operator -@item + - ! +@item @code{+ - !} Unary plus, minus, logical ``not.'' @cindex @code{*} (asterisk), @code{*} operator, as multiplication operator @@ -11129,17 +11874,17 @@ Unary plus, minus, logical ``not.'' @cindex forward slash (@code{/}), @code{/} operator @cindex @code{%} (percent sign), @code{%} operator @cindex percent sign (@code{%}), @code{%} operator -@item * / % +@item @code{* / %} Multiplication, division, remainder. @cindex @code{+} (plus sign), @code{+} operator @cindex plus sign (@code{+}), @code{+} operator @cindex @code{-} (hyphen), @code{-} operator @cindex hyphen (@code{-}), @code{-} operator -@item + - +@item @code{+ -} Addition, subtraction. -@item @r{String Concatenation} +@item String Concatenation There is no special symbol for concatenation. The operands are simply written side by side (@pxref{Concatenation}). @@ -11165,7 +11910,7 @@ The operands are simply written side by side @cindex @code{|} (vertical bar), @code{|&} operator (I/O) @cindex vertical bar (@code{|}), @code{|&} operator (I/O) @cindex operators, input/output -@item < <= == != > >= >> | |& +@item @code{< <= == != > >= >> | |&} Relational and redirection. The relational operators and the redirections have the same precedence level. Characters such as @samp{>} serve both as relationals and as @@ -11186,26 +11931,26 @@ The correct way to write this statement is @samp{print foo > (a ? b : c)}. @cindex tilde (@code{~}), @code{~} operator @cindex @code{!} (exclamation point), @code{!~} operator @cindex exclamation point (@code{!}), @code{!~} operator -@item ~ !~ +@item @code{~ !~} Matching, nonmatching. @cindex @code{in} operator -@item in +@item @code{in} Array membership. @cindex @code{&} (ampersand), @code{&&} operator @cindex ampersand (@code{&}), @code{&&} operator -@item && +@item @code{&&} Logical ``and''. @cindex @code{|} (vertical bar), @code{||} operator @cindex vertical bar (@code{|}), @code{||} operator -@item || +@item @code{||} Logical ``or''. @cindex @code{?} (question mark), @code{?:} operator @cindex question mark (@code{?}), @code{?:} operator -@item ?: +@item @code{?:} Conditional. This operator groups right-to-left. @cindex @code{+} (plus sign), @code{+=} operator @@ -11222,7 +11967,7 @@ Conditional. This operator groups right-to-left. @cindex percent sign (@code{%}), @code{%=} operator @cindex @code{^} (caret), @code{^=} operator @cindex caret (@code{^}), @code{^=} operator -@item = += -= *= /= %= ^= **= +@item @code{= += -= *= /= %= ^= **=} Assignment. These operators group right-to-left. @end table @@ -11239,27 +11984,102 @@ For maximum portability, do not use them. @section Where You Are Makes A Difference @cindex locale, definition of -Modern systems support the notion of @dfn{locales}: a way to tell -the system about the local character set and language. +Modern systems support the notion of @dfn{locales}: a way to tell the +system about the local character set and language. The ISO C standard +defines a default @code{"C"} locale, which is an environment that is +typical of what many C programmers are used to. Once upon a time, the locale setting used to affect regexp matching (@pxref{Ranges and Locales}), but this is no longer true. -Locales can affect record splitting. -For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant. -For other single-character record separators, setting @samp{LC_ALL=C} -in the environment -will give you much better performance when reading records. Otherwise, +Locales can affect record splitting. For the normal case of @samp{RS = +"\n"}, the locale is largely irrelevant. For other single-character +record separators, setting @samp{LC_ALL=C} in the environment will +give you much better performance when reading records. Otherwise, @command{gawk} has to make several function calls, @emph{per input character}, to find the record terminator. -According to POSIX, string comparison is also affected by locales -(similar to regular expressions). The details are presented in -@ref{POSIX String Comparison}. +Locales can affect how dates and times are formatted (@pxref{Time +Functions}). For example, a common way to abbreviate the date September +4, 2015 in the United States is ``9/4/15.'' In many countries in +Europe, however, it is abbreviated ``4.9.15.'' Thus, the @samp{%x} +specification in a @code{"US"} locale might produce @samp{9/4/15}, +while in a @code{"EUROPE"} locale, it might produce @samp{4.9.15}. + +According to POSIX, string comparison is also affected by locales (similar +to regular expressions). The details are presented in @ref{POSIX String +Comparison}. Finally, the locale affects the value of the decimal point character -used when @command{gawk} parses input data. This is discussed in -detail in @ref{Conversion}. +used when @command{gawk} parses input data. This is discussed in detail +in @ref{Conversion}. + +@node Expressions Summary +@section Summary + +@itemize @value{BULLET} +@item +Expressions are the basic elements of computation in programs. They are +built from constants, variables, function calls and combinations of the +various kinds of values with operators. + +@item +@command{awk} supplies three kinds of constants: numeric, string, and +regexp. @command{gawk} lets you specify numeric constants in octal +and hexadecimal (bases 8 and 16) in addition to decimal (base 10). +In certain contexts, a standalone regexp constant such as @code{/foo/} +has the same meaning as @samp{$0 ~ /foo/}. + +@item +Variables hold values between uses in computations. A number of built-in +variables provide information to your @command{awk} program, and a number +of others let you control how @command{awk} behaves. + +@item +Numbers are automatically converted to strings, and strings to numbers, +as needed by @command{awk}. Numeric values are converted as if they were +formatted with @code{sprintf()} using the format in @code{CONVFMT}. +Locales can influence the conversions. + +@item +@command{awk} provides the usual arithmetic operators (addition, +subtraction, multiplication, division, modulus), and unary plus and minus. +It also provides comparison operators, boolean operators, and regexp +matching operators. String concatenation is accomplished by placing +two expressions next to each other; there is no explicit operator. +The three-operand @samp{?:} operator provides an ``if-else'' test within +expressions. + +@item +Assignment operators provide convenient shorthands for common arithmetic +operations. + +@item +In @command{awk}, a value is considered to be true if it is non-zero +@emph{or} non-null. Otherwise, the value is false. + +@item +A value's type is set upon each assignment and may change over its +lifetime. The type determines how it behaves in comparisons (string +or numeric). + +@item +Function calls return a value which may be used as part of a larger +expression. Expressions used to pass parameter values are fully +evaluated before the function is called. @command{awk} provides +built-in and user-defined functions; this is described later on in this +@value{DOCUMENT}. + +@item +Operator precedence specifies the order in which operations are performed, +unless explicitly overridden by parentheses. @command{awk}'s operator +precedence is compatible with that of C. + +@item +Locales can affect the format of data as output by an @command{awk} +program, and occasionally the format for data read as input. + +@end itemize @c ENDOFRANGE exps @@ -11287,6 +12107,7 @@ building something useful. * Statements:: Describes the various control statements in detail. * Built-in Variables:: Summarizes the built-in variables. +* Pattern Action Summary:: Patterns and Actions summary. @end menu @node Pattern Overview @@ -11317,10 +12138,10 @@ A single expression. It matches when its value is nonzero (if a number) or non-null (if a string). (@xref{Expression Patterns}.) -@item @var{pat1}, @var{pat2} +@item @var{begpat}, @var{endpat} A pair of patterns separated by a comma, specifying a range of records. -The range includes both the initial record that matches @var{pat1} and -the final record that matches @var{pat2}. +The range includes both the initial record that matches @var{begpat} and +the final record that matches @var{endpat}. (@xref{Ranges}.) @item BEGIN @@ -11332,7 +12153,7 @@ Special patterns for you to supply startup or cleanup actions for your @item BEGINFILE @itemx ENDFILE Special patterns for you to supply startup or cleanup actions to be -done on a per file basis. +done on a per-file basis. (@xref{BEGINFILE/ENDFILE}.) @item @var{empty} @@ -11382,7 +12203,7 @@ slashes (@code{/@var{regexp}/}), or any expression whose string value is used as a dynamic regular expression (@pxref{Computed Regexps}). The following example prints the second field of each input record -whose first field is precisely @samp{foo}: +whose first field is precisely @samp{li}: @cindex @code{/} (forward slash), patterns and @cindex forward slash (@code{/}), patterns and @@ -11391,68 +12212,65 @@ whose first field is precisely @samp{foo}: @cindex @code{!} (exclamation point), @code{!~} operator @cindex exclamation point (@code{!}), @code{!~} operator @example -$ @kbd{awk '$1 == "foo" @{ print $2 @}' BBS-list} +$ @kbd{awk '$1 == "li" @{ print $2 @}' mail-list} @end example @noindent -(There is no output, because there is no BBS site with the exact name @samp{foo}.) +(There is no output, because there is no person with the exact name @samp{li}.) Contrast this with the following regular expression match, which -accepts any record with a first field that contains @samp{foo}: +accepts any record with a first field that contains @samp{li}: @example -$ @kbd{awk '$1 ~ /foo/ @{ print $2 @}' BBS-list} -@print{} 555-1234 +$ @kbd{awk '$1 ~ /foo/ @{ print $2 @}' mail-list} +@print{} 555-5553 @print{} 555-6699 -@print{} 555-6480 -@print{} 555-2127 @end example @cindex regexp constants, as patterns @cindex patterns, regexp constants as A regexp constant as a pattern is also a special case of an expression -pattern. The expression @code{/foo/} has the value one if @samp{foo} -appears in the current input record. Thus, as a pattern, @code{/foo/} -matches any record containing @samp{foo}. +pattern. The expression @code{/li/} has the value one if @samp{li} +appears in the current input record. Thus, as a pattern, @code{/li/} +matches any record containing @samp{li}. @cindex Boolean expressions, as patterns Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on whether its subexpressions match. For example, the following command prints all the records in -@file{BBS-list} that contain both @samp{2400} and @samp{foo}: +@file{mail-list} that contain both @samp{edu} and @samp{li}: @example -$ @kbd{awk '/2400/ && /foo/' BBS-list} -@print{} fooey 555-1234 2400/1200/300 B +$ @kbd{awk '/edu/ && /li/' mail-list} +@print{} Samuel 555-3430 samuel.lanceolis@@shu.edu A @end example The following command prints all records in -@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo} +@file{mail-list} that contain @emph{either} @samp{edu} or @samp{li} (or both, of course): @example -$ @kbd{awk '/2400/ || /foo/' BBS-list} -@print{} alpo-net 555-3412 2400/1200/300 A -@print{} bites 555-1675 2400/1200/300 A -@print{} fooey 555-1234 2400/1200/300 B -@print{} foot 555-6699 1200/300 B -@print{} macfoo 555-6480 1200/300 A -@print{} sdace 555-3430 2400/1200/300 A -@print{} sabafoo 555-2127 1200/300 C +$ @kbd{awk '/edu/ || /li/' mail-list} +@print{} Amelia 555-5553 amelia.zodiacusque@@gmail.com F +@print{} Broderick 555-0542 broderick.aliquotiens@@yahoo.com R +@print{} Fabius 555-1234 fabius.undevicesimus@@ucb.edu F +@print{} Julie 555-6699 julie.perscrutabor@@skeeve.com F +@print{} Samuel 555-3430 samuel.lanceolis@@shu.edu A +@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R @end example The following command prints all records in -@file{BBS-list} that do @emph{not} contain the string @samp{foo}: +@file{mail-list} that do @emph{not} contain the string @samp{li}: @example -$ @kbd{awk '! /foo/' BBS-list} -@print{} aardvark 555-5553 1200/300 B -@print{} alpo-net 555-3412 2400/1200/300 A -@print{} barfly 555-7685 1200/300 A -@print{} bites 555-1675 2400/1200/300 A -@print{} camelot 555-0542 300 C -@print{} core 555-2912 1200/300 C -@print{} sdace 555-3430 2400/1200/300 A +$ @kbd{awk '! /li/' mail-list} +@print{} Anthony 555-3412 anthony.asserturo@@hotmail.com A +@print{} Becky 555-7685 becky.algebrarum@@gmail.com A +@print{} Bill 555-1675 bill.drowning@@hotmail.com A +@print{} Camilla 555-2912 camilla.infusarum@@skynet.be R +@print{} Fabius 555-1234 fabius.undevicesimus@@ucb.edu F +@print{} Martin 555-6480 martin.codicibus@@hotmail.com A +@print{} Jean-Paul 555-2127 jeanpaul.campanorum@@nyu.edu R @end example @cindex @code{BEGIN} pattern, Boolean patterns and @@ -11496,7 +12314,7 @@ input record. When a record matches @var{begpat}, the range pattern is @dfn{turned on} and the range pattern matches this record as well. As long as the range pattern stays turned on, it automatically matches every input record read. The range pattern also matches @var{endpat} against every -input record; when this succeeds, the range pattern is turned off again +input record; when this succeeds, the range pattern is @dfn{turned off} again for the following record. Then the range pattern goes back to checking @var{begpat} against each record. @@ -11556,6 +12374,11 @@ $ @kbd{echo Yes | gawk '(/1/,/2/) || /Yes/'} @error{} gawk: cmd. line:1: ^ syntax error @end example +@cindex range patterns, line continuation and +As a minor point of interest, although it is poor style, +POSIX allows you to put a newline after the comma in +a range pattern. @value{DARKCORNER} + @node BEGIN/END @subsection The @code{BEGIN} and @code{END} Special Patterns @@ -11580,28 +12403,30 @@ programmers. @node Using BEGIN/END @subsubsection Startup and Cleanup Actions +@cindex @code{BEGIN} pattern +@cindex @code{END} pattern A @code{BEGIN} rule is executed once only, before the first input record is read. Likewise, an @code{END} rule is executed once only, after all the input is read. For example: @example $ @kbd{awk '} -> @kbd{BEGIN @{ print "Analysis of \"foo\"" @}} -> @kbd{/foo/ @{ ++n @}} -> @kbd{END @{ print "\"foo\" appears", n, "times." @}' BBS-list} -@print{} Analysis of "foo" -@print{} "foo" appears 4 times. +> @kbd{BEGIN @{ print "Analysis of \"li\"" @}} +> @kbd{/li/ @{ ++n @}} +> @kbd{END @{ print "\"li\" appears in", n, "records." @}' mail-list} +@print{} Analysis of "li" +@print{} "li" appears in 4 records. @end example @cindex @code{BEGIN} pattern, operators and @cindex @code{END} pattern, operators and -This program finds the number of records in the input file @file{BBS-list} -that contain the string @samp{foo}. The @code{BEGIN} rule prints a title +This program finds the number of records in the input file @file{mail-list} +that contain the string @samp{li}. The @code{BEGIN} rule prints a title for the report. There is no need to use the @code{BEGIN} rule to initialize the counter @code{n} to zero, since @command{awk} does this automatically (@pxref{Variables}). The second rule increments the variable @code{n} every time a -record containing the pattern @samp{foo} is read. The @code{END} rule +record containing the pattern @samp{li} is read. The @code{END} rule prints the value of @code{n} at the end of the run. The special patterns @code{BEGIN} and @code{END} cannot be used in ranges @@ -11643,7 +12468,7 @@ rule checks the @code{FNR} and @code{NR} variables. @subsubsection Input/Output from @code{BEGIN} and @code{END} Rules @cindex input/output, from @code{BEGIN} and @code{END} -There are several (sometimes subtle) points to remember when doing I/O +There are several (sometimes subtle) points to be aware of when doing I/O from a @code{BEGIN} or @code{END} rule. The first has to do with the value of @code{$0} in a @code{BEGIN} rule. Because @code{BEGIN} rules are executed before any input is read, @@ -11654,6 +12479,7 @@ to give @code{$0} a real value is to execute a @code{getline} command without a variable (@pxref{Getline}). Another way is simply to assign a value to @code{$0}. +@cindex Brian Kernighan's @command{awk} @cindex differences in @command{awk} and @command{gawk}, @code{BEGIN}/@code{END} patterns @cindex POSIX @command{awk}, @code{BEGIN}/@code{END} patterns @cindex @code{print} statement, @code{BEGIN}/@code{END} patterns and @@ -11703,8 +12529,19 @@ This @value{SECTION} describes a @command{gawk}-specific feature. Two special kinds of rule, @code{BEGINFILE} and @code{ENDFILE}, give you ``hooks'' into @command{gawk}'s command-line file processing loop. -As with the @code{BEGIN} and @code{END} rules (@pxref{BEGIN/END}), all -@code{BEGINFILE} rules in a program are merged, in the order they are +As with the @code{BEGIN} and @code{END} rules +@ifnottex +@ifnotdocbook +(@pxref{BEGIN/END}), +@end ifnotdocbook +@end ifnottex +@iftex +(see the previous section), +@end iftex +@ifdocbook +(see the previous section), +@end ifdocbook +all @code{BEGINFILE} rules in a program are merged, in the order they are read by @command{gawk}, and all @code{ENDFILE} rules are merged as well. The body of the @code{BEGINFILE} rules is executed just before @@ -11714,7 +12551,7 @@ is set to the name of the current file, and @code{FNR} is set to zero. The @code{BEGINFILE} rule provides you the opportunity to accomplish two tasks that would otherwise be difficult or impossible to perform: -@itemize @bullet +@itemize @value{BULLET} @item You can test if the file is readable. Normally, it is a fatal error if a file named on the command line cannot be opened for reading. However, @@ -11722,7 +12559,7 @@ you can bypass the fatal error and move on to the next file on the command line. @cindex @command{gawk}, @code{ERRNO} variable in -@cindex @code{ERRNO} variable +@cindex @code{ERRNO} variable, with @code{BEGINFILE} pattern @cindex @code{nextfile} statement, @code{BEGINFILE}/@code{ENDFILE} patterns and You do this by checking if the @code{ERRNO} variable is not the empty string; if so, then @command{gawk} was not able to open the file. In @@ -11732,10 +12569,11 @@ the file entirely. Otherwise, @command{gawk} exits with the usual fatal error. @item -If you have written extensions that modify the record handling (by inserting -an ``input parser''), you can invoke them at this point, before @command{gawk} -has started processing the file. (This is a @emph{very} advanced feature, -currently used only by the @uref{http://gawkextlib.sourceforge.net, @code{gawkextlib} project}.) +If you have written extensions that modify the record handling (by +inserting an ``input parser,'' @pxref{Input Parsers}), you can invoke +them at this point, before @command{gawk} has started processing the file. +(This is a @emph{very} advanced feature, currently used only by the +@uref{http://gawkextlib.sourceforge.net, @code{gawkextlib} project}.) @end itemize The @code{ENDFILE} rule is called when @command{gawk} has finished processing @@ -11757,14 +12595,14 @@ statement (@pxref{Nextfile Statement}) is allowed only inside a @cindex @code{getline} statement, @code{BEGINFILE}/@code{ENDFILE} patterns and The @code{getline} statement (@pxref{Getline}) is restricted inside -both @code{BEGINFILE} and @code{ENDFILE}. Only the @samp{getline -@var{variable} < @var{file}} form is allowed. +both @code{BEGINFILE} and @code{ENDFILE}: only redirected +forms of @code{getline} are allowed. @code{BEGINFILE} and @code{ENDFILE} are @command{gawk} extensions. In most other @command{awk} implementations, or if @command{gawk} is in compatibility mode (@pxref{Options}), they are not special. -@c FIXME: For 4.1 maybe deal with this? +@c FIXME: For 4.2 maybe deal with this? @ignore Date: Tue, 17 May 2011 02:06:10 PDT From: rankin@pactechdata.com (Pat Rankin) @@ -11795,7 +12633,7 @@ An empty (i.e., nonexistent) pattern is considered to match @emph{every} input record. For example, the program: @example -awk '@{ print $1 @}' BBS-list +awk '@{ print $1 @}' mail-list @end example @noindent @@ -11818,7 +12656,7 @@ into the body of the @command{awk} program. @cindex shells, quoting The most common method is to use shell quoting to substitute the variable's value into the program inside the script. -For example, in the following program: +For example, consider the following program: @example printf "Enter search pattern: " @@ -11828,7 +12666,7 @@ awk "/$pattern/ "'@{ nmatches++ @} @end example @noindent -the @command{awk} program consists of two pieces of quoted text +The @command{awk} program consists of two pieces of quoted text that are concatenated together to form the program. The first part is double-quoted, which allows substitution of the @code{pattern} shell variable inside the quotes. @@ -11842,8 +12680,8 @@ match up the quotes when reading the program. A better method is to use @command{awk}'s variable assignment feature (@pxref{Assignment Options}) -to assign the shell variable's value to an @command{awk} variable's -value. Then use dynamic regexps to match the pattern +to assign the shell variable's value to an @command{awk} variable. +Then use dynamic regexps to match the pattern (@pxref{Computed Regexps}). The following shows how to redo the previous example using this technique: @@ -11881,13 +12719,13 @@ both) may be omitted. The purpose of the @dfn{action} is to tell @command{awk} what to do once a match for the pattern is found. Thus, in outline, an @command{awk} program generally looks like this: -@example -@r{[}@var{pattern}@r{]} @{ @var{action} @} - @var{pattern} @r{[}@{ @var{action} @}@r{]} +@display +[@var{pattern}] @code{@{ @var{action} @}} + @var{pattern} [@code{@{ @var{action} @}}] @dots{} -function @var{name}(@var{args}) @{ @dots{} @} +@code{function @var{name}(@var{args}) @{ @dots{} @}} @dots{} -@end example +@end display @cindex @code{@{@}} (braces), actions and @cindex braces (@code{@{@}}), actions and @@ -11896,11 +12734,11 @@ function @var{name}(@var{args}) @{ @dots{} @} @cindex @code{;} (semicolon), separating statements in actions @cindex semicolon (@code{;}), separating statements in actions An action consists of one or more @command{awk} @dfn{statements}, enclosed -in curly braces (@samp{@{@dots{}@}}). Each statement specifies one +in braces (@samp{@{@r{@dots{}}@}}). Each statement specifies one thing to do. The statements are separated by newlines or semicolons. -The curly braces around an action must be used even if the action +The braces around an action must be used even if the action contains only one statement, or if it contains no statements at -all. However, if you omit the action entirely, omit the curly braces as +all. However, if you omit the action entirely, omit the braces as well. An omitted action is equivalent to @samp{@{ print $0 @}}: @example @@ -11926,10 +12764,9 @@ programs. The @command{awk} language gives you C-like constructs special ones (@pxref{Statements}). @item Compound statements -Consist of one or more statements enclosed in -curly braces. A compound statement is used in order to put several -statements together in the body of an @code{if}, @code{while}, @code{do}, -or @code{for} statement. +Enclose one or more statements in braces. A compound statement +is used in order to put several statements together in the body of an +@code{if}, @code{while}, @code{do}, or @code{for} statement. @item Input statements Use the @code{getline} command @@ -11975,7 +12812,7 @@ Many control statements contain other statements. For example, the @code{if} statement contains another statement that may or may not be executed. The contained statement is called the @dfn{body}. To include more than one statement in the body, group them into a -single @dfn{compound statement} with curly braces, separating them with +single @dfn{compound statement} with braces, separating them with newlines or semicolons. @menu @@ -12003,9 +12840,9 @@ newlines or semicolons. The @code{if}-@code{else} statement is @command{awk}'s decision-making statement. It looks like this: -@example -if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]} -@end example +@display +@code{if (@var{condition}) @var{then-body}} [@code{else @var{else-body}}] +@end display @noindent The @var{condition} is an expression that controls what the rest of the @@ -12029,7 +12866,7 @@ if the value of @code{x} is evenly divisible by two), then the first statement is executed. If the @code{else} keyword appears on the same line as @var{then-body} and @var{then-body} is not a compound statement (i.e., not surrounded by -curly braces), then a semicolon must separate @var{then-body} from +braces), then a semicolon must separate @var{then-body} from the @code{else}. To illustrate this, the previous example can be rewritten as: @@ -12048,6 +12885,7 @@ the first thing on its line. @subsection The @code{while} Statement @cindex @code{while} statement @cindex loops +@cindex loops, @code{while} @cindex loops, See Also @code{while} statement In programming, a @dfn{loop} is a part of a program that can @@ -12108,6 +12946,7 @@ program is harder to read without it. @node Do Statement @subsection The @code{do}-@code{while} Statement @cindex @code{do}-@code{while} statement +@cindex loops, @code{do}-@code{while} The @code{do} loop is a variation of the @code{while} looping statement. The @code{do} loop executes the @var{body} once and then repeats the @@ -12153,6 +12992,7 @@ occasionally is there a real use for a @code{do} statement. @node For Statement @subsection The @code{for} Statement @cindex @code{for} statement +@cindex loops, @code{for}, iterative The @code{for} statement makes it more convenient to count iterations of a loop. The general form of the @code{for} statement looks like this: @@ -12259,6 +13099,10 @@ for more information on this version of the @code{for} loop. @cindex @code{case} keyword @cindex @code{default} keyword +This @value{SECTION} describes a @command{gawk}-specific feature. +If @command{gawk} is in compatibility mode (@pxref{Options}), +it is not available. + The @code{switch} statement allows the evaluation of an expression and the execution of statements based on a @code{case} match. Case statements are checked for a match in the order they are defined. If no suitable @@ -12314,15 +13158,11 @@ the @code{print} statement is executed and then falls through into the the @minus{}1 case will also be executed since the @code{default} does not halt execution. -This @code{switch} statement is a @command{gawk} extension. -If @command{gawk} is in compatibility mode -(@pxref{Options}), -it is not available. - @node Break Statement @subsection The @code{break} Statement @cindex @code{break} statement @cindex loops, exiting +@cindex loops, @code{break} statement and The @code{break} statement jumps out of the innermost @code{for}, @code{while}, or @code{do} loop that encloses it. The following example @@ -12332,15 +13172,15 @@ numbers: @example # find smallest divisor of num @{ - num = $1 - for (div = 2; div * div <= num; div++) @{ - if (num % div == 0) - break - @} - if (num % div == 0) - printf "Smallest divisor of %d is %d\n", num, div - else - printf "%d is prime\n", num + num = $1 + for (div = 2; div * div <= num; div++) @{ + if (num % div == 0) + break + @} + if (num % div == 0) + printf "Smallest divisor of %d is %d\n", num, div + else + printf "%d is prime\n", num @} @end example @@ -12358,17 +13198,17 @@ an @code{if}: @example # find smallest divisor of num @{ - num = $1 - for (div = 2; ; div++) @{ - if (num % div == 0) @{ - printf "Smallest divisor of %d is %d\n", num, div - break - @} - if (div * div > num) @{ - printf "%d is prime\n", num - break + num = $1 + for (div = 2; ; div++) @{ + if (num % div == 0) @{ + printf "Smallest divisor of %d is %d\n", num, div + break + @} + if (div * div > num) @{ + printf "%d is prime\n", num + break + @} @} - @} @} @end example @@ -12382,6 +13222,7 @@ This is discussed in @ref{Switch Statement}. @cindex POSIX @command{awk}, @code{break} statement and @cindex dark corner, @code{break} statement @cindex @command{gawk}, @code{break} statement in +@cindex Brian Kernighan's @command{awk} The @code{break} statement has no meaning when used outside the body of a loop or @code{switch}. However, although it was never documented, @@ -12446,6 +13287,7 @@ This program loops forever once @code{x} reaches 5. @cindex POSIX @command{awk}, @code{continue} statement and @cindex dark corner, @code{continue} statement @cindex @command{gawk}, @code{continue} statement in +@cindex Brian Kernighan's @command{awk} The @code{continue} statement has no special meaning with respect to the @code{switch} statement, nor does it have any meaning when used outside the body of a loop. Historical versions of @command{awk} treated a @code{continue} @@ -12515,16 +13357,14 @@ The @code{next} statement is not allowed inside @code{BEGINFILE} and @cindex POSIX @command{awk}, @code{next}/@code{nextfile} statements and @cindex @code{next} statement, user-defined functions and @cindex functions, user-defined, @code{next}/@code{nextfile} statements and -According to the POSIX standard, the behavior is undefined if -the @code{next} statement is used in a @code{BEGIN} or @code{END} rule. -@command{gawk} treats it as a syntax error. -Although POSIX permits it, -some other @command{awk} implementations don't allow the @code{next} -statement inside function bodies -(@pxref{User-defined}). -Just as with any other @code{next} statement, a @code{next} statement inside a -function body reads the next record and starts processing it with the -first rule in the program. +According to the POSIX standard, the behavior is undefined if the +@code{next} statement is used in a @code{BEGIN} or @code{END} rule. +@command{gawk} treats it as a syntax error. Although POSIX permits it, +most other @command{awk} implementations don't allow the @code{next} +statement inside function bodies (@pxref{User-defined}). Just as with any +other @code{next} statement, a @code{next} statement inside a function +body reads the next record and starts processing it with the first rule +in the program. @node Nextfile Statement @subsection The @code{nextfile} Statement @@ -12534,11 +13374,11 @@ The @code{nextfile} statement is similar to the @code{next} statement. However, instead of abandoning processing of the current record, the @code{nextfile} statement instructs @command{awk} to stop processing the -current data file. +current @value{DF}. Upon execution of the @code{nextfile} statement, @code{FILENAME} is -updated to the name of the next data file listed on the command line, +updated to the name of the next @value{DF} listed on the command line, @code{FNR} is reset to one, and processing starts over with the first rule in the program. @@ -12547,10 +13387,10 @@ then the code in any @code{END} rules is executed. An exception to this is when @code{nextfile} is invoked during execution of any statement in an @code{END} rule; In this case, it causes the program to stop immediately. @xref{BEGIN/END}. -The @code{nextfile} statement is useful when there are many data files +The @code{nextfile} statement is useful when there are many @value{DF}s to process but it isn't necessary to process every record in every file. Without @code{nextfile}, -in order to move on to the next data file, a program +in order to move on to the next @value{DF}, a program would have to continue scanning the unwanted records. The @code{nextfile} statement accomplishes this much more efficiently. @@ -12583,8 +13423,10 @@ See @uref{http://austingroupbugs.net/view.php?id=607, the Austin Group website}. @cindex functions, user-defined, @code{next}/@code{nextfile} statements and @cindex @code{nextfile} statement, user-defined functions and -The current version of the Brian Kernighan's @command{awk} (@pxref{Other -Versions}) also supports @code{nextfile}. However, it doesn't allow the +@cindex Brian Kernighan's @command{awk} +@cindex @command{mawk} utility +The current version of the Brian Kernighan's @command{awk}, and @command{mawk} (@pxref{Other +Versions}) also support @code{nextfile}. However, they don't allow the @code{nextfile} statement inside function bodies (@pxref{User-defined}). @command{gawk} does; a @code{nextfile} inside a function body reads the next record and starts processing it with the first rule in the program, @@ -12598,9 +13440,9 @@ The @code{exit} statement causes @command{awk} to immediately stop executing the current rule and to stop processing input; any remaining input is ignored. The @code{exit} statement is written as follows: -@example -exit @r{[}@var{return code}@r{]} -@end example +@display +@code{exit} [@var{return code}] +@end display @cindex @code{BEGIN} pattern, @code{exit} statement and @cindex @code{END} pattern, @code{exit} statement and @@ -12633,8 +13475,7 @@ status code for the @command{awk} process. If no argument is supplied, In the case where an argument is supplied to a first @code{exit} statement, and then @code{exit} is called a second time from an @code{END} rule with no argument, -@command{awk} uses the previously supplied exit value. -@value{DARKCORNER} +@command{awk} uses the previously supplied exit value. @value{DARKCORNER} @xref{Exit Status}, for more information. @cindex programming conventions, @code{exit} statement @@ -12646,12 +13487,12 @@ in the following example: @example BEGIN @{ - if (("date" | getline date_now) <= 0) @{ - print "Can't get system date" > "/dev/stderr" - exit 1 - @} - print "current date is", date_now - close("date") + if (("date" | getline date_now) <= 0) @{ + print "Can't get system date" > "/dev/stderr" + exit 1 + @} + print "current date is", date_now + close("date") @} @end example @@ -12682,9 +13523,9 @@ automatically by @command{awk}, so that they carry information from the internal workings of @command{awk} to your program. @cindex @command{gawk}, built-in variables and -This @value{SECTION} documents all the built-in variables of -@command{gawk}, most of which are also documented in the chapters -describing their areas of activity. +This @value{SECTION} documents all of @command{gawk}'s built-in variables, +most of which are also documented in the @value{CHAPTER}s describing +their areas of activity. @menu * User-modified:: Built-in variables that you change to control @@ -12702,44 +13543,38 @@ describing their areas of activity. @cindex user-modifiable variables The following is an alphabetical list of variables that you can change to -control how @command{awk} does certain things. The variables that are -specific to @command{gawk} are marked with a pound sign@w{ (@samp{#}).} +control how @command{awk} does certain things. + +The variables that are specific to @command{gawk} are marked with a pound +sign (@samp{#}). These variables are @command{gawk} extensions. In other +@command{awk} implementations or if @command{gawk} is in compatibility +mode (@pxref{Options}), they are not special. (Any exceptions are noted +in the description of each variable.) @table @code @cindex @code{BINMODE} variable @cindex binary input/output @cindex input/output, binary -@item BINMODE # -On non-POSIX systems, this variable specifies use of binary mode for all I/O. -Numeric values of one, two, or three specify that input files, output files, or -all files, respectively, should use binary I/O. -A numeric value less than zero is treated as zero, and a numeric value greater than -three is treated as three. -Alternatively, -string values of @code{"r"} or @code{"w"} specify that input files and -output files, respectively, should use binary I/O. -A string value of @code{"rw"} or @code{"wr"} indicates that all -files should use binary I/O. -Any other string value is treated the same as @code{"rw"}, -but causes @command{gawk} -to generate a warning message. -@code{BINMODE} is described in more detail in -@ref{PC Using}. - @cindex differences in @command{awk} and @command{gawk}, @code{BINMODE} variable -This variable is a @command{gawk} extension. -In other @command{awk} implementations -(except @command{mawk}, -@pxref{Other Versions}), -or if @command{gawk} is in compatibility mode -(@pxref{Options}), -it is not special. +@item BINMODE # +On non-POSIX systems, this variable specifies use of binary mode +for all I/O. Numeric values of one, two, or three specify that input +files, output files, or all files, respectively, should use binary I/O. +A numeric value less than zero is treated as zero, and a numeric value +greater than three is treated as three. Alternatively, string values +of @code{"r"} or @code{"w"} specify that input files and output files, +respectively, should use binary I/O. A string value of @code{"rw"} or +@code{"wr"} indicates that all files should use binary I/O. Any other +string value is treated the same as @code{"rw"}, but causes @command{gawk} +to generate a warning message. @code{BINMODE} is described in more +detail in @ref{PC Using}. @command{mawk} @pxref{Other Versions}), +also supports this variable, but only using numeric values. @cindex @code{CONVFMT} variable @cindex POSIX @command{awk}, @code{CONVFMT} variable and @cindex numbers, converting, to strings @cindex strings, converting, numbers to -@item CONVFMT +@item @code{CONVFMT} This string controls conversion of numbers to strings (@pxref{Conversion}). It works by being passed, in effect, as the first argument to the @@ -12754,40 +13589,29 @@ Its default value is @code{"%.6g"}. @cindex field separators, @code{FIELDWIDTHS} variable and @cindex separators, field, @code{FIELDWIDTHS} variable and @item FIELDWIDTHS # -This is a space-separated list of columns that tells @command{gawk} +A space-separated list of columns that tells @command{gawk} how to split input with fixed columnar boundaries. Assigning a value to @code{FIELDWIDTHS} overrides the use of @code{FS} and @code{FPAT} for field splitting. @xref{Constant Size}, for more information. -If @command{gawk} is in compatibility mode -(@pxref{Options}), then @code{FIELDWIDTHS} -has no special meaning, and field-splitting operations occur based -exclusively on the value of @code{FS}. - @cindex @command{gawk}, @code{FPAT} variable in @cindex @code{FPAT} variable @cindex differences in @command{awk} and @command{gawk}, @code{FPAT} variable @cindex field separators, @code{FPAT} variable and @cindex separators, field, @code{FPAT} variable and @item FPAT # -This is a regular expression (as a string) that tells @command{gawk} +A regular expression (as a string) that tells @command{gawk} to create the fields based on text that matches the regular expression. Assigning a value to @code{FPAT} overrides the use of @code{FS} and @code{FIELDWIDTHS} for field splitting. @xref{Splitting By Content}, for more information. -If @command{gawk} is in compatibility mode -(@pxref{Options}), then @code{FPAT} -has no special meaning, and field-splitting operations occur based -exclusively on the value of @code{FS}. - @cindex @code{FS} variable @cindex separators, field @cindex field separators @item FS -This is the input field separator -(@pxref{Field Separators}). +The input field separator (@pxref{Field Separators}). The value is a single-character string or a multicharacter regular expression that matches the separations between fields in an input record. If the value is the null string (@code{""}), then each @@ -12821,8 +13645,8 @@ is to simply say @samp{FS = FS}, perhaps with an explanatory comment. @cindex @command{gawk}, @code{IGNORECASE} variable in @cindex @code{IGNORECASE} variable @cindex differences in @command{awk} and @command{gawk}, @code{IGNORECASE} variable -@cindex case sensitivity, string comparisons and -@cindex case sensitivity, regexps and +@cindex case sensitivity, and string comparisons +@cindex case sensitivity, and regexps @cindex regular expressions, case sensitivity @item IGNORECASE # If @code{IGNORECASE} is nonzero or non-null, then all string comparisons @@ -12837,18 +13661,13 @@ and it does not affect field splitting when using a single-character field separator. @xref{Case-sensitivity}. -If @command{gawk} is in compatibility mode -(@pxref{Options}), -then @code{IGNORECASE} has no special meaning. Thus, string -and regexp operations are always case-sensitive. - @cindex @command{gawk}, @code{LINT} variable in @cindex @code{LINT} variable @cindex differences in @command{awk} and @command{gawk}, @code{LINT} variable @cindex lint checking @item LINT # When this variable is true (nonzero or non-null), @command{gawk} -behaves as if the @option{--lint} command-line option is in effect. +behaves as if the @option{--lint} command-line option is in effect (@pxref{Options}). With a value of @code{"fatal"}, lint warnings become fatal errors. With a value of @code{"invalid"}, only warnings about things that are @@ -12869,7 +13688,7 @@ of @command{awk} being executed. @cindex numbers, converting, to strings @cindex strings, converting, numbers to @item OFMT -This string controls conversion of numbers to +Controls conversion of numbers to strings (@pxref{Conversion}) for printing with the @code{print} statement. It works by being passed as the first argument to the @code{sprintf()} function @@ -12890,27 +13709,26 @@ default value is @w{@code{" "}}, a string consisting of a single space. @cindex @code{ORS} variable @item ORS -This is the output record separator. It is output at the end of every +The output record separator. It is output at the end of every @code{print} statement. Its default value is @code{"\n"}, the newline character. (@xref{Output Separators}.) @cindex @code{PREC} variable @item PREC # The working precision of arbitrary precision floating-point numbers, -53 bits by default (@pxref{Setting Precision}). +53 bits by default (@pxref{Setting precision}). @cindex @code{ROUNDMODE} variable @item ROUNDMODE # The rounding mode to use for arbitrary precision arithmetic on numbers, by default @code{"N"} (@samp{roundTiesToEven} in -the IEEE-754 standard) -(@pxref{Setting Rounding Mode}). +the IEEE 754 standard; @pxref{Setting the rounding mode}). @cindex @code{RS} variable @cindex separators, for records @cindex record separators -@item RS -This is @command{awk}'s input record separator. Its default value is a string +@item @code{RS} +The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by @@ -12929,8 +13747,8 @@ just the first character of @code{RS}'s value is used. @cindex @code{SUBSEP} variable @cindex separators, subscript @cindex subscript separators -@item SUBSEP -This is the subscript separator. It has the default value of +@item @code{SUBSEP} +The subscript separator. It has the default value of @code{"\034"} and is used to separate the parts of the indices of a multidimensional array. Thus, the expression @code{@w{foo["A", "B"]}} really accesses @code{foo["A\034B"]} @@ -12941,18 +13759,12 @@ really accesses @code{foo["A\034B"]} @cindex differences in @command{awk} and @command{gawk}, @code{TEXTDOMAIN} variable @cindex internationalization, localization @item TEXTDOMAIN # -This variable is used for internationalization of programs at the +Used for internationalization of programs at the @command{awk} level. It sets the default text domain for specially marked string constants in the source text, as well as for the @code{dcgettext()}, @code{dcngettext()} and @code{bindtextdomain()} functions (@pxref{Internationalization}). The default value of @code{TEXTDOMAIN} is @code{"messages"}. - -This variable is a @command{gawk} extension. -In other @command{awk} implementations, -or if @command{gawk} is in compatibility mode -(@pxref{Options}), -it is not special. @end table @c ENDOFRANGE bvar @c ENDOFRANGE varb @@ -12968,14 +13780,19 @@ it is not special. @cindex variables, built-in, conveying information The following is an alphabetical list of variables that @command{awk} sets automatically on certain occasions in order to provide -information to your program. The variables that are specific to -@command{gawk} are marked with a pound sign@w{ (@samp{#}).} +information to your program. -@table @code +The variables that are specific to @command{gawk} are marked with a pound +sign (@samp{#}). These variables are @command{gawk} extensions. In other +@command{awk} implementations or if @command{gawk} is in compatibility +mode (@pxref{Options}), they are not special. + +@c @asis for docbook +@table @asis @cindex @code{ARGC}/@code{ARGV} variables @cindex arguments, command-line @cindex command line, arguments -@item ARGC@r{,} ARGV +@item @code{ARGC}, @code{ARGV} The command-line arguments available to @command{awk} programs are stored in an array called @code{ARGV}. @code{ARGC} is the number of command-line arguments present. @xref{Other Arguments}. @@ -12987,16 +13804,16 @@ In the following example: $ @kbd{awk 'BEGIN @{} > @kbd{for (i = 0; i < ARGC; i++)} > @kbd{print ARGV[i]} -> @kbd{@}' inventory-shipped BBS-list} +> @kbd{@}' inventory-shipped mail-list} @print{} awk @print{} inventory-shipped -@print{} BBS-list +@print{} mail-list @end example @noindent @code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]} contains @samp{inventory-shipped}, and @code{ARGV[2]} contains -@samp{BBS-list}. The value of @code{ARGC} is three, one more than the +@samp{mail-list}. The value of @code{ARGC} is three, one more than the index of the last element in @code{ARGV}, because the elements are numbered from zero. @@ -13015,36 +13832,30 @@ about how @command{awk} uses these variables. @cindex @code{ARGIND} variable @cindex differences in @command{awk} and @command{gawk}, @code{ARGIND} variable -@item ARGIND # +@item @code{ARGIND #} The index in @code{ARGV} of the current file being processed. -Every time @command{gawk} opens a new data file for processing, it sets -@code{ARGIND} to the index in @code{ARGV} of the file name. +Every time @command{gawk} opens a new @value{DF} for processing, it sets +@code{ARGIND} to the index in @code{ARGV} of the @value{FN}. When @command{gawk} is processing the input files, @samp{FILENAME == ARGV[ARGIND]} is always true. @cindex files, processing@comma{} @code{ARGIND} variable and This variable is useful in file processing; it allows you to tell how far -along you are in the list of data files as well as to distinguish between -successive instances of the same file name on the command line. +along you are in the list of @value{DF}s as well as to distinguish between +successive instances of the same @value{FN} on the command line. @cindex file names, distinguishing While you can change the value of @code{ARGIND} within your @command{awk} program, @command{gawk} automatically sets it to a new value when the next file is opened. -This variable is a @command{gawk} extension. -In other @command{awk} implementations, -or if @command{gawk} is in compatibility mode -(@pxref{Options}), -it is not special. - @cindex @code{ENVIRON} array -@cindex environment variables -@item ENVIRON +@cindex environment variables, in @code{ENVIRON} array +@item @code{ENVIRON} An associative array containing the values of the environment. The array indices are the environment variable names; the elements are the values of the particular environment variables. For example, -@code{ENVIRON["HOME"]} might be @file{/home/arnold}. +@code{ENVIRON["HOME"]} might be @code{/home/arnold}. For POSIX @command{awk}, changing this array does not affect the environment passed on to any programs that @command{awk} may spawn via @@ -13059,80 +13870,60 @@ executable programs. Some operating systems may not have environment variables. On such systems, the @code{ENVIRON} array is empty (except for -@w{@code{ENVIRON["AWKPATH"]}}, -@pxref{AWKPATH Variable} and -@w{@code{ENVIRON["AWKLIBPATH"]}}, +@w{@code{ENVIRON["AWKPATH"]}} and +@w{@code{ENVIRON["AWKLIBPATH"]}}; +@pxref{AWKPATH Variable}, and @pxref{AWKLIBPATH Variable}). @cindex @command{gawk}, @code{ERRNO} variable in @cindex @code{ERRNO} variable @cindex differences in @command{awk} and @command{gawk}, @code{ERRNO} variable @cindex error handling, @code{ERRNO} variable and -@item ERRNO # -If a system error occurs during a redirection for @code{getline}, -during a read for @code{getline}, or during a @code{close()} operation, -then @code{ERRNO} contains a string describing the error. - -In addition, @command{gawk} clears @code{ERRNO} -before opening each command-line input file. This enables checking if -the file is readable inside a @code{BEGINFILE} pattern (@pxref{BEGINFILE/ENDFILE}). - -Otherwise, -@code{ERRNO} works similarly to the C variable @code{errno}. -Except for the case just mentioned, -@command{gawk} @emph{never} clears it (sets it -to zero or @code{""}). Thus, you should only expect its value -to be meaningful when an I/O operation returns a failure -value, such as @code{getline} returning @minus{}1. -You are, of course, free to clear it yourself before doing an -I/O operation. - -This variable is a @command{gawk} extension. -In other @command{awk} implementations, -or if @command{gawk} is in compatibility mode -(@pxref{Options}), -it is not special. +@item @code{ERRNO #} +If a system error occurs during a redirection for @code{getline}, during +a read for @code{getline}, or during a @code{close()} operation, then +@code{ERRNO} contains a string describing the error. + +In addition, @command{gawk} clears @code{ERRNO} before opening each +command-line input file. This enables checking if the file is readable +inside a @code{BEGINFILE} pattern (@pxref{BEGINFILE/ENDFILE}). + +Otherwise, @code{ERRNO} works similarly to the C variable @code{errno}. +Except for the case just mentioned, @command{gawk} @emph{never} clears +it (sets it to zero or @code{""}). Thus, you should only expect its +value to be meaningful when an I/O operation returns a failure value, +such as @code{getline} returning @minus{}1. You are, of course, free +to clear it yourself before doing an I/O operation. @cindex @code{FILENAME} variable @cindex dark corner, @code{FILENAME} variable -@item FILENAME -The name of the file that @command{awk} is currently reading. -When no data files are listed on the command line, @command{awk} reads -from the standard input and @code{FILENAME} is set to @code{"-"}. -@code{FILENAME} is changed each time a new file is read -(@pxref{Reading Files}). -Inside a @code{BEGIN} rule, the value of @code{FILENAME} is -@code{""}, since there are no input files being processed -yet.@footnote{Some early implementations of Unix @command{awk} initialized -@code{FILENAME} to @code{"-"}, even if there were data files to be -processed. This behavior was incorrect and should not be relied -upon in your programs.} -@value{DARKCORNER} -Note, though, that using @code{getline} -(@pxref{Getline}) -inside a @code{BEGIN} rule can give -@code{FILENAME} a value. +@item @code{FILENAME} +The name of the current input file. When no @value{DF}s are listed +on the command line, @command{awk} reads from the standard input and +@code{FILENAME} is set to @code{"-"}. @code{FILENAME} changes each +time a new file is read (@pxref{Reading Files}). Inside a @code{BEGIN} +rule, the value of @code{FILENAME} is @code{""}, since there are no input +files being processed yet.@footnote{Some early implementations of Unix +@command{awk} initialized @code{FILENAME} to @code{"-"}, even if there +were @value{DF}s to be processed. This behavior was incorrect and should +not be relied upon in your programs.} @value{DARKCORNER} Note, though, +that using @code{getline} (@pxref{Getline}) inside a @code{BEGIN} rule +can give @code{FILENAME} a value. @cindex @code{FNR} variable -@item FNR +@item @code{FNR} The current record number in the current file. @code{FNR} is incremented each time a new record is read (@pxref{Records}). It is reinitialized to zero each time a new input file is started. @cindex @code{NF} variable -@item NF +@item @code{NF} The number of fields in the current input record. @code{NF} is set each time a new record is read, when a new field is created or when @code{$0} changes (@pxref{Fields}). -Unlike most of the variables described in this -@ifnotinfo -section, -@end ifnotinfo -@ifinfo -node, -@end ifinfo +Unlike most of the variables described in this @value{SUBSECTION}, assigning a value to @code{NF} has the potential to affect @command{awk}'s internal workings. In particular, assignments to @code{NF} can be used to create or remove fields from the @@ -13141,18 +13932,18 @@ current record. @xref{Changing Fields}. @cindex @code{FUNCTAB} array @cindex @command{gawk}, @code{FUNCTAB} array in @cindex differences in @command{awk} and @command{gawk}, @code{FUNCTAB} variable -@item FUNCTAB # +@item @code{FUNCTAB #} An array whose indices and corresponding values are the names of all the user-defined or extension functions in the program. @quotation NOTE Attempting to use the @code{delete} statement with the @code{FUNCTAB} -array will cause a fatal error. Any attempt to assign to an element of -the @code{FUNCTAB} array will also cause a fatal error. +array causes a fatal error. Any attempt to assign to an element of +@code{FUNCTAB} also causes a fatal error. @end quotation @cindex @code{NR} variable -@item NR +@item @code{NR} The number of input records @command{awk} has processed since the beginning of the program's execution (@pxref{Records}). @@ -13161,17 +13952,19 @@ the beginning of the program's execution @cindex @command{gawk}, @code{PROCINFO} array in @cindex @code{PROCINFO} array @cindex differences in @command{awk} and @command{gawk}, @code{PROCINFO} array -@item PROCINFO # +@item @code{PROCINFO #} The elements of this array provide access to information about the running @command{awk} program. The following elements (listed alphabetically) are guaranteed to be available: @table @code +@cindex effective group ID of @command{gawk} user @item PROCINFO["egid"] The value of the @code{getegid()} system call. @item PROCINFO["euid"] +@cindex effective user ID of @command{gawk} user The value of the @code{geteuid()} system call. @item PROCINFO["FS"] @@ -13181,6 +13974,7 @@ This is or @code{"FPAT"} if field matching with @code{FPAT} is in effect. @item PROCINFO["identifiers"] +@cindex program identifiers A subarray, indexed by the names of all identifiers used in the text of the AWK program. For each identifier, the value of the element is one of the following: @@ -13209,21 +14003,25 @@ after it has finished parsing the program; they are @emph{not} updated while the program runs. @item PROCINFO["gid"] +@cindex group ID of @command{gawk} user The value of the @code{getgid()} system call. @item PROCINFO["pgrpid"] +@cindex process group idIDof @command{gawk} process The process group ID of the current process. @item PROCINFO["pid"] +@cindex process ID of @command{gawk} process The process ID of the current process. @item PROCINFO["ppid"] +@cindex parent process ID of @command{gawk} process The parent process ID of the current process. @item PROCINFO["sorted_in"] If this element exists in @code{PROCINFO}, its value controls the order in which array indices will be processed by -@samp{for (index in array) @dots{}} loops. +@samp{for (@var{index} in @var{array})} loops. Since this is an advanced feature, we defer the full description until later; see @ref{Scanning an Array}. @@ -13237,6 +14035,8 @@ Assigning a new value to this element changes the default. The value of the @code{getuid()} system call. @item PROCINFO["version"] +@cindex version of @command{gawk} +@cindex @command{gawk} version The version of @command{gawk}. @end table @@ -13246,16 +14046,20 @@ if your version of @command{gawk} supports arbitrary precision numbers (@pxref{Arbitrary Precision Arithmetic}): @table @code +@cindex version of GNU MPFR library @item PROCINFO["mpfr_version"] The version of the GNU MPFR library. @item PROCINFO["gmp_version"] +@cindex version of GNU MP library The version of the GNU MP library. @item PROCINFO["prec_max"] +@cindex maximum precision supported by MPFR library The maximum precision supported by MPFR. @item PROCINFO["prec_min"] +@cindex minimum precision supported by MPFR library The minimum precision required by MPFR. @end table @@ -13266,12 +14070,15 @@ of @command{gawk} supports dynamic loading of extension functions @table @code @item PROCINFO["api_major"] +@cindex version of @command{gawk} extension API +@cindex extension API, version number The major version of the extension API. @item PROCINFO["api_minor"] The minor version of the extension API. @end table +@cindex supplementary groups of @command{gawk} process On some systems, there may be elements in the array, @code{"group1"} through @code{"group@var{N}"} for some @var{N}. @var{N} is the number of supplementary groups that the process has. Use the @code{in} operator @@ -13279,15 +14086,14 @@ to test for these elements (@pxref{Reference to Elements}). @cindex @command{gawk}, @code{PROCINFO} array in -@cindex @code{PROCINFO} array +@cindex @code{PROCINFO} array, uses The @code{PROCINFO} array has the following additional uses: -@itemize @bullet +@itemize @value{BULLET} @item -It may be -used to cause coprocesses -to communicate over pseudo-ttys instead of through two-way pipes; -this is discussed further in @ref{Two-way I/O}. +It may be used to cause coprocesses to communicate over pseudo-ttys +instead of through two-way pipes; this is discussed further in +@ref{Two-way I/O}. @item It may be used to provide a timeout when reading from any @@ -13295,14 +14101,8 @@ open input file, pipe, or coprocess. @xref{Read Timeout}, for more information. @end itemize -This array is a @command{gawk} extension. -In other @command{awk} implementations, -or if @command{gawk} is in compatibility mode -(@pxref{Options}), -it is not special. - @cindex @code{RLENGTH} variable -@item RLENGTH +@item @code{RLENGTH} The length of the substring matched by the @code{match()} function (@pxref{String Functions}). @@ -13310,7 +14110,7 @@ The length of the substring matched by the is the length of the matched string, or @minus{}1 if no match is found. @cindex @code{RSTART} variable -@item RSTART +@item @code{RSTART} The start-index in characters of the substring that is matched by the @code{match()} function (@pxref{String Functions}). @@ -13321,20 +14121,14 @@ if no match was found. @cindex @command{gawk}, @code{RT} variable in @cindex @code{RT} variable @cindex differences in @command{awk} and @command{gawk}, @code{RT} variable -@item RT # -This is set each time a record is read. It contains the input text -that matched the text denoted by @code{RS}, the record separator. - -This variable is a @command{gawk} extension. -In other @command{awk} implementations, -or if @command{gawk} is in compatibility mode -(@pxref{Options}), -it is not special. +@item @code{RT #} +The input text that matched the text denoted by @code{RS}, +the record separator. It is set every time a record is read. @cindex @command{gawk}, @code{SYMTAB} array in @cindex @code{SYMTAB} array @cindex differences in @command{awk} and @command{gawk}, @code{SYMTAB} variable -@item SYMTAB # +@item @code{SYMTAB #} An array whose indices are the names of all currently defined global variables and arrays in the program. The array may be used for indirect access to read or write the value of a variable: @@ -13351,7 +14145,7 @@ if an element in @code{SYMTAB} is an array. Also, you may not use the @code{delete} statement with the @code{SYMTAB} array. -You may use an index for @code{SYMTAB} that is not a predefined identifer: +You may use an index for @code{SYMTAB} that is not a predefined identifier: @example SYMTAB["xxx"] = 5 @@ -13363,6 +14157,7 @@ This works as expected: in this case @code{SYMTAB} acts just like a regular array. The only difference is that you can't then delete @code{SYMTAB["xxx"]}. +@cindex Schorr, Andrew The @code{SYMTAB} array is more interesting than it looks. Andrew Schorr points out that it effectively gives @command{awk} data pointers. Consider his example: @@ -13377,8 +14172,8 @@ function multiply(variable, amount) @end example @quotation NOTE -In order to avoid severe time-travel paradoxes@footnote{Not to mention difficult -implementation issues.}, neither @code{FUNCTAB} nor @code{SYMTAB} +In order to avoid severe time-travel paradoxes,@footnote{Not to mention difficult +implementation issues.} neither @code{FUNCTAB} nor @code{SYMTAB} are available as elements within the @code{SYMTAB} array. @end quotation @end table @@ -13419,7 +14214,7 @@ changed. @node ARGC and ARGV @subsection Using @code{ARGC} and @code{ARGV} -@cindex @code{ARGC}/@code{ARGV} variables +@cindex @code{ARGC}/@code{ARGV} variables, how to use @cindex arguments, command-line @cindex command line, arguments @@ -13431,16 +14226,16 @@ and @code{ARGV}: $ @kbd{awk 'BEGIN @{} > @kbd{for (i = 0; i < ARGC; i++)} > @kbd{print ARGV[i]} -> @kbd{@}' inventory-shipped BBS-list} +> @kbd{@}' inventory-shipped mail-list} @print{} awk @print{} inventory-shipped -@print{} BBS-list +@print{} mail-list @end example @noindent In this example, @code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]} contains @samp{inventory-shipped}, and @code{ARGV[2]} contains -@samp{BBS-list}. +@samp{mail-list}. Notice that the @command{awk} program is not entered in @code{ARGV}. The other command-line options, with their arguments, are also not entered. This includes variable assignments done with the @option{-v} @@ -13481,11 +14276,11 @@ additional files to be read. If the value of @code{ARGC} is decreased, that eliminates input files from the end of the list. By recording the old value of @code{ARGC} elsewhere, a program can treat the eliminated arguments as -something other than file names. +something other than @value{FN}s. To eliminate a file from the middle of the list, store the null string (@code{""}) into @code{ARGV} in place of the file's name. As a -special feature, @command{awk} ignores file names that have been +special feature, @command{awk} ignores @value{FN}s that have been replaced with the null string. Another option is to use the @code{delete} statement to remove elements from @@ -13544,6 +14339,65 @@ are passed on to the @command{awk} program. (@xref{Getopt Function}, for an @command{awk} library function that parses command-line options.) +@node Pattern Action Summary +@section Summary + +@itemize @value{BULLET} +@item +Pattern-action pairs make up the basic elements of an @command{awk} +program. Patterns are either normal expressions, range expressions, +regexp constants, one of the special keywords @code{BEGIN}, @code{END}, +@code{BEGINFILE}, @code{ENDFILE}, or empty. The action executes if +the current record matches the pattern. Empty (missing) patterns match +all records. + +@item +I/O from @code{BEGIN} and @code{END} rules have certain constraints. +This is also true, only more so, for @code{BEGINFILE} and @code{ENDFILE} +rules. The latter two give you ``hooks'' into @command{gawk}'s file +processing, allowing you to recover from a file that otherwise would +cause a fatal error (such as a file that cannot be opened). + +@item +Shell variables can be used in @command{awk} programs by careful +use of shell quoting. It is easier to pass a shell variable into +@command{awk} by using the @option{-v} option and an @command{awk} +variable. + +@item +Actions consist of statements enclosed in curly braces. Statements +are built up from expressions, control statements, compound statements, +input and output statements, and deletion statements. + +@item +The control statements in @command{awk} are @code{if}-@code{else}, +@code{while}, @code{for}, and @code{do}-@code{while}. @command{gawk} +adds the @code{switch} statement. There are two flavors of @code{for} +statement: one for for performing general looping, and the other iterating +through an array. + +@item +@code{break} and @code{continue} let you exit early or start the next +iteration of a loop (or get out of a @code{switch}). + +@item +@code{next} and @code{nextfile} let you read the next record and start +over at the top of your program, or skip to the next input file and +start over, respectively. + +@item +The @code{exit} statement terminates your program. When executed +from an action (or function body) it transfers control to the +@code{END} statements. From an @code{END} statement body, it exits +immediately. You may pass an optional numeric value to be used +at @command{awk}'s exit status. + +@item +Some built-in variables provide control over @command{awk}, mainly for I/O. +Other variables convey information from @command{awk} to your program. + +@end itemize + @node Arrays @chapter Arrays in @command{awk} @c STARTOFRANGE arrs @@ -13560,11 +14414,11 @@ It also describes how @command{awk} simulates multidimensional arrays, as well as some of the less obvious points about array usage. The @value{CHAPTER} moves on to discuss @command{gawk}'s facility for sorting arrays, and ends with a brief description of @command{gawk}'s -ability to support true multidimensional arrays. +ability to support true arrays of arrays. @cindex variables, names of @cindex functions, names of -@cindex arrays, names of +@cindex arrays, names of, and names of functions/variables @cindex names, arrays/variables @cindex namespace issues @command{awk} maintains a single set @@ -13583,6 +14437,7 @@ same @command{awk} program. * Multidimensional:: Emulating multidimensional arrays in @command{awk}. * Arrays of Arrays:: True multidimensional arrays. +* Arrays Summary:: Summary of arrays. @end menu @node Array Basics @@ -13644,35 +14499,34 @@ the array is declared.) A contiguous array of four elements might look like the following example, conceptually, if the element values are 8, @code{"foo"}, -@code{""}, and 30: +@code{""}, and 30 +@ifnotdocbook +as shown in @ref{figure-array-elements}: +@end ifnotdocbook +@ifdocbook +as shown in @inlineraw{docbook, <xref linkend="figure-array-elements"/>}: +@end ifdocbook -@c @strong{FIXME: NEXT ED:} Use real images here, and an @float -@iftex -@c from Karl Berry, much thanks for the help. -@tex -\bigskip % space above the table (about 1 linespace) -\offinterlineskip -\newdimen\width \width = 1.5cm -\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt -\centerline{\vbox{ -\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr -\noalign{\hrule width\hwidth} - &&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad Value\cr -\noalign{\hrule width\hwidth} -\noalign{\smallskip} - &\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad Index\cr -} -}} -@end tex -@end iftex -@ifnottex -@example -+---------+---------+--------+---------+ -| 8 | "foo" | "" | 30 | @r{Value} -+---------+---------+--------+---------+ - 0 1 2 3 @r{Index} -@end example -@end ifnottex +@ifnotdocbook +@float Figure,figure-array-elements +@caption{A Contiguous Array} +@ifinfo +@center @image{array-elements, , , Basic Program Stages, txt} +@end ifinfo +@ifnotinfo +@center @image{array-elements, , , Basic Program Stages} +@end ifnotinfo +@end float +@end ifnotdocbook + +@docbook +<figure id="figure-array-elements" float="0"> +<title>A Contiguous Array</title> +<mediaobject> +<imageobject role="web"><imagedata fileref="array-elements.png" format="PNG"/></imageobject> +</mediaobject> +</figure> +@end docbook @noindent Only the values are stored; the indices are implicit from the order of @@ -13689,12 +14543,53 @@ Arrays in @command{awk} are different---they are @dfn{associative}. This means that each array is a collection of pairs: an index and its corresponding array element value: +@ifnotdocbook @example @r{Index} 3 @r{Value} 30 @r{Index} 1 @r{Value} "foo" @r{Index} 0 @r{Value} 8 @r{Index} 2 @r{Value} "" @end example +@end ifnotdocbook + +@docbook +<informaltable> +<tgroup cols="2"> +<colspec colname="1" align="center"/> +<colspec colname="2" align="center"/> +<thead> +<row> +<entry>Index</entry> +<entry>Value</entry> +</row> +</thead> + +<tbody> +<row> +<entry><literal>3</literal></entry> +<entry><literal>30</literal></entry> +</row> + +<row> +<entry><literal>1</literal></entry> +<entry><literal>"foo"</literal></entry> +</row> + +<row> +<entry><literal>0</literal></entry> +<entry><literal>8</literal></entry> +</row> + +<row> +<entry><literal>2</literal></entry> +<entry><literal>""</literal></entry> +</row> + +</tbody> +</tgroup> +</informaltable> + +@end docbook @noindent The pairs are shown in jumbled order because their order is irrelevant. @@ -13703,6 +14598,7 @@ One advantage of associative arrays is that new pairs can be added at any time. For example, suppose a tenth element is added to the array whose value is @w{@code{"number ten"}}. The result is: +@ifnotdocbook @example @r{Index} 10 @r{Value} "number ten" @r{Index} 3 @r{Value} 30 @@ -13710,6 +14606,51 @@ whose value is @w{@code{"number ten"}}. The result is: @r{Index} 0 @r{Value} 8 @r{Index} 2 @r{Value} "" @end example +@end ifnotdocbook + +@docbook +<informaltable> +<tgroup cols="2"> +<colspec colname="1" align="center"/> +<colspec colname="2" align="center"/> +<thead> +<row> +<entry>Index</entry> +<entry>Value</entry> +</row> +</thead> +<tbody> + +<row> +<entry><literal>10</literal></entry> +<entry><literal>"number ten"</literal></entry> +</row> + +<row> +<entry><literal>3</literal></entry> +<entry><literal>30</literal></entry> +</row> + +<row> +<entry><literal>1</literal></entry> +<entry><literal>"foo"</literal></entry> +</row> + +<row> +<entry><literal>0</literal></entry> +<entry><literal>8</literal></entry> +</row> + +<row> +<entry><literal>2</literal></entry> +<entry><literal>""</literal></entry> +</row> + +</tbody> +</tgroup> +</informaltable> + +@end docbook @noindent @cindex sparse arrays @@ -13722,28 +14663,67 @@ have to be positive integers. Any number, or even a string, can be an index. For example, the following is an array that translates words from English to French: +@ifnotdocbook @example @r{Index} "dog" @r{Value} "chien" @r{Index} "cat" @r{Value} "chat" @r{Index} "one" @r{Value} "un" @r{Index} 1 @r{Value} "un" @end example +@end ifnotdocbook + +@docbook +<informaltable> +<tgroup cols="2"> +<colspec colname="1" align="center"/> +<colspec colname="2" align="center"/> +<thead> +<row> +<entry>Index</entry> +<entry>Value</entry> +</row> +</thead> +<tbody> +<row> +<entry><literal>"dog"</literal></entry> +<entry><literal>"chien"</literal></entry> +</row> + +<row> +<entry><literal>"cat"</literal></entry> +<entry><literal>"chat"</literal></entry> +</row> + +<row> +<entry><literal>"one"</literal></entry> +<entry><literal>"un"</literal></entry> +</row> + +<row> +<entry><literal>1</literal></entry> +<entry><literal>"un"</literal></entry> +</row> + +</tbody> +</tgroup> +</informaltable> + +@end docbook @noindent Here we decided to translate the number one in both spelled-out and numeric form---thus illustrating that a single array can have both numbers and strings as indices. -In fact, array subscripts are always strings; this is discussed +(In fact, array subscripts are always strings; this is discussed in more detail in -@ref{Numeric Array Subscripts}. +@ref{Numeric Array Subscripts}.) Here, the number @code{1} isn't double-quoted, since @command{awk} automatically converts it to a string. @cindex @command{gawk}, @code{IGNORECASE} variable in -@cindex @code{IGNORECASE} variable @cindex case sensitivity, array indices and -@cindex arrays, @code{IGNORECASE} variable and -@cindex @code{IGNORECASE} variable, array subscripts and +@cindex arrays, and @code{IGNORECASE} variable +@cindex @code{IGNORECASE} variable, and array indices The value of @code{IGNORECASE} has no effect upon array subscripting. The identical string value used to store an array element must be used to retrieve it. @@ -13759,8 +14739,9 @@ is independent of the number of elements in the array. @node Reference to Elements @subsection Referring to an Array Element -@cindex arrays, elements, referencing -@cindex elements in arrays +@cindex arrays, referencing elements +@cindex array members +@cindex elements of arrays The principal way to use an array is to refer to one of its elements. An array reference is an expression as follows: @@ -13777,11 +14758,16 @@ The value of the array reference is the current value of that array element. For example, @code{foo[4.3]} is an expression for the element of array @code{foo} at index @samp{4.3}. +@cindex arrays, unassigned elements +@cindex unassigned array elements +@cindex empty array elements A reference to an array element that has no recorded value yields a value of @code{""}, the null string. This includes elements that have not been assigned any value as well as elements that have been deleted (@pxref{Delete}). +@cindex non-existent array elements +@cindex arrays, elements that don't exist @quotation NOTE A reference to an element that does not exist @emph{automatically} creates that array element, with the null string as its value. (In some cases, @@ -13801,19 +14787,19 @@ if it didn't exist before! @end quotation @c @cindex arrays, @code{in} operator and -@cindex @code{in} operator +@cindex @code{in} operator, testing if array element exists To determine whether an element exists in an array at a certain index, use the following expression: @example -@var{ind} in @var{array} +@var{indx} in @var{array} @end example @cindex side effects, array indexing @noindent -This expression tests whether the particular index @var{ind} exists, +This expression tests whether the particular index @var{indx} exists, without the side effect of creating that element if it is not present. -The expression has the value one (true) if @code{@var{array}[@var{ind}]} +The expression has the value one (true) if @code{@var{array}[@var{indx}]} exists and zero (false) if it does not exist. For example, this statement tests whether the array @code{frequencies} contains the index @samp{2}: @@ -13836,8 +14822,8 @@ if (frequencies[2] != "") @node Assigning Elements @subsection Assigning Array Elements -@cindex arrays, elements, assigning -@cindex elements in arrays, assigning +@cindex arrays, elements, assigning values +@cindex elements in arrays, assigning values Array elements can be assigned values just like @command{awk} variables: @@ -13854,6 +14840,7 @@ assign to that element of the array. @node Array Example @subsection Basic Array Example +@cindex arrays, an example of using The following program takes a list of lines, each beginning with a line number, and prints them out in order of line number. The line numbers @@ -13923,7 +14910,9 @@ END @{ @node Scanning an Array @subsection Scanning All Elements of an Array @cindex elements in arrays, scanning +@cindex scanning arrays @cindex arrays, scanning +@cindex loops, @code{for}, array scanning In programs that use arrays, it is often necessary to use a loop that executes once for each element of an array. In other languages, where @@ -13940,7 +14929,7 @@ for (@var{var} in @var{array}) @end example @noindent -@cindex @code{in} operator +@cindex @code{in} operator, use in loops This loop executes @var{body} once for each index in @var{array} that the program has previously used, with the variable @var{var} set to that index. @@ -13979,18 +14968,61 @@ END @{ @xref{Word Sorting}, for a more detailed example of this type. -@cindex arrays, elements, order of -@cindex elements in arrays, order of +@cindex arrays, elements, order of access by @code{in} operator +@cindex elements in arrays, order of access by @code{in} operator +@cindex @code{in} operator, order of array access The order in which elements of the array are accessed by this statement is determined by the internal arrangement of the array elements within -@command{awk} and normally cannot be controlled or changed. This can lead to -problems if new elements are added to @var{array} by statements in -the loop body; it is not predictable whether the @code{for} loop will -reach them. Similarly, changing @var{var} inside the loop may produce -strange results. It is best to avoid such things. +@command{awk} and in standard @command{awk} cannot be controlled +or changed. This can lead to problems if new elements are added to +@var{array} by statements in the loop body; it is not predictable whether +the @code{for} loop will reach them. Similarly, changing @var{var} inside +the loop may produce strange results. It is best to avoid such things. + +As a point of information, @command{gawk} sets up the list of elements +to be iterated over before the loop starts, and does not change it. +But not all @command{awk} versions do so. Consider this program, named +@file{loopcheck.awk}: + +@example +BEGIN @{ + a["here"] = "here" + a["is"] = "is" + a["a"] = "a" + a["loop"] = "loop" + for (i in a) @{ + j++ + a[j] = j + print i + @} +@} +@end example + +Here is what happens when run with @command{gawk}: + +@example +$ @kbd{gawk -f loopcheck.awk} +@print{} here +@print{} loop +@print{} a +@print{} is +@end example + +Contrast this to Brian Kernighan's @command{awk}: + +@example +$ @kbd{nawk -f loopcheck.awk} +@print{} loop +@print{} here +@print{} is +@print{} a +@print{} 1 +@end example @node Controlling Scanning -@subsection Using Predefined Array Scanning Orders +@subsection Using Predefined Array Scanning Orders With @command{gawk} + +This @value{SUBSECTION} describes a feature that is specific to @command{gawk}. By default, when a @code{for} loop traverses an array, the order is undefined, meaning that the @command{awk} implementation @@ -13998,12 +15030,14 @@ determines the order in which the array is traversed. This order is usually based on the internal implementation of arrays and will vary from one version of @command{awk} to the next. +@cindex array scanning order, controlling +@cindex controlling array scanning order Often, though, you may wish to do something simple, such as ``traverse the array by comparing the indices in ascending order,'' or ``traverse the array by comparing the values in descending order.'' @command{gawk} provides two mechanisms which give you this control. -@itemize @bullet +@itemize @value{BULLET} @item Set @code{PROCINFO["sorted_in"]} to one of a set of predefined values. We describe this now. @@ -14014,6 +15048,7 @@ to use for comparison of array elements. This advanced feature is described later, in @ref{Array Sorting}. @end itemize +@cindex @code{PROCINFO}, values of @code{sorted_in} The following special values for @code{PROCINFO["sorted_in"]} are available: @table @code @@ -14109,7 +15144,7 @@ order relative to each other is determined by their index strings. Here are some additional things to bear in mind about sorted array traversal. -@itemize @bullet +@itemize @value{BULLET} @item The value of @code{PROCINFO["sorted_in"]} is global. That is, it affects all array traversal @code{for} loops. If you need to change it within your @@ -14174,7 +15209,7 @@ if (4 in foo) print "This will never be printed" @end example -@cindex null strings, array elements and +@cindex null strings, and deleting array elements It is important to note that deleting an element is @emph{not} the same as assigning it a null value (the empty string, @code{""}). For example: @@ -14196,6 +15231,7 @@ is not in the array is deleted. @cindex extensions, common@comma{} @code{delete} to delete entire arrays @cindex arrays, deleting entire contents @cindex deleting entire arrays +@cindex @code{delete} @var{array} @cindex differences in @command{awk} and @command{gawk}, array elements, deleting All the elements of an array may be deleted with a single statement by leaving off the subscript in the @code{delete} statement, @@ -14210,6 +15246,7 @@ Using this version of the @code{delete} statement is about three times more efficient than the equivalent loop that deletes each element one at a time. +@cindex Brian Kernighan's @command{awk} @quotation NOTE For many years, using @code{delete} without a subscript was a @command{gawk} extension. @@ -14252,9 +15289,9 @@ a = 3 @section Using Numbers to Subscript Arrays @cindex numbers, as array subscripts -@cindex arrays, subscripts +@cindex arrays, numeric subscripts @cindex subscripts in arrays, numbers as -@cindex @code{CONVFMT} variable, array subscripts and +@cindex @code{CONVFMT} variable, and array subscripts An important aspect to remember about arrays is that @emph{array subscripts are always strings}. When a numeric value is used as a subscript, it is converted to a string value before being used for subscripting @@ -14284,7 +15321,8 @@ string value from @code{xyz}---this time @code{"12.15"}---because the value of @code{CONVFMT} only allows two significant digits. This test fails, since @code{"12.15"} is different from @code{"12.153"}. -@cindex converting, during subscripting +@cindex converting integer array subscripts +@cindex integer array indices According to the rules for conversions (@pxref{Conversion}), integer values are always converted to strings as integers, no matter what the @@ -14338,7 +15376,7 @@ $ @kbd{echo 'line 1} @print{} line 2 @end example -Unfortunately, the very first line of input data did not come out in the +Unfortunately, the very first line of input data did not appear in the output! Upon first glance, we would think that this program should have worked. @@ -14378,7 +15416,7 @@ on the command line (@pxref{Options}). @section Multidimensional Arrays @menu -* Multiscanning:: Scanning multidimensional arrays. +* Multiscanning:: Scanning multidimensional arrays. @end menu @cindex subscripts in arrays, multidimensional @@ -14390,7 +15428,7 @@ languages, including @command{awk}) to refer to an element of a two-dimensional array named @code{grid} is with @code{grid[@var{x},@var{y}]}. -@cindex @code{SUBSEP} variable, multidimensional arrays +@cindex @code{SUBSEP} variable, and multidimensional arrays Multidimensional arrays are supported in @command{awk} through concatenation of indices into one string. @command{awk} converts the indices into strings @@ -14422,6 +15460,7 @@ combined strings that are ambiguous. Suppose that @code{SUBSEP} is "b@@c"]}} are indistinguishable because both are actually stored as @samp{foo["a@@b@@c"]}. +@cindex @code{in} operator, index existence in multidimensional arrays To test whether a particular index sequence exists in a multidimensional array, use the same operator (@code{in}) that is used for single dimensional arrays. Write the whole sequence of indices @@ -14487,6 +15526,7 @@ multidimensional @emph{way of accessing} an array. @cindex subscripts in arrays, multidimensional, scanning @cindex arrays, multidimensional, scanning +@cindex scanning multidimensional arrays However, if your program has an array that is always accessed as multidimensional, you can get the effect of scanning it by combining the scanning @code{for} statement @@ -14528,12 +15568,13 @@ separate indices is recovered. @node Arrays of Arrays @section Arrays of Arrays +@cindex arrays of arrays @command{gawk} goes beyond standard @command{awk}'s multidimensional array access and provides true arrays of arrays. Elements of a subarray are referred to by their own indices enclosed in square brackets, just like the elements of the main array. -For example, the following creates a two-element subarray at index @samp{1} +For example, the following creates a two-element subarray at index @code{1} of the main array @code{a}: @example @@ -14557,7 +15598,7 @@ Each subarray and the main array can be of different length. In fact, the elements of an array or its subarray do not all have to have the same type. This means that the main array and any of its subarrays can be non-rectangular, or jagged in structure. One can assign a scalar value to -the index @samp{4} of the main array @code{a}: +the index @code{4} of the main array @code{a}: @example a[4] = "An element in a jagged array" @@ -14578,7 +15619,7 @@ a[4][5][6][7] = "An element in a four-dimensional array" @end example @noindent -This removes the scalar value from index @samp{4} and then inserts a +This removes the scalar value from index @code{4} and then inserts a subarray of subarray of subarray containing a scalar. You can also delete an entire subarray or subarray of subarrays: @@ -14679,6 +15720,63 @@ creating an arbitrary index: $ @kbd{gawk 'BEGIN @{ b[1][1] = ""; split("a b c d", b[1]); print b[1][1] @}'} @print{} a @end example + +@node Arrays Summary +@section Summary + +@itemize @value{BULLET} +@item +Standard @command{awk} provides one-dimensional associative arrays +(arrays indexed by string values). All arrays are associative; numeric +indices are converted automatically to strings. + +@item +Array elements are referenced as @code{@var{array}[@var{indx}]}. +Referencing an element creates it if it did not exist previously. + +@item +The proper way to see if an array has an element with a given index +is to use the @code{in} operator: @samp{@var{indx} in @var{array}}. + +@item +Use @samp{for (@var{indx} in @var{array}) @dots{}} to scan through all the +individual elements of an array. In the body of the loop, @var{indx} takes +on the value of each element's index in turn. + +@item +The order in which a @samp{for (@var{indx} in @var{array})} loop +traverses an array is undefined in POSIX @command{awk} and varies among +implementations. @command{gawk} lets you control the order by assigning +special predefined values to @code{PROCINFO["sorted_in"]}. + +@item +Use @samp{delete @var{array}[@var{indx}]} to delete an individual element. +You may also use @samp{delete @var{array}} to delete all of the elements +in the array. This latter feature has been a common extension for many +years and is now standard, but may not be supported by all commercial +versions of @command{awk}. + +@item +Standard @command{awk} simulates multidimensional arrays by separating +subscript values with a comma. The values are concatenated into a +single string, separated by the value of @code{SUBSEP}. The fact +that such a subscript was created in this way is not retained; thus +changing @code{SUBSEP} may have unexpected consequences. You can use +@samp{(@var{sub1}, @var{sub2}, @dots{}) in @var{array}} to see if such +a multidimensional subscript exists in @var{array}. + +@item +@command{gawk} provides true arrays of arrays. You use a separate +set of square brackets for each dimension in such an array: +@code{data[row][col]}, for example. Array elements may thus be either +scalar values (number or string) or another array. + +@item +Use the @code{isarray()} built-in function to determine if an array +element is itself a subarray. + +@end itemize + @c ENDOFRANGE arrs @node Functions @@ -14703,6 +15801,7 @@ The second half of this @value{CHAPTER} describes these * Built-in:: Summarizes the built-in functions. * User-defined:: Describes User-defined functions in detail. * Indirect Calls:: Choosing the function to call at runtime. +* Functions Summary:: Summary of functions. @end menu @node Built-in @@ -14787,42 +15886,50 @@ two arguments 11 and 10. @node Numeric Functions @subsection Numeric Functions +@cindex numeric functions The following list describes all of the built-in functions that work with numbers. Optional parameters are enclosed in square brackets@w{ ([ ]):} -@table @code -@item atan2(@var{y}, @var{x}) -@cindex @code{atan2()} function +@c @asis for docbook +@table @asis +@item @code{atan2(@var{y}, @var{x})} +@cindexawkfunc{atan2} +@cindex arctangent Return the arctangent of @code{@var{y} / @var{x}} in radians. -You can use @samp{pi = atan2(0, -1)} to retrieve the value of @value{PI}. +You can use @samp{pi = atan2(0, -1)} to retrieve the value of +@value{PI}. -@item cos(@var{x}) -@cindex @code{cos()} function +@item @code{cos(@var{x})} +@cindexawkfunc{cos} +@cindex cosine Return the cosine of @var{x}, with @var{x} in radians. -@item exp(@var{x}) -@cindex @code{exp()} function +@item @code{exp(@var{x})} +@cindexawkfunc{exp} +@cindex exponent Return the exponential of @var{x} (@code{e ^ @var{x}}) or report an error if @var{x} is out of range. The range of values @var{x} can have depends on your machine's floating-point representation. -@item int(@var{x}) -@cindex @code{int()} function +@item @code{int(@var{x})} +@cindexawkfunc{int} +@cindex round to nearest integer Return the nearest integer to @var{x}, located between @var{x} and zero and truncated toward zero. For example, @code{int(3)} is 3, @code{int(3.9)} is 3, @code{int(-3.9)} is @minus{}3, and @code{int(-3)} is @minus{}3 as well. -@item log(@var{x}) -@cindex @code{log()} function +@item @code{log(@var{x})} +@cindexawkfunc{log} +@cindex logarithm Return the natural logarithm of @var{x}, if @var{x} is positive; otherwise, report an error. -@item rand() -@cindex @code{rand()} function +@item @code{rand()} +@cindexawkfunc{rand} @cindex random numbers, @code{rand()}/@code{srand()} functions Return a random number. The values of @code{rand()} are uniformly distributed between zero and one. @@ -14864,7 +15971,7 @@ function roll(n) @{ return 1 + int(rand() * n) @} @} @end example -@cindex numbers, random +@cindex seeding random number generator @cindex random numbers, seed of @quotation CAUTION In most @command{awk} implementations, including @command{gawk}, @@ -14879,18 +15986,20 @@ the seed to a value that is different in each run. To do this, use @code{srand()}. @end quotation -@item sin(@var{x}) -@cindex @code{sin()} function +@item @code{sin(@var{x})} +@cindexawkfunc{sin} +@cindex sine Return the sine of @var{x}, with @var{x} in radians. -@item sqrt(@var{x}) -@cindex @code{sqrt()} function +@item @code{sqrt(@var{x})} +@cindexawkfunc{sqrt} +@cindex square root Return the positive square root of @var{x}. @command{gawk} prints a warning message if @var{x} is negative. Thus, @code{sqrt(4)} is 2. -@item srand(@r{[}@var{x}@r{]}) -@cindex @code{srand()} function +@item @code{srand(}[@var{x}]@code{)} +@cindexawkfunc{srand} Set the starting point, or seed, for generating random numbers to the value @var{x}. @@ -14920,6 +16029,7 @@ sequences of random numbers. @node String Functions @subsection String-Manipulation Functions +@cindex string-manipulation functions The functions in this @value{SECTION} look at or change the text of one or more strings. @@ -14932,12 +16042,23 @@ example, @code{length()} returns the number of characters in a string, and not the number of bytes used to represent those characters. Similarly, @code{index()} works with character indices, and not byte indices. +@quotation CAUTION +A number of functions deal with indices into strings. For these +functions, the first character of a string is at position (index) one. +This is different from C and the languages descended from it, where the +first character is at position zero. You need to remember this when +doing index calculations, particularly if you are used to C. +@end quotation + In the following list, optional parameters are enclosed in square brackets@w{ ([ ]).} Several functions perform string substitution; the full discussion is provided in the description of the @code{sub()} function, which comes towards the end since the list is presented in alphabetic order. + Those functions that are specific to @command{gawk} are marked with a -pound sign@w{ (@samp{#}):} +pound sign (@samp{#}). They are not available in compatibility mode +(@pxref{Options}): + @menu * Gory Details:: More than you want to know about @samp{\} and @@ -14945,14 +16066,15 @@ pound sign@w{ (@samp{#}):} @code{gensub()}. @end menu -@table @code -@item asort(@var{source} @r{[}, @var{dest} @r{[}, @var{how} @r{]} @r{]}) # -@itemx asorti(@var{source} @r{[}, @var{dest} @r{[}, @var{how} @r{]} @r{]}) # -@cindex @code{asorti()} function (@command{gawk}) +@c @asis for docbook +@table @asis +@item @code{asort(}@var{source} [@code{,} @var{dest} [@code{,} @var{how} ] ]@code{) #} +@itemx @code{asorti(}@var{source} [@code{,} @var{dest} [@code{,} @var{how} ] ]@code{) #} +@cindexgawkfunc{asorti} +@cindex sort array @cindex arrays, elements, retrieving number of -@cindex @code{asort()} function (@command{gawk}) -@cindex @command{gawk}, @code{IGNORECASE} variable in -@cindex @code{IGNORECASE} variable +@cindexgawkfunc{asort} +@cindex sort array indices These two functions are similar in behavior, so they are described together. @@ -14970,7 +16092,9 @@ sequential integers starting with one. If the optional array @var{dest} is specified, then @var{source} is duplicated into @var{dest}. @var{dest} is then sorted, leaving the indices of @var{source} unchanged. -When comparing strings, @code{IGNORECASE} affects the sorting. If the +@cindex @command{gawk}, @code{IGNORECASE} variable in +When comparing strings, @code{IGNORECASE} affects the sorting +(@pxref{Array Sorting Functions}). If the @var{source} array contains subarrays as values (@pxref{Arrays of Arrays}), they will come last, after all scalar values. @@ -15009,11 +16133,10 @@ a[2] = "last" a[3] = "middle" @end example -@code{asort()} and @code{asorti()} are @command{gawk} extensions; they -are not available in compatibility mode (@pxref{Options}). - -@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]}) # -@cindex @code{gensub()} function (@command{gawk}) +@item @code{gensub(@var{regexp}, @var{replacement}, @var{how}} [@code{, @var{target}}]@code{) #} +@cindexgawkfunc{gensub} +@cindex search and replace in strings +@cindex substitute in string Search the target string @var{target} for matches of the regular expression @var{regexp}. If @var{how} is a string beginning with @samp{g} or @samp{G} (short for ``global''), then replace all matches of @var{regexp} with @@ -15022,7 +16145,7 @@ which match of @var{regexp} to replace. If no @var{target} is supplied, use @code{$0}. It returns the modified string as the result of the function and the original target string is @emph{not} changed. -@code{gensub()} is a general substitution function. It's purpose is +@code{gensub()} is a general substitution function. Its purpose is to provide more features than the standard @code{sub()} and @code{gsub()} functions. @@ -15072,11 +16195,8 @@ a warning message. If @var{regexp} does not match @var{target}, @code{gensub()}'s return value is the original unchanged value of @var{target}. -@code{gensub()} is a @command{gawk} extension; it is not available -in compatibility mode (@pxref{Options}). - -@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) -@cindex @code{gsub()} function +@item @code{gsub(@var{regexp}, @var{replacement}} [@code{, @var{target}}]@code{)} +@cindexawkfunc{gsub} Search @var{target} for @emph{all} of the longest, leftmost, @emph{nonoverlapping} matching substrings it can find and replace them with @var{replacement}. @@ -15097,9 +16217,10 @@ omitted, then the entire input record (@code{$0}) is used. As in @code{sub()}, the characters @samp{&} and @samp{\} are special, and the third argument must be assignable. -@item index(@var{in}, @var{find}) -@cindex @code{index()} function -@cindex searching +@item @code{index(@var{in}, @var{find})} +@cindexawkfunc{index} +@cindex search in string +@cindex find substring in string Search the string @var{in} for the first occurrence of the string @var{find}, and return the position in characters where that occurrence begins in the string @var{in}. Consider the following example: @@ -15111,19 +16232,33 @@ $ @kbd{awk 'BEGIN @{ print index("peanut", "an") @}'} @noindent If @var{find} is not found, @code{index()} returns zero. -(Remember that string indices in @command{awk} start at one.) It is a fatal error to use a regexp constant for @var{find}. -@item length(@r{[}@var{string}@r{]}) -@cindex @code{length()} function +@item @code{length(}[@var{string}]@code{)} +@cindexawkfunc{length} +@cindex string length +@cindex length of string Return the number of characters in @var{string}. If @var{string} is a number, the length of the digit string representing that number is returned. For example, @code{length("abcde")} is five. By -contrast, @code{length(15 * 35)} works out to three. In this example, 15 * 35 = -525, and 525 is then converted to the string @code{"525"}, which has +contrast, @code{length(15 * 35)} works out to three. In this example, +@iftex +@math{15 @cdot 35 = 525}, +@end iftex +@ifnottex +@ifnotdocbook +15 * 35 = 525, +@end ifnotdocbook +@end ifnottex +@docbook +15 ⋅ 35 = 525, @c +@end docbook +and 525 is then converted to the string @code{"525"}, which has three characters. +@cindex length of input record +@cindex input record, length of If no argument is supplied, @code{length()} returns the length of @code{$0}. @c @cindex historical features @@ -15162,6 +16297,8 @@ warning about this. @cindex common extensions, @code{length()} applied to an array @cindex extensions, common@comma{} @code{length()} applied to an array @cindex differences between @command{gawk} and @command{awk} +@cindex number of array elements +@cindex array, number of elements With @command{gawk} and several other @command{awk} implementations, when given an array argument, the @code{length()} function returns the number of elements in the array. @value{COMMONEXT} @@ -15174,16 +16311,18 @@ If @option{--lint} is provided on the command line If @option{--posix} is supplied, using an array argument is a fatal error (@pxref{Arrays}). -@item match(@var{string}, @var{regexp} @r{[}, @var{array}@r{]}) -@cindex @code{match()} function +@item @code{match(@var{string}, @var{regexp}} [@code{, @var{array}}]@code{)} +@cindexawkfunc{match} +@cindex string, regular expression match +@cindex match regexp in string Search @var{string} for the longest, leftmost substring matched by the regular expression, -@var{regexp} and return the character position, or @dfn{index}, +@var{regexp} and return the character position (index) at which that substring begins (one, if it starts at the beginning of @var{string}). If no match is found, return zero. The @var{regexp} argument may be either a regexp constant -(@code{/@dots{}/}) or a string constant (@code{"@dots{}"}). +(@code{/}@dots{}@code{/}) or a string constant (@code{"}@dots{}@code{"}). In the latter case, the string is treated as a regexp to be matched. @xref{Computed Regexps}, for a discussion of the difference between the two forms, and the @@ -15289,8 +16428,9 @@ The @var{array} argument to @code{match()} is a (@pxref{Options}), using a third argument is a fatal error. -@item patsplit(@var{string}, @var{array} @r{[}, @var{fieldpat} @r{[}, @var{seps} @r{]} @r{]}) # -@cindex @code{patsplit()} function (@command{gawk}) +@item @code{patsplit(@var{string}, @var{array}} [@code{, @var{fieldpat}} [@code{, @var{seps}} ] ]@code{) #} +@cindexgawkfunc{patsplit} +@cindex split string into array Divide @var{string} into pieces defined by @var{fieldpat} and store the pieces in @var{array} and the separator strings in the @@ -15314,14 +16454,8 @@ manner similar to the way input lines are split into fields using @code{FPAT} Before splitting the string, @code{patsplit()} deletes any previously existing elements in the arrays @var{array} and @var{seps}. -@cindex troubleshooting, @code{patsplit()} function -The @code{patsplit()} function is a -@command{gawk} extension. In compatibility mode -(@pxref{Options}), -it is not available. - -@item split(@var{string}, @var{array} @r{[}, @var{fieldsep} @r{[}, @var{seps} @r{]} @r{]}) -@cindex @code{split()} function +@item @code{split(@var{string}, @var{array}} [@code{, @var{fieldsep}} [@code{, @var{seps}} ] ]@code{)} +@cindexawkfunc{split} Divide @var{string} into pieces separated by @var{fieldsep} and store the pieces in @var{array} and the separator strings in the @var{seps} array. The first piece is stored in @@ -15350,7 +16484,7 @@ split("cul-de-sac", a, "-", seps) @end example @noindent -@cindex strings, splitting +@cindex strings splitting, example splits the string @samp{cul-de-sac} into three fields using @samp{-} as the separator. It sets the contents of the array @code{a} as follows: @@ -15405,8 +16539,11 @@ If @var{string} does not match @var{fieldsep} at all (but is not null), @var{array} has one element only. The value of that element is the original @var{string}. -@item sprintf(@var{format}, @var{expression1}, @dots{}) -@cindex @code{sprintf()} function +In POSIX mode (@pxref{Options}), the fourth argument is not allowed. + +@item @code{sprintf(@var{format}, @var{expression1}, @dots{})} +@cindexawkfunc{sprintf} +@cindex formatting strings Return (without printing) the string that @code{printf} would have printed out with the same arguments (@pxref{Printf}). @@ -15419,8 +16556,9 @@ pival = sprintf("pi = %.2f (approx.)", 22/7) @noindent assigns the string @w{@samp{pi = 3.14 (approx.)}} to the variable @code{pival}. -@cindex @code{strtonum()} function (@command{gawk}) -@item strtonum(@var{str}) # +@cindexgawkfunc{strtonum} +@cindex convert string to number +@item @code{strtonum(@var{str}) #} Examine @var{str} and return its numeric value. If @var{str} begins with a leading @samp{0}, @code{strtonum()} assumes that @var{str} is an octal number. If @var{str} begins with a leading @samp{0x} or @@ -15442,12 +16580,9 @@ you use the @option{--non-decimal-data} option, which isn't recommended. Note also that @code{strtonum()} uses the current locale's decimal point for recognizing numbers (@pxref{Locales}). -@cindex differences in @command{awk} and @command{gawk}, @code{strtonum()} function (@command{gawk}) -@code{strtonum()} is a @command{gawk} extension; it is not available -in compatibility mode (@pxref{Options}). - -@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) -@cindex @code{sub()} function +@item @code{sub(@var{regexp}, @var{replacement}} [@code{, @var{target}}]@code{)} +@cindexawkfunc{sub} +@cindex replace in string Search @var{target}, which is treated as a string, for the leftmost, longest substring matched by the regular expression @var{regexp}. Modify the entire string @@ -15456,7 +16591,7 @@ The modified string becomes the new value of @var{target}. Return the number of substitutions made (zero or one). The @var{regexp} argument may be either a regexp constant -(@code{/@dots{}/}) or a string constant (@code{"@dots{}"}). +(@code{/}@dots{}@code{/}) or a string constant (@code{"}@dots{}@code{"}). In the latter case, the string is treated as a regexp to be matched. @xref{Computed Regexps}, for a discussion of the difference between the two forms, and the @@ -15546,8 +16681,9 @@ will not run. Finally, if the @var{regexp} is not a regexp constant, it is converted into a string, and then the value of that string is treated as the regexp to match. -@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]}) -@cindex @code{substr()} function +@item @code{substr(@var{string}, @var{start}} [@code{, @var{length}} ]@code{)} +@cindexawkfunc{substr} +@cindex substring Return a @var{length}-character-long substring of @var{string}, starting at character number @var{start}. The first character of a string is character number one.@footnote{This is different from @@ -15561,6 +16697,7 @@ suffix is also returned if @var{length} is greater than the number of characters remaining in the string, counting from character @var{start}. +@cindex Brian Kernighan's @command{awk} If @var{start} is less than one, @code{substr()} treats it as if it was one. (POSIX doesn't specify what to do in this case: Brian Kernighan's @command{awk} acts this way, and therefore @command{gawk} @@ -15603,16 +16740,18 @@ string = substr(string, 1, 2) "CDE" substr(string, 6) @end example @cindex case sensitivity, converting case -@cindex converting, case -@item tolower(@var{string}) -@cindex @code{tolower()} function +@cindex strings, converting letter case +@item @code{tolower(@var{string})} +@cindexawkfunc{tolower} +@cindex convert string to lower case Return a copy of @var{string}, with each uppercase character in the string replaced with its corresponding lowercase character. Nonalphabetic characters are left unchanged. For example, @code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}. -@item toupper(@var{string}) -@cindex @code{toupper()} function +@item @code{toupper(@var{string})} +@cindexawkfunc{toupper} +@cindex convert string to upper case Return a copy of @var{string}, with each lowercase character in the string replaced with its corresponding uppercase character. Nonalphabetic characters are left unchanged. For example, @@ -15636,10 +16775,11 @@ that there are several levels of @dfn{escape processing} going on. First, there is the @dfn{lexical} level, which is when @command{awk} reads your program -and builds an internal copy of it that can be executed. +and builds an internal copy of it to execute. Then there is the runtime level, which is when @command{awk} actually scans the replacement string to determine what to generate. +@cindex Brian Kernighan's @command{awk} At both levels, @command{awk} looks for a defined set of characters that can come after a backslash. At the lexical level, it looks for the escape sequences listed in @ref{Escape Sequences}. @@ -15778,7 +16918,7 @@ says, in effect, that @samp{\} turns off the special meaning of any following character, but for anything other than @samp{\} and @samp{&}, such special meaning is undefined. This wording leads to two problems: -@itemize @bullet +@itemize @value{BULLET} @item Backslashes must now be doubled in the @var{replacement} string, breaking historical @command{awk} programs. @@ -15909,17 +17049,17 @@ _bigskip} The only case where the difference is noticeable is the last one: @samp{\\\\} is seen as @samp{\\} and produces @samp{\} instead of @samp{\\}. -Starting with version 3.1.4, @command{gawk} followed the POSIX rules +Starting with @value{PVERSION} 3.1.4, @command{gawk} followed the POSIX rules when @option{--posix} is specified (@pxref{Options}). Otherwise, it continued to follow the 1996 proposed rules, since that had been its behavior for many years. -When version 4.0.0 was released, the @command{gawk} maintainer +When @value{PVERSION} 4.0.0 was released, the @command{gawk} maintainer made the POSIX rules the default, breaking well over a decade's worth of backwards compatibility.@footnote{This was rather naive of him, despite there being a note in this section indicating that the next major version would move to the POSIX rules.} Needless to say, this was a bad idea, -and as of version 4.0.1, @command{gawk} resumed its historical +and as of @value{PVERSION} 4.0.1, @command{gawk} resumed its historical behavior, and only follows the POSIX rules when @option{--posix} is given. The rules for @code{gensub()} are considerably simpler. At the runtime @@ -16004,14 +17144,16 @@ Although this makes a certain amount of sense, it can be surprising. @node I/O Functions @subsection Input/Output Functions +@cindex input/output functions The following functions relate to input/output (I/O). Optional parameters are enclosed in square brackets ([ ]): -@table @code -@item close(@var{filename} @r{[}, @var{how}@r{]}) -@cindex @code{close()} function +@table @asis +@item @code{close(}@var{filename} [@code{,} @var{how}]@code{)} +@cindexawkfunc{close} @cindex files, closing +@cindex close file or coprocess Close the file @var{filename} for input or output. Alternatively, the argument may be a shell command that was used for creating a coprocess, or for redirecting to or from a pipe; then the coprocess or pipe is closed. @@ -16027,8 +17169,12 @@ not matter. @xref{Two-way I/O}, which discusses this feature in more detail and gives an example. -@item fflush(@r{[}@var{filename}@r{]}) -@cindex @code{fflush()} function +Note that the second argument to @code{close()} is a @command{gawk} +extension; it is not available in compatibility mode (@pxref{Options}). + +@item @code{fflush(}[@var{filename}]@code{)} +@cindexawkfunc{fflush} +@cindex flush buffered output Flush any buffered output associated with @var{filename}, which is either a file opened for writing or a shell command for redirecting output to a pipe or coprocess. @@ -16046,11 +17192,12 @@ This is the purpose of the @code{fflush()} function---@command{gawk} also buffers its output and the @code{fflush()} function forces @command{gawk} to flush its buffers. -@code{fflush()} was added to Brian Kernighan's -version of @command{awk} in 1994. -For over two decades, it was not part of the POSIX standard. -As of December, 2012, it was accepted for -inclusion into the POSIX standard. +@cindex extensions, common@comma{} @code{fflush()} function +@cindex Brian Kernighan's @command{awk} +@code{fflush()} was added to Brian Kernighan's @command{awk} in +April of 1992. For two decades, it was not part of the POSIX standard. +As of December, 2012, it was accepted for inclusion into the POSIX +standard. See @uref{http://austingroupbugs.net/view.php?id=634, the Austin Group website}. POSIX standardizes @code{fflush()} as follows: If there @@ -16059,7 +17206,7 @@ then @command{awk} flushes the buffers for @emph{all} open output files and pipes. @quotation NOTE -Prior to version 4.0.2, @command{gawk} +Prior to @value{PVERSION} 4.0.2, @command{gawk} would flush only the standard output if there was no argument, and flush all output files and pipes if the argument was the null string. This was changed in order to be compatible with Brian @@ -16075,7 +17222,7 @@ only the standard output. @c @cindex warnings, automatic @cindex troubleshooting, @code{fflush()} function @code{fflush()} returns zero if the buffer is successfully flushed; -otherwise, it returns non-zero (@command{gawk} returns @minus{}1). +otherwise, it returns non-zero. (@command{gawk} returns @minus{}1.) In the case where all buffers are flushed, the return value is zero only if all buffers were flushed successfully. Otherwise, it is @minus{}1, and @command{gawk} warns about the problem @var{filename}. @@ -16085,8 +17232,9 @@ a file or pipe that was opened for reading (such as with @code{getline}), or if @var{filename} is not an open file, pipe, or coprocess. In such a case, @code{fflush()} returns @minus{}1, as well. -@item system(@var{command}) -@cindex @code{system()} function +@item @code{system(@var{command})} +@cindexawkfunc{system} +@cindex invoke shell command @cindex interacting with other programs Execute the operating-system command @var{command} and then return to the @command{awk} program. @@ -16117,7 +17265,7 @@ close("/bin/sh") @noindent @cindex troubleshooting, @code{system()} function -@cindex @code{--sandbox} option, disabling @code{system()} function +@cindex @option{--sandbox} option, disabling @code{system()} function However, if your @command{awk} program is interactive, @code{system()} is useful for running large self-contained programs, such as a shell or an editor. @@ -16233,6 +17381,7 @@ you would see the latter (undesirable) output. @node Time Functions @subsection Time Functions +@cindex time functions @c STARTOFRANGE tst @cindex timestamps @@ -16249,10 +17398,26 @@ particular log record was written. Many programs log their timestamp in the form returned by the @code{time()} system call, which is the number of seconds since a particular epoch. On POSIX-compliant systems, it is the number of seconds since -1970-01-01 00:00:00 UTC, not counting leap seconds.@footnote{@xref{Glossary}, -especially the entries ``Epoch'' and ``UTC.''} +1970-01-01 00:00:00 UTC, not counting leap +@ifclear FOR_PRINT +seconds.@footnote{@xref{Glossary}, especially the entries ``Epoch'' and ``UTC.''} +@end ifclear +@ifset FOR_PRINT +seconds. +@end ifset All known POSIX-compliant systems support timestamps from 0 through -@math{2^{31} - 1}, which is sufficient to represent times through +@iftex +@math{2^{31} - 1}, +@end iftex +@ifnottex +@ifnotdocbook +2^31 - 1, +@end ifnotdocbook +@end ifnottex +@docbook +2<superscript>31</superscript> − 1, @c +@end docbook +which is sufficient to represent times through 2038-01-19 03:14:07 UTC. Many systems support a wider range of timestamps, including negative timestamps that represent times before the epoch. @@ -16269,9 +17434,11 @@ However, recent versions of @command{mawk} (@pxref{Other Versions}) also support these functions. Optional parameters are enclosed in square brackets ([ ]): -@table @code -@item mktime(@var{datespec}) -@cindex @code{mktime()} function (@command{gawk}) +@c @asis for docbook +@table @asis +@item @code{mktime(@var{datespec})} +@cindexgawkfunc{mktime} +@cindex generate time values Turn @var{datespec} into a timestamp in the same form as is returned by @code{systime()}. It is similar to the function of the same name in ISO C. The argument, @var{datespec}, is a string of the form @@ -16299,9 +17466,10 @@ is out of range, @code{mktime()} returns @minus{}1. @cindex @command{gawk}, @code{PROCINFO} array in @cindex @code{PROCINFO} array -@item strftime(@r{[}@var{format} @r{[}, @var{timestamp} @r{[}, @var{utc-flag}@r{]]]}) +@item @code{strftime(} [@var{format} [@code{,} @var{timestamp} [@code{,} @var{utc-flag}] ] ]@code{)} @c STARTOFRANGE strf -@cindex @code{strftime()} function (@command{gawk}) +@cindexgawkfunc{strftime} +@cindex format time string Format the time specified by @var{timestamp} based on the contents of the @var{format} string and return the result. It is similar to the function of the same name in ISO C. @@ -16318,11 +17486,12 @@ The default string value is @code{@w{"%a %b %e %H:%M:%S %Z %Y"}}. This format string produces output that is equivalent to that of the @command{date} utility. You can assign a new value to @code{PROCINFO["strftime"]} to -change the default format. +change the default format; see below for the various format directives. -@item systime() -@cindex @code{systime()} function (@command{gawk}) +@item @code{systime()} +@cindexgawkfunc{systime} @cindex timestamps +@cindex current system time Return the current time as the number of seconds since the system epoch. On POSIX systems, this is the number of seconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. @@ -16394,10 +17563,10 @@ This is the ISO 8601 date format. @item %g The year modulo 100 of the ISO 8601 week number, as a decimal number (00--99). -For example, January 1, 1993 is in week 53 of 1992. Thus, the year -of its ISO 8601 week number is 1992, even though its year is 1993. -Similarly, December 31, 1973 is in week 1 of 1974. Thus, the year -of its ISO week number is 1974, even though its year is 1973. +For example, January 1, 2012 is in week 53 of 2011. Thus, the year +of its ISO 8601 week number is 2011, even though its year is 2012. +Similarly, December 31, 2012 is in week 1 of 2013. Thus, the year +of its ISO week number is 2013, even though its year is 2012. @item %G The full year of the ISO week number, as a decimal number. @@ -16478,7 +17647,7 @@ The locale's ``appropriate'' time representation. The year modulo 100 as a decimal number (00--99). @item %Y -The full year as a decimal number (e.g., 2011). +The full year as a decimal number (e.g., 2015). @c @cindex RFC 822 @c @cindex RFC 1036 @@ -16512,17 +17681,6 @@ uses the system's version of @code{strftime()} if it's there. Typically, the conversion specifier either does not appear in the returned string or appears literally.} -@c @cindex locale, definition of -Informally, a @dfn{locale} is the geographic place in which a program -is meant to run. For example, a common way to abbreviate the date -September 4, 2012 in the United States is ``9/4/12.'' -In many countries in Europe, however, it is abbreviated ``4.9.12.'' -Thus, the @samp{%x} specification in a @code{"US"} locale might produce -@samp{9/4/12}, while in a @code{"EUROPE"} locale, it might produce -@samp{4.9.12}. The ISO C standard defines a default @code{"C"} -locale, which is an environment that is typical of what many C programmers -are used to. - For systems that are not yet fully standards-compliant, @command{gawk} supplies a copy of @code{strftime()} from the GNU C Library. @@ -16575,7 +17733,7 @@ the string. For example: @example $ date '+Today is %A, %B %d, %Y.' -@print{} Today is Wednesday, March 30, 2011. +@print{} Today is Monday, May 05, 2014. @end example Here is the @command{gawk} version of the @command{date} utility. @@ -16595,7 +17753,7 @@ case $1 in esac gawk 'BEGIN @{ - format = "%a %b %e %H:%M:%S %Z %Y" + format = PROCINFO["strftime"] exitval = 0 if (ARGC > 2) @@ -16616,6 +17774,7 @@ gawk 'BEGIN @{ @node Bitwise Functions @subsection Bit-Manipulation Functions +@cindex bit-manipulation functions @c STARTOFRANGE bit @cindex bitwise, operations @c STARTOFRANGE and @@ -16682,9 +17841,7 @@ Operands | 0 | 1 | 0 | 1 | 0 | 1 @end tex @docbook -<!-- FIXME: Fix ID and add xref in text. --> -<table id="table-bitwise-ops"> -<title>Bitwise Operations</title> +<informaltable> <tgroup cols="7" colsep="1"> <colspec colname="c1"/> @@ -16744,7 +17901,7 @@ Operands | 0 | 1 | 0 | 1 | 0 | 1 </tbody> </tgroup> -</table> +</informaltable> @end docbook @end float @@ -16778,28 +17935,34 @@ bitwise operations just described. They are: @cindex @command{gawk}, bitwise operations in @table @code -@cindex @code{and()} function (@command{gawk}) -@item and(@var{v1}, @var{v2} @r{[}, @r{@dots{}]}) +@cindexgawkfunc{and} +@cindex bitwise AND +@item @code{and(@var{v1}, @var{v2}} [@code{,} @dots{}]@code{)} Return the bitwise AND of the arguments. There must be at least two. -@cindex @code{compl()} function (@command{gawk}) -@item compl(@var{val}) +@cindexgawkfunc{compl} +@cindex bitwise complement +@item @code{compl(@var{val})} Return the bitwise complement of @var{val}. -@cindex @code{lshift()} function (@command{gawk}) -@item lshift(@var{val}, @var{count}) +@cindexgawkfunc{lshift} +@cindex left shift +@item @code{lshift(@var{val}, @var{count})} Return the value of @var{val}, shifted left by @var{count} bits. -@cindex @code{or()} function (@command{gawk}) -@item or(@var{v1}, @var{v2} @r{[}, @r{@dots{}]}) +@cindexgawkfunc{or} +@cindex bitwise OR +@item @code{or(@var{v1}, @var{v2}} [@code{,} @dots{}]@code{)} Return the bitwise OR of the arguments. There must be at least two. -@cindex @code{rshift()} function (@command{gawk}) -@item rshift(@var{val}, @var{count}) +@cindexgawkfunc{rshift} +@cindex right shift +@item @code{rshift(@var{val}, @var{count})} Return the value of @var{val}, shifted right by @var{count} bits. -@cindex @code{xor()} function (@command{gawk}) -@item xor(@var{v1}, @var{v2} @r{[}, @r{@dots{}]}) +@cindexgawkfunc{xor} +@cindex bitwise XOR +@item @code{xor(@var{v1}, @var{v2}} [@code{,} @dots{}]@code{)} Return the bitwise XOR of the arguments. There must be at least two. @end table @@ -16890,6 +18053,7 @@ $ @kbd{gawk -f testbits.awk} @cindex strings, converting @cindex numbers, converting @cindex converting, numbers to strings +@cindex number as string of bits The @code{bits2str()} function turns a binary number into a string. The number @code{1} represents a binary value where the rightmost bit is set to 1. Using this mask, @@ -16921,11 +18085,12 @@ results of the @code{compl()}, @code{lshift()}, and @code{rshift()} functions. @command{gawk} provides a single function that lets you distinguish an array from a scalar variable. This is necessary for writing code -that traverses every element of a true multidimensional array +that traverses every element of an array of arrays. (@pxref{Arrays of Arrays}). @table @code -@cindex @code{isarray()} function (@command{gawk}) +@cindexgawkfunc{isarray} +@cindex scalar or array @item isarray(@var{x}) Return a true value if @var{x} is an array. Otherwise return false. @end table @@ -16933,7 +18098,7 @@ Return a true value if @var{x} is an array. Otherwise return false. @code{isarray()} is meant for use in two circumstances. The first is when traversing a multidimensional array: you can test if an element is itself an array or not. The second is inside the body of a user-defined function -(not discussed yet; @pxref{User-defined}), to test if a paramater is an +(not discussed yet; @pxref{User-defined}), to test if a parameter is an array or not. Note, however, that using @code{isarray()} at the global level to test @@ -16947,6 +18112,7 @@ will end up turning it into a scalar. @subsection String-Translation Functions @cindex @command{gawk}, string-translation functions @cindex functions, string-translation +@cindex string-translation functions @cindex internationalization @cindex @command{awk} programs, internationalizing @@ -16957,9 +18123,10 @@ The descriptions here are purposely brief. for the full story. Optional parameters are enclosed in square brackets ([ ]): -@table @code -@cindex @code{bindtextdomain()} function (@command{gawk}) -@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]}) +@table @asis +@cindexgawkfunc{bindtextdomain} +@cindex set directory of message catalogs +@item @code{bindtextdomain(@var{directory}} [@code{,} @var{domain}]@code{)} Set the directory in which @command{gawk} will look for message translation files, in case they will not or cannot be placed in the ``standard'' locations @@ -16971,15 +18138,16 @@ If @var{directory} is the null string (@code{""}), then @code{bindtextdomain()} returns the current binding for the given @var{domain}. -@cindex @code{dcgettext()} function (@command{gawk}) -@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) +@cindexgawkfunc{dcgettext} +@cindex translate string +@item @code{dcgettext(@var{string}} [@code{,} @var{domain} [@code{,} @var{category}] ]@code{)} Return the translation of @var{string} in text domain @var{domain} for locale category @var{category}. The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. The default value for @var{category} is @code{"LC_MESSAGES"}. -@cindex @code{dcngettext()} function (@command{gawk}) -@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) +@cindexgawkfunc{dcngettext} +@item @code{dcngettext(@var{string1}, @var{string2}, @var{number}} [@code{,} @var{domain} [@code{,} @var{category}] ]@code{)} Return the plural form used for @var{number} of the translation of @var{string1} and @var{string2} in text domain @var{domain} for locale category @var{category}. @var{string1} is the @@ -16995,7 +18163,7 @@ The default value for @var{category} is @code{"LC_MESSAGES"}. @section User-Defined Functions @c STARTOFRANGE udfunc -@cindex user-defined, functions +@cindex user-defined functions @c STARTOFRANGE funcud @cindex functions, user-defined Complicated @command{awk} programs can often be simplified by defining @@ -17027,12 +18195,12 @@ entire program before starting to execute any of it. The definition of a function named @var{name} looks like this: -@example -function @var{name}(@r{[}@var{parameter-list}@r{]}) -@{ +@display +@code{function} @var{name}@code{(}[@var{parameter-list}]@code{)} +@code{@{} @var{body-of-function} -@} -@end example +@code{@}} +@end display @cindex names, functions @cindex functions, names of @@ -17047,14 +18215,20 @@ used as a variable, array, or function. @var{parameter-list} is an optional list of the function's arguments and local variable names, separated by commas. When the function is called, the argument names are used to hold the argument values given in -the call. The local variables are initialized to the empty string. +the call. + A function cannot have two parameters with the same name, nor may it have a parameter with the same name as the function itself. +In addition, according to the POSIX standard, function parameters +cannot have the same name as one of the special built-in variables +(@pxref{Built-in Variables}). Not all versions of @command{awk} enforce +this restriction.) -In addition, according to the POSIX standard, function parameters cannot have the same -name as one of the special built-in variables -(@pxref{Built-in Variables}. Not all versions of @command{awk} -enforce this restriction. +Local variables act like the empty string if referenced where a string +value is required, and like zero if referenced where a numeric value +is required. This is the same as regular variables that have never been +assigned a value. (There is more to understand about local variables; +@pxref{Dynamic Typing}.) The @var{body-of-function} consists of @command{awk} statements. It is the most important part of the definition, because it says what the function @@ -17081,6 +18255,7 @@ conventional to place some extra space between the arguments and the local variables, in order to document how your function is supposed to be used. @cindex variables, shadowing +@cindex shadowing of variable values During execution of the function body, the arguments and local variable values hide, or @dfn{shadow}, any variables of the same names used in the rest of the program. The shadowed variables are not accessible in the @@ -17101,7 +18276,7 @@ function. When this happens, we say the function is @dfn{recursive}. The act of a function calling itself is called @dfn{recursion}. All the built-in functions return a value to their caller. -User-defined functions can do also, using the @code{return} statement, +User-defined functions can do so also, using the @code{return} statement, which is described in detail in @ref{Return Statement}. Many of the subsequent examples in this @value{SECTION} use the @code{return} statement. @@ -17139,6 +18314,7 @@ keyword @code{function} when defining a function. @node Function Example @subsection Function Definition Examples +@cindex function definition example Here is an example of a user-defined function, called @code{myprint()}, that takes a number and prints it in a specific format: @@ -17193,7 +18369,8 @@ Instead of having to repeat this loop everywhere that you need to clear out an array, your program can just call @code{delarray}. (This guarantees portability. The use of @samp{delete @var{array}} to delete -the contents of an entire array is a nonstandard extension.) +the contents of an entire array is a recent@footnote{Late in 2012.} +addition to the POSIX standard.) The following is an example of a recursive function. It takes a string as an input parameter and returns the string in backwards order. @@ -17236,7 +18413,7 @@ to create an @command{awk} version of @code{ctime()}: function ctime(ts, format) @{ - format = "%a %b %e %H:%M:%S %Z %Y" + format = PROCINFO["strftime"] if (ts == 0) ts = systime() # use current time as default return strftime(format, ts) @@ -17249,7 +18426,10 @@ function ctime(ts, format) @subsection Calling User-Defined Functions @c STARTOFRANGE fudc -This section describes how to call a user-defined function. +@cindex functions, user-defined, calling +@dfn{Calling a function} means causing the function to run and do its job. +A function call is an expression and its value is the value returned by +the function. @menu * Calling A Function:: Don't use spaces. @@ -17260,11 +18440,6 @@ This section describes how to call a user-defined function. @node Calling A Function @subsubsection Writing A Function Call -@cindex functions, user-defined, calling -@dfn{Calling a function} means causing the function to run and do its job. -A function call is an expression and its value is the value returned by -the function. - A function call consists of the function name followed by the arguments in parentheses. @command{awk} expressions are what you write in the call for the arguments. Each time the call is executed, these @@ -17288,9 +18463,10 @@ an error. @node Variable Scope @subsubsection Controlling Variable Scope -@cindex local variables -@cindex variables, local -There is no way to make a variable local to a @code{@{ @dots{} @}} block in +@cindex local variables, in a function +@cindex variables, local to a function +Unlike many languages, +there is no way to make a variable local to a @code{@{} @dots{} @code{@}} block in @command{awk}, but you can make a variable local to a function. It is good practice to do so whenever a variable is needed only in that function. @@ -17552,14 +18728,14 @@ This statement returns control to the calling part of the @command{awk} program. can also be used to return a value for use in the rest of the @command{awk} program. It looks like this: -@example -return @r{[}@var{expression}@r{]} -@end example +@display +@code{return} [@var{expression}] +@end display The @var{expression} part is optional. Due most likely to an oversight, POSIX does not define what the return value is if you omit the @var{expression}. Technically speaking, this -make the returned value undefined, and therefore, unpredictable. +makes the returned value undefined, and therefore, unpredictable. In practice, though, all versions of @command{awk} simply return the null string, which acts like zero if used in a numeric context. @@ -17662,9 +18838,9 @@ BEGIN @{ @end example In this example, the first call to @code{foo()} generates -a fatal error, so @command{gawk} will not report the second -error. If you comment out that call, though, then @command{gawk} -will report the second error. +a fatal error, so @command{awk} will not report the second +error. If you comment out that call, though, then @command{awk} +does report the second error. Usually, such things aren't a big issue, but it's worth being aware of them. @@ -17734,7 +18910,7 @@ character: @example the_func = "sum" -result = @@the_func() # calls the `sum' function +result = @@the_func() # calls the sum() function @end example Here is a full program that processes the previously shown data, @@ -17855,8 +19031,9 @@ We can do something similar using @command{gawk}, like this: @ignore @c file eg/lib/quicksort.awk # -# Arnold Robbins, arnold@skeeve.com, Public Domain +# Arnold Robbins, arnold@@skeeve.com, Public Domain # January 2009 + @c endfile @end ignore @@ -17929,7 +19106,7 @@ or equal to), which yields data sorted in descending order. Next comes a sorting function. It is parameterized with the starting and ending field numbers and the comparison function. It builds an array with -the data and calls @code{quicksort} appropriately, and then formats the +the data and calls @code{quicksort()} appropriately, and then formats the results as a single string: @example @@ -17977,7 +19154,7 @@ function rsort(first, last) @c endfile @end example -Here is an extended version of the data file: +Here is an extended version of the @value{DF}: @example @c file eg/data/class_data2 @@ -18028,21 +19205,80 @@ for (i = 1; i <= n; i++) @noindent @code{gawk} will look up the actual function to call only once. +@node Functions Summary +@section Summary + +@itemize @value{BULLET} +@item +@command{awk} provides built-in functions and lets you define your own +functions. + +@item +POSIX @command{awk} provides three kinds of built-in functions: numeric, +string, and I/O. @command{gawk} provides functions that work with values +representing time, do bit manipulation, sort arrays, and internationalize +and localize programs. @command{gawk} also provides several extensions to +some of standard functions, typically in the form of additional arguments. + +@item +Functions accept zero or more arguments and return a value. The +expressions that provide the argument values are completely evaluated +before the function is called. Order of evaluation is not defined. +The return value can be ignored. + +@item +The handling of backslash in @code{sub()} and @code{gsub()} is not simple. +It is more straightforward in @command{gawk}'s @code{gensub()} function, +but that function still requires care in its use. + +@item +User-defined functions provide important capabilities but come with +some syntactic inelegancies. In a function call, there cannot be any +space between the function name and the opening left parenthesis of the +argument list. Also, there is no provision for local variables, so the +convention is to add extra parameters, and to separate them visually +from the real parameters by extra whitespace. + +@item +User-defined functions may call other user-defined (and built-in) +functions and may call themselves recursively. Function parameters +``hide'' any global variables of the same names. + +@item +Scalar values are passed to user-defined functions by value. Array +parameters are passed by reference; any changes made by the function to +array parameters are thus visible after the function has returned. + +@item +Use the @code{return} statement to return from a user-defined function. +An optional expression becomes the function's return value. Only scalar +values may be returned by a function. + +@item +If a variable that has never been used is passed to a user-defined +function, how that function treats the variable can set its nature: +either scalar or array. + +@item +@command{gawk} provides indirect function calls using a special syntax. +By setting a variable to the name of a user-defined function, you can +determine at runtime what function will be called at that point in the +program. This is equivalent to function pointers in C and C++. + +@end itemize + @c ENDOFRANGE funcud -@iftex -@part Part II:@* Problem Solving With @command{awk} -@end iftex +@ifnotinfo +@part @value{PART2}Problem Solving With @command{awk} +@end ifnotinfo -@ignore @ifdocbook -@part Part II:@* Problem Solving With @command{awk} - Part II shows how to use @command{awk} and @command{gawk} for problem solving. There is lots of code here for you to read and learn from. It contains the following chapters: -@itemize @bullet +@itemize @value{BULLET} @item @ref{Library Functions}. @@ -18050,7 +19286,6 @@ It contains the following chapters: @ref{Sample Programs}. @end itemize @end ifdocbook -@end ignore @node Library Functions @chapter A Library of @command{awk} Functions @@ -18067,6 +19302,8 @@ it allows you to encapsulate algorithms and program tasks in a single place. It simplifies programming, making program development more manageable, and making programs more readable. +@cindex Kernighan, Brian +@cindex Plauger, P.J.@: In their seminal 1976 book, @cite{Software Tools},@footnote{Sadly, over 35 years later, many of the lessons taught by this book have yet to be learned by a vast number of practicing programmers.} Brian Kernighan @@ -18086,7 +19323,6 @@ that their statement is correct, this @value{CHAPTER} and @ref{Sample Programs}, provide a good-sized body of code for you to read, and we hope, to learn from. -@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!! This @value{CHAPTER} presents a library of useful @command{awk} functions. Many of the sample programs presented later in this @value{DOCUMENT} use these functions. @@ -18099,9 +19335,11 @@ these example library functions and programs from the Texinfo source for this @value{DOCUMENT}. (This has already been done as part of the @command{gawk} distribution.) +@ifclear FOR_PRINT If you have written one or more useful, general-purpose @command{awk} functions and would like to contribute them to the @command{awk} user community, see @ref{How To Contribute}, for more information. +@end ifclear @cindex portability, example programs The programs in this @value{CHAPTER} and in @@ -18110,7 +19348,7 @@ freely use features that are @command{gawk}-specific. Rewriting these programs for different implementations of @command{awk} is pretty straightforward. -@itemize @bullet +@itemize @value{BULLET} @item Diagnostic error messages are sent to @file{/dev/stderr}. Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"} if your system @@ -18154,6 +19392,8 @@ comparisons use only lowercase letters. * Passwd Functions:: Functions for getting user information. * Group Functions:: Functions for getting group information. * Walking Arrays:: A function to walk arrays of arrays. +* Library Functions Summary:: Summary of library functions. +* Library exercises:: Exercises. @end menu @node Library Names @@ -18196,7 +19436,7 @@ with the user's program. @cindex underscore (@code{_}), in names of private variables In addition, several of the library functions use a prefix that helps indicate what function or set of functions use the variables---for example, -@code{_pw_byname} in the user database routines +@code{_pw_byname()} in the user database routines (@pxref{Passwd Functions}). This convention is recommended, since it even further decreases the chance of inadvertent conflict among variable names. Note that this @@ -18215,7 +19455,7 @@ The leading capital letter indicates that it is global, while the fact that the variable name is not all capital letters indicates that the variable is not one of @command{awk}'s built-in variables, such as @code{FS}. -@cindex @code{--dump-variables} option +@cindex @option{--dump-variables} option, using for library functions It is also important that @emph{all} variables in library functions that do not need to save state are, in fact, declared local.@footnote{@command{gawk}'s @option{--dump-variables} command-line @@ -18288,11 +19528,12 @@ provides an implementation for other versions of @command{awk}: # # Arnold Robbins, arnold@@skeeve.com, Public Domain # February, 2004 +# Revised June, 2014 @c endfile @end ignore @c file eg/lib/strtonum.awk -function mystrtonum(str, ret, chars, n, i, k, c) +function mystrtonum(str, ret, n, i, k, c) @{ if (str ~ /^0[0-7]*$/) @{ # octal @@ -18305,7 +19546,7 @@ function mystrtonum(str, ret, chars, n, i, k, c) ret = ret * 8 + k @} - @} else if (str ~ /^0[xX][[:xdigit:]]+/) @{ + @} else if (str ~ /^0[xX][[:xdigit:]]+$/) @{ # hexadecimal str = substr(str, 3) # lop off leading 0x n = length(str) @@ -18313,10 +19554,7 @@ function mystrtonum(str, ret, chars, n, i, k, c) for (i = 1; i <= n; i++) @{ c = substr(str, i, 1) c = tolower(c) - if ((k = index("0123456789", c)) > 0) - k-- # adjust for 1-basing in awk - else if ((k = index("abcdef", c)) > 0) - k += 9 + k = index("123456789abcdef", c) ret = ret * 16 + k @} @@ -18484,7 +19722,7 @@ An @code{END} rule is automatically added to the program calling @code{assert()}. Normally, if a program consists of just a @code{BEGIN} rule, the input files and/or standard input are not read. However, now that the program has an @code{END} rule, @command{awk} -attempts to read the input data files or standard input +attempts to read the input @value{DF}s or standard input (@pxref{Using BEGIN/END}), most likely causing the program to hang as it waits for input. @@ -18510,9 +19748,9 @@ with an @code{exit} statement. The way @code{printf} and @code{sprintf()} (@pxref{Printf}) perform rounding often depends upon the system's C @code{sprintf()} -subroutine. On many machines, @code{sprintf()} rounding is ``unbiased,'' -which means it doesn't always round a trailing @samp{.5} up, contrary -to naive expectations. In unbiased rounding, @samp{.5} rounds to even, +subroutine. On many machines, @code{sprintf()} rounding is @dfn{unbiased}, +which means it doesn't always round a trailing .5 up, contrary +to naive expectations. In unbiased rounding, .5 rounds to even, rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means that if you are using a format that does rounding (e.g., @code{"%.0f"}), you should check what your system does. The following function does @@ -18561,7 +19799,7 @@ function round(x, ival, aval, fraction) @c don't include test harness in the file that gets installed # test harness -@{ print $0, round($0) @} +# @{ print $0, round($0) @} @end example @node Cliff Random Function @@ -18628,6 +19866,7 @@ reason to build them into the @command{awk} interpreter: @cindex @code{ord()} user-defined function @cindex @code{chr()} user-defined function +@cindex @code{_ord_init()} user-defined function @example @c file eg/lib/ord.awk # ord.awk --- do ord and chr @@ -18674,8 +19913,9 @@ function _ord_init( low, high, i, t) @cindex character sets (machine character encodings) @cindex ASCII @cindex EBCDIC +@cindex Unicode @cindex mark parity -Some explanation of the numbers used by @code{chr} is worthwhile. +Some explanation of the numbers used by @code{_ord_init()} is worthwhile. The most prominent character set in use today is ASCII.@footnote{This is changing; many systems use Unicode, a very large character set that includes ASCII as a subset. On systems with full Unicode support, @@ -18686,7 +19926,7 @@ Although an defines characters that use the values from 0 to 127.@footnote{ASCII has been extended in many countries to use the values from 128 to 255 for country-specific characters. If your system uses these extensions, -you can simplify @code{_ord_init} to loop from 0 to 255.} +you can simplify @code{_ord_init()} to loop from 0 to 255.} In the now distant past, at least one minicomputer manufacturer @c Pr1me, blech @@ -18853,7 +20093,7 @@ function getlocaltime(time, ret, now, i) now = systime() # return date(1)-style output - ret = strftime("%a %b %e %H:%M:%S %Z %Y", now) + ret = strftime(PROCINFO["strftime"], now) # clear out target array delete time @@ -18969,7 +20209,7 @@ This tests the result to see if it is empty or not. An equivalent test would be @samp{contents == ""}. @node Data File Management -@section Data File Management +@section @value{DDF} Management @c STARTOFRANGE dataf @cindex files, managing @@ -18978,7 +20218,7 @@ test would be @samp{contents == ""}. @c STARTOFRANGE flibdataf @cindex functions, library, managing data files This @value{SECTION} presents functions that are useful for managing -command-line data files. +command-line @value{DF}s. @menu * Filetrans Function:: A function for handling data file transitions. @@ -18989,7 +20229,7 @@ command-line data files. @end menu @node Filetrans Function -@subsection Noting Data File Boundaries +@subsection Noting @value{DDF} Boundaries @cindex files, managing, data file boundaries @cindex files, initialization and cleanup @@ -18997,8 +20237,8 @@ The @code{BEGIN} and @code{END} rules are each executed exactly once at the beginning and end of your @command{awk} program, respectively (@pxref{BEGIN/END}). We (the @command{gawk} authors) once had a user who mistakenly thought that the -@code{BEGIN} rule is executed at the beginning of each data file and the -@code{END} rule is executed at the end of each data file. +@code{BEGIN} rule is executed at the beginning of each @value{DF} and the +@code{END} rule is executed at the end of each @value{DF}. When informed that this was not the case, the user requested that we add new special @@ -19009,7 +20249,7 @@ Adding these special patterns to @command{gawk} wasn't necessary; the job can be done cleanly in @command{awk} itself, as illustrated by the following library program. It arranges to call two user-supplied functions, @code{beginfile()} and -@code{endfile()}, at the beginning and end of each data file. +@code{endfile()}, at the beginning and end of each @value{DF}. Besides solving the problem in only nine(!) lines of code, it does so @emph{portably}; this works with any implementation of @command{awk}: @@ -19040,17 +20280,17 @@ This file must be loaded before the user's ``main'' program, so that the rule it supplies is executed first. This rule relies on @command{awk}'s @code{FILENAME} variable that -automatically changes for each new data file. The current file name is +automatically changes for each new @value{DF}. The current @value{FN} is saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does -not equal @code{_oldfilename}, then a new data file is being processed and +not equal @code{_oldfilename}, then a new @value{DF} is being processed and it is necessary to call @code{endfile()} for the old file. Because @code{endfile()} should only be called if a file has been processed, the program first checks to make sure that @code{_oldfilename} is not the null -string. The program then assigns the current file name to +string. The program then assigns the current @value{FN} to @code{_oldfilename} and calls @code{beginfile()} for the file. Because, like all @command{awk} variables, @code{_oldfilename} is initialized to the null string, this rule executes correctly even for the -first data file. +first @value{DF}. The program also supplies an @code{END} rule to do the final processing for the last file. Because this @code{END} rule comes before any @code{END} rules @@ -19059,7 +20299,7 @@ again the value of multiple @code{BEGIN} and @code{END} rules should be clear. @cindex @code{beginfile()} user-defined function @cindex @code{endfile()} user-defined function -If the same data file occurs twice in a row on the command line, then +If the same @value{DF} occurs twice in a row on the command line, then @code{endfile()} and @code{beginfile()} are not executed at the end of the first pass and at the beginning of the second pass. The following version solves the problem: @@ -19174,16 +20414,18 @@ The @code{rewind()} function also relies on the @code{nextfile} keyword (@pxref{Nextfile Statement}). @node File Checking -@subsection Checking for Readable Data Files +@subsection Checking for Readable @value{DDF}s @cindex troubleshooting, readable data files @cindex readable data files@comma{} checking @cindex files, skipping -Normally, if you give @command{awk} a data file that isn't readable, -it stops with a fatal error. There are times when you -might want to just ignore such files and keep going. You can -do this by prepending the following program to your @command{awk} -program: +Normally, if you give @command{awk} a @value{DF} that isn't readable, +it stops with a fatal error. There are times when you might want to +just ignore such files and keep going.@footnote{The @code{BEGINFILE} +special pattern (@pxref{BEGINFILE/ENDFILE}) provides an alternative +mechanism for dealing with files that can't be opened. However, the +code here provides a portable solution.} You can do this by prepending +the following program to your @command{awk} program: @cindex @code{readable.awk} program @example @@ -19221,22 +20463,22 @@ skips the file (since it's no longer in the list). See also @ref{ARGC and ARGV}. @node Empty Files -@subsection Checking For Zero-length Files +@subsection Checking for Zero-length Files All known @command{awk} implementations silently skip over zero-length files. This is a by-product of @command{awk}'s implicit read-a-record-and-match-against-the-rules loop: when @command{awk} tries to read a record from an empty file, it immediately receives an end of file indication, closes the file, and proceeds on to the next -command-line data file, @emph{without} executing any user-level +command-line @value{DF}, @emph{without} executing any user-level @command{awk} program code. Using @command{gawk}'s @code{ARGIND} variable (@pxref{Built-in Variables}), it is possible to detect when an empty -data file has been skipped. Similar to the library file presented +@value{DF} has been skipped. Similar to the library file presented in @ref{Filetrans Function}, the following library file calls a function named @code{zerofile()} that the user must provide. The arguments passed are -the file name and the position in @code{ARGV} where it was found: +the @value{FN} and the position in @code{ARGV} where it was found: @cindex @code{zerofile.awk} program @example @@ -19283,56 +20525,16 @@ the end of the command-line arguments. Note that the test in the condition of the @code{for} loop uses the @samp{<=} operator, not @samp{<}. -As an exercise, you might consider whether this same problem can -be solved without relying on @command{gawk}'s @code{ARGIND} variable. - -As a second exercise, revise this code to handle the case where -an intervening value in @code{ARGV} is a variable assignment. - -@ignore -# zerofile2.awk --- same thing, portably - -BEGIN @{ - ARGIND = Argind = 0 - for (i = 1; i < ARGC; i++) - Fnames[ARGV[i]]++ - -@} -FNR == 1 @{ - while (ARGV[ARGIND] != FILENAME) - ARGIND++ - Seen[FILENAME]++ - if (Seen[FILENAME] == Fnames[FILENAME]) - do - ARGIND++ - while (ARGV[ARGIND] != FILENAME) -@} -ARGIND > Argind + 1 @{ - for (Argind++; Argind < ARGIND; Argind++) - zerofile(ARGV[Argind], Argind) -@} -ARGIND != Argind @{ - Argind = ARGIND -@} -END @{ - if (ARGIND < ARGC - 1) - ARGIND = ARGC - 1 - if (ARGIND > Argind) - for (Argind++; Argind <= ARGIND; Argind++) - zerofile(ARGV[Argind], Argind) -@} -@end ignore - @node Ignoring Assigns -@subsection Treating Assignments as File Names +@subsection Treating Assignments as @value{FFN}s @cindex assignments as filenames @cindex filenames, assignments as Occasionally, you might not want @command{awk} to process command-line variable assignments (@pxref{Assignment Options}). -In particular, if you have a file name that contain an @samp{=} character, -@command{awk} treats the file name as an assignment, and does not process it. +In particular, if you have a @value{FN} that contains an @samp{=} character, +@command{awk} treats the @value{FN} as an assignment, and does not process it. Some users have suggested an additional command-line option for @command{gawk} to disable command-line assignments. However, some simple programming with @@ -19376,7 +20578,7 @@ awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk * The function works by looping through the arguments. It prepends @samp{./} to any argument that matches the form -of a variable assignment, turning that argument into a file name. +of a variable assignment, turning that argument into a @value{FN}. The use of @code{No_command_assign} allows you to disable command-line assignments at invocation time, by giving the variable a true value. @@ -19460,7 +20662,6 @@ application might want to print its own error message.) @item optopt The letter representing the command-line option. -@c While not usually documented, most versions supply this variable. @end table The following C fragment shows how @code{getopt()} might process command-line @@ -19511,7 +20712,6 @@ necessary for accessing individual characters function was written before @command{gawk} acquired the ability to split strings into single characters using @code{""} as the separator. We have left it alone, since using @code{substr()} is more portable.} -@c FIXME: could use split(str, a, "") to do it more easily. The discussion that follows walks through the code a bit at a time: @@ -19694,7 +20894,7 @@ BEGIN @{ # test program if (_getopt_test) @{ while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1) - printf("c = <%c>, optarg = <%s>\n", + printf("c = <%c>, Optarg = <%s>\n", _go_c, Optarg) printf("non-option arguments:\n") for (; Optind < ARGC; Optind++) @@ -19710,32 +20910,31 @@ result of two sample runs of the test program: @example $ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x} -@print{} c = <a>, optarg = <> -@print{} c = <c>, optarg = <> -@print{} c = <b>, optarg = <ARG> +@print{} c = <a>, Optarg = <> +@print{} c = <c>, Optarg = <> +@print{} c = <b>, Optarg = <ARG> @print{} non-option arguments: @print{} ARGV[3] = <bax> @print{} ARGV[4] = <-x> $ @kbd{awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc} -@print{} c = <a>, optarg = <> +@print{} c = <a>, Optarg = <> @error{} x -- invalid option -@print{} c = <?>, optarg = <> +@print{} c = <?>, Optarg = <> @print{} non-option arguments: @print{} ARGV[4] = <xyz> @print{} ARGV[5] = <abc> @end example -In both runs, -the first @option{--} terminates the arguments to @command{awk}, so that it does -not try to interpret the @option{-a}, etc., as its own options. +In both runs, the first @option{--} terminates the arguments to +@command{awk}, so that it does not try to interpret the @option{-a}, +etc., as its own options. @quotation NOTE -After @code{getopt()} is through, it is the responsibility of the user level -code to -clear out all the elements of @code{ARGV} from 1 to @code{Optind}, -so that @command{awk} does not try to process the command-line options -as file names. +After @code{getopt()} is through, it is the responsibility of the +user level code to clear out all the elements of @code{ARGV} from 1 +to @code{Optind}, so that @command{awk} does not try to process the +command-line options as @value{FN}s. @end quotation Several of the sample programs presented in @@ -19752,7 +20951,7 @@ use @code{getopt()} to process their arguments. @c STARTOFRANGE libfudata @cindex libraries of @command{awk} functions, user database, reading @c STARTOFRANGE flibudata -@cindex functions, library, user database, reading +@cindex functions, library, user database@comma{} reading @c STARTOFRANGE udatar @cindex user database@comma{} reading @c STARTOFRANGE dataur @@ -19797,14 +20996,12 @@ no more entries, it returns @code{NULL}, the null pointer. When this happens, the C program should call @code{endpwent()} to close the database. Following is @command{pwcat}, a C program that ``cats'' the password database: -@c Use old style function header for portability to old systems (SunOS, HP/UX). - @example @c file eg/lib/pwcat.c /* * pwcat.c * - * Generate a printable version of the password database + * Generate a printable version of the password database. */ @c endfile @ignore @@ -20001,7 +21198,7 @@ from anywhere within a user's program, and the user may have his or her own way of splitting records and fields. -@cindex @code{PROCINFO} array +@cindex @code{PROCINFO} array, testing the field splitting The @code{using_fw} variable checks @code{PROCINFO["FS"]}, which is @code{"FIELDWIDTHS"} if field splitting is being done with @code{FIELDWIDTHS}. This makes it possible to restore the correct @@ -20010,7 +21207,7 @@ field-splitting mechanism later. The test can only be true for or on some other @command{awk} implementation. The code that checks for using @code{FPAT}, using @code{using_fpat} -and @code{PROCINFO["FS"]} is similar. +and @code{PROCINFO["FS"]}, is similar. The main part of the function uses a loop to read database lines, split the line into fields, and then store the line into each array as necessary. @@ -20040,10 +21237,9 @@ function getpwnam(name) @end example @cindex @code{getpwuid()} function (C library) -Similarly, -the @code{getpwuid} function takes a user ID number argument. If that -user number is in the database, it returns the appropriate line. Otherwise, it -returns the null string: +Similarly, the @code{getpwuid()} function takes a user ID number +argument. If that user number is in the database, it returns the +appropriate line. Otherwise, it returns the null string: @cindex @code{getpwuid()} user-defined function @example @@ -20120,12 +21316,12 @@ uses these functions. @c STARTOFRANGE libfgdata @cindex libraries of @command{awk} functions, group database, reading @c STARTOFRANGE flibgdata -@cindex functions, library, group database, reading +@cindex functions, library, group database@comma{} reading @c STARTOFRANGE gdatar @cindex group database, reading @c STARTOFRANGE datagr @cindex database, group, reading -@cindex @code{PROCINFO} array +@cindex @code{PROCINFO} array, and group membership @cindex @code{getgrent()} function (C library) @cindex @code{getgrent()} user-defined function @cindex groups@comma{} information about @@ -20151,7 +21347,7 @@ is as follows: /* * grcat.c * - * Generate a printable version of the group database + * Generate a printable version of the group database. */ @c endfile @ignore @@ -20238,7 +21434,7 @@ it is usually empty or set to @samp{*}. @item Group ID Number The group's numeric group ID number; -this number must be unique within the file. +the association of name to number must be unique within the file. (On some systems it's a C @code{long}, and not an @code{int}. Thus we cast it to @code{long} for all cases.) @@ -20368,16 +21564,16 @@ database for the same group. This is common when a group has a large number of members. A pair of such entries might look like the following: @example -tvpeople:*:101:johnny,jay,arsenio +tvpeople:*:101:johny,jay,arsenio tvpeople:*:101:david,conan,tom,joan @end example For this reason, @code{_gr_init()} looks to see if a group name or group ID number is already seen. If it is, then the user names are -simply concatenated onto the previous list of users. (There is actually a +simply concatenated onto the previous list of users.@footnote{There is actually a subtle problem with the code just presented. Suppose that the first time there were no names. This code adds the names with -a leading comma. It also doesn't check that there is a @code{$4}.) +a leading comma. It also doesn't check that there is a @code{$4}.} Finally, @code{_gr_init()} closes the pipeline to @command{grcat}, restores @code{FS} (and @code{FIELDWIDTHS} or @code{FPAT} if necessary), @code{RS}, and @code{$0}, @@ -20537,24 +21733,121 @@ $ @kbd{gawk -f walk_array.awk} @print{} a[3] = 3 @end example -Walking an array and processing each element is a general-purpose -operation. You might want to consider generalizing the @code{walk_array()} -function by adding an additional parameter named @code{process}. - -Then, inside the loop, instead of simply printing the array element's -index and value, use the indirect function call syntax -(@pxref{Indirect Calls}) on @code{process}, passing it the index -and the value. - -When calling @code{walk_array()}, you would pass the name of a user-defined -function that expects to receive and index and a value, and then processes -the element. - - @c ENDOFRANGE libfgdata @c ENDOFRANGE flibgdata @c ENDOFRANGE gdatar @c ENDOFRANGE libf + +@node Library Functions Summary +@section Summary + +@itemize @value{BULLET} +@item +Reading programs is an excellent way to learn Good Programming. +The functions provided in this @value{CHAPTER} and the next are intended +to serve that purpose. + +@item +When writing general-purpose library functions, put some thought into how +to name any global variables so that they won't conflict with variables +from a user's program. + +@item +The functions presented here fit into the following categories: + +@c nested list +@table @asis +@item General problems +Number to string conversion, assertions, rounding, random number +generation, converting characters to numbers, joining strings, getting +easily usable time-of-day information, and reading a whole file in +one shot. + +@item Managing @value{DF}s +Noting @value{DF} boundaries, rereading the current file, checking for +readable files, checking for zero-length files, and treating assignments +as @value{FN}s. + +@item Processing command-line options +An @command{awk} version of the standard C @code{getopt()} function. + +@item Reading the user and group databases +Two sets of routines that parallel the C library versions. + +@item Traversing arrays of arrays +A simple function to traverse an array of arrays to any depth. +@end table +@c end nested list + +@end itemize + +@node Library exercises +@section Exercises + +@enumerate +@item +In @ref{Empty Files}, we presented the @file{zerofile.awk} program, +which made use of @command{gawk}'s @code{ARGIND} variable. Can this +problem be solved without relying on @code{ARGIND}? If so, how? + +@ignore +# zerofile2.awk --- same thing, portably + +BEGIN @{ + ARGIND = Argind = 0 + for (i = 1; i < ARGC; i++) + Fnames[ARGV[i]]++ + +@} +FNR == 1 @{ + while (ARGV[ARGIND] != FILENAME) + ARGIND++ + Seen[FILENAME]++ + if (Seen[FILENAME] == Fnames[FILENAME]) + do + ARGIND++ + while (ARGV[ARGIND] != FILENAME) +@} +ARGIND > Argind + 1 @{ + for (Argind++; Argind < ARGIND; Argind++) + zerofile(ARGV[Argind], Argind) +@} +ARGIND != Argind @{ + Argind = ARGIND +@} +END @{ + if (ARGIND < ARGC - 1) + ARGIND = ARGC - 1 + if (ARGIND > Argind) + for (Argind++; Argind <= ARGIND; Argind++) + zerofile(ARGV[Argind], Argind) +@} +@end ignore + +@item +As a related challenge, revise that code to handle the case where +an intervening value in @code{ARGV} is a variable assignment. + +@item +@ref{Walking Arrays}, presented a function that walked a multidimensional +array to print it out. However, walking an array and processing +each element is a general-purpose operation. Generalize the +@code{walk_array()} function by adding an additional parameter named +@code{process}. + +Then, inside the loop, instead of printing the array element's index and +value, use the indirect function call syntax (@pxref{Indirect Calls}) +on @code{process}, passing it the index and the value. + +When calling @code{walk_array()}, you would pass the name of a +user-defined function that expects to receive an index and a value, +and then processes the element. + +Test your new version by printing the array; you should end up with +output identical to that of the original version. + +@end enumerate + @c ENDOFRANGE flib @c ENDOFRANGE fudlib @c ENDOFRANGE datagr @@ -20564,11 +21857,13 @@ the element. @c STARTOFRANGE awkpex @cindex @command{awk} programs, examples of +@c FULLXREF ON @ref{Library Functions}, presents the idea that reading programs in a language contributes to learning that language. This @value{CHAPTER} continues that theme, presenting a potpourri of @command{awk} programs for your reading enjoyment. +@c FULLXREF OFF @ifnotinfo There are three sections. The first describes how to run the programs presented @@ -20595,6 +21890,8 @@ Many of these programs use library functions presented in * Running Examples:: How to run these examples. * Clones:: Clones of common utilities. * Miscellaneous Programs:: Some interesting @command{awk} programs. +* Programs Summary:: Summary of programs. +* Programs Exercises:: Exercises. @end menu @node Running Examples @@ -20609,7 +21906,7 @@ awk -f @var{program} -- @var{options} @var{files} @noindent Here, @var{program} is the name of the @command{awk} program (such as @file{cut.awk}), @var{options} are any command-line options for the -program that start with a @samp{-}, and @var{files} are the actual data files. +program that start with a @samp{-}, and @var{files} are the actual @value{DF}s. If your system supports the @samp{#!} executable interpreter mechanism (@pxref{Executable Scripts}), @@ -20747,13 +22044,7 @@ function usage( e1, e2) @noindent The variables @code{e1} and @code{e2} are used so that the function -fits nicely on the -@ifnotinfo -page. -@end ifnotinfo -@ifnottex -screen. -@end ifnottex +fits nicely on the @value{PAGE}. @cindex @code{BEGIN} pattern, running @command{awk} programs and @cindex @code{FS} variable, running @command{awk} programs and @@ -20783,7 +22074,7 @@ BEGIN \ OFS = "" @} else if (c == "d") @{ if (length(Optarg) > 1) @{ - printf("Using first character of %s" \ + printf("cut: using first character of %s" \ " for delimiter\n", Optarg) > "/dev/stderr" Optarg = substr(Optarg, 1, 1) @} @@ -20792,7 +22083,7 @@ BEGIN \ if (FS == " ") # defeat awk semantics FS = "[ ]" @} else if (c == "s") - suppress++ + suppress = 1 else usage() @} @@ -20814,7 +22105,7 @@ spaces. Also remember that after @code{getopt()} is through we have to clear out all the elements of @code{ARGV} from 1 to @code{Optind}, so that @command{awk} does not try to process the command-line options -as file names. +as @value{FN}s. After dealing with the command-line options, the program verifies that the options make sense. Only one or the other of @option{-c} and @option{-f} @@ -20864,7 +22155,7 @@ function set_fieldlist( n, m, i, j, k, f, g) m = split(f[i], g, "-") @group if (m != 2 || g[1] >= g[2]) @{ - printf("bad field list: %s\n", + printf("cut: bad field list: %s\n", f[i]) > "/dev/stderr" exit 1 @} @@ -20901,7 +22192,7 @@ complete field list, including filler fields: @example @c file eg/prog/cut.awk -function set_charlist( field, i, j, f, g, t, +function set_charlist( field, i, j, f, g, n, m, t, filler, last, len) @{ field = 1 # count total fields @@ -20911,7 +22202,7 @@ function set_charlist( field, i, j, f, g, t, if (index(f[i], "-") != 0) @{ # range m = split(f[i], g, "-") if (m != 2 || g[1] >= g[2]) @{ - printf("bad character list: %s\n", + printf("cut: bad character list: %s\n", f[i]) > "/dev/stderr" exit 1 @} @@ -20987,7 +22278,6 @@ of picking the input line apart by characters. @c ENDOFRANGE ficut @c ENDOFRANGE colcut -@c Exercise: Rewrite using split with "". @node Egrep Program @subsection Searching for Regular Expressions in Files @@ -20998,20 +22288,21 @@ of picking the input line apart by characters. @cindex searching, files for regular expressions @c STARTOFRANGE fsregexp @cindex files, searching for regular expressions +@c STARTOFRANGE egrep @cindex @command{egrep} utility The @command{egrep} utility searches files for patterns. It uses regular expressions that are almost identical to those available in @command{awk} (@pxref{Regexp}). You invoke it as follows: -@example -egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{} -@end example +@display +@command{egrep} [@var{options}] @code{'@var{pattern}'} @var{files} @dots{} +@end display The @var{pattern} is a regular expression. In typical usage, the regular expression is quoted to prevent the shell from expanding any of the -special characters as file name wildcards. Normally, @command{egrep} -prints the lines that matched. If multiple file names are provided on +special characters as @value{FN} wildcards. Normally, @command{egrep} +prints the lines that matched. If multiple @value{FN}s are provided on the command line, each output line is preceded by the name of the file and a colon. @@ -21102,7 +22393,7 @@ pattern is supplied with @option{-e}, the first nonoption on the command line is used. The @command{awk} command-line arguments up to @code{ARGV[Optind]} are cleared, so that @command{awk} won't try to process them as files. If no files are specified, the standard input is used, and if multiple files are -specified, we make sure to note this so that the file names can precede the +specified, we make sure to note this so that the @value{FN}s can precede the matched lines in the output: @example @@ -21136,8 +22427,6 @@ if a match happens, we output the translated line, not the original.} The rule is commented out since it is not necessary with @command{gawk}: -@c Exercise: Fix this, w/array and new line as key to original line - @example @c file eg/prog/egrep.awk #@{ @@ -21188,6 +22477,11 @@ function endfile(file) @c endfile @end example +The @code{BEGINFILE} and @code{ENDFILE} special patterns +(@pxref{BEGINFILE/ENDFILE}) could be used, but then the program would be +@command{gawk}-specific. Additionally, this example was written before +@command{gawk} acquired @code{BEGINFILE} and @code{ENDFILE}. + The following rule does most of the work of matching lines. The variable @code{matches} is true if the line matched the pattern. If the user wants lines that did not match, the sense of @code{matches} is inverted @@ -21200,9 +22494,9 @@ A number of additional tests are made, but they are only done if we are not counting lines. First, if the user only wants exit status (@code{no_print} is true), then it is enough to know that @emph{one} line in this file matched, and we can skip on to the next file with -@code{nextfile}. Similarly, if we are only printing file names, we can -print the file name, and then skip to the next file with @code{nextfile}. -Finally, each line is printed, with a leading file name and colon +@code{nextfile}. Similarly, if we are only printing @value{FN}s, we can +print the @value{FN}, and then skip to the next file with @code{nextfile}. +Finally, each line is printed, with a leading @value{FN} and colon if necessary: @cindex @code{!} (exclamation point), @code{!} operator @@ -21244,9 +22538,7 @@ there are no matches, the exit status is one; otherwise it is zero: @c file eg/prog/egrep.awk END \ @{ - if (total == 0) - exit 1 - exit 0 + exit (total == 0) @} @c endfile @end example @@ -21283,12 +22575,14 @@ or not. @c ENDOFRANGE regexps @c ENDOFRANGE sfregexp @c ENDOFRANGE fsregexp +@c ENDOFRANGE egrep @node Id Program @subsection Printing out User Information @cindex printing, user information @cindex users, information about, printing +@c STARTOFRANGE id @cindex @command{id} utility The @command{id} utility lists a user's real and effective user ID numbers, real and effective group ID numbers, and the user's group set, if any. @@ -21298,10 +22592,10 @@ corresponding user and group names. The output might look like this: @example $ @kbd{id} -@print{} uid=500(arnold) gid=500(arnold) groups=6(disk),7(lp),19(floppy) +@print{} uid=1000(arnold) gid=1000(arnold) groups=1000(arnold),4(adm),7(lp),27(sudo) @end example -@cindex @code{PROCINFO} array +@cindex @code{PROCINFO} array, and user and group ID numbers This information is part of what is provided by @command{gawk}'s @code{PROCINFO} array (@pxref{Built-in Variables}). However, the @command{id} utility provides a more palatable output than just @@ -21334,6 +22628,7 @@ numbers: # Arnold Robbins, arnold@@skeeve.com, Public Domain # May 1993 # Revised February 1996 +# Revised May 2014 @c endfile @end ignore @@ -21353,34 +22648,26 @@ BEGIN \ printf("uid=%d", uid) pw = getpwuid(uid) - if (pw != "") @{ - split(pw, a, ":") - printf("(%s)", a[1]) - @} + if (pw != "") + pr_first_field(pw) if (euid != uid) @{ printf(" euid=%d", euid) pw = getpwuid(euid) - if (pw != "") @{ - split(pw, a, ":") - printf("(%s)", a[1]) - @} + if (pw != "") + pr_first_field(pw) @} printf(" gid=%d", gid) pw = getgrgid(gid) - if (pw != "") @{ - split(pw, a, ":") - printf("(%s)", a[1]) - @} + if (pw != "") + pr_first_field(pw) if (egid != gid) @{ printf(" egid=%d", egid) pw = getgrgid(egid) - if (pw != "") @{ - split(pw, a, ":") - printf("(%s)", a[1]) - @} + if (pw != "") + pr_first_field(pw) @} for (i = 1; ("group" i) in PROCINFO; i++) @{ @@ -21389,20 +22676,23 @@ BEGIN \ group = PROCINFO["group" i] printf("%d", group) pw = getgrgid(group) - if (pw != "") @{ - split(pw, a, ":") - printf("(%s)", a[1]) - @} + if (pw != "") + pr_first_field(pw) if (("group" (i+1)) in PROCINFO) printf(",") @} print "" @} + +function pr_first_field(str, a) +@{ + split(str, a, ":") + printf("(%s)", a[1]) +@} @c endfile @end example -@cindex @code{in} operator The test in the @code{for} loop is worth noting. Any supplementary groups in the @code{PROCINFO} array have the indices @code{"group1"} through @code{"group@var{N}"} for some @@ -21412,19 +22702,18 @@ there are. This loop works by starting at one, concatenating the value with @code{"group"}, and then using @code{in} to see if that value is -in the array. Eventually, @code{i} is incremented past +in the array (@pxref{Reference to Elements}). Eventually, @code{i} is incremented past the last group in the array and the loop exits. The loop is also correct if there are @emph{no} supplementary groups; then the condition is false the first time it's tested, and the loop body never executes. -@c exercise!!! -@ignore -The POSIX version of @command{id} takes arguments that control which -information is printed. Modify this version to accept the same -arguments and perform in the same way. -@end ignore +The @code{pr_first_field()} function simply isolates out some +code that is used repeatedly, making the whole program +slightly shorter and cleaner. + +@c ENDOFRANGE id @node Split Program @subsection Splitting a Large File into Pieces @@ -21433,15 +22722,16 @@ arguments and perform in the same way. @c STARTOFRANGE filspl @cindex files, splitting +@c STARTOFRANGE split @cindex @code{split} utility The @command{split} program splits large text files into smaller pieces. Usage is as follows:@footnote{This is the traditional usage. The POSIX usage is different, but not relevant for what the program aims to demonstrate.} -@example -split @r{[}-@var{count}@r{]} file @r{[} @var{prefix} @r{]} -@end example +@display +@command{split} [@code{-@var{count}}] [@var{file}] [@var{prefix}] +@end display By default, the output files are named @file{xaa}, @file{xab}, and so on. Each file has @@ -21450,7 +22740,7 @@ number of lines in each file, supply a number on the command line preceded with a minus; e.g., @samp{-500} for files with 500 lines in them instead of 1000. To change the name of the output files to something like @file{myfileaa}, @file{myfileab}, and so on, supply an additional -argument that specifies the file name prefix. +argument that specifies the @value{FN} prefix. Here is a version of @command{split} in @command{awk}. It uses the @code{ord()} and @code{chr()} functions presented in @@ -21460,8 +22750,8 @@ The program first sets its defaults, and then tests to make sure there are not too many arguments. It then looks at each argument in turn. The first argument could be a minus sign followed by a number. If it is, this happens to look like a negative number, so it is made positive, and that is the -count of lines. The data file name is skipped over and the final argument -is used as the prefix for the output file names: +count of lines. The @value{DF} name is skipped over and the final argument +is used as the prefix for the output @value{FN}s: @cindex @code{split.awk} program @example @@ -21475,11 +22765,12 @@ is used as the prefix for the output file names: # # Arnold Robbins, arnold@@skeeve.com, Public Domain # May 1993 +# Revised slightly, May 2014 @c endfile @end ignore @c file eg/prog/split.awk -# usage: split [-num] [file] [outname] +# usage: split [-count] [file] [outname] BEGIN @{ outfile = "x" # default @@ -21488,7 +22779,7 @@ BEGIN @{ usage() i = 1 - if (ARGV[i] ~ /^-[[:digit:]]+$/) @{ + if (i in ARGV && ARGV[i] ~ /^-[[:digit:]]+$/) @{ count = -ARGV[i] ARGV[i] = "" i++ @@ -21510,7 +22801,7 @@ BEGIN @{ The next rule does most of the work. @code{tcount} (temporary count) tracks how many lines have been printed to the output file so far. If it is greater than @code{count}, it is time to close the current file and start a new one. -@code{s1} and @code{s2} track the current suffixes for the file name. If +@code{s1} and @code{s2} track the current suffixes for the @value{FN}. If they are both @samp{z}, the file is just too big. Otherwise, @code{s1} moves to the next letter in the alphabet and @code{s2} starts over again at @samp{a}: @@ -21542,8 +22833,6 @@ moves to the next letter in the alphabet and @code{s2} starts over again at @c endfile @end example -@c Exercise: do this with just awk builtin functions, index("abc..."), substr, etc. - @noindent The @code{usage()} function simply prints an error message and exits: @@ -21560,36 +22849,30 @@ function usage( e) @noindent The variable @code{e} is used so that the function -fits nicely on the -@ifinfo -screen. -@end ifinfo -@ifnotinfo -page. -@end ifnotinfo +fits nicely on the @value{PAGE}. This program is a bit sloppy; it relies on @command{awk} to automatically close the last file instead of doing it in an @code{END} rule. It also assumes that letters are contiguous in the character set, which isn't true for EBCDIC systems. -@c Exercise: Fix these problems. -@c BFD... @c ENDOFRANGE filspl +@c ENDOFRANGE split @node Tee Program @subsection Duplicating Output into Multiple Files @cindex files, multiple@comma{} duplicating output into @cindex output, duplicating into files +@c STARTOFRANGE tee @cindex @code{tee} utility The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies its standard input to its standard output and also duplicates it to the files named on the command line. Its usage is as follows: -@example -tee @r{[}-a@r{]} file @dots{} -@end example +@display +@command{tee} [@option{-a}] @var{file} @dots{} +@end display The @option{-a} option tells @code{tee} to append to the named files, instead of truncating them and starting over. @@ -21598,13 +22881,13 @@ The @code{BEGIN} rule first makes a copy of all the command-line arguments into an array named @code{copy}. @code{ARGV[0]} is not copied, since it is not needed. @code{tee} cannot use @code{ARGV} directly, since @command{awk} attempts to -process each file name in @code{ARGV} as input data. +process each @value{FN} in @code{ARGV} as input data. @cindex flag variables If the first argument is @option{-a}, then the flag variable @code{append} is set to true, and both @code{ARGV[1]} and @code{copy[1]} are deleted. If @code{ARGC} is less than two, then no -file names were supplied and @code{tee} prints a usage message and exits. +@value{FN}s were supplied and @code{tee} prints a usage message and exits. Finally, @command{awk} is forced to read the standard input by setting @code{ARGV[1]} to @code{"-"} and @code{ARGC} to two: @@ -21696,6 +22979,7 @@ END \ @} @c endfile @end example +@c ENDOFRANGE tee @node Uniq Program @subsection Printing Nonduplicated Lines of Text @@ -21706,15 +22990,16 @@ END \ @cindex printing, unduplicated lines of text @c STARTOFRANGE tpul @cindex text@comma{} printing, unduplicated lines of +@c STARTOFRANGE uniq @cindex @command{uniq} utility The @command{uniq} utility reads sorted lines of data on its standard input, and by default removes duplicate lines. In other words, it only prints unique lines---hence the name. @command{uniq} has a number of options. The usage is as follows: -@example -uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]} -@end example +@display +@command{uniq} [@option{-udc} [@code{-@var{n}}]] [@code{+@var{n}}] [@var{inputfile} [@var{outputfile}]] +@end display The options for @command{uniq} are: @@ -21738,11 +23023,11 @@ by runs of spaces and/or TABs. Skip @var{n} characters before comparing lines. Any fields specified with @samp{-@var{n}} are skipped first. -@item @var{input file} +@item @var{inputfile} Data is read from the input file named on the command line, instead of from the standard input. -@item @var{output file} +@item @var{outputfile} The generated output is sent to the named output file, instead of to the standard output. @end table @@ -21957,6 +23242,7 @@ END @{ @end example @c ENDOFRANGE prunt @c ENDOFRANGE tpul +@c ENDOFRANGE uniq @node Wc Program @subsection Counting Things @@ -21973,13 +23259,14 @@ END @{ @cindex characters, counting @c STARTOFRANGE lico @cindex lines, counting +@c STARTOFRANGE wc @cindex @command{wc} utility The @command{wc} (word count) utility counts lines, words, and characters in one or more input files. Its usage is as follows: -@example -wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]} -@end example +@display +@command{wc} [@option{-lwc}] [@var{files} @dots{}] +@end display If no files are specified on the command line, @command{wc} reads its standard input. If there are multiple files, it also prints total counts for all @@ -22066,7 +23353,7 @@ BEGIN @{ @end example The @code{beginfile()} function is simple; it just resets the counts of lines, -words, and characters to zero, and saves the current file name in +words, and characters to zero, and saves the current @value{FN} in @code{fname}: @example @@ -22079,18 +23366,10 @@ function beginfile(file) @c endfile @end example -The @code{endfile()} function adds the current file's numbers to the running -totals of lines, words, and characters.@footnote{@command{wc} can't just use the value of -@code{FNR} in @code{endfile()}. If you examine -the code in -@ref{Filetrans Function}, -you will see that -@code{FNR} has already been reset by the time -@code{endfile()} is called.} It then prints out those numbers -for the file that was just read. It relies on @code{beginfile()} to reset the -numbers for the following data file: -@c FIXME: ONE DAY: make the above footnote an exercise, -@c instead of giving away the answer. +The @code{endfile()} function adds the current file's numbers to the +running totals of lines, words, and characters. It then prints out those +numbers for the file that was just read. It relies on @code{beginfile()} +to reset the numbers for the following @value{DF}: @example @c file eg/prog/wc.awk @@ -22155,6 +23434,7 @@ END @{ @c ENDOFRANGE lico @c ENDOFRANGE woco @c ENDOFRANGE chco +@c ENDOFRANGE wc @c ENDOFRANGE posimawk @node Miscellaneous Programs @@ -22259,6 +23539,32 @@ word, comparing it to the previous one: @i{Nothing cures insomnia like a ringing alarm clock.} @author Arnold Robbins @end quotation +@cindex Quanstrom, Erik +@ignore +Date: Sat, 15 Feb 2014 16:47:09 -0500 +Subject: Re: 9atom install question +Message-ID: <l2jcvx6j6mey60xnrkb0hhob.1392500829294@email.android.com> +From: Erik Quanstrom <quanstro@quanstro.net> +To: Aharon Robbins <arnold@skeeve.com> + +yes. + +- erik + +Aharon Robbins <arnold@skeeve.com> wrote: + +>> sleep is for web developers. +> +>Can I quote you, in the gawk manual? +> +>Thanks, +> +>Arnold +@end ignore +@quotation +@i{Sleep is for web developers.} +@author Erik Quanstrom +@end quotation @c STARTOFRANGE tialarm @cindex time, alarm clock example program @@ -22380,7 +23686,7 @@ is how long to wait before setting off the alarm: # how long to sleep for naptime = target - current if (naptime <= 0) @{ - print "time is in the past!" > "/dev/stderr" + print "alarm: time is in the past!" > "/dev/stderr" exit 1 @} @c endfile @@ -22423,6 +23729,7 @@ seconds are necessary: @c STARTOFRANGE chtra @cindex characters, transliterating +@c STARTOFRANGE tr @cindex @command{tr} utility The system @command{tr} utility transliterates characters. For example, it is often used to map uppercase letters into lowercase for further processing: @@ -22432,19 +23739,18 @@ often used to map uppercase letters into lowercase for further processing: @end example @command{tr} requires two lists of characters.@footnote{On some older -systems, -including Solaris, -@command{tr} may require that the lists be written as -range expressions enclosed in square brackets (@samp{[a-z]}) and quoted, -to prevent the shell from attempting a file name expansion. This is -not a feature.} When processing the input, the first character in the -first list is replaced with the first character in the second list, -the second character in the first list is replaced with the second -character in the second list, and so on. If there are more characters -in the ``from'' list than in the ``to'' list, the last character of the -``to'' list is used for the remaining characters in the ``from'' list. - -Some time ago, +systems, including Solaris, the system version of @command{tr} may require +that the lists be written as range expressions enclosed in square brackets +(@samp{[a-z]}) and quoted, to prevent the shell from attempting a file +name expansion. This is not a feature.} When processing the input, the +first character in the first list is replaced with the first character +in the second list, the second character in the first list is replaced +with the second character in the second list, and so on. If there are +more characters in the ``from'' list than in the ``to'' list, the last +character of the ``to'' list is used for the remaining characters in the +``from'' list. + +Once upon a time, @c early or mid-1989! a user proposed that a transliteration function should be added to @command{gawk}. @@ -22462,7 +23768,6 @@ and @code{gsub()} built-in functions (@pxref{String Functions}).@footnote{This program was written before @command{gawk} acquired the ability to split each character in a string into separate array elements.} -@c Exercise: How might you use this new feature to simplify the program? There are two functions. The first, @code{stranslate()}, takes three arguments: @@ -22558,19 +23863,19 @@ BEGIN @{ While it is possible to do character transliteration in a user-level function, it is not necessarily efficient, and we (the @command{gawk} authors) started to consider adding a built-in function. However, -shortly after writing this program, we learned that the System V Release 4 -@command{awk} had added the @code{toupper()} and @code{tolower()} functions -(@pxref{String Functions}). -These functions handle the vast majority of the -cases where character transliteration is necessary, and so we chose to -simply add those functions to @command{gawk} as well and then leave well -enough alone. +shortly after writing this program, we learned that Brian Kernighan +had added the @code{toupper()} and @code{tolower()} functions to his +@command{awk} (@pxref{String Functions}). These functions handle the +vast majority of the cases where character transliteration is necessary, +and so we chose to simply add those functions to @command{gawk} as well +and then leave well enough alone. An obvious improvement to this program would be to set up the @code{t_ar} array only once, in a @code{BEGIN} rule. However, this assumes that the ``from'' and ``to'' lists will never change throughout the lifetime of the program. @c ENDOFRANGE chtra +@c ENDOFRANGE tr @node Labels Program @subsection Printing Mailing Labels @@ -22596,7 +23901,18 @@ The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that @command{awk} splits records at blank lines (@pxref{Records}). It sets @code{MAXLINES} to 100, since 100 is the maximum number -of lines on the page (20 * 5 = 100). +of lines on the page +@iftex +(@math{20 @cdot 5 = 100}). +@end iftex +@ifnottex +@ifnotdocbook +(20 * 5 = 100). +@end ifnotdocbook +@end ifnottex +@docbook +(20 ⋅ 5 = 100). @c +@end docbook Most of the work is done in the @code{printpage()} function. The label lines are stored sequentially in the @code{line} array. But they @@ -22630,6 +23946,7 @@ that there are two blank lines at the top and two blank lines at the bottom. The @code{END} rule arranges to flush the final page of labels; there may not have been an even multiple of 20 labels in the data: +@c STARTOFRANGE labels @cindex @code{labels.awk} program @example @c file eg/prog/labels.awk @@ -22697,6 +24014,7 @@ END \ @end example @c ENDOFRANGE prml @c ENDOFRANGE mlprint +@c ENDOFRANGE labels @node Word Sorting @subsection Generating Word-Usage Counts @@ -22706,7 +24024,7 @@ END \ When working with large amounts of text, it can be interesting to know how often different words appear. For example, an author may overuse -certain words, in which case she might wish to find synonyms to substitute +certain words, in which case he or she might wish to find synonyms to substitute for words that appear too often. This @value{SUBSECTION} develops a program for counting words and presenting the frequency information in a useful format. @@ -22736,7 +24054,7 @@ it prints the counts. This program has several problems that prevent it from being useful on real text files: -@itemize @bullet +@itemize @value{BULLET} @item The @command{awk} language considers upper- and lowercase characters to be distinct. Therefore, ``bartender'' and ``Bartender'' are not treated @@ -22763,6 +24081,7 @@ to remove punctuation characters. Finally, we solve the third problem by using the system @command{sort} utility to process the output of the @command{awk} script. Here is the new version of the program: +@c STARTOFRANGE wordfreq @cindex @code{wordfreq.awk} program @example @c file eg/prog/wordfreq.awk @@ -22783,6 +24102,10 @@ END @{ @} @end example +The regexp @samp{/[^[:alnum:]_[:blank:]]/} might have been written +@samp{/[[:punct:]]/}, but then underscores would also be removed, +and we want to keep them. + Assuming we have saved this program in a file named @file{wordfreq.awk}, and that the data is in @file{file1}, the following pipeline: @@ -22824,6 +24147,7 @@ have true pipes at the command-line (or batch-file) level. See the general operating system documentation for more information on how to use the @command{sort} program. @c ENDOFRANGE worus +@c ENDOFRANGE wordfreq @node History Sorting @subsection Removing Duplicates from Unsorted Text @@ -22834,7 +24158,7 @@ The @command{uniq} program (@pxref{Uniq Program}), removes duplicate lines from @emph{sorted} data. -Suppose, however, you need to remove duplicate lines from a data file but +Suppose, however, you need to remove duplicate lines from a @value{DF} but that you want to preserve the order the lines are in. A good example of this might be a shell history file. The history file keeps a copy of all the commands you have entered, and it is not unusual to repeat a command @@ -22853,6 +24177,7 @@ Each element of @code{lines} is a unique command, and the indices of The @code{END} rule simply prints out the lines, in order: @cindex Rakitzis, Byron +@c STARTOFRANGE histsort @cindex @code{histsort.awk} program @example @c file eg/prog/histsort.awk @@ -22892,9 +24217,11 @@ information. For example, using the following @code{print} statement in the print data[lines[i]], lines[i] @end example +@noindent This works because @code{data[$0]} is incremented each time a line is seen. @c ENDOFRANGE lidu +@c ENDOFRANGE histsort @node Extract Program @subsection Extracting Programs from Texinfo Source Files @@ -22926,7 +24253,8 @@ printed and online documentation. @ifnotinfo Texinfo is fully documented in the book @cite{Texinfo---The GNU Documentation Format}, -available from the Free Software Foundation. +available from the Free Software Foundation, +and also available @uref{http://www.gnu.org/software/texinfo/manual/texinfo/, online}. @end ifnotinfo @ifinfo The Texinfo language is described fully, starting with @@ -22936,7 +24264,7 @@ The Texinfo language is described fully, starting with For our purposes, it is enough to know three things about Texinfo input files: -@itemize @bullet +@itemize @value{BULLET} @item The ``at'' symbol (@samp{@@}) is special in Texinfo, much as the backslash (@samp{\}) is in C @@ -23004,6 +24332,7 @@ The first rule handles calling @code{system()}, checking that a command is given (@code{NF} is at least three) and also checking that the command exits with a zero exit status, signifying OK: +@c STARTOFRANGE extract @cindex @code{extract.awk} program @example @c file eg/prog/extract.awk @@ -23025,7 +24354,7 @@ BEGIN @{ IGNORECASE = 1 @} /^@@c(omment)?[ \t]+system/ \ @{ if (NF < 3) @{ - e = (FILENAME ":" FNR) + e = ("extract: " FILENAME ":" FNR) e = (e ": badly formed `system' line") print e > "/dev/stderr" next @@ -23034,7 +24363,7 @@ BEGIN @{ IGNORECASE = 1 @} $2 = "" stat = system($0) if (stat != 0) @{ - e = (FILENAME ":" FNR) + e = ("extract: " FILENAME ":" FNR) e = (e ": warning: system returned " stat) print e > "/dev/stderr" @} @@ -23044,16 +24373,10 @@ BEGIN @{ IGNORECASE = 1 @} @noindent The variable @code{e} is used so that the rule -fits nicely on the -@ifnotinfo -page. -@end ifnotinfo -@ifnottex -screen. -@end ifnottex +fits nicely on the @value{PAGE}. The second rule handles moving data into files. It verifies that a -file name is given in the directive. If the file named is not the +@value{FN} is given in the directive. If the file named is not the current file, then the current file is closed. Keeping the current file open until a new file is encountered allows the use of the @samp{>} redirection for printing the contents, keeping open file management @@ -23077,12 +24400,11 @@ the array @code{a}, using the @code{split()} function The @samp{@@} symbol is used as the separator character. Each element of @code{a} that is empty indicates two successive @samp{@@} symbols in the original line. For each two empty elements (@samp{@@@@} in -the original file), we have to add a single @samp{@@} symbol back -in.@footnote{This program was written before @command{gawk} had the -@code{gensub()} function. Consider how you might use it to simplify the code.} +the original file), we have to add a single @samp{@@} symbol back in. When the processing of the array is finished, @code{join()} is called with the -value of @code{SUBSEP}, to rejoin the pieces back into a single +value of @code{SUBSEP} (@pxref{Multidimensional}), +to rejoin the pieces back into a single line. That line is then printed to the output file: @example @@ -23090,7 +24412,7 @@ line. That line is then printed to the output file: /^@@c(omment)?[ \t]+file/ \ @{ if (NF != 3) @{ - e = (FILENAME ":" FNR ": badly formed `file' line") + e = ("extract: " FILENAME ":" FNR ": badly formed `file' line") print e > "/dev/stderr" next @} @@ -23135,20 +24457,19 @@ subsequent output is appended to the file (@pxref{Redirection}). This makes it easy to mix program text and explanatory prose for the same sample source file (as has been done here!) without any hassle. The file is -only closed when a new data file name is encountered or at the end of the +only closed when a new @value{DF} name is encountered or at the end of the input file. Finally, the function @code{@w{unexpected_eof()}} prints an appropriate error message and then exits. The @code{END} rule handles the final cleanup, closing the open file: -@c function lb put on same line for page breaking. sigh @example @c file eg/prog/extract.awk @group function unexpected_eof() @{ - printf("%s:%d: unexpected EOF or error\n", + printf("extract: %s:%d: unexpected EOF or error\n", FILENAME, FNR) > "/dev/stderr" exit 1 @} @@ -23162,6 +24483,7 @@ END @{ @end example @c ENDOFRANGE texse @c ENDOFRANGE fitex +@c ENDOFRANGE extract @node Simple Sed @subsection A Simple Stream Editor @@ -23187,10 +24509,11 @@ Here, @samp{s/old/new/g} tells @command{sed} to look for the regexp The following program, @file{awksed.awk}, accepts at least two command-line arguments: the pattern to look for and the text to replace it with. Any -additional arguments are treated as data file names to process. If none +additional arguments are treated as @value{DF} names to process. If none are provided, the standard input is used: @cindex Brennan, Michael +@c STARTOFRANGE awksed @cindex @command{awksed.awk} program @c @cindex simple stream editor @c @cindex stream editor, simple @@ -23260,33 +24583,14 @@ The @code{BEGIN} rule handles the setup, checking for the right number of arguments and calling @code{usage()} if there is a problem. Then it sets @code{RS} and @code{ORS} from the command-line arguments and sets @code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they are -not treated as file names +not treated as @value{FN}s (@pxref{ARGC and ARGV}). The @code{usage()} function prints an error message and exits. Finally, the single rule handles the printing scheme outlined above, using @code{print} or @code{printf} as appropriate, depending upon the value of @code{RT}. - -@ignore -Exercise, compare the performance of this version with the more -straightforward: - -BEGIN { - pat = ARGV[1] - repl = ARGV[2] - ARGV[1] = ARGV[2] = "" -} - -{ gsub(pat, repl); print } - -Exercise: what are the advantages and disadvantages of this version versus sed? - Advantage: egrep regexps - speed (?) - Disadvantage: no & in replacement text - -Others? -@end ignore +@c ENDOFRANGE awksed @node Igawk Program @subsection An Easy Way to Use Library Functions @@ -23328,7 +24632,7 @@ BEGIN @{ The following program, @file{igawk.sh}, provides this service. It simulates @command{gawk}'s searching of the @env{AWKPATH} variable and also allows @dfn{nested} includes; i.e., a file that is included -with @samp{@@include} can contain further @samp{@@include} statements. +with @code{@@include} can contain further @code{@@include} statements. @command{igawk} makes an effort to only include files once, so that nested includes don't accidentally include a library function twice. @@ -23358,7 +24662,7 @@ Literal text, provided with @option{--source} or @option{--source=}. This text is just appended directly. @item -Source file names, provided with @option{-f}. We use a neat trick and append +Source @value{FN}s, provided with @option{-f}. We use a neat trick and append @samp{@@include @var{filename}} to the shell variable's contents. Since the file-inclusion program works the way @command{gawk} does, this gets the text of the file included into the program at the correct point. @@ -23366,12 +24670,12 @@ of the file included into the program at the correct point. @item Run an @command{awk} program (naturally) over the shell variable's contents to expand -@samp{@@include} statements. The expanded program is placed in a second +@code{@@include} statements. The expanded program is placed in a second shell variable. @item Run the expanded program with @command{gawk} and any other original command-line -arguments that the user supplied (such as the data file names). +arguments that the user supplied (such as the @value{DF} names). @end enumerate This program uses shell variables extensively: for storing command-line arguments, @@ -23386,24 +24690,25 @@ argument is @samp{debug}. The next part loops through all the command-line arguments. There are several cases of interest: -@table @code -@item -- +@c @asis for docbook +@table @asis +@item @option{--} This ends the arguments to @command{igawk}. Anything else should be passed on to the user's @command{awk} program without being evaluated. -@item -W +@item @option{-W} This indicates that the next option is specific to @command{gawk}. To make argument processing easier, the @option{-W} is appended to the front of the remaining arguments and the loop continues. (This is an @command{sh} programming trick. Don't worry about it if you are not familiar with @command{sh}.) -@item -v@r{,} -F +@item @option{-v}, @option{-F} These are saved and passed on to @command{gawk}. -@item -f@r{,} --file@r{,} --file=@r{,} -Wfile= -The file name is appended to the shell variable @code{program} with an -@samp{@@include} statement. +@item @option{-f}, @option{--file}, @option{--file=}, @option{-Wfile=} +The @value{FN} is appended to the shell variable @code{program} with an +@code{@@include} statement. The @command{expr} utility is used to remove the leading option part of the argument (e.g., @samp{--file=}). (Typical @command{sh} usage would be to use the @command{echo} and @command{sed} @@ -23411,10 +24716,10 @@ utilities to do this work. Unfortunately, some versions of @command{echo} evalu escape sequences in their arguments, possibly mangling the program text. Using @command{expr} avoids this problem.) -@item --source@r{,} --source=@r{,} -Wsource= +@item @option{--source}, @option{--source=}, @option{-Wsource=} The source text is appended to @code{program}. -@item --version@r{,} -Wversion +@item @option{--version}, @option{-Wversion} @command{igawk} prints its version number, runs @samp{gawk --version} to get the @command{gawk} version information, and then exits. @end table @@ -23430,6 +24735,7 @@ program. The program is as follows: +@c STARTOFRANGE igawk @cindex @code{igawk.sh} program @example @c file eg/prog/igawk.sh @@ -23521,15 +24827,15 @@ fi @c endfile @end example -The @command{awk} program to process @samp{@@include} directives +The @command{awk} program to process @code{@@include} directives is stored in the shell variable @code{expand_prog}. Doing this keeps the shell script readable. The @command{awk} program reads through the user's program, one line at a time, using @code{getline} (@pxref{Getline}). The input -file names and @samp{@@include} statements are managed using a stack. -As each @samp{@@include} is encountered, the current file name is -``pushed'' onto the stack and the file named in the @samp{@@include} -directive becomes the current file name. As each file is finished, +@value{FN}s and @code{@@include} statements are managed using a stack. +As each @code{@@include} is encountered, the current @value{FN} is +``pushed'' onto the stack and the file named in the @code{@@include} +directive becomes the current @value{FN}. As each file is finished, the stack is ``popped,'' and the previous input file becomes the current input file again. The process is started by making the original file the first one on the stack. @@ -23538,16 +24844,16 @@ The @code{pathto()} function does the work of finding the full path to a file. It simulates @command{gawk}'s behavior when searching the @env{AWKPATH} environment variable (@pxref{AWKPATH Variable}). -If a file name has a @samp{/} in it, no path search is done. -Similarly, if the file name is @code{"-"}, then that string is +If a @value{FN} has a @samp{/} in it, no path search is done. +Similarly, if the @value{FN} is @code{"-"}, then that string is used as-is. Otherwise, -the file name is concatenated with the name of each directory in -the path, and an attempt is made to open the generated file name. +the @value{FN} is concatenated with the name of each directory in +the path, and an attempt is made to open the generated @value{FN}. The only way to test if a file can be read in @command{awk} is to go ahead and try to read it with @code{getline}; this is what @code{pathto()} does.@footnote{On some very old versions of @command{awk}, the test @samp{getline junk < t} can loop forever if the file exists but is empty. -Caveat emptor.} If the file can be read, it is closed and the file name +Caveat emptor.} If the file can be read, it is closed and the @value{FN} is returned: @ignore @@ -23602,17 +24908,17 @@ BEGIN @{ @c endfile @end example -The stack is initialized with @code{ARGV[1]}, which will be @file{/dev/stdin}. +The stack is initialized with @code{ARGV[1]}, which will be @code{"/dev/stdin"}. The main loop comes next. Input lines are read in succession. Lines that -do not start with @samp{@@include} are printed verbatim. -If the line does start with @samp{@@include}, the file name is in @code{$2}. +do not start with @code{@@include} are printed verbatim. +If the line does start with @code{@@include}, the @value{FN} is in @code{$2}. @code{pathto()} is called to generate the full path. If it cannot, then the program prints an error message and continues. The next thing to check is if the file is included already. The -@code{processed} array is indexed by the full file name of each included +@code{processed} array is indexed by the full @value{FN} of each included file and it tracks this information for us. If the file is -seen again, a warning message is printed. Otherwise, the new file name is +seen again, a warning message is printed. Otherwise, the new @value{FN} is pushed onto the stack and processing continues. Finally, when @code{getline} encounters the end of the input file, the file @@ -23633,7 +24939,7 @@ the program is done: fpath = pathto($2) @group if (fpath == "") @{ - printf("igawk:%s:%d: cannot find %s\n", + printf("igawk: %s:%d: cannot find %s\n", input[stackptr], FNR, $2) > "/dev/stderr" continue @} @@ -23673,7 +24979,7 @@ It's done in these steps: @enumerate @item -Run @command{gawk} with the @samp{@@include}-processing program (the +Run @command{gawk} with the @code{@@include}-processing program (the value of the @code{expand_prog} shell variable) on standard input. @item @@ -23690,10 +24996,10 @@ options and command-line arguments that the user supplied. @c this causes more problems than it solves, so leave it out. @ignore -The special file @file{/dev/null} is passed as a data file to @command{gawk} +The special file @file{/dev/null} is passed as a @value{DF} to @command{gawk} to handle an interesting case. Suppose that the user's program only has -a @code{BEGIN} rule and there are no data files to read. -The program should exit without reading any data files. +a @code{BEGIN} rule and there are no @value{DF}s to read. +The program should exit without reading any @value{DF}s. However, suppose that an included library file defines an @code{END} rule of its own. In this case, @command{gawk} will hang, reading standard input. In order to avoid this, @file{/dev/null} is explicitly added to the @@ -23712,27 +25018,25 @@ eval gawk $opts -- '"$processed_program"' '"$@@"' The @command{eval} command is a shell construct that reruns the shell's parsing process. This keeps things properly quoted. -This version of @command{igawk} represents my fifth version of this program. +This version of @command{igawk} represents the fifth version of this program. There are four key simplifications that make the program work better: -@itemize @bullet +@itemize @value{BULLET} @item -Using @samp{@@include} even for the files named with @option{-f} makes building +Using @code{@@include} even for the files named with @option{-f} makes building the initial collected @command{awk} program much simpler; all the -@samp{@@include} processing can be done once. +@code{@@include} processing can be done once. @item Not trying to save the line read with @code{getline} in the @code{pathto()} function when testing for the file's accessibility for use with the main program simplifies things considerably. -@c what problem does this engender though - exercise -@c answer, reading from "-" or /dev/stdin @item Using a @code{getline} loop in the @code{BEGIN} rule does it all in one place. It is not necessary to call out to a separate loop for processing -nested @samp{@@include} statements. +nested @code{@@include} statements. @item Instead of saving the expanded program in a temporary file, putting it in a shell variable @@ -23752,47 +25056,18 @@ Finally, @command{igawk} shows that it is not always necessary to add new features to a program; they can often be layered on top. @ignore With @command{igawk}, -there is no real reason to build @samp{@@include} processing into +there is no real reason to build @code{@@include} processing into @command{gawk} itself. @end ignore - -@cindex search paths -@cindex search paths, for source files -@cindex source files@comma{} search path for -@cindex files, source@comma{} search path for -@cindex directories, searching -As an additional example of this, consider the idea of having two -files in a directory in the search path: - -@table @file -@item default.awk -This file contains a set of default library functions, such -as @code{getopt()} and @code{assert()}. - -@item site.awk -This file contains library functions that are specific to a site or -installation; i.e., locally developed functions. -Having a separate file allows @file{default.awk} to change with -new @command{gawk} releases, without requiring the system administrator to -update it each time by adding the local functions. -@end table - -One user -@c Karl Berry, karl@ileaf.com, 10/95 -suggested that @command{gawk} be modified to automatically read these files -upon startup. Instead, it would be very simple to modify @command{igawk} -to do this. Since @command{igawk} can process nested @samp{@@include} -directives, @file{default.awk} could simply contain @samp{@@include} -statements for the desired library functions. - -@c Exercise: make this change @c ENDOFRANGE libfex @c ENDOFRANGE flibex @c ENDOFRANGE awkpex +@c ENDOFRANGE igawk @node Anagram Program @subsection Finding Anagrams From A Dictionary +@cindex anagrams, finding An interesting programming challenge is to search for @dfn{anagrams} in a word list (such as @@ -23812,6 +25087,7 @@ The following program uses arrays of arrays to bring together words with the same signature and array sorting to print the words in sorted order. +@c STARTOFRANGE anagram @cindex @code{anagram.awk} program @example @c file eg/prog/anagram.awk @@ -23920,9 +25196,13 @@ babery yabber @dots{} @end example +@c ENDOFRANGE anagram + @node Signature Program @subsection And Now For Something Completely Different +@cindex signature program +@cindex Brini, Davide The following program was written by Davide Brini @c (@email{dave_br@@gmx.com}) and is published on @uref{http://backreference.org/2011/02/03/obfuscated-awk/, @@ -23947,7 +25227,10 @@ X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O, O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O@}' @end example -We leave it to you to determine what the program does. +@cindex Johansen, Chris +We leave it to you to determine what the program does. (If you are +truly desperate to understand it, see Chris Johansen's explanation, +which is embedded in the Texinfo source file for this @value{DOCUMENT}.) @ignore To: "Arnold Robbins" <arnold@skeeve.com> @@ -24027,19 +25310,181 @@ BEGIN { } @end ignore -@iftex -@part Part III:@* Moving Beyond Standard @command{awk} With @command{gawk} -@end iftex +@node Programs Summary +@section Summary + +@itemize @value{BULLET} +@item +The functions provided in this @value{CHAPTER} and the previous one +continue on the theme that reading programs is an excellent way to learn +Good Programming. + +@item +Using @samp{#!} to make @command{awk} programs directly runnable makes +them easier to use. Otherwise, invoke the program using @samp{awk +-f @dots{}}. + +@item +Reimplementing standard POSIX programs in @command{awk} is a pleasant +exercise; @command{awk}'s expressive power lets you write such programs +in relatively few lines of code, yet they are functionally complete +and usable. + +@item +One of standard @command{awk}'s weaknesses is working with individual +characters. The ability to use @code{split()} with the empty string as +the separator can considerably simplify such tasks. + +@item +The library functions from @ref{Library Functions}, proved their +usefulness for a number of real (if small) programs. + +@item +Besides reinventing POSIX wheels, other programs solved a selection of +interesting problems, such as finding duplicates words in text, printing +mailing labels, and finding anagrams. + +@end itemize + +@node Programs Exercises +@section Exercises + +@enumerate +@item +Rewrite @file{cut.awk} (@pxref{Cut Program}) +using @code{split()} with @code{""} as the seperator. + +@item +In @ref{Egrep Program}, we mentioned that @samp{egrep -i} could be +simulated in versions of @command{awk} without @code{IGNORECASE} by +using @code{tolower()} on the line and the pattern. In a footnote there, +we also mentioned that this solution has a bug: the translated line is +output, and not the original one. Fix this problem. +@c Exercise: Fix this, w/array and new line as key to original line + +@item +The POSIX version of @command{id} takes options that control which +information is printed. Modify the @command{awk} version +(@pxref{Id Program}) to accept the same arguments and perform in the +same way. + +@item +The @code{split.awk} program (@pxref{Split Program}) uses +the @code{chr()} and @code{ord()} functions to move through the +letters of the alphabet. +Modify the program to instead use only the @command{awk} +built-in functions, such as @code{index()} and @code{substr()}. + +@item +The @code{split.awk} program (@pxref{Split Program}) assumes +that letters are contiguous in the character set, +which isn't true for EBCDIC systems. +Fix this problem. + +@item +Why can't the @file{wc.awk} program (@pxref{Wc Program}) just +use the value of @code{FNR} in @code{endfile()}? +Hint: Examine the code in @ref{Filetrans Function}. @ignore -@ifdocbook +@command{wc} can't just use the value of @code{FNR} in +@code{endfile()}. If you examine the code in @ref{Filetrans Function}, +you will see that @code{FNR} has already been reset by the time +@code{endfile()} is called. +@end ignore + +@item +Manipulation of individual characters in the @command{translate} program +(@pxref{Translate Program}) is painful using standard @command{awk} +functions. Given that @command{gawk} can split strings into individual +characters using @code{""} as the separator, how might you use this +feature to simplify the program? + +@item +The @file{extract.awk} program (@pxref{Extract Program}) was written +before @command{gawk} had the @code{gensub()} function. Use it +to simplify the code. + +@item +Compare the performance of the @file{awksed.awk} program +(@pxref{Simple Sed}) with the more straightforward: + +@example +BEGIN @{ + pat = ARGV[1] + repl = ARGV[2] + ARGV[1] = ARGV[2] = "" +@} + +@{ gsub(pat, repl); print @} +@end example + +@item +What are the advantages and disadvantages of @file{awksed.awk} versus +the real @command{sed} utility? + +@ignore + Advantage: egrep regexps + speed (?) + Disadvantage: no & in replacement text + +Others? +@end ignore + +@item +In @ref{Igawk Program}, we mentioned that not trying to save the line +read with @code{getline} in the @code{pathto()} function when testing +for the file's accessibility for use with the main program simplifies +things considerably. What problem does this engender though? +@c answer, reading from "-" or /dev/stdin + +@cindex search paths +@cindex search paths, for source files +@cindex source files@comma{} search path for +@cindex files, source@comma{} search path for +@cindex directories, searching +@item +As an additional example of the idea that it is not always necessary to +add new features to a program, consider the idea of having two files in +a directory in the search path: + +@table @file +@item default.awk +This file contains a set of default library functions, such +as @code{getopt()} and @code{assert()}. + +@item site.awk +This file contains library functions that are specific to a site or +installation; i.e., locally developed functions. +Having a separate file allows @file{default.awk} to change with +new @command{gawk} releases, without requiring the system administrator to +update it each time by adding the local functions. +@end table + +One user +@c Karl Berry, karl@ileaf.com, 10/95 +suggested that @command{gawk} be modified to automatically read these files +upon startup. Instead, it would be very simple to modify @command{igawk} +to do this. Since @command{igawk} can process nested @code{@@include} +directives, @file{default.awk} could simply contain @code{@@include} +statements for the desired library functions. +Make this change. + +@item +Modify @file{anagram.awk} (@pxref{Anagram Program}), to avoid +the use of the external @command{sort} utility. -@part Part III:@* Moving Beyond Standard @command{awk} With @command{gawk} +@end enumerate +@ifnotinfo +@part @value{PART3}Moving Beyond Standard @command{awk} With @command{gawk} +@end ifnotinfo + +@ifdocbook Part III focuses on features specific to @command{gawk}. It contains the following chapters: -@itemize @bullet +@itemize @value{BULLET} @item @ref{Advanced Features}. @@ -24054,12 +25499,11 @@ It contains the following chapters: @item @ref{Dynamic Extensions}. +@end itemize @end ifdocbook -@end ignore @node Advanced Features @chapter Advanced Features of @command{gawk} -@cindex advanced features, network connections, See Also networks, connections @c STARTOFRANGE gawadv @cindex @command{gawk}, features, advanced @c STARTOFRANGE advgaw @@ -24072,6 +25516,8 @@ Contributed by: Peter Langston <pud!psl@bellcore.bellcore.com> "Write documentation as if whoever reads it is a violent psychopath who knows where you live." @end ignore +@cindex Langston, Peter +@cindex English, Steve @quotation @i{Write documentation as if whoever reads it is a violent psychopath who knows where you live.} @@ -24091,10 +25537,11 @@ of TCP/IP networking. Finally, @command{gawk} can @dfn{profile} an @command{awk} program, making it possible to tune it for performance. +@c FULLXREF ON A number of advanced features require separate @value{CHAPTER}s of their own: -@itemize @bullet +@itemize @value{BULLET} @item @ref{Internationalization}, discusses how to internationalize your @command{awk} programs, so that they can speak multiple @@ -24113,6 +25560,7 @@ debugger for debugging @command{awk} programs. discusses the ability to dynamically add new built-in functions to @command{gawk}. @end itemize +@c FULLXREF OFF @menu * Nondecimal Data:: Allowing nondecimal input data. @@ -24121,11 +25569,12 @@ discusses the ability to dynamically add new built-in functions to * Two-way I/O:: Two-way communications with another process. * TCP/IP Networking:: Using @command{gawk} for network programming. * Profiling:: Profiling your @command{awk} programs. +* Advanced Features Summary:: Summary of advanced features. @end menu @node Nondecimal Data @section Allowing Nondecimal Input Data -@cindex @code{--non-decimal-data} option +@cindex @option{--non-decimal-data} option @cindex advanced features, nondecimal input data @cindex input, data@comma{} nondecimal @cindex constants, nondecimal @@ -24153,7 +25602,7 @@ $ @kbd{echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'} The @code{print} statement treats its expressions as strings. Although the fields can act as numbers when necessary, they are still strings, so @code{print} does not try to treat them -numerically. You may need to add zero to a field to force it to +numerically. You need to add zero to a field to force it to be treated as a number. For example: @example @@ -24169,13 +25618,13 @@ using this facility could lead to surprising results, the default is to leave it disabled. If you want it, you must explicitly request it. @cindex programming conventions, @code{--non-decimal-data} option -@cindex @code{--non-decimal-data} option, @code{strtonum()} function and +@cindex @option{--non-decimal-data} option, @code{strtonum()} function and @cindex @code{strtonum()} function (@command{gawk}), @code{--non-decimal-data} option and @quotation CAUTION @emph{Use of this option is not recommended.} It can break old programs very badly. Instead, use the @code{strtonum()} function to convert your data -(@pxref{Nondecimal-numbers}). +(@pxref{String Functions}). This makes your programs easier to write and easier to read, and leads to less surprising results. @end quotation @@ -24209,7 +25658,7 @@ lets you do this. @ref{Controlling Scanning}, describes how you can assign special, pre-defined values to @code{PROCINFO["sorted_in"]} in order to -control the order in which @command{gawk} will traverse an array +control the order in which @command{gawk} traverses an array during a @code{for} loop. In addition, the value of @code{PROCINFO["sorted_in"]} can be a function name. @@ -24466,9 +25915,9 @@ sorted array traversal is not the default. @subsection Sorting Array Values and Indices with @command{gawk} @cindex arrays, sorting -@cindex @code{asort()} function (@command{gawk}) +@cindexgawkfunc{asort} @cindex @code{asort()} function (@command{gawk}), arrays@comma{} sorting -@cindex @code{asorti()} function (@command{gawk}) +@cindexgawkfunc{asorti} @cindex @code{asorti()} function (@command{gawk}), arrays@comma{} sorting @cindex sort function, arrays, sorting In most @command{awk} implementations, sorting an array requires writing @@ -24533,9 +25982,9 @@ END @{ So far, so good. Now it starts to get interesting. Both @code{asort()} and @code{asorti()} accept a third string argument to control comparison -of array elements. In @ref{String Functions}, we ignored this third -argument; however, the time has now come to describe how this argument -affects these two functions. +of array elements. When we introduced @code{asort()} and @code{asorti()} +in @ref{String Functions}, we ignored this third argument; however, +now is the time to describe how this argument affects these two functions. Basically, the third argument specifies how the array is to be sorted. There are two possibilities. As with @code{PROCINFO["sorted_in"]}, @@ -24563,9 +26012,8 @@ both arrays use the values. @c Document It And Call It A Feature. Sigh. @cindex @command{gawk}, @code{IGNORECASE} variable in -@cindex @code{IGNORECASE} variable -@cindex arrays, sorting, @code{IGNORECASE} variable and -@cindex @code{IGNORECASE} variable, array sorting and +@cindex arrays, sorting, and @code{IGNORECASE} variable +@cindex @code{IGNORECASE} variable, and array sorting functions Because @code{IGNORECASE} affects string comparisons, the value of @code{IGNORECASE} also affects sorting for both @code{asort()} and @code{asorti()}. Note also that the locale's sorting order does @emph{not} @@ -24644,7 +26092,7 @@ open a @emph{two-way} pipe to another process. The second process is termed a @dfn{coprocess}, since it runs in parallel with @command{gawk}. The two-way connection is created using the @samp{|&} operator (borrowed from the Korn shell, @command{ksh}):@footnote{This is very -different from the same operator in the C shell.} +different from the same operator in the C shell and in Bash.} @example do @{ @@ -24666,7 +26114,7 @@ the shell. There are some cautionary items to be aware of: -@itemize @bullet +@itemize @value{BULLET} @item As the code inside @command{gawk} currently stands, the coprocess's standard error goes to the same place that the parent @command{gawk}'s @@ -24732,9 +26180,10 @@ has been read, @command{gawk} terminates the coprocess and exits. As a side note, the assignment @samp{LC_ALL=C} in the @command{sort} command ensures traditional Unix (ASCII) sorting from @command{sort}. +This is not strictly necessary here, but it's good to know how to do this. @cindex @command{gawk}, @code{PROCINFO} array in -@cindex @code{PROCINFO} array +@cindex @code{PROCINFO} array, and communications via ptys You may also use pseudo-ttys (ptys) for two-way communication instead of pipes, if your system supports them. This is done on a per-command basis, by setting a special element @@ -24750,7 +26199,7 @@ print @dots{} |& command # start two-way pipe @end example @noindent -Using ptys avoids the buffer deadlock issues described earlier, at some +Using ptys usually avoids the buffer deadlock issues described earlier, at some loss in performance. If your system does not have ptys, or if all the system's ptys are in use, @command{gawk} automatically falls back to using regular pipes. @@ -24785,10 +26234,10 @@ another process on another system across an IP network connection. You can think of this as just a @emph{very long} two-way pipeline to a coprocess. The way @command{gawk} decides that you want to use TCP/IP networking is -by recognizing special file names that begin with one of @samp{/inet/}, +by recognizing special @value{FN}s that begin with one of @samp{/inet/}, @samp{/inet4/} or @samp{/inet6}. -The full syntax of the special file name is +The full syntax of the special @value{FN} is @file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}. The components are: @@ -24854,7 +26303,9 @@ See @inforef{Top, , General Introduction, gawkinet, TCP/IP Internetworking with @command{gawk}}, @end ifinfo @ifnotinfo -See @cite{TCP/IP Internetworking with @command{gawk}}, +See +@uref{http://www.gnu.org/software/gawk/manual/gawkinet/, +@cite{TCP/IP Internetworking with @command{gawk}}}, which comes as part of the @command{gawk} distribution, @end ifnotinfo for a much more complete introduction and discussion, as well as @@ -24877,7 +26328,7 @@ When @command{gawk} has finished running, it creates a profile of your program i named @file{awkprof.out}. Because it is profiling, it also executes up to 45% slower than @command{gawk} normally does. -@cindex @code{--profile} option +@cindex @option{--profile} option As shown in the following example, the @option{--profile} option can be used to change the name of the file where @command{gawk} will write the profile: @@ -24932,68 +26383,77 @@ foo junk @end example -Here is the @file{awkprof.out} that results from running the @command{gawk} -profiler on this program and data (this example also illustrates that @command{awk} -programmers sometimes have to work late): +Here is the @file{awkprof.out} that results from running the +@command{gawk} profiler on this program and data. (This example also +illustrates that @command{awk} programmers sometimes get up very early +in the morning to work.) -@cindex @code{BEGIN} pattern -@cindex @code{END} pattern +@cindex @code{BEGIN} pattern, and profiling +@cindex @code{END} pattern, and profiling @example - # gawk profile, created Sun Aug 13 00:00:15 2000 + # gawk profile, created Thu Feb 27 05:16:21 2014 - # BEGIN block(s) + # BEGIN block(s) - BEGIN @{ - 1 print "First BEGIN rule" - 1 print "Second BEGIN rule" - @} + BEGIN @{ + 1 print "First BEGIN rule" + @} - # Rule(s) + BEGIN @{ + 1 print "Second BEGIN rule" + @} - 5 /foo/ @{ # 2 - 2 print "matched /foo/, gosh" - 6 for (i = 1; i <= 3; i++) @{ - 6 sing() - @} - @} + # Rule(s) - 5 @{ - 5 if (/foo/) @{ # 2 - 2 print "if is true" - 3 @} else @{ - 3 print "else is true" - @} - @} + 5 /foo/ @{ # 2 + 2 print "matched /foo/, gosh" + 6 for (i = 1; i <= 3; i++) @{ + 6 sing() + @} + @} - # END block(s) + 5 @{ + 5 if (/foo/) @{ # 2 + 2 print "if is true" + 3 @} else @{ + 3 print "else is true" + @} + @} - END @{ - 1 print "First END rule" - 1 print "Second END rule" - @} + # END block(s) - # Functions, listed alphabetically + END @{ + 1 print "First END rule" + @} - 6 function sing(dummy) - @{ - 6 print "I gotta be me!" - @} + END @{ + 1 print "Second END rule" + @} + + + # Functions, listed alphabetically + + 6 function sing(dummy) + @{ + 6 print "I gotta be me!" + @} @end example This example illustrates many of the basic features of profiling output. They are as follows: -@itemize @bullet +@itemize @value{BULLET} @item -The program is printed in the order @code{BEGIN} rule, -@code{BEGINFILE} rule, +The program is printed in the order @code{BEGIN} rules, +@code{BEGINFILE} rules, pattern/action rules, -@code{ENDFILE} rule, @code{END} rule and functions, listed +@code{ENDFILE} rules, @code{END} rules and functions, listed alphabetically. -Multiple @code{BEGIN} and @code{END} rules are merged together, -as are multiple @code{BEGINFILE} and @code{ENDFILE} rules. +Multiple @code{BEGIN} and @code{END} rules retain their +separate identities, as do +multiple @code{BEGINFILE} and @code{ENDFILE} rules. -@cindex patterns, counts +@cindex patterns, counts, in a profile @item Pattern-action rules have two counts. The first count, to the left of the rule, shows how many times @@ -25013,7 +26473,7 @@ is a count showing how many times the condition was true. The count for the @code{else} indicates how many times the test failed. -@cindex loops, count for header +@cindex loops, count for header, in a profile @item The count for a loop header (such as @code{for} or @code{while}) shows how many times the loop test was executed. @@ -25021,8 +26481,8 @@ or @code{while}) shows how many times the loop test was executed. statement in a rule to determine how many times the rule was executed. If the first statement is a loop, the count is misleading.) -@cindex functions, user-defined, counts -@cindex user-defined, functions, counts +@cindex functions, user-defined, counts, in a profile +@cindex user-defined, functions, counts, in a profile @item For user-defined functions, the count next to the @code{function} keyword indicates how many times the function was called. @@ -25036,12 +26496,11 @@ The layout uses ``K&R'' style with TABs. Braces are used everywhere, even when the body of an @code{if}, @code{else}, or loop is only a single statement. -@cindex @code{()} (parentheses) -@cindex parentheses @code{()} +@cindex @code{()} (parentheses), in a profile +@cindex parentheses @code{()}, in a profile @item Parentheses are used only where needed, as indicated by the structure of the program and the precedence rules. -@c extra verbiage here satisfies the copyeditor. ugh. For example, @samp{(3 + 5) * 4} means add three plus five, then multiply the total by four. However, @samp{3 + 5 * 4} has no parentheses, and means @samp{3 + (5 * 4)}. @@ -25072,8 +26531,8 @@ typed when you wrote it. This is because @command{gawk} creates the profiled version by ``pretty printing'' its internal representation of the program. The advantage to this is that @command{gawk} can produce a standard representation. The disadvantage is that all source-code -comments are lost, as are the distinctions among multiple @code{BEGIN}, -@code{END}, @code{BEGINFILE}, and @code{ENDFILE} rules. Also, things such as: +comments are lost. +Also, things such as: @example /foo/ @@ -25093,6 +26552,7 @@ which is correct, but possibly surprising. @cindex profiling @command{awk} programs, dynamically @cindex @command{gawk} program, dynamic profiling +@cindex dynamic profiling Besides creating profiles when a program has completed, @command{gawk} can produce a profile while it is running. This is useful if your @command{awk} program goes into an @@ -25106,9 +26566,9 @@ $ @kbd{gawk --profile -f myprog &} @end example @cindex @command{kill} command@comma{} dynamic profiling -@cindex @code{USR1} signal -@cindex @code{SIGUSR1} signal -@cindex signals, @code{USR1}/@code{SIGUSR1} +@cindex @code{USR1} signal, for dynamic profiling +@cindex @code{SIGUSR1} signal, for dynamic profiling +@cindex signals, @code{USR1}/@code{SIGUSR1}, for profiling @noindent The shell prints a job number and process ID number; in this case, 13992. Use the @command{kill} command to send the @code{USR1} signal @@ -25123,7 +26583,7 @@ As usual, the profiled version of the program is written to @file{awkprof.out}, or to a different file if one specified with the @option{--profile} option. -Along with the regular profile, as shown earlier, the profile +Along with the regular profile, as shown earlier, the profile file includes a trace of any active functions: @example @@ -25139,9 +26599,9 @@ You may send @command{gawk} the @code{USR1} signal as many times as you like. Each time, the profile and function call trace are appended to the output profile file. -@cindex @code{HUP} signal -@cindex @code{SIGHUP} signal -@cindex signals, @code{HUP}/@code{SIGHUP} +@cindex @code{HUP} signal, for dynamic profiling +@cindex @code{SIGHUP} signal, for dynamic profiling +@cindex signals, @code{HUP}/@code{SIGHUP}, for profiling If you use the @code{HUP} signal instead of the @code{USR1} signal, @command{gawk} produces the profile and the function call trace and then exits. @@ -25163,11 +26623,61 @@ keyboard. The @code{INT} signal is generated by the Finally, @command{gawk} also accepts another option, @option{--pretty-print}. When called this way, @command{gawk} ``pretty prints'' the program into @file{awkprof.out}, without any execution counts. -@c ENDOFRANGE advgaw -@c ENDOFRANGE gawadv + +@quotation NOTE +The @option{--pretty-print} option still runs your program. +This will change in the next major release. +@end quotation @c ENDOFRANGE awkp @c ENDOFRANGE proawk +@node Advanced Features Summary +@section Summary + +@itemize @value{BULLET} +@item +The @option{--non-decimal-data} option causes @command{gawk} to treat +octal- and hexadecimal-looking input data as octal and hexadecimal. +This option should be used with caution or not at all; use of @code{strtonum()} +is preferable. + +@item +You can take over complete control of sorting in @samp{for (@var{indx} in @var{array})} +array traversal by setting @code{PROCINFO["sorted_in"]} to the name of a user-defined +function that does the comparison of array elements based on index and value. + +@item +Similarly, you can supply the name of a user-defined comparison function as the +third argument to either @code{asort()} or @command{asorti()} to control how +those functions sort arrays. Or you may provide one of the predefined control +strings that work for @code{PROCINFO["sorted_in"]}. + +@item +You can use the @samp{|&} operator to create a two-way pipe to a co-process. +You read from the co-process with @code{getline} and write to it with @code{print} +or @code{printf}. Use @code{close()} to close off the co-process completely, or +optionally, close off one side of the two-way communications. + +@item +By using special ``@value{FN}s'' with the @samp{|&} operator, you can open a +TCP/IP (or UDP/IP) connection to remote hosts in the Internet. @command{gawk} +supports both IPv4 an IPv6. + +@item +You can generate statement count profiles of your program. This can help you +determine which parts of your program may be taking the most time and let +you tune them more easily. Sending the @code{USR1} signal while profiling causes +@command{gawk} to dump the profile and keep going, including a function call stack. + +@item +You can also just ``pretty print'' the program. This currently also runs +the program, but that will change in the next major release. + +@end itemize + +@c ENDOFRANGE advgaw +@c ENDOFRANGE gawadv + @node Internationalization @chapter Internationalization with @command{gawk} @@ -25196,11 +26706,12 @@ a requirement. @menu * I18N and L10N:: Internationalization and Localization. -* Explaining gettext:: How GNU @code{gettext} works. +* Explaining gettext:: How GNU @command{gettext} works. * Programmer i18n:: Features for the programmer. * Translator i18n:: Features for the translator. * I18N Example:: A simple i18n example. * Gawk I18N:: @command{gawk} is also internationalized. +* I18N Summary:: Summary of I18N stuff. @end menu @node I18N and L10N @@ -25220,20 +26731,22 @@ responses, and information related to how numerical and monetary values are printed and read. @node Explaining gettext -@section GNU @code{gettext} +@section GNU @command{gettext} @cindex internationalizing a program @c STARTOFRANGE gettex -@cindex @code{gettext} library -The facilities in GNU @code{gettext} focus on messages; strings printed +@cindex @command{gettext} library +@command{gawk} uses GNU @command{gettext} to provide its internationalization +features. +The facilities in GNU @command{gettext} focus on messages; strings printed by a program, either directly or via formatting with @code{printf} or @code{sprintf()}.@footnote{For some operating systems, the @command{gawk} -port doesn't support GNU @code{gettext}. +port doesn't support GNU @command{gettext}. Therefore, these features are not available if you are using one of those operating systems. Sorry.} -@cindex portability, @code{gettext} library and -When using GNU @code{gettext}, each application has its own +@cindex portability, @command{gettext} library and +When using GNU @command{gettext}, each application has its own @dfn{text domain}. This is a unique name, such as @samp{kpilot} or @samp{gawk}, that identifies the application. A complete application may have multiple components---programs written @@ -25257,7 +26770,7 @@ language). @cindex @code{textdomain()} function (C library) @item The programmer indicates the application's text domain -(@code{"guide"}) to the @code{gettext} library, +(@command{"guide"}) to the @command{gettext} library, by calling the @code{textdomain()} function. @cindex @code{.pot} files @@ -25274,6 +26787,7 @@ lookup of the translations. @cindex @code{.po} files @cindex files, @code{.po} +@c STARTOFRANGE portobfi @cindex portable object files @cindex files, portable object @item @@ -25285,6 +26799,7 @@ For example, there might be a @file{fr.po} for a French translation. @cindex @code{.gmo} files @cindex files, @code{.gmo} @cindex message object files +@c STARTOFRANGE portmsgfi @cindex files, message object @item Each language's @file{.po} file is converted into a binary @@ -25299,7 +26814,7 @@ are installed in a standard place. @cindex @code{bindtextdomain()} function (C library) @item -For testing and development, it is possible to tell @code{gettext} +For testing and development, it is possible to tell @command{gettext} to use @file{.gmo} files in a different directory than the standard one by using the @code{bindtextdomain()} function. @@ -25332,7 +26847,7 @@ strings enclosed in calls to @code{gettext()}. @cindex @code{_} (underscore), C macro @cindex underscore (@code{_}), C macro -The GNU @code{gettext} developers, recognizing that typing +The GNU @command{gettext} developers, recognizing that typing @samp{gettext(@dots{})} over and over again is both painful and ugly to look at, use the macro @samp{_} (an underscore) to make things easier: @@ -25345,7 +26860,7 @@ printf("%s", _("Don't Panic!\n")); @end example @cindex internationalization, localization, locale categories -@cindex @code{gettext} library, locale categories +@cindex @command{gettext} library, locale categories @cindex locale categories @noindent This reduces the typing overhead to just three extra characters per string @@ -25353,12 +26868,12 @@ and is considerably easier to read as well. There are locale @dfn{categories} for different types of locale-related information. -The defined locale categories that @code{gettext} knows about are: +The defined locale categories that @command{gettext} knows about are: @table @code @cindex @code{LC_MESSAGES} locale category @item LC_MESSAGES -Text messages. This is the default category for @code{gettext} +Text messages. This is the default category for @command{gettext} operations, but it is possible to supply a different one explicitly, if necessary. (It is almost never necessary to supply a different category.) @@ -25406,7 +26921,7 @@ before or after the day in a date, local month abbreviations, and so on. @cindex @code{LC_ALL} locale category @item LC_ALL -All of the above. (Not too useful in the context of @code{gettext}.) +All of the above. (Not too useful in the context of @command{gettext}.) @end table @c ENDOFRANGE gettex @@ -25422,7 +26937,7 @@ internationalization: @cindex @code{TEXTDOMAIN} variable @item TEXTDOMAIN This variable indicates the application's text domain. -For compatibility with GNU @code{gettext}, the default +For compatibility with GNU @command{gettext}, the default value is @code{"messages"}. @cindex internationalization, localization, marked strings @@ -25432,8 +26947,8 @@ String constants marked with a leading underscore are candidates for translation at runtime. String constants without a leading underscore are not translated. -@cindex @code{dcgettext()} function (@command{gawk}) -@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) +@cindexgawkfunc{dcgettext} +@item @code{dcgettext(@var{string}} [@code{,} @var{domain} [@code{,} @var{category}]]@code{)} Return the translation of @var{string} in text domain @var{domain} for locale category @var{category}. The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. @@ -25458,8 +26973,8 @@ chosen to be simple and to allow for reasonable @command{awk}-style default arguments. @end quotation -@cindex @code{dcngettext()} function (@command{gawk}) -@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) +@cindexgawkfunc{dcngettext} +@item @code{dcngettext(@var{string1}, @var{string2}, @var{number}} [@code{,} @var{domain} [@code{,} @var{category}]]@code{)} Return the plural form used for @var{number} of the translation of @var{string1} and @var{string2} in text domain @var{domain} for locale category @var{category}. @var{string1} is the @@ -25474,10 +26989,10 @@ The same remarks about argument order as for the @code{dcgettext()} function app @cindex files, @code{.gmo}, specifying directory of @cindex message object files, specifying directory of @cindex files, message object, specifying directory of -@cindex @code{bindtextdomain()} function (@command{gawk}) -@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]}) +@cindexgawkfunc{bindtextdomain} +@item @code{bindtextdomain(@var{directory}} [@code{,} @var{domain} ]@code{)} Change the directory in which -@code{gettext} looks for @file{.gmo} files, in case they +@command{gettext} looks for @file{.gmo} files, in case they will not or cannot be placed in the standard locations (e.g., during testing). Return the directory in which @var{domain} is ``bound.'' @@ -25576,7 +27091,7 @@ and use translations from @command{awk}. @cindex portable object files @cindex files, portable object Once a program's translatable strings have been marked, they must -be extracted to create the initial @file{.po} file. +be extracted to create the initial @file{.pot} file. As part of translation, it is often helpful to rearrange the order in which arguments to @code{printf} are output. @@ -25596,13 +27111,13 @@ is covered. @subsection Extracting Marked Strings @cindex strings, extracting @cindex marked strings@comma{} extracting -@cindex @code{--gen-pot} option +@cindex @option{--gen-pot} option @cindex command-line options, string extraction @cindex string extraction (internationalization) @cindex marked string extraction (internationalization) @cindex extraction, of marked strings (internationalization) -@cindex @code{--gen-pot} option +@cindex @option{--gen-pot} option Once your @command{awk} program is working, and all the strings have been marked and you've set (and perhaps bound) the text domain, it is time to produce translations. @@ -25616,15 +27131,17 @@ $ @kbd{gawk --gen-pot -f guide.awk > guide.pot} @cindex @code{xgettext} utility When run with @option{--gen-pot}, @command{gawk} does not execute your program. Instead, it parses it as usual and prints all marked strings -to standard output in the format of a GNU @code{gettext} Portable Object +to standard output in the format of a GNU @command{gettext} Portable Object file. Also included in the output are any constant strings that appear as the first argument to @code{dcgettext()} or as the first and second argument to @code{dcngettext()}.@footnote{The @command{xgettext} utility that comes with GNU -@code{gettext} can handle @file{.awk} files.} +@command{gettext} can handle @file{.awk} files.} @xref{I18N Example}, for the full list of steps to go through to create and test translations for @command{guide}. +@c ENDOFRANGE portobfi +@c ENDOFRANGE portmsgfi @node Printf Ordering @subsection Rearranging @code{printf} Arguments @@ -25635,9 +27152,8 @@ Format strings for @code{printf} and @code{sprintf()} (@pxref{Printf}) present a special problem for translation. Consider the following:@footnote{This example is borrowed -from the GNU @code{gettext} manual.} +from the GNU @command{gettext} manual.} -@c line broken here only for smallbook format @example printf(_"String `%s' has %d characters\n", string, length(string))) @@ -25671,7 +27187,7 @@ example, @samp{string} is the first argument and @samp{length(string)} is the se @example $ @kbd{gawk 'BEGIN @{} > @kbd{string = "Dont Panic"} -> @kbd{printf _"%2$d characters live in \"%1$s\"\n",} +> @kbd{printf "%2$d characters live in \"%1$s\"\n",} > @kbd{string, length(string)} > @kbd{@}'} @print{} 10 characters live in "Dont Panic" @@ -25705,7 +27221,7 @@ This is somewhat counterintuitive. and those with positional specifiers in the same string: @example -$ @kbd{gawk 'BEGIN @{ printf _"%d %3$s\n", 1, 2, "hi" @}'} +$ @kbd{gawk 'BEGIN @{ printf "%d %3$s\n", 1, 2, "hi" @}'} @error{} gawk: cmd. line:1: fatal: must use `count$' on all formats or none @end example @@ -25745,7 +27261,7 @@ As written, it won't work on other versions of @command{awk}. However, it is actually almost portable, requiring very little change: -@itemize @bullet +@itemize @value{BULLET} @cindex @code{TEXTDOMAIN} variable, portability and @item Assignments to @code{TEXTDOMAIN} won't have any effect, @@ -25885,33 +27401,34 @@ msgstr "Like, the scoop is" @cindex Linux @cindex GNU/Linux The next step is to make the directory to hold the binary message object -file and then to create the @file{guide.gmo} file. -The directory layout shown here is standard for GNU @code{gettext} on -GNU/Linux systems. Other versions of @code{gettext} may use a different +file and then to create the @file{guide.mo} file. +We pretend that our file is to be used in the @code{en_US.UTF-8} locale. +The directory layout shown here is standard for GNU @command{gettext} on +GNU/Linux systems. Other versions of @command{gettext} may use a different layout: @example -$ @kbd{mkdir en_US en_US/LC_MESSAGES} +$ @kbd{mkdir en_US.UTF-8 en_US.UTF-8/LC_MESSAGES} @end example -@cindex @code{.po} files, converting to @code{.gmo} -@cindex files, @code{.po}, converting to @code{.gmo} -@cindex @code{.gmo} files, converting from @code{.po} -@cindex files, @code{.gmo}, converting from @code{.po} +@cindex @code{.po} files, converting to @code{.mo} +@cindex files, @code{.po}, converting to @code{.mo} +@cindex @code{.mo} files, converting from @code{.po} +@cindex files, @code{.mo}, converting from @code{.po} @cindex portable object files, converting to message object files @cindex files, portable object, converting to message object files @cindex message object files, converting from portable object files @cindex files, message object, converting from portable object files @cindex @command{msgfmt} utility The @command{msgfmt} utility does the conversion from human-readable -@file{.po} file to machine-readable @file{.gmo} file. +@file{.po} file to machine-readable @file{.mo} file. By default, @command{msgfmt} creates a file named @file{messages}. This file must be renamed and placed in the proper directory so that @command{gawk} can find it: @example $ @kbd{msgfmt guide-mellow.po} -$ @kbd{mv messages en_US/LC_MESSAGES/guide.gmo} +$ @kbd{mv messages en_US.UTF-8/LC_MESSAGES/guide.mo} @end example Finally, we run the program to test it: @@ -25940,30 +27457,71 @@ $ @kbd{gawk --posix -f guide.awk -f libintl.awk} @section @command{gawk} Can Speak Your Language @command{gawk} itself has been internationalized -using the GNU @code{gettext} package. -(GNU @code{gettext} is described in +using the GNU @command{gettext} package. +(GNU @command{gettext} is described in complete detail in @ifinfo -@inforef{Top, , GNU @code{gettext} utilities, gettext, GNU gettext tools}.) +@inforef{Top, , GNU @command{gettext} utilities, gettext, GNU gettext tools}.) @end ifinfo @ifnotinfo -@cite{GNU gettext tools}.) +@uref{http://www.gnu.org/software/gettext/manual/, +@cite{GNU gettext tools}}.) @end ifnotinfo -As of this writing, the latest version of GNU @code{gettext} is -@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.18.2.1.tar.gz, version 0.18.2.1}. +As of this writing, the latest version of GNU @command{gettext} is +@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.19.1.tar.gz, +@value{PVERSION} 0.19.1}. If a translation of @command{gawk}'s messages exists, then @command{gawk} produces usage messages, warnings, and fatal errors in the local language. -@c ENDOFRANGE inloc -@c The original text for this chapter was contributed by Efraim Yawitz. -@c FIXME: Add more indexing. +@node I18N Summary +@section Summary + +@itemize @value{BULLET} +@item +Internationalization means writing a program such that it can use multiple +languages without requiring source-code changes. Localization means +providing the data necessary for an internationalized program to work +in a particular language. + +@item +@command{gawk} uses GNU @command{gettext} to let you internationalize +and localize @command{awk} programs. A program's text domain identifies +the program for grouping all messages and other data together. + +@item +You mark a program's strings for translation by preceding them with +an underscore. Once that is done, the strings are extracted into a +@file{.pot} file. This file is copied for each language into a @file{.po} +file, and the @file{.po} files are compiled into @file{.gmo} files for +use at runtime. + +@item +You can use position specifications with @code{sprintf()} and +@code{printf} to rearrange the placement of argument values in formatted +strings and output. This is useful for the translations of format +control strings. + +@item +The internationalization features have been designed so that they +can be easily worked around in a standard @command{awk}. + +@item +@command{gawk} itself has been internationalized and ships with +a number of translations for its messages. + +@end itemize + +@c ENDOFRANGE inloc @node Debugger @chapter Debugging @command{awk} Programs @cindex debugging @command{awk} programs +@c The original text for this chapter was contributed by Efraim Yawitz. +@c FIXME: Add more indexing. + It would be nice if computer programs worked perfectly the first time they were run, but in real life, this rarely happens for programs of any complexity. Thus, most programming languages have facilities available @@ -25980,10 +27538,11 @@ how to use @command{gawk} for debugging your program is easy. * List of Debugger Commands:: Main debugger commands. * Readline Support:: Readline support. * Limitations:: Limitations and future plans. +* Debugging Summary:: Debugging summary. @end menu @node Debugging -@section Introduction to @command{gawk} Debugger +@section Introduction to The @command{gawk} Debugger This @value{SECTION} introduces debugging in general and begins the discussion of debugging in @command{gawk}. @@ -26008,7 +27567,7 @@ In that case, what can you expect from such a tool? The answer to that depends on the language being debugged, but in general, you can expect at least the following: -@itemize @bullet +@itemize @value{BULLET} @item The ability to watch a program execute its instructions one by one, giving you, the programmer, the opportunity to think about what is happening @@ -26046,6 +27605,7 @@ The following list defines terms used throughout the rest of this @value{CHAPTER}. @table @dfn +@cindex stack frame @item Stack Frame Programs generally call functions during the course of their execution. One function can call another, or a function can call itself (recursion). @@ -26067,6 +27627,7 @@ invoked. Commands that print the call stack print information about each stack frame (as detailed later on). @item Breakpoint +@cindex breakpoint During debugging, you often wish to let the program run until it reaches a certain point, and then continue execution from there one statement (or instruction) at a time. The way to do this is to set @@ -26076,6 +27637,7 @@ take over control of the program's execution. You can add and remove as many breakpoints as you like. @item Watchpoint +@cindex watchpoint A watchpoint is similar to a breakpoint. The difference is that breakpoints are oriented around the code: stop when a certain point in the code is reached. A watchpoint, however, specifies that program execution @@ -26107,6 +27669,7 @@ by the higher-level @command{awk} commands. @node Sample Debugging Session @section Sample Debugging Session +@cindex sample debugging session In order to illustrate the use of @command{gawk} as a debugger, let's look at a sample debugging session. We will use the @command{awk} implementation of the @@ -26120,13 +27683,16 @@ as our example. @node Debugger Invocation @subsection How to Start the Debugger +@cindex starting the debugger +@cindex debugger, how to start -Starting the debugger is almost exactly like running @command{awk}, except you have to -pass an additional option @option{--debug} or the corresponding short option @option{-D}. -The file(s) containing the program and any supporting code are given on the command -line as arguments to one or more @option{-f} options. (@command{gawk} is not designed -to debug command-line programs, only programs contained in files.) In our case, -we invoke the debugger like this: +Starting the debugger is almost exactly like running @command{gawk}, +except you have to pass an additional option @option{--debug} or the +corresponding short option @option{-D}. The file(s) containing the +program and any supporting code are given on the command line as arguments +to one or more @option{-f} options. (@command{gawk} is not designed +to debug command-line programs, only programs contained in files.) +In our case, we invoke the debugger like this: @example $ @kbd{gawk -D -f getopt.awk -f join.awk -f uniq.awk inputfile} @@ -26259,7 +27825,7 @@ gawk> @kbd{p NR} @noindent So we can see that @code{are_equal()} was only called for the second record -of the file. Of course, this is because our program contained a rule for +of the file. Of course, this is because our program contains a rule for @samp{NR == 1}: @example @@ -26291,13 +27857,7 @@ This tells us that @command{gawk} is now ready to execute line 67, which decides whether to give the lines the special ``field skipping'' treatment indicated by the @option{-f} command-line option. (Notice that we skipped from where we were before at line 64 to here, since the condition in line 64 - -@example -if (fcount == 0 && charcount == 0) -@end example - -@noindent -was false.) +@samp{if (fcount == 0 && charcount == 0)} was false.) Continuing to step, we now get to the splitting of the current and last records: @@ -26406,7 +27966,7 @@ and problem solved! The @command{gawk} debugger command set can be divided into the following categories: -@itemize @bullet{} +@itemize @value{BULLET} @item Breakpoint control @@ -26432,7 +27992,7 @@ In the following descriptions, commands which may be abbreviated show the abbreviation on a second description line. A debugger command name may also be truncated if that partial name is unambiguous. The debugger has the built-in capability to -automatically repeat the previous command when just hitting @key{Enter}. +automatically repeat the previous command just by hitting @key{Enter}. This works for the commands @code{list}, @code{next}, @code{nexti}, @code{step}, @code{stepi} and @code{continue} executed without any argument. @@ -26459,21 +28019,24 @@ controlling breakpoints are: @cindex debugger commands, @code{break} @cindex @code{break} debugger command @cindex @code{b} debugger command (alias for @code{break}) +@cindex set breakpoint +@cindex breakpoint, setting @item @code{break} [[@var{filename}@code{:}]@var{n} | @var{function}] [@code{"@var{expression}"}] @itemx @code{b} [[@var{filename}@code{:}]@var{n} | @var{function}] [@code{"@var{expression}"}] Without any argument, set a breakpoint at the next instruction to be executed in the selected stack frame. Arguments can be one of the following: +@c @asis for docbook @c nested table -@table @var -@item n +@table @asis +@item @var{n} Set a breakpoint at line number @var{n} in the current source file. -@item filename@code{:}n +@item @var{filename}@code{:}@var{n} Set a breakpoint at line number @var{n} in source file @var{filename}. -@item function +@item @var{function} Set a breakpoint at entry to (the first instruction of) function @var{function}. @end table @@ -26489,6 +28052,8 @@ it continues executing the program. @cindex debugger commands, @code{clear} @cindex @code{clear} debugger command +@cindex delete breakpoint at location +@cindex breakpoint at location, how to delete @item @code{clear} [[@var{filename}@code{:}]@var{n} | @var{function}] Without any argument, delete any breakpoint at the next instruction to be executed in the selected stack frame. If the program stops at @@ -26496,19 +28061,20 @@ a breakpoint, this deletes that breakpoint so that the program does not stop at that location again. Arguments can be one of the following: @c nested table -@table @var -@item n +@table @asis +@item @var{n} Delete breakpoint(s) set at line number @var{n} in the current source file. -@item filename@code{:}n +@item @var{filename}@code{:}@var{n} Delete breakpoint(s) set at line number @var{n} in source file @var{filename}. -@item function +@item @var{function} Delete breakpoint(s) set at entry to function @var{function}. @end table @cindex debugger commands, @code{condition} @cindex @code{condition} debugger command +@cindex breakpoint condition @item @code{condition} @var{n} @code{"@var{expression}"} Add a condition to existing breakpoint or watchpoint @var{n}. The condition is an @command{awk} expression that the debugger evaluates @@ -26522,6 +28088,8 @@ watchpoint is made unconditional. @cindex debugger commands, @code{delete} @cindex @code{delete} debugger command @cindex @code{d} debugger command (alias for @code{delete}) +@cindex delete breakpoint by number +@cindex breakpoint, delete by number @item @code{delete} [@var{n1 n2} @dots{}] [@var{n}--@var{m}] @itemx @code{d} [@var{n1 n2} @dots{}] [@var{n}--@var{m}] Delete specified breakpoints or a range of breakpoints. Deletes @@ -26529,6 +28097,8 @@ all defined breakpoints if no argument is supplied. @cindex debugger commands, @code{disable} @cindex @code{disable} debugger command +@cindex disable breakpoint +@cindex breakpoint, how to disable or enable @item @code{disable} [@var{n1 n2} @dots{} | @var{n}--@var{m}] Disable specified breakpoints or a range of breakpoints. Without any argument, disables all breakpoints. @@ -26537,6 +28107,7 @@ any argument, disables all breakpoints. @cindex debugger commands, @code{enable} @cindex @code{enable} debugger command @cindex @code{e} debugger command (alias for @code{enable}) +@cindex enable breakpoint @item @code{enable} [@code{del} | @code{once}] [@var{n1 n2} @dots{}] [@var{n}--@var{m}] @itemx @code{e} [@code{del} | @code{once}] [@var{n1 n2} @dots{}] [@var{n}--@var{m}] Enable specified breakpoints or a range of breakpoints. Without @@ -26556,6 +28127,7 @@ the program stops at the breakpoint. @cindex debugger commands, @code{ignore} @cindex @code{ignore} debugger command +@cindex ignore breakpoint @item @code{ignore} @var{n} @var{count} Ignore breakpoint number @var{n} the next @var{count} times it is hit. @@ -26564,6 +28136,7 @@ hit. @cindex debugger commands, @code{tbreak} @cindex @code{tbreak} debugger command @cindex @code{t} debugger command (alias for @code{tbreak}) +@cindex temporary breakpoint @item @code{tbreak} [[@var{filename}@code{:}]@var{n} | @var{function}] @itemx @code{t} [[@var{filename}@code{:}]@var{n} | @var{function}] Set a temporary breakpoint (enabled for only one stop). @@ -26584,6 +28157,8 @@ execution of the program than we saw in our earlier example: @cindex @code{silent} debugger command @cindex debugger commands, @code{end} @cindex @code{end} debugger command +@cindex breakpoint commands +@cindex commands to execute at breakpoint @item @code{commands} [@var{n}] @itemx @code{silent} @itemx @dots{} @@ -26611,6 +28186,7 @@ gawk> @cindex debugger commands, @code{c} (@code{continue}) @cindex debugger commands, @code{continue} +@cindex continue program, in debugger @item @code{continue} [@var{count}] @itemx @code{c} [@var{count}] Resume program execution. If continued from a breakpoint and @var{count} is @@ -26627,6 +28203,7 @@ Print the returned value. @cindex debugger commands, @code{next} @cindex @code{next} debugger command @cindex @code{n} debugger command (alias for @code{next}) +@cindex single-step execution, in the debugger @item @code{next} [@var{count}] @itemx @code{n} [@var{count}] Continue execution to the next source line, stepping over function calls. @@ -26721,6 +28298,7 @@ items on the list. @cindex debugger commands, @code{eval} @cindex @code{eval} debugger command +@cindex evaluate expressions, in debugger @item @code{eval "@var{awk statements}"} Evaluate @var{awk statements} in the context of the running program. You can do anything that an @command{awk} program would do: assign @@ -26738,6 +28316,7 @@ parameters defined by the program. @cindex debugger commands, @code{print} @cindex @code{print} debugger command @cindex @code{p} debugger command (alias for @code{print}) +@cindex print variables, in debugger @item @code{print} @var{var1}[@code{,} @var{var2} @dots{}] @itemx @code{p} @var{var1}[@code{,} @var{var2} @dots{}] Print the value of a @command{gawk} variable or field. @@ -26771,10 +28350,11 @@ No newline is printed unless one is specified. @cindex debugger commands, @code{set} @cindex @code{set} debugger command +@cindex assign values to variables, in debugger @item @code{set} @var{var}@code{=}@var{value} Assign a constant (number or string) value to an @command{awk} variable or field. -String values must be enclosed between double quotes (@code{"@dots{}"}). +String values must be enclosed between double quotes (@code{"}@dots{}@code{"}). You can also set special @command{awk} variables, such as @code{FS}, @code{NF}, @code{NR}, etc. @@ -26783,6 +28363,7 @@ You can also set special @command{awk} variables, such as @code{FS}, @cindex debugger commands, @code{watch} @cindex @code{watch} debugger command @cindex @code{w} debugger command (alias for @code{watch}) +@cindex set watchpoint @item @code{watch} @var{var} | @code{$}@var{n} [@code{"@var{expression}"}] @itemx @code{w} @var{var} | @code{$}@var{n} [@code{"@var{expression}"}] Add variable @var{var} (or field @code{$@var{n}}) to the watch list. @@ -26799,12 +28380,14 @@ then the debugger stops execution and prompts for a command. Otherwise, @cindex debugger commands, @code{undisplay} @cindex @code{undisplay} debugger command +@cindex stop automatic display, in debugger @item @code{undisplay} [@var{n}] Remove item number @var{n} (or all items, if no argument) from the automatic display list. @cindex debugger commands, @code{unwatch} @cindex @code{unwatch} debugger command +@cindex delete watchpoint @item @code{unwatch} [@var{n}] Remove item number @var{n} (or all items, if no argument) from the watch list. @@ -26825,12 +28408,14 @@ functions which called the one you are in. The commands for doing this are: @cindex debugger commands, @code{backtrace} @cindex @code{backtrace} debugger command @cindex @code{bt} debugger command (alias for @code{backtrace}) +@cindex call stack, display in debugger +@cindex traceback, display in debugger @item @code{backtrace} [@var{count}] @itemx @code{bt} [@var{count}] Print a backtrace of all function calls (stack frames), or innermost @var{count} frames if @var{count} > 0. Print the outermost @var{count} frames if @var{count} < 0. The backtrace displays the name and arguments to each -function, the source file name, and the line number. +function, the source @value{FN}, and the line number. @cindex debugger commands, @code{down} @cindex @code{down} debugger command @@ -26844,10 +28429,11 @@ Then select and print the frame. @cindex @code{f} debugger command (alias for @code{frame}) @item @code{frame} [@var{n}] @itemx @code{f} [@var{n}] -Select and print (frame number, function and argument names, source file, -and the source line) stack frame @var{n}. Frame 0 is the currently executing, -or @dfn{innermost}, frame (function call), frame 1 is the frame that called the -innermost one. The highest numbered frame is the one for the main program. +Select and print stack frame @var{n}. Frame 0 is the currently executing, +or @dfn{innermost}, frame (function call), frame 1 is the frame that +called the innermost one. The highest numbered frame is the one for the +main program. The printed information consists of the frame number, +function and argument names, source file, and the source line. @cindex debugger commands, @code{up} @cindex @code{up} debugger command @@ -26878,25 +28464,32 @@ The value for @var{what} should be one of the following: @c nested table @table @code @item args +@cindex show function arguments, in debugger Arguments of the selected frame. @item break +@cindex show breakpoints List all currently set breakpoints. @item display +@cindex automatic displays, in debugger List all items in the automatic display list. @item frame +@cindex describe call stack frame, in debugger Description of the selected stack frame. @item functions -List all function definitions including source file names and +@cindex list function definitions, in debugger +List all function definitions including source @value{FN}s and line numbers. @item locals +@cindex show local variables, in debugger Local variables of the selected frame. @item source +@cindex show name of current source file, in debugger The name of the current source file. Each time the program stops, the current source file is the file containing the current instruction. When the debugger first starts, the current source file is the first file @@ -26905,12 +28498,15 @@ included via the @option{-f} option. The be used at any time to change the current source. @item sources +@cindex show all source files, in debugger List all program sources. @item variables +@cindex list all global variables, in debugger List all global variables. @item watch +@cindex show watchpoints List all items in the watch list. @end table @end table @@ -26924,6 +28520,8 @@ from a file. The commands are: @cindex debugger commands, @code{option} @cindex @code{option} debugger command @cindex @code{o} debugger command (alias for @code{option}) +@cindex display debugger options +@cindex debugger options @item @code{option} [@var{name}[@code{=}@var{value}]] @itemx @code{o} [@var{name}[@code{=}@var{value}]] Without an argument, display the available debugger options @@ -26933,40 +28531,49 @@ a new value to the named option. The available options are: @c nested table -@table @code -@item history_size +@c asis for docbook +@table @asis +@item @code{history_size} +@cindex debugger history size The maximum number of lines to keep in the history file @file{./.gawk_history}. The default is 100. -@item listsize +@item @code{listsize} +@cindex debugger default list amount The number of lines that @code{list} prints. The default is 15. -@item outfile +@item @code{outfile} +@cindex redirect @command{gawk} output, in debugger Send @command{gawk} output to a file; debugger output still goes to standard output. An empty string (@code{""}) resets output to standard output. -@item prompt +@item @code{prompt} +@cindex debugger prompt The debugger prompt. The default is @samp{@w{gawk> }}. -@item save_history @r{[}on @r{|} off@r{]} +@item @code{save_history} [@code{on} | @code{off}] +@cindex debugger history file Save command history to file @file{./.gawk_history}. The default is @code{on}. -@item save_options @r{[}on @r{|} off@r{]} +@item @code{save_options} [@code{on} | @code{off}] +@cindex save debugger options Save current options to file @file{./.gawkrc} upon exit. The default is @code{on}. Options are read back in to the next session upon startup. -@item trace @r{[}on @r{|} off@r{]} +@item @code{trace} [@code{on} | @code{off}] +@cindex instruction tracing, in debugger Turn instruction tracing on or off. The default is @code{off}. @end table @item @code{save} @var{filename} -Save the commands from the current session to the given file name, +Save the commands from the current session to the given @value{FN}, so that they can be replayed using the @command{source} command. @item @code{source} @var{filename} +@cindex debugger, read commands from a file Run command(s) from a file; an error in any command does not terminate execution of subsequent commands. Comments (lines starting with @samp{#}) are allowed in a command file. @@ -27002,7 +28609,7 @@ partial dump of Davide Brini's obfuscated code @smallexample gawk> @kbd{dump} -@print{} # BEGIN +@print{} # BEGIN @print{} @print{} [ 1:0xfcd340] Op_rule : [in_rule = BEGIN] [source_file = brini.awk] @print{} [ 1:0xfcc240] Op_push_i : "~" [MALLOC|STRING|STRCUR] @@ -27065,8 +28672,8 @@ about the command @var{command}. @cindex debugger commands, @code{list} @cindex @code{list} debugger command @cindex @code{l} debugger command (alias for @code{list}) -@item @code{list} [@code{-} | @code{+} | @var{n} | @var{filename@code{:}n} | @var{n}--@var{m} | @var{function}] -@itemx @code{l} [@code{-} | @code{+} | @var{n} | @var{filename@code{:}n} | @var{n}--@var{m} | @var{function}] +@item @code{list} [@code{-} | @code{+} | @var{n} | @var{filename}@code{:}@var{n} | @var{n}--@var{m} | @var{function}] +@itemx @code{l} [@code{-} | @code{+} | @var{n} | @var{filename}@code{:}@var{n} | @var{n}--@var{m} | @var{function}] Print the specified lines (default 15) from the current source file or the file named @var{filename}. The possible arguments to @code{list} are as follows: @@ -27086,7 +28693,7 @@ Print lines centered around line number @var{n}. @item @var{n}--@var{m} Print lines from @var{n} to @var{m}. -@item @var{filename@code{:}n} +@item @var{filename}@code{:}@var{n} Print lines centered around line number @var{n} in source file @var{filename}. This command may change the current source file. @@ -27099,6 +28706,7 @@ function @var{function}. This command may change the current source file. @cindex debugger commands, @code{quit} @cindex @code{quit} debugger command @cindex @code{q} debugger command (alias for @code{quit}) +@cindex exit the debugger @item @code{quit} @itemx @code{q} Exit the debugger. Debugging is great fun, but sometimes we all have @@ -27109,7 +28717,7 @@ running a program, the debugger warns you if you accidentally type @cindex debugger commands, @code{trace} @cindex @code{trace} debugger command -@item @code{trace} @code{on} @r{|} @code{off} +@item @code{trace} [@code{on} | @code{off}] Turn on or off a continuous printing of instructions which are about to be executed, along with printing the @command{awk} line which they implement. The default is @code{off}. @@ -27122,17 +28730,21 @@ fairly self-explanatory, and using @code{stepi} and @code{nexti} while @node Readline Support @section Readline Support +@cindex command completion, in debugger +@cindex history expansion, in debugger -If @command{gawk} is compiled with the @code{readline} library, you -can take advantage of that library's command completion and history expansion -features. The following types of completion are available: +If @command{gawk} is compiled with +@uref{http://cnswww.cns.cwru.edu/php/chet/readline/readline.html, +the @code{readline} library}, you can take advantage of that library's +command completion and history expansion features. The following types +of completion are available: @table @asis @item Command completion Command names. -@item Source file name completion -Source file names. Relevant commands are +@item Source @value{FN} completion +Source @value{FN}s. Relevant commands are @code{break}, @code{clear}, @code{list}, @@ -27162,7 +28774,7 @@ We hope you find the @command{gawk} debugger useful and enjoyable to work with, but as with any program, especially in its early releases, it still has some limitations. A few which are worth being aware of are: -@itemize @bullet{} +@itemize @value{BULLET} @item At this point, the debugger does not give a detailed explanation of what you did wrong when you type in something it doesn't like. Rather, it just @@ -27175,9 +28787,10 @@ If you perused the dump of opcodes in @ref{Miscellaneous Debugger Commands}, you will realize that much of the internal manipulation of data in @command{gawk}, as in many interpreters, is done on a stack. @code{Op_push}, @code{Op_pop}, etc., are the ``bread and butter'' of -most @command{gawk} code. Unfortunately, as of now, the @command{gawk} -debugger does not allow you to examine the stack's contents. +most @command{gawk} code. +Unfortunately, as of now, the @command{gawk} +debugger does not allow you to examine the stack's contents. That is, the intermediate results of expression evaluation are on the stack, but cannot be printed. Rather, only variables which are defined in the program can be printed. Of course, a workaround for @@ -27204,433 +28817,329 @@ The @command{gawk} debugger only accepts source supplied with the @option{-f} op Look forward to a future release when these and other missing features may be added, and of course feel free to try to add them yourself! +@node Debugging Summary +@section Summary + +@itemize @value{BULLET} +@item +Programs rarely work correctly the first time. Finding bugs +is @dfn{debugging} and a program that helps you find bugs is a +@dfn{debugger}. @command{gawk} has a built-in debugger that works very +similarly to the GNU Debugger, GDB. + +@item +Debuggers let you step through your program one statement at a time, +examine and change variable and array values, and do a number of other +things that let understand what your program is actually doing (as +opposed to what it is supposed to do). + +@item +Like most debuggers, the @command{gawk} debugger works in terms of stack +frames, and lets you set both breakpoints (stop at a point in the code) +and watchpoints (stop when a data value changes). + +@item +The debugger command set is fairly complete, providing control over +breakpoints, execution, viewing and changing data, working with the stack, +getting information, and other tasks. + +@item +If the @code{readline} library is available when @command{gawk} is +compiled, it is used by the debugger to provide command-line history +and editing. + +@end itemize + @node Arbitrary Precision Arithmetic @chapter Arithmetic and Arbitrary Precision Arithmetic with @command{gawk} @cindex arbitrary precision @cindex multiple precision @cindex infinite precision -@cindex floating-point numbers, arbitrary precision -@cindex MPFR -@cindex GMP - -@cindex Knuth, Donald -@quotation -@i{There's a credibility gap: We don't know how much of the computer's answers -to believe. Novice computer users solve this problem by implicitly trusting -in the computer as an infallible authority; they tend to believe that all -digits of a printed answer are significant. Disillusioned computer users have -just the opposite approach; they are constantly afraid that their answers -are almost meaningless.}@footnote{Donald E.@: Knuth. -@cite{The Art of Computer Programming}. Volume 2, -@cite{Seminumerical Algorithms}, third edition, -1998, ISBN 0-201-89683-4, p.@: 229.} -@author Donald Knuth -@end quotation - -This @value{CHAPTER} discusses issues that you may encounter -when performing arithmetic. It begins by discussing some of -the general attributes of computer arithmetic, along with how -this can influence what you see when running @command{awk} programs. -This discussion applies to all versions of @command{awk}. - -The @value{CHAPTER} then moves on to describe @dfn{arbitrary precision -arithmetic}, a feature which is specific to @command{gawk}. +@cindex floating-point, numbers@comma{} arbitrary precision + +This @value{CHAPTER} introduces some basic concepts relating to +how computers do arithmetic and briefly lists the features in +@command{gawk} for performing arbitrary precision floating point +computations. It then proceeds to describe floating-point arithmetic, +which is what @command{awk} uses for all its computations, including a +discussion of arbitrary precision floating point arithmetic, which is +a feature available only in @command{gawk}. It continues on to present +arbitrary precision integers, and concludes with a description of some +points where @command{gawk} and the POSIX standard are not quite in +agreement. @menu -* General Arithmetic:: An introduction to computer arithmetic. -* Floating-point Programming:: Effective Floating-point Programming. -* Gawk and MPFR:: How @command{gawk} provides - arbitrary-precision arithmetic. -* Arbitrary Precision Floats:: Arbitrary Precision Floating-point Arithmetic - with @command{gawk}. -* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with - @command{gawk}. +* Computer Arithmetic:: A quick intro to computer math. +* Math Definitions:: Defining terms used. +* MPFR features:: The MPFR features in @command{gawk}. +* FP Math Caution:: Things to know. +* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with + @command{gawk}. +* POSIX Floating Point Problems:: Standards Versus Existing Practice. +* Floating point summary:: Summary of floating point discussion. @end menu -@node General Arithmetic +@node Computer Arithmetic @section A General Description of Computer Arithmetic -@cindex integers -@cindex floating-point, numbers -@cindex numbers, floating-point -Within computers, there are two kinds of numeric values: @dfn{integers} -and @dfn{floating-point}. -In school, integer values were referred to as ``whole'' numbers---that is, -numbers without any fractional part, such as 1, 42, or @minus{}17. +Until now, we have worked with data as either numbers or +strings. Ultimately, however, computers represent everything in terms +of @dfn{binary digits}, or @dfn{bits}. A decimal digit can take on any +of 10 values: zero through nine. A binary digit can take on any of two +values, zero or one. Using binary, computers (and computer software) +can represent and manipulate numerical and character data. In general, +the more bits you can use to represent a particular thing, the greater +the range of possible values it can take on. + +Modern computers support at least two, and often more, ways to do +arithmetic. Each kind of arithmetic uses a different representation +(organization of the bits) for the numbers. The kinds of arithmetic +that interest us are: + +@table @asis +@item Decimal arithmetic +This is the kind of arithmetic you learned in elementary school, using +paper and pencil (and/or a calculator). In theory, numbers can have an +arbitrary number of digits on either side (or both sides) of the decimal +point, and the results of a computation are always exact. + +Some modern system can do decimal arithmetic in hardware, but usually you +need a special software library to provide access to these instructions. +There are also libraries that do decimal arithmetic entirely in software. + +Despite the fact that some users expect @command{gawk} to be performing +decimal arithmetic,@footnote{We don't know why they expect this, but +they do.} it does not do so. + +@item Integer arithmetic +In school, integer values were referred to as ``whole'' numbers---that +is, numbers without any fractional part, such as 1, 42, or @minus{}17. The advantage to integer numbers is that they represent values exactly. -The disadvantage is that their range is limited. On most systems, -this range is @minus{}2,147,483,648 to 2,147,483,647. -However, many systems now support a range from -@minus{}9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. +The disadvantage is that their range is limited. @cindex unsigned integers @cindex integers, unsigned -Integer values come in two flavors: @dfn{signed} and @dfn{unsigned}. -Signed values may be negative or positive, with the range of values just -described. -Unsigned values are always positive. On most systems, -the range is from 0 to 4,294,967,295. -However, many systems now support a range from -0 to 18,446,744,073,709,551,615. - -@cindex double precision floating-point -@cindex single precision floating-point -Floating-point numbers represent what are called ``real'' numbers; i.e., -those that do have a fractional part, such as 3.1415927. -The advantage to floating-point numbers is that they -can represent a much larger range of values. -The disadvantage is that there are numbers that they cannot represent -exactly. -@command{awk} uses @dfn{double precision} floating-point numbers, which -can hold more digits than @dfn{single precision} -floating-point numbers. -@c Floating-point issues are discussed more fully in -@c @ref{Floating Point Issues}. - -There a several important issues to be aware of, described next. +In computers, integer values come in two flavors: @dfn{signed} and +@dfn{unsigned}. Signed values may be negative or positive, whereas +unsigned values are always positive (that is, greater than or equal +to zero). + +In computer systems, integer arithmetic is exact, but the possible +range of values is limited. Integer arithmetic is generally faster than +floating point arithmetic. + +@item Floating point arithmetic +Floating-point numbers represent what were called in school ``real'' +numbers; i.e., those that have a fractional part, such as 3.1415927. +The advantage to floating-point numbers is that they can represent a +much larger range of values than can integers. The disadvantage is that +there are numbers that they cannot represent exactly. + +Modern systems support floating point arithmetic in hardware, with a +limited range of values. There are software libraries that allow +the use of arbitrary precision floating point calculations. + +POSIX @command{awk} uses @dfn{double precision} floating-point numbers, which +can hold more digits than @dfn{single precision} floating-point numbers. +@command{gawk} has facilities for performing arbitrary precision floating +point arithmetic, which we describe in more detail shortly. +@end table -@menu -* Floating Point Issues:: Stuff to know about floating-point numbers. -* Integer Programming:: Effective integer programming. -@end menu +Computers work with integer and floating point values of different +ranges. Integer values are usually either 32 or 64 bits in size. Single +precision floating point values occupy 32 bits, whereas double precision +floating point values occupy 64 bits. Floating point values are always +signed. The possible ranges of values are shown in the following table. + +@multitable @columnfractions .34 .33 .33 +@headitem Numeric representation @tab Miniumum value @tab Maximum value +@item 32-bit signed integer @tab @minus{}2,147,483,648 @tab 2,147,483,647 +@item 32-bit unsigned integer @tab 0 @tab 4,294,967,295 +@item 64-bit signed integer @tab @minus{}9,223,372,036,854,775,808 @tab 9,223,372,036,854,775,807 +@item 64-bit unsigned integer @tab 0 @tab 18,446,744,073,709,551,615 +@item Single precision floating point (approximate) @tab @code{1.175494e-38} @tab @code{3.402823e+38} +@item Double precision floating point (approximate) @tab @code{2.225074e-308} @tab @code{1.797693e+308} +@end multitable -@node Floating Point Issues -@subsection Floating-Point Number Caveats +@node Math Definitions +@section Other Stuff To Know -This @value{SECTION} describes some of the issues -involved in using floating-point numbers. +The rest of this @value{CHAPTER} uses a number of terms. Here are some +informal definitions that should help you work your way through the material +here. -There is a very nice -@uref{http://www.validlab.com/goldberg/paper.pdf, paper on floating-point arithmetic} -by David Goldberg, -``What Every Computer Scientist Should Know About Floating-point Arithmetic,'' -@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 5-48. -This is worth reading if you are interested in the details, -but it does require a background in computer science. +@table @dfn +@item Accuracy +A floating-point calculation's accuracy is how close it comes +to the real (paper and pencil) value. -@menu -* String Conversion Precision:: The String Value Can Lie. -* Unexpected Results:: Floating Point Numbers Are Not Abstract - Numbers. -* POSIX Floating Point Problems:: Standards Versus Existing Practice. -@end menu +@item Error +The difference between what the result of a computation ``should be'' +and what it actually is. It is best to minimize error as much +as possible. -@node String Conversion Precision -@subsubsection The String Value Can Lie +@item Exponent +The order of magnitude of a value; +some number of bits in a floating-point value store the exponent. -Internally, @command{awk} keeps both the numeric value -(double precision floating-point) and the string value for a variable. -Separately, @command{awk} keeps -track of what type the variable has -(@pxref{Typing and Comparison}), -which plays a role in how variables are used in comparisons. +@item Inf +A special value representing infinity. Operations involving another +number and infinity produce infinity. -It is important to note that the string value for a number may not -reflect the full value (all the digits) that the numeric value -actually contains. -The following program, @file{values.awk}, illustrates this: +@item NaN +``Not A Number.'' A special value indicating a result that can't +happen in real math, but that can happen in floating-point computations. -@example -@{ - sum = $1 + $2 - # see it for what it is - printf("sum = %.12g\n", sum) - # use CONVFMT - a = "<" sum ">" - print "a =", a - # use OFMT - print "sum =", sum -@} -@end example +@item Normalized +How the significand (see later in this list) is usually stored. The +value is adjusted so that the first bit is one, and then that leading +one is assumed instead of physically stored. This provides one +extra bit of precision. -@noindent -This program shows the full value of the sum of @code{$1} and @code{$2} -using @code{printf}, and then prints the string values obtained -from both automatic conversion (via @code{CONVFMT}) and -from printing (via @code{OFMT}). +@item Precision +The number of bits used to represent a floating-point number. +The more bits, the more digits you can represent. +Binary and decimal precisions are related approximately, according to the +formula: -Here is what happens when the program is run: +@display +@iftex +@math{prec = 3.322 @cdot dps} +@end iftex +@ifnottex +@ifnotdocbook +@var{prec} = 3.322 * @var{dps} +@end ifnotdocbook +@end ifnottex +@docbook +<emphasis>prec</emphasis> = 3.322 ⋅ <emphasis>dps</emphasis> @c +@end docbook +@end display -@example -$ @kbd{echo 3.654321 1.2345678 | awk -f values.awk} -@print{} sum = 4.8888888 -@print{} a = <4.88889> -@print{} sum = 4.88889 -@end example +@noindent +Here, @var{prec} denotes the binary precision +(measured in bits) and @var{dps} (short for decimal places) +is the decimal digits. + +@item Rounding mode +How numbers are rounded up or down when necessary. +More details are provided later. + +@item Significand +A floating point value consists the significand multiplied by 10 +to the power of the exponent. For example, in @code{1.2345e67}, +the significand is @code{1.2345}. + +@item Stability +From @uref{http://en.wikipedia.org/wiki/Numerical_stability, +the Wikipedia article on numerical stability}: +``Calculations that can be proven not to magnify approximation errors +are called @dfn{numerically stable}.'' +@end table -This makes it clear that the full numeric value is different from -what the default string representations show. +See @uref{http://en.wikipedia.org/wiki/Accuracy_and_precision, +the Wikipedia article on accuracy and precision} for more information +on some of those terms. -@code{CONVFMT}'s default value is @code{"%.6g"}, which yields a value with -at most six significant digits. For some applications, you might want to -change it to specify more precision. -On most modern machines, most of the time, -17 digits is enough to capture a floating-point number's -value exactly.@footnote{Pathological cases can require up to -752 digits (!), but we doubt that you need to worry about this.} +On modern systems, floating-point hardware uses the representation and +operations defined by the IEEE 754 standard. +Three of the standard IEEE 754 types are 32-bit single precision, +64-bit double precision and 128-bit quadruple precision. +The standard also specifies extended precision formats +to allow greater precisions and larger exponent ranges. +(@command{awk} uses only the 64-bit double precision format.) -@node Unexpected Results -@subsubsection Floating Point Numbers Are Not Abstract Numbers - -@cindex floating-point, numbers -Unlike numbers in the abstract sense (such as what you studied in high school -or college arithmetic), numbers stored in computers are limited in certain ways. -They cannot represent an infinite number of digits, nor can they always -represent things exactly. -In particular, -floating-point numbers cannot -always represent values exactly. Here is an example: - -@example -$ @kbd{awk '@{ printf("%010d\n", $1 * 100) @}'} -515.79 -@print{} 0000051579 -515.80 -@print{} 0000051579 -515.81 -@print{} 0000051580 -515.82 -@print{} 0000051582 -@kbd{Ctrl-d} -@end example +@ref{table-ieee-formats} lists the precision and exponent +field values for the basic IEEE 754 binary formats: -@noindent -This shows that some values can be represented exactly, -whereas others are only approximated. This is not a ``bug'' -in @command{awk}, but simply an artifact of how computers -represent numbers. +@float Table,table-ieee-formats +@caption{Basic IEEE Format Context Values} +@multitable @columnfractions .20 .20 .20 .20 .20 +@headitem Name @tab Total bits @tab Precision @tab emin @tab emax +@item Single @tab 32 @tab 24 @tab @minus{}126 @tab +127 +@item Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023 +@item Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383 +@end multitable +@end float @quotation NOTE -It cannot be emphasized enough that the behavior just -described is fundamental to modern computers. You will -see this kind of thing happen in @emph{any} programming -language using hardware floating-point numbers. It is @emph{not} -a bug in @command{gawk}, nor is it something that can be ``just -fixed.'' +The precision numbers include the implied leading one that gives them +one extra bit of significand. @end quotation -@cindex negative zero -@cindex positive zero -@cindex zero@comma{} negative vs.@: positive -Another peculiarity of floating-point numbers on modern systems -is that they often have more than one representation for the number zero! -In particular, it is possible to represent ``minus zero'' as well as -regular, or ``positive'' zero. - -This example shows that negative and positive zero are distinct values -when stored internally, but that they are in fact equal to each other, -as well as to ``regular'' zero: - -@example -$ @kbd{gawk 'BEGIN @{ mz = -0 ; pz = 0} -> @kbd{printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz} -> @kbd{printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0} -> @kbd{@}'} -@print{} -0 = -0, +0 = 0, (-0 == +0) -> 1 -@print{} mz == 0 -> 1, pz == 0 -> 1 -@end example - -It helps to keep this in mind should you process numeric data -that contains negative zero values; the fact that the zero is negative -is noted and can affect comparisons. - -@node POSIX Floating Point Problems -@subsubsection Standards Versus Existing Practice - -Historically, @command{awk} has converted any non-numeric looking string -to the numeric value zero, when required. Furthermore, the original -definition of the language and the original POSIX standards specified that -@command{awk} only understands decimal numbers (base 10), and not octal -(base 8) or hexadecimal numbers (base 16). - -Changes in the language of the -2001 and 2004 POSIX standards can be interpreted to imply that @command{awk} -should support additional features. These features are: - -@itemize @bullet -@item -Interpretation of floating point data values specified in hexadecimal -notation (@samp{0xDEADBEEF}). (Note: data values, @emph{not} -source code constants.) +@node MPFR features +@section Arbitrary Precison Arithmetic Features In @command{gawk} -@item -Support for the special IEEE 754 floating point values ``Not A Number'' -(NaN), positive Infinity (``inf'') and negative Infinity (``@minus{}inf''). -In particular, the format for these values is as specified by the ISO 1999 -C standard, which ignores case and can allow machine-dependent additional -characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}. -@end itemize - -The first problem is that both of these are clear changes to historical -practice: - -@itemize @bullet -@item -The @command{gawk} maintainer feels that supporting hexadecimal floating -point values, in particular, is ugly, and was never intended by the -original designers to be part of the language. - -@item -Allowing completely alphabetic strings to have valid numeric -values is also a very severe departure from historical practice. -@end itemize - -The second problem is that the @code{gawk} maintainer feels that this -interpretation of the standard, which requires a certain amount of -``language lawyering'' to arrive at in the first place, was not even -intended by the standard developers. In other words, ``we see how you -got where you are, but we don't think that that's where you want to be.'' - -Recognizing the above issues, but attempting to provide compatibility -with the earlier versions of the standard, -the 2008 POSIX standard added explicit wording to allow, but not require, -that @command{awk} support hexadecimal floating point values and -special values for ``Not A Number'' and infinity. - -Although the @command{gawk} maintainer continues to feel that -providing those features is inadvisable, -nevertheless, on systems that support IEEE floating point, it seems -reasonable to provide @emph{some} way to support NaN and Infinity values. -The solution implemented in @command{gawk} is as follows: - -@itemize @bullet -@item -With the @option{--posix} command-line option, @command{gawk} becomes -``hands off.'' String values are passed directly to the system library's -@code{strtod()} function, and if it successfully returns a numeric value, -that is what's used.@footnote{You asked for it, you got it.} -By definition, the results are not portable across -different systems. They are also a little surprising: +By default, @command{gawk} uses the double precision floating point values +supplied by the hardware of the system it runs on. However, if it was +compiled to do, @command{gawk} uses the @uref{http://www.mpfr.org, GNU +MPFR} and @uref{http://gmplib.org, GNU MP} (GMP) libraries for arbitrary +precision arithmetic on numbers. You can see if MPFR support is available +like so: @example -$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'} -@print{} nan -$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'} -@print{} 3735928559 +$ @kbd{gawk --version} +@print{} GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.0-p3, GNU MP 5.0.2) +@print{} Copyright (C) 1989, 1991-2014 Free Software Foundation. +@dots{} @end example -@item -Without @option{--posix}, @command{gawk} interprets the four strings -@samp{+inf}, -@samp{-inf}, -@samp{+nan}, -and -@samp{-nan} -specially, producing the corresponding special numeric values. -The leading sign acts a signal to @command{gawk} (and the user) -that the value is really numeric. Hexadecimal floating point is -not supported (unless you also use @option{--non-decimal-data}, -which is @emph{not} recommended). For example: - -@example -$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'} -@print{} 0 -$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'} -@print{} nan -$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'} -@print{} 0 -@end example +@noindent +(You may see different version numbers than what's shown here. That's OK; +what's important is to see that GNU MPFR and GNU MP are listed in +the output.) -@command{gawk} does ignore case in the four special values. -Thus @samp{+nan} and @samp{+NaN} are the same. -@end itemize +Additionally, there are a few elements available in the @code{PROCINFO} +array to provide information about the MPFR and GMP libraries +(@pxref{Auto-set}). -@node Integer Programming -@subsection Mixing Integers And Floating-point - -As has been mentioned already, @command{awk} uses hardware double -precision with 64-bit IEEE binary floating-point representation -for numbers on most systems. A large integer like 9,007,199,254,740,997 -has a binary representation that, although finite, is more than 53 bits long; -it must also be rounded to 53 bits. -The biggest integer that can be stored in a C @code{double} is usually the same -as the largest possible value of a @code{double}. If your system @code{double} -is an IEEE 64-bit @code{double}, this largest possible value is an integer and -can be represented precisely. What more should one know about integers? - -If you want to know what is the largest integer, such that it and -all smaller integers can be stored in 64-bit doubles without losing precision, -then the answer is -@iftex -@math{2^{53}}. -@end iftex -@ifnottex -2^53. -@end ifnottex -The next representable number is the even number -@iftex -@math{2^{53} + 2}, -@end iftex -@ifnottex -2^53 + 2, -@end ifnottex -meaning it is unlikely that you will be able to make -@command{gawk} print -@iftex -@math{2^{53} + 1} -@end iftex -@ifnottex -2^53 + 1 -@end ifnottex -in integer format. -The range of integers exactly representable by a 64-bit double -is -@iftex -@math{[-2^{53}, 2^{53}]}. -@end iftex -@ifnottex -[@minus{}2^53, 2^53]. -@end ifnottex -If you ever see an integer outside this range in @command{awk} -using 64-bit doubles, you have reason to be very suspicious about -the accuracy of the output. Here is a simple program with erroneous output: +The MPFR library provides precise control over precisions and rounding +modes, and gives correctly rounded, reproducible, platform-independent +results. With either of the command-line options @option{--bignum} or +@option{-M}, all floating-point arithmetic operators and numeric functions +can yield results to any desired precision level supported by MPFR. -@example -$ @kbd{gawk 'BEGIN @{ i = 2^53 - 1; for (j = 0; j < 4; j++) print i + j @}'} -@print{} 9007199254740991 -@print{} 9007199254740992 -@print{} 9007199254740992 -@print{} 9007199254740994 -@end example +Two built-in variables, @code{PREC} and @code{ROUNDMODE}, +provide control over the working precision and the rounding mode. +The precision and the rounding mode are set globally for every operation +to follow. +@xref{Auto-set}, for more information. -The lesson is to not assume that any large integer printed by @command{awk} -represents an exact result from your computation, especially if it wraps -around on your screen. +@node FP Math Caution +@section Floating Point Arithmetic: Caveat Emptor! -@node Floating-point Programming -@section Understanding Floating-point Programming +@quotation +Math class is tough! +@author Late 1980's Barbie +@end quotation -Numerical programming is an extensive area; if you need to develop -sophisticated numerical algorithms then @command{gawk} may not be -the ideal tool, and this documentation may not be sufficient. -It might require digesting a book or two@footnote{One recommended title is -@cite{Numerical Computing with IEEE Floating Point Arithmetic}, Michael L.@: -Overton, Society for Industrial and Applied Mathematics, 2004. -ISBN: 0-89871-482-6, ISBN-13: 978-0-89871-482-1. See -@uref{http://www.cs.nyu.edu/cs/faculty/overton/book}.} -to really internalize how to compute -with ideal accuracy and precision, -and the result often depends on the particular application. +This @value{SECTION} provides a high level overview of the issues +involved when doing lots of floating-point arithmetic.@footnote{There +is a very nice @uref{http://www.validlab.com/goldberg/paper.pdf, +paper on floating-point arithmetic} by David Goldberg, ``What Every +Computer Scientist Should Know About Floating-point Arithmetic,'' +@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 5-48. This is +worth reading if you are interested in the details, but it does require +a background in computer science.} +The discussion applies to both hardware and arbitrary-precision +floating-point arithmetic. -@quotation NOTE -A floating-point calculation's @dfn{accuracy} is how close it comes -to the real value. This is as opposed to the @dfn{precision}, which -usually refers to the number of bits used to represent the number -(see @uref{http://en.wikipedia.org/wiki/Accuracy_and_precision, -the Wikipedia article} for more information). +@quotation CAUTION +The material here is purposely general. If you need to do serious +computer arithmetic, you should do some research first, and not +rely just on what we tell you. @end quotation -There are two options for doing floating-point calculations: -hardware floating-point (as used by standard @command{awk} and -the default for @command{gawk}), and @dfn{arbitrary-precision} -floating-point, which is software based. -From this point forward, this @value{CHAPTER} -aims to provide enough information to understand both, and then -will focus on @command{gawk}'s facilities for the latter.@footnote{If you -are interested in other tools that perform arbitrary precision arithmetic, -you may want to investigate the POSIX @command{bc} tool. See -@uref{http://pubs.opengroup.org/onlinepubs/009695399/utilities/bc.html, -the POSIX specification for it}, for more information.} +@menu +* Inexactness of computations:: Floating point math is not exact. +* Getting Accuracy:: Getting more accuracy takes some work. +* Try To Round:: Add digits and round. +* Setting precision:: How to set the precision. +* Setting the rounding mode:: How to set the rounding mode. +@end menu + +@node Inexactness of computations +@subsection Floating Point Arithmetic Is Not Exact Binary floating-point representations and arithmetic are inexact. Simple values like 0.1 cannot be precisely represented using @@ -27642,7 +29151,16 @@ floating-point, you can set the precision before starting a computation, but then you cannot be sure of the number of significant decimal places in the final result. -Sometimes, before you start to write any code, you should think more +@menu +* Inexact representation:: Numbers are not exactly represented. +* Comparing FP Values:: How to compare floating point values. +* Errors accumulate:: Errors get bigger as they go. +@end menu + +@node Inexact representation +@subsubsection Many Numbers Cannot Be Represented Exactly + +So, before you start to write any code, you should think about what you really want and what's really happening. Consider the two numbers in the following example: @@ -27672,21 +29190,42 @@ you can always specify how much precision you would like in your output. Usually this is a format string like @code{"%.15g"}, which when used in the previous example, produces an output identical to the input. +@node Comparing FP Values +@subsubsection Be Careful Comparing Values + Because the underlying representation can be a little bit off from the exact value, -comparing floating-point values to see if they are equal is generally not a good idea. -Here is an example where it does not work like you expect: +comparing floating-point values to see if they are exactly equal is generally a bad idea. +Here is an example where it does not work like you would expect: @example $ @kbd{gawk 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} @print{} 0 @end example -The loss of accuracy during a single computation with floating-point numbers -usually isn't enough to worry about. However, if you compute a value -which is the result of a sequence of floating point operations, +The general wisdom when comparing floating-point values is to see if +they are within some small range of each other (called a @dfn{delta}, +or @dfn{tolerance}). +You have to decide how small a delta is important to you. Code to do +this looks something like this: + +@example +delta = 0.00001 # for example +difference = abs(a) - abs(b) # subtract the two values +if (difference < delta) + # all ok +else + # not ok +@end example + +@node Errors accumulate +@subsubsection Errors Accumulate + +The loss of accuracy during a single computation with floating-point +numbers usually isn't enough to worry about. However, if you compute a +value which is the result of a sequence of floating point operations, the error can accumulate and greatly affect the computation itself. -Here is an attempt to compute the value of the constant -@value{PI} using one of its many series representations: +Here is an attempt to compute the value of @value{PI} using one of its +many series representations: @example BEGIN @{ @@ -27700,8 +29239,8 @@ BEGIN @{ @} @end example -When run, the early errors propagating through later computations -cause the loop to terminate prematurely after an attempt to divide by zero. +When run, the early errors propagate through later computations, +causing the loop to terminate prematurely after attempting to divide by zero: @example $ @kbd{gawk -f pi.awk} @@ -27728,23 +29267,88 @@ $ @kbd{gawk 'BEGIN @{} @print{} 4 @end example -Can computation using arbitrary precision help with the previous examples? -If you are impatient to know, see -@ref{Exact Arithmetic}. +@node Getting Accuracy +@subsection Getting The Accuracy You Need + +Can arbitrary precision arithmetic give exact results? There are +no easy answers. The standard rules of algebra often do not apply +when using floating-point arithmetic. +Among other things, the distributive and associative laws +do not hold completely, and order of operation may be important +for your computation. Rounding error, cumulative precision loss +and underflow are often troublesome. + +When @command{gawk} tests the expressions @samp{0.1 + 12.2} and +@samp{12.3} for equality using the machine double precision arithmetic, +it decides that they are not equal! (@xref{Comparing FP Values}.) +You can get the result you want by increasing the precision; 56 bits in +this case does the job: + +@example +$ @kbd{gawk -M -v PREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} +@print{} 1 +@end example + +If adding more bits is good, perhaps adding even more bits of +precision is better? +Here is what happens if we use an even larger value of @code{PREC}: + +@example +$ @kbd{gawk -M -v PREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} +@print{} 0 +@end example + +This is not a bug in @command{gawk} or in the MPFR library. +It is easy to forget that the finite number of bits used to store the value +is often just an approximation after proper rounding. +The test for equality succeeds if and only if @emph{all} bits in the two operands +are exactly the same. Since this is not necessarily true after floating-point +computations with a particular precision and effective rounding rule, +a straight test for equality may not work. Instead, compare the +two numbers to see if they are within the desirable delta of each other. + +In applications where 15 or fewer decimal places suffice, +hardware double precision arithmetic can be adequate, and is usually much faster. +But you need to keep in mind that every floating-point operation +can suffer a new rounding error with catastrophic consequences as illustrated +by our earlier attempt to compute the value of @value{PI}. +Extra precision can greatly enhance the stability and the accuracy +of your computation in such cases. + +Repeated addition is not necessarily equivalent to multiplication +in floating-point arithmetic. In the example in +@ref{Errors accumulate}: + +@example +$ @kbd{gawk 'BEGIN @{} +> @kbd{for (d = 1.1; d <= 1.5; d += 0.1) # loop five times (?)} +> @kbd{i++} +> @kbd{print i} +> @kbd{@}'} +@print{} 4 +@end example + +@noindent +you may or may not succeed in getting the correct result by choosing +an arbitrarily large value for @code{PREC}. Reformulation of +the problem at hand is often the correct approach in such situations. + +@node Try To Round +@subsection Try A Few Extra Bits of Precision and Rounding Instead of arbitrary precision floating-point arithmetic, often all you need is an adjustment of your logic or a different order for the operations in your calculation. -The stability and the accuracy of the computation of the constant @value{PI} +The stability and the accuracy of the computation of @value{PI} in the earlier example can be enhanced by using the following simple algebraic transformation: @example -(sqrt(x * x + 1) - 1) / x = x / (sqrt(x * x + 1) + 1) +(sqrt(x * x + 1) - 1) / x @equiv{} x / (sqrt(x * x + 1) + 1) @end example @noindent -After making this, change the program does converge to +After making this, change the program converges to @value{PI} in under 30 iterations: @example @@ -27759,340 +29363,22 @@ $ @kbd{gawk -f pi2.awk} @print{} 3.141592653589797 @end example -There is no need to be unduly suspicious about the results from -floating-point arithmetic. The lesson to remember is that -floating-point arithmetic is always more complex than arithmetic using -pencil and paper. In order to take advantage of the power -of computer floating-point, you need to know its limitations -and work within them. For most casual use of floating-point arithmetic, -you will often get the expected result in the end if you simply round -the display of your final results to the correct number of significant -decimal digits. - -As general advice, avoid presenting numerical data in a manner that -implies better precision than is actually the case. - -@menu -* Floating-point Representation:: Binary floating-point representation. -* Floating-point Context:: Floating-point context. -* Rounding Mode:: Floating-point rounding mode. -@end menu - -@node Floating-point Representation -@subsection Binary Floating-point Representation -@cindex IEEE-754 format - -Although floating-point representations vary from machine to machine, -the most commonly encountered representation is that defined by the -IEEE 754 Standard. An IEEE-754 format value has three components: - -@itemize @bullet -@item -A sign bit telling whether the number is positive or negative. - -@item -An @dfn{exponent}, @var{e}, giving its order of magnitude. - -@item -A @dfn{significand}, @var{s}, -specifying the actual digits of the number. -@end itemize - -The value of the -number is then -@iftex -@math{s @cdot 2^e}. -@end iftex -@ifnottex -@var{s * 2^e}. -@end ifnottex -The first bit of a non-zero binary significand -is always one, so the significand in an IEEE-754 format only includes the -fractional part, leaving the leading one implicit. -The significand is stored in @dfn{normalized} format, -which means that the first bit is always a one. - -Three of the standard IEEE-754 types are 32-bit single precision, -64-bit double precision and 128-bit quadruple precision. -The standard also specifies extended precision formats -to allow greater precisions and larger exponent ranges. - -@node Floating-point Context -@subsection Floating-point Context -@cindex context, floating-point - -A floating-point @dfn{context} defines the environment for arithmetic operations. -It governs precision, sets rules for rounding, and limits the range for exponents. -The context has the following primary components: - -@table @dfn -@item Precision -Precision of the floating-point format in bits. - -@item emax -Maximum exponent allowed for the format. - -@item emin -Minimum exponent allowed for the format. - -@item Underflow behavior -The format may or may not support gradual underflow. - -@item Rounding -The rounding mode of the context. -@end table - -@ref{table-ieee-formats} lists the precision and exponent -field values for the basic IEEE-754 binary formats: - -@float Table,table-ieee-formats -@caption{Basic IEEE Format Context Values} -@multitable @columnfractions .20 .20 .20 .20 .20 -@headitem Name @tab Total bits @tab Precision @tab emin @tab emax -@item Single @tab 32 @tab 24 @tab @minus{}126 @tab +127 -@item Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023 -@item Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383 -@end multitable -@end float - -@quotation NOTE -The precision numbers include the implied leading one that gives them -one extra bit of significand. -@end quotation - -A floating-point context can also determine which signals are treated -as exceptions, and can set rules for arithmetic with special values. -Please consult the IEEE-754 standard or other resources for details. - -@command{gawk} ordinarily uses the hardware double precision -representation for numbers. On most systems, this is IEEE-754 -floating-point format, corresponding to 64-bit binary with 53 bits -of precision. - -@quotation NOTE -In case an underflow occurs, the standard allows, but does not require, -the result from an arithmetic operation to be a number smaller than -the smallest nonzero normalized number. Such numbers do -not have as many significant digits as normal numbers, and are called -@dfn{denormals} or @dfn{subnormals}. The alternative, simply returning a zero, -is called @dfn{flush to zero}. The basic IEEE-754 binary formats -support subnormal numbers. -@end quotation - -@node Rounding Mode -@subsection Floating-point Rounding Mode -@cindex rounding mode, floating-point - -The @dfn{rounding mode} specifies the behavior for the results of numerical -operations when discarding extra precision. Each rounding mode indicates -how the least significant returned digit of a rounded result is to -be calculated. -@ref{table-rounding-modes} lists the IEEE-754 defined -rounding modes: - -@float Table,table-rounding-modes -@caption{IEEE 754 Rounding Modes} -@multitable @columnfractions .45 .55 -@headitem Rounding Mode @tab IEEE Name -@item Round to nearest, ties to even @tab @code{roundTiesToEven} -@item Round toward plus Infinity @tab @code{roundTowardPositive} -@item Round toward negative Infinity @tab @code{roundTowardNegative} -@item Round toward zero @tab @code{roundTowardZero} -@item Round to nearest, ties away from zero @tab @code{roundTiesToAway} -@end multitable -@end float - -The default mode @code{roundTiesToEven} is the most preferred, -but the least intuitive. This method does the obvious thing for most values, -by rounding them up or down to the nearest digit. -For example, rounding 1.132 to two digits yields 1.13, -and rounding 1.157 yields 1.16. - -However, when it comes to rounding a value that is exactly halfway between, -things do not work the way you probably learned in school. -In this case, the number is rounded to the nearest even digit. -So rounding 0.125 to two digits rounds down to 0.12, -but rounding 0.6875 to three digits rounds up to 0.688. -You probably have already encountered this rounding mode when -using @code{printf} to format floating-point numbers. -For example: - -@example -BEGIN @{ - x = -4.5 - for (i = 1; i < 10; i++) @{ - x += 1.0 - printf("%4.1f => %2.0f\n", x, x) - @} -@} -@end example - -@noindent -produces the following output when run on the author's system:@footnote{It -is possible for the output to be completely different if the -C library in your system does not use the IEEE-754 even-rounding -rule to round halfway cases for @code{printf}.} - -@example --3.5 => -4 --2.5 => -2 --1.5 => -2 --0.5 => 0 - 0.5 => 0 - 1.5 => 2 - 2.5 => 2 - 3.5 => 4 - 4.5 => 4 -@end example - -The theory behind the rounding mode @code{roundTiesToEven} is that -it more or less evenly distributes upward and downward rounds -of exact halves, which might cause any round-off error -to cancel itself out. This is the default rounding mode used -in IEEE-754 computing functions and operators. - -The other rounding modes are rarely used. -Round toward positive infinity (@code{roundTowardPositive}) -and round toward negative infinity (@code{roundTowardNegative}) -are often used to implement interval arithmetic, -where you adjust the rounding mode to calculate upper and lower bounds -for the range of output. The @code{roundTowardZero} -mode can be used for converting floating-point numbers to integers. -The rounding mode @code{roundTiesToAway} rounds the result to the -nearest number and selects the number with the larger magnitude -if a tie occurs. - -Some numerical analysts will tell you that your choice of rounding style -has tremendous impact on the final outcome, and advise you to wait until -final output for any rounding. Instead, you can often avoid round-off error problems by -setting the precision initially to some value sufficiently larger than -the final desired precision, so that the accumulation of round-off error -does not influence the outcome. -If you suspect that results from your computation are -sensitive to accumulation of round-off error, -one way to be sure is to look for a significant difference in output -when you change the rounding mode. - -@node Gawk and MPFR -@section @command{gawk} + MPFR = Powerful Arithmetic - -The rest of this @value{CHAPTER} describes how to use the arbitrary precision -(also known as @dfn{multiple precision} or @dfn{infinite precision}) numeric -capabilities in @command{gawk} to produce maximally accurate results -when you need it. - -But first you should check if your version of -@command{gawk} supports arbitrary precision arithmetic. -The easiest way to find out is to look at the output of -the following command: - -@example -$ @kbd{gawk --version} -@print{} GNU Awk 4.1.0, API: 1.0 (GNU MPFR 3.1.0-p3, GNU MP 5.0.2) -@print{} Copyright (C) 1989, 1991-2013 Free Software Foundation. -@dots{} -@end example - -@command{gawk} uses the -@uref{http://www.mpfr.org, GNU MPFR} -and -@uref{http://gmplib.org, GNU MP} (GMP) -libraries for arbitrary precision -arithmetic on numbers. So if you do not see the names of these libraries -in the output, then your version of @command{gawk} does not support -arbitrary precision arithmetic. - -Additionally, -there are a few elements available in the @code{PROCINFO} array -to provide information about the MPFR and GMP libraries. -@xref{Auto-set}, for more information. - -@ignore -Even if you aren't interested in arbitrary precision arithmetic, you -may still benefit from knowing about how @command{gawk} handles numbers -in general, and the limitations of doing arithmetic with ordinary -@command{gawk} numbers. -@end ignore - - -@node Arbitrary Precision Floats -@section Arbitrary Precision Floating-point Arithmetic with @command{gawk} - -@command{gawk} uses the GNU MPFR library -for arbitrary precision floating-point arithmetic. The MPFR library -provides precise control over precisions and rounding modes, and gives -correctly rounded, reproducible, platform-independent results. With one -of the command-line options @option{--bignum} or @option{-M}, -all floating-point arithmetic operators and numeric functions can yield -results to any desired precision level supported by MPFR. -Two built-in variables, @code{PREC} and @code{ROUNDMODE}, -provide control over the working precision and the rounding mode -(@pxref{Setting Precision}, and -@pxref{Setting Rounding Mode}). -The precision and the rounding mode are set globally for every operation -to follow. - -The default working precision for arbitrary precision floating-point values is -53 bits, and the default value for @code{ROUNDMODE} is @code{"N"}, -which selects the IEEE-754 @code{roundTiesToEven} rounding mode -(@pxref{Rounding Mode}).@footnote{The -default precision is 53 bits, since according to the MPFR documentation, -the library should be able to exactly reproduce all computations with -double-precision machine floating-point numbers (@code{double} type -in C), except the default exponent range is much wider and subnormal -numbers are not implemented.} -@command{gawk} uses the default exponent range in MPFR -@iftex -(@math{emax = 2^{30} - 1, emin = -emax}) -@end iftex -@ifnottex -(@var{emax} = 2^30 @minus{} 1, @var{emin} = @minus{}@var{emax}) -@end ifnottex -for all floating-point contexts. -There is no explicit mechanism to adjust the exponent range. -MPFR does not implement subnormal numbers by default, -and this behavior cannot be changed in @command{gawk}. - -@quotation NOTE -When emulating an IEEE-754 format (@pxref{Setting Precision}), -@command{gawk} internally adjusts the exponent range -to the value defined for the format and also performs computations needed for -gradual underflow (subnormal numbers). -@end quotation - -@quotation NOTE -MPFR numbers are variable-size entities, consuming only as much space as -needed to store the significant digits. Since the performance using MPFR -numbers pales in comparison to doing arithmetic using the underlying machine -types, you should consider using only as much precision as needed by -your program. -@end quotation - -@menu -* Setting Precision:: Setting the working precision. -* Setting Rounding Mode:: Setting the rounding mode. -* Floating-point Constants:: Representing floating-point constants. -* Changing Precision:: Changing the precision of a number. -* Exact Arithmetic:: Exact arithmetic with floating-point numbers. -@end menu - -@node Setting Precision -@subsection Setting the Working Precision -@cindex @code{PREC} variable +@node Setting precision +@subsection Setting The Precision @command{gawk} uses a global working precision; it does not keep track of the precision or accuracy of individual numbers. Performing an arithmetic operation or calling a built-in function rounds the result to the current -working precision. The default working precision is 53 bits, which can be -modified using the built-in variable @code{PREC}. You can also set the -value to one of the pre-defined case-insensitive strings +working precision. The default working precision is 53 bits, which you can +modify using the built-in variable @code{PREC}. You can also set the +value to one of the predefined case-insensitive strings shown in @ref{table-predefined-precision-strings}, -to emulate an IEEE-754 binary format. +to emulate an IEEE 754 binary format. @float Table,table-predefined-precision-strings -@caption{Predefined precision strings for @code{PREC}} +@caption{Predefined Precision Strings For @code{PREC}} @multitable {@code{"double"}} {12345678901234567890123456789012345} -@headitem @code{PREC} @tab IEEE-754 Binary Format +@headitem @code{PREC} @tab IEEE 754 Binary Format @item @code{"half"} @tab 16-bit half-precision. @item @code{"single"} @tab Basic 32-bit single precision. @item @code{"double"} @tab Basic 64-bit double precision. @@ -28111,49 +29397,34 @@ $ @kbd{gawk -M -v PREC=100 'BEGIN @{ x = 1.0e-400; print x + 0} @print{} 0 @end example -Binary and decimal precisions are related approximately, according to the -formula: +@quotation CAUTION +Be wary of floating-point constants! When reading a floating-point +constant from program source code, @command{gawk} uses the default +precision (that of a C @code{double}), unless overridden by an assignment +to the special variable @code{PREC} on the command line, to store it +internally as a MPFR number. Changing the precision using @code{PREC} +in the program text does @emph{not} change the precision of a constant. + +If you need to represent a floating-point constant at a higher precision +than the default and cannot use a command line assignment to @code{PREC}, +you should either specify the constant as a string, or as a rational +number, whenever possible. The following example illustrates the +differences among various ways to print a floating-point constant: +@end quotation -@iftex -@math{prec = 3.322 @cdot dps} -@end iftex -@ifnottex -@var{prec} = 3.322 * @var{dps} -@end ifnottex +@example +$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'} +@print{} 0.1000000000000000055511151 +$ @kbd{gawk -M -v PREC=113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'} +@print{} 0.1000000000000000000000000 +$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'} +@print{} 0.1000000000000000000000000 +$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'} +@print{} 0.1000000000000000000000000 +@end example -@noindent -Here, @var{prec} denotes the binary precision -(measured in bits) and @var{dps} (short for decimal places) -is the decimal digits. We can easily calculate how many decimal -digits the 53-bit significand of an IEEE double is equivalent to: -53 / 3.322 which is equal to about 15.95. -But what does 15.95 digits actually mean? It depends whether you are -concerned about how many digits you can rely on, or how many digits -you need. - -It is important to know how many bits it takes to uniquely identify -a double-precision value (the C type @code{double}). If you want to -convert from @code{double} to decimal and back to @code{double} (e.g., -saving a @code{double} representing an intermediate result to a file, and -later reading it back to restart the computation), then a few more decimal -digits are required. 17 digits is generally enough for a @code{double}. - -It can also be important to know what decimal numbers can be uniquely -represented with a @code{double}. If you want to convert -from decimal to @code{double} and back again, 15 digits is the most that -you can get. Stated differently, you should not present -the numbers from your floating-point computations with more than 15 -significant digits in them. - -Conversely, it takes a precision of 332 bits to hold an approximation -of the constant @value{PI} that is accurate to 100 decimal places. - -You should always add some extra bits in order to avoid the confusing round-off -issues that occur because numbers are stored internally in binary. - -@node Setting Rounding Mode -@subsection Setting the Rounding Mode -@cindex @code{ROUNDMODE} variable +@node Setting the rounding mode +@subsection Setting The Rounding Mode The @code{ROUNDMODE} variable provides program level control over the rounding mode. @@ -28172,190 +29443,104 @@ rounding modes is shown in @ref{table-gawk-rounding-modes}. @end multitable @end float -@code{ROUNDMODE} has the default value @code{"N"}, -which selects the IEEE-754 rounding mode @code{roundTiesToEven}. -In @ref{table-gawk-rounding-modes}, @code{"A"} is listed to select the IEEE-754 mode -@code{roundTiesToAway}. This is only available -if your version of the MPFR library supports it; otherwise setting -@code{ROUNDMODE} to this value has no effect. @xref{Rounding Mode}, -for the meanings of the various rounding modes. - -Here is an example of how to change the default rounding behavior of -@code{printf}'s output: - -@example -$ @kbd{gawk -M -v ROUNDMODE="Z" 'BEGIN @{ printf("%.2f\n", 1.378) @}'} -@print{} 1.37 -@end example - -@node Floating-point Constants -@subsection Representing Floating-point Constants -@cindex constants, floating-point - -Be wary of floating-point constants! When reading a floating-point constant -from program source code, @command{gawk} uses the default precision, -unless overridden -by an assignment to the special variable @code{PREC} on the command -line, to store it internally as a MPFR number. -Changing the precision using @code{PREC} in the program text does -@emph{not} change the precision of a constant. If you need to -represent a floating-point constant at a higher precision than the -default and cannot use a command line assignment to @code{PREC}, -you should either specify the constant as a string, or -as a rational number, whenever possible. The following example -illustrates the differences among various ways to -print a floating-point constant: - -@example -$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'} -@print{} 0.1000000000000000055511151 -$ @kbd{gawk -M -v PREC=113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'} -@print{} 0.1000000000000000000000000 -$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'} -@print{} 0.1000000000000000000000000 -$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'} -@print{} 0.1000000000000000000000000 -@end example - -In the first case, the number is stored with the default precision of 53 bits. +@code{ROUNDMODE} has the default value @code{"N"}, which +selects the IEEE 754 rounding mode @code{roundTiesToEven}. +In @ref{table-gawk-rounding-modes}, the value @code{"A"} selects +@code{roundTiesToAway}. This is only available if your version of the +MPFR library supports it; otherwise setting @code{ROUNDMODE} to @code{"A"} +has no effect. -@node Changing Precision -@subsection Changing the Precision of a Number - -@cindex Laurie, Dirk -@quotation -@i{The point is that in any variable-precision package, -a decision is made on how to treat numbers given as data, -or arising in intermediate results, which are represented in -floating-point format to a precision lower than working precision. -Do we promote them to full membership of the high-precision club, -or do we treat them and all their associates as second-class citizens? -Sometimes the first course is proper, sometimes the second, and it takes -careful analysis to tell which.}@footnote{Dirk Laurie. -@cite{Variable-precision Arithmetic Considered Perilous --- A Detective Story}. -Electronic Transactions on Numerical Analysis. Volume 28, pp. 168-173, 2008.} -@author Dirk Laurie -@end quotation +The default mode @code{roundTiesToEven} is the most preferred, +but the least intuitive. This method does the obvious thing for most values, +by rounding them up or down to the nearest digit. +For example, rounding 1.132 to two digits yields 1.13, +and rounding 1.157 yields 1.16. -@command{gawk} does not implicitly modify the precision of any previously -computed results when the working precision is changed with an assignment -to @code{PREC}. The precision of a number is always the one that was -used at the time of its creation, and there is no way for the user -to explicitly change it afterwards. However, since the result of a -floating-point arithmetic operation is always an arbitrary precision -floating-point value---with a precision set by the value of @code{PREC}---one of the -following workarounds effectively accomplishes the desired behavior: +However, when it comes to rounding a value that is exactly halfway between, +things do not work the way you probably learned in school. +In this case, the number is rounded to the nearest even digit. +So rounding 0.125 to two digits rounds down to 0.12, +but rounding 0.6875 to three digits rounds up to 0.688. +You probably have already encountered this rounding mode when +using @code{printf} to format floating-point numbers. +For example: @example -x = x + 0.0 +BEGIN @{ + x = -4.5 + for (i = 1; i < 10; i++) @{ + x += 1.0 + printf("%4.1f => %2.0f\n", x, x) + @} +@} @end example @noindent -or: - -@example -x += 0.0 -@end example - -@node Exact Arithmetic -@subsection Exact Arithmetic with Floating-point Numbers - -@quotation CAUTION -Never depend on the exactness of floating-point arithmetic, -even for apparently simple expressions! -@end quotation - -Can arbitrary precision arithmetic give exact results? There are -no easy answers. The standard rules of algebra often do not apply -when using floating-point arithmetic. -Among other things, the distributive and associative laws -do not hold completely, and order of operation may be important -for your computation. Rounding error, cumulative precision loss -and underflow are often troublesome. - -When @command{gawk} tests the expressions @samp{0.1 + 12.2} and @samp{12.3} -for equality -using the machine double precision arithmetic, it decides that they -are not equal! -(@xref{Floating-point Programming}.) -You can get the result you want by increasing the precision; -56 bits in this case will get the job done: - -@example -$ @kbd{gawk -M -v PREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} -@print{} 1 -@end example - -If adding more bits is good, perhaps adding even more bits of -precision is better? -Here is what happens if we use an even larger value of @code{PREC}: - -@example -$ @kbd{gawk -M -v PREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} -@print{} 0 -@end example - -This is not a bug in @command{gawk} or in the MPFR library. -It is easy to forget that the finite number of bits used to store the value -is often just an approximation after proper rounding. -The test for equality succeeds if and only if @emph{all} bits in the two operands -are exactly the same. Since this is not necessarily true after floating-point -computations with a particular precision and effective rounding rule, -a straight test for equality may not work. - -So, don't assume that floating-point values can be compared for equality. -You should also exercise caution when using other forms of comparisons. -The standard way to compare between floating-point numbers is to determine -how much error (or @dfn{tolerance}) you will allow in a comparison and -check to see if one value is within this error range of the other. - -In applications where 15 or fewer decimal places suffice, -hardware double precision arithmetic can be adequate, and is usually much faster. -But you do need to keep in mind that every floating-point operation -can suffer a new rounding error with catastrophic consequences as illustrated -by our earlier attempt to compute the value of the constant @value{PI} -(@pxref{Floating-point Programming}). -Extra precision can greatly enhance the stability and the accuracy -of your computation in such cases. - -Repeated addition is not necessarily equivalent to multiplication -in floating-point arithmetic. In the example in -@ref{Floating-point Programming}: +produces the following output when run on the author's system:@footnote{It +is possible for the output to be completely different if the +C library in your system does not use the IEEE 754 even-rounding +rule to round halfway cases for @code{printf}.} @example -$ @kbd{gawk 'BEGIN @{} -> @kbd{for (d = 1.1; d <= 1.5; d += 0.1) # loop five times (?)} -> @kbd{i++} -> @kbd{print i} -> @kbd{@}'} -@print{} 4 +-3.5 => -4 +-2.5 => -2 +-1.5 => -2 +-0.5 => 0 + 0.5 => 0 + 1.5 => 2 + 2.5 => 2 + 3.5 => 4 + 4.5 => 4 @end example -@noindent -you may or may not succeed in getting the correct result by choosing -an arbitrarily large value for @code{PREC}. Reformulation of -the problem at hand is often the correct approach in such situations. +The theory behind @code{roundTiesToEven} is that it more or less evenly +distributes upward and downward rounds of exact halves, which might +cause any accumulating round-off error to cancel itself out. This is the +default rounding mode for IEEE 754 computing functions and operators. + +The other rounding modes are rarely used. Round toward positive infinity +(@code{roundTowardPositive}) and round toward negative infinity +(@code{roundTowardNegative}) are often used to implement interval +arithmetic, where you adjust the rounding mode to calculate upper and +lower bounds for the range of output. The @code{roundTowardZero} mode can +be used for converting floating-point numbers to integers. The rounding +mode @code{roundTiesToAway} rounds the result to the nearest number and +selects the number with the larger magnitude if a tie occurs. + +Some numerical analysts will tell you that your choice of rounding +style has tremendous impact on the final outcome, and advise you to +wait until final output for any rounding. Instead, you can often avoid +round-off error problems by setting the precision initially to some +value sufficiently larger than the final desired precision, so that +the accumulation of round-off error does not influence the outcome. +If you suspect that results from your computation are sensitive to +accumulation of round-off error, look for a significant difference in +output when you change the rounding mode to be sure. @node Arbitrary Precision Integers @section Arbitrary Precision Integer Arithmetic with @command{gawk} -@cindex integer, arbitrary precision - -If one of the options @option{--bignum} or @option{-M} is specified, -@command{gawk} performs all -integer arithmetic using GMP arbitrary precision integers. -Any number that looks like an integer in a program source or data file -is stored as an arbitrary precision integer. -The size of the integer is limited only by your computer's memory. -The current floating-point context has no effect on operations involving integers. -For example, the following computes +@cindex integers, arbitrary precision +@cindex arbitrary precision integers + +When given one of the options @option{--bignum} or @option{-M}, +@command{gawk} performs all integer arithmetic using GMP arbitrary +precision integers. Any number that looks like an integer in a source +or @value{DF} is stored as an arbitrary precision integer. The size +of the integer is limited only by the available memory. For example, +the following computes @iftex @math{5^{4^{3^{2}}}}, @end iftex @ifnottex +@ifnotdocbook 5^4^3^2, +@end ifnotdocbook @end ifnottex +@docbook +5<superscript>4<superscript>3<superscript>2</superscript></superscript></superscript>, @c +@end docbook the result of which is beyond the -limits of ordinary @command{gawk} numbers: +limits of ordinary hardware double-precision floating point values: @example $ @kbd{gawk -M 'BEGIN @{} @@ -28367,25 +29552,32 @@ $ @kbd{gawk -M 'BEGIN @{} @print{} 62060698786608744707 ... 92256259918212890625 @end example -If you were to compute the same value using arbitrary precision -floating-point values instead, the precision needed for correct output -(using the formula +If instead you were to compute the same value using arbitrary precision +floating-point values, the precision needed for correct output (using +the formula @iftex @math{prec = 3.322 @cdot dps}), would be @math{3.322 @cdot 183231}, @end iftex @ifnottex +@ifnotdocbook @samp{prec = 3.322 * dps}), would be 3.322 x 183231, +@end ifnotdocbook @end ifnottex +@docbook +<emphasis>prec</emphasis> = 3.322 ⋅ <emphasis>dps</emphasis>), +would be +<emphasis>prec</emphasis> = 3.322 ⋅ 183231, @c +@end docbook or 608693. The result from an arithmetic operation with an integer and a floating-point value is a floating-point value with a precision equal to the working precision. The following program calculates the eighth term in Sylvester's sequence@footnote{Weisstein, Eric W. -@cite{Sylvester's Sequence}. From MathWorld---A Wolfram Web Resource. -@url{http://mathworld.wolfram.com/SylvestersSequence.html}} +@cite{Sylvester's Sequence}. From MathWorld---A Wolfram Web Resource +@w{(@url{http://mathworld.wolfram.com/SylvestersSequence.html}).}} using a recurrence: @example @@ -28405,15 +29597,15 @@ floating-point results exactly. You can either increase the precision @samp{2.0} with an integer, to perform all computations using integer arithmetic to get the correct output. -It will sometimes be necessary for @command{gawk} to implicitly convert an -arbitrary precision integer into an arbitrary precision floating-point value. -This is primarily because the MPFR library does not always provide the -relevant interface to process arbitrary precision integers or mixed-mode -numbers as needed by an operation or function. -In such a case, the precision is set to the minimum value necessary -for exact conversion, and the working precision is not used for this purpose. -If this is not what you need or want, you can employ a subterfuge -like this: +Sometimes @command{gawk} must implicitly convert an arbitrary precision +integer into an arbitrary precision floating-point value. This is +primarily because the MPFR library does not always provide the relevant +interface to process arbitrary precision integers or mixed-mode numbers +as needed by an operation or function. In such a case, the precision is +set to the minimum value necessary for exact conversion, and the working +precision is not used for this purpose. If this is not what you need or +want, you can employ a subterfuge, and convert the integer to floating +point first, like this: @example gawk -M 'BEGIN @{ n = 13; print (n + 0.0) % 2.0 @}' @@ -28426,15 +29618,186 @@ to begin with: gawk -M 'BEGIN @{ n = 13.0; print n % 2.0 @}' @end example -Note that for the particular example above, there is likely best +Note that for the particular example above, it is likely best to just use the following: @example gawk -M 'BEGIN @{ n = 13; print n % 2 @}' @end example +@node POSIX Floating Point Problems +@section Standards Versus Existing Practice + +Historically, @command{awk} has converted any non-numeric looking string +to the numeric value zero, when required. Furthermore, the original +definition of the language and the original POSIX standards specified that +@command{awk} only understands decimal numbers (base 10), and not octal +(base 8) or hexadecimal numbers (base 16). + +Changes in the language of the +2001 and 2004 POSIX standards can be interpreted to imply that @command{awk} +should support additional features. These features are: + +@itemize @value{BULLET} +@item +Interpretation of floating point data values specified in hexadecimal +notation (e.g., @code{0xDEADBEEF}). (Note: data values, @emph{not} +source code constants.) + +@item +Support for the special IEEE 754 floating point values ``Not A Number'' +(NaN), positive Infinity (``inf'') and negative Infinity (``@minus{}inf''). +In particular, the format for these values is as specified by the ISO 1999 +C standard, which ignores case and can allow implementation-dependent additional +characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}. +@end itemize + +The first problem is that both of these are clear changes to historical +practice: + +@itemize @value{BULLET} +@item +The @command{gawk} maintainer feels that supporting hexadecimal floating +point values, in particular, is ugly, and was never intended by the +original designers to be part of the language. + +@item +Allowing completely alphabetic strings to have valid numeric +values is also a very severe departure from historical practice. +@end itemize + +The second problem is that the @code{gawk} maintainer feels that this +interpretation of the standard, which requires a certain amount of +``language lawyering'' to arrive at in the first place, was not even +intended by the standard developers. In other words, ``we see how you +got where you are, but we don't think that that's where you want to be.'' + +Recognizing the above issues, but attempting to provide compatibility +with the earlier versions of the standard, +the 2008 POSIX standard added explicit wording to allow, but not require, +that @command{awk} support hexadecimal floating point values and +special values for ``Not A Number'' and infinity. + +Although the @command{gawk} maintainer continues to feel that +providing those features is inadvisable, +nevertheless, on systems that support IEEE floating point, it seems +reasonable to provide @emph{some} way to support NaN and Infinity values. +The solution implemented in @command{gawk} is as follows: + +@itemize @value{BULLET} +@item +With the @option{--posix} command-line option, @command{gawk} becomes +``hands off.'' String values are passed directly to the system library's +@code{strtod()} function, and if it successfully returns a numeric value, +that is what's used.@footnote{You asked for it, you got it.} +By definition, the results are not portable across +different systems. They are also a little surprising: + +@example +$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'} +@print{} nan +$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'} +@print{} 3735928559 +@end example + +@item +Without @option{--posix}, @command{gawk} interprets the four strings +@samp{+inf}, +@samp{-inf}, +@samp{+nan}, +and +@samp{-nan} +specially, producing the corresponding special numeric values. +The leading sign acts a signal to @command{gawk} (and the user) +that the value is really numeric. Hexadecimal floating point is +not supported (unless you also use @option{--non-decimal-data}, +which is @emph{not} recommended). For example: + +@example +$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'} +@print{} 0 +$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'} +@print{} nan +$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'} +@print{} 0 +@end example + +@command{gawk} ignores case in the four special values. +Thus @samp{+nan} and @samp{+NaN} are the same. +@end itemize + +@node Floating point summary +@section Summary + +@itemize @value{BULLET} +@item +Most computer arithmetic is done using either integers or floating-point +values. The default for @command{awk} is to use double-precision +floating-point values. + +@item +In the 1980's, Barbie mistakenly said ``Math class is tough!'' +While math isn't tough, floating-point arithmetic isn't the same +as pencil and paper math, and care must be taken: + +@c nested list +@itemize @value{MINUS} +@item +Not all numbers can be represented exactly. + +@item +Comparing values should use a delta, instead of being done directly +with @samp{==} and @samp{!=}. + +@item +Errors accumulate. + +@item +Operations are not always truly associative or distributive. +@end itemize + +@item +Increasing the accuracy can help, but it is not a panacea. + +@item +Often, increasing the accuracy and then rounding to the desired +number of digits produces reasonable results. + +@item +Use either @option{-M} or @option{--bignum} to enable MPFR +arithmetic. Use @code{PREC} to set the precision in bits, and +@code{ROUNDMODE} to set the IEEE 754 rounding mode. + +@item +With @option{-M} or @option{--bignum}, @command{gawk} performs +arbitrary precision integer arithmetic using the GMP library. +This is faster and more space efficient than using MPFR for +the same calculations. + +@item +There are several ``dark corners'' with respect to floating-point +numbers where @command{gawk} disagrees with the POSIX standard. +It pays to be aware of them. + +@item +Overall, there is no need to be unduly suspicious about the results from +floating-point arithmetic. The lesson to remember is that floating-point +arithmetic is always more complex than arithmetic using pencil and +paper. In order to take advantage of the power of computer floating-point, +you need to know its limitations and work within them. For most casual +use of floating-point arithmetic, you will often get the expected result +if you simply round the display of your final results to the correct number +of significant decimal digits. + +@item +As general advice, avoid presenting numerical data in a manner that +implies better precision than is actually the case. + +@end itemize + @node Dynamic Extensions @chapter Writing Extensions for @command{gawk} +@cindex dynamically loaded extensions It is possible to add new functions written in C or C++ to @command{gawk} using dynamically loaded libraries. This facility is available on systems @@ -28464,11 +29827,14 @@ When @option{--sandbox} is specified, extensions are disabled * Extension Samples:: The sample extensions that ship with @code{gawk}. * gawkextlib:: The @code{gawkextlib} project. +* Extension summary:: Extension summary. +* Extension Exercises:: Exercises. @end menu @node Extension Intro @section Introduction +@cindex plug-in An @dfn{extension} (sometimes called a @dfn{plug-in}) is a piece of external compiled code that @command{gawk} can load at runtime to provide additional functionality, over and above the built-in capabilities @@ -28488,8 +29854,15 @@ the facilities that the API provides and how to use them, and presents a small sample extension. In addition, it documents the sample extensions included in the @command{gawk} distribution, and describes the @code{gawkextlib} project. +@ifclear FOR_PRINT @xref{Extension Design}, for a discussion of the extension mechanism goals and design. +@end ifclear +@ifset FOR_PRINT +See @uref{http://www.gnu.org/software/gawk/manual/html_node/Extension-Design.html} +for a discussion of the extension mechanism +goals and design. +@end ifset @node Plugin License @section Extension Licensing @@ -28514,45 +29887,83 @@ Communication between @command{gawk} and an extension is two-way. First, when an extension is loaded, it is passed a pointer to a @code{struct} whose fields are function pointers. -This is shown in @ref{load-extension}. +@ifnotdocbook +This is shown in @ref{figure-load-extension}. +@end ifnotdocbook +@ifdocbook +This is shown in @inlineraw{docbook, <xref linkend="figure-load-extension"/>}. +@end ifdocbook -@float Figure,load-extension +@ifnotdocbook +@float Figure,figure-load-extension @caption{Loading The Extension} @c FIXME: One day, it should not be necessary to have two cases, @c but rather just the one without the "txt" final argument. @c This applies to the other figures as well. @ifinfo -@center @image{api-figure1, , , Loading the extension, txt} +@center @image{api-figure1, , , Loading The Extension, txt} @end ifinfo @ifnotinfo -@center @image{api-figure1, , , Loading the extension} +@center @image{api-figure1, , , Loading The Extension} @end ifnotinfo @end float +@end ifnotdocbook + +@docbook +<figure id="figure-load-extension" float="0"> +<title>Loading The Extension</title> +<mediaobject> +<imageobject role="web"><imagedata fileref="api-figure1.png" format="PNG"/></imageobject> +</mediaobject> +</figure> +@end docbook The extension can call functions inside @command{gawk} through these function pointers, at runtime, without needing (link-time) access to @command{gawk}'s symbols. One of these function pointers is to a function for ``registering'' new built-in functions. -This is shown in @ref{load-new-function}. +@ifnotdocbook +This is shown in @ref{figure-load-new-function}. +@end ifnotdocbook +@ifdocbook +This is shown in @inlineraw{docbook, <xref linkend="figure-load-new-function"/>}. +@end ifdocbook -@float Figure,load-new-function +@ifnotdocbook +@float Figure,figure-load-new-function @caption{Loading The New Function} @ifinfo -@center @image{api-figure2, , , Loading the new function, txt} +@center @image{api-figure2, , , Loading The New Function, txt} @end ifinfo @ifnotinfo -@center @image{api-figure2, , , Loading the new function} +@center @image{api-figure2, , , Loading The New Function} @end ifnotinfo @end float +@end ifnotdocbook + +@docbook +<figure id="figure-load-new-function" float="0"> +<title>Loading The New Function</title> +<mediaobject> +<imageobject role="web"><imagedata fileref="api-figure2.png" format="PNG"/></imageobject> +</mediaobject> +</figure> +@end docbook In the other direction, the extension registers its new functions with @command{gawk} by passing function pointers to the functions that provide the new feature (@code{do_chdir()}, for example). @command{gawk} associates the function pointer with a name and can then call it, using a defined calling convention. -This is shown in @ref{call-new-function}. +@ifnotdocbook +This is shown in @ref{figure-call-new-function}. +@end ifnotdocbook +@ifdocbook +This is shown in @inlineraw{docbook, <xref linkend="figure-call-new-function"/>}. +@end ifdocbook -@float Figure,call-new-function +@ifnotdocbook +@float Figure,figure-call-new-function @caption{Calling The New Function} @ifinfo @center @image{api-figure3, , , Calling the new function, txt} @@ -28561,14 +29972,24 @@ This is shown in @ref{call-new-function}. @center @image{api-figure3, , , Calling the new function} @end ifnotinfo @end float +@end ifnotdocbook + +@docbook +<figure id="figure-call-new-function" float="0"> +<title>Calling The New Function</title> +<mediaobject> +<imageobject role="web"><imagedata fileref="api-figure3.png" format="PNG"/></imageobject> +</mediaobject> +</figure> +@end docbook The @code{do_@var{xxx}()} function, in turn, then uses the function pointers in the API @code{struct} to do its work, such as updating variables or arrays, printing messages, setting @code{ERRNO}, and so on. -Convenience macros in the @file{gawkapi.h} header file make calling -through the function pointers look like regular function calls so that -extension code is quite readable and understandable. +Convenience macros make calling through the function pointers look +like regular function calls so that extension code is quite readable +and understandable. Although all of this sounds somewhat complicated, the result is that extension code is quite straightforward to write and to read. You can @@ -28577,7 +29998,7 @@ Example}) and also the @file{testext.c} code for testing the APIs. Some other bits and pieces: -@itemize @bullet +@itemize @value{BULLET} @item The API provides access to @command{gawk}'s @code{do_@var{xxx}} values, reflecting command line options, like @code{do_lint}, @code{do_profiling} @@ -28597,13 +30018,18 @@ happen, but we all know how @emph{that} goes.) @node Extension API Description @section API Description +@cindex extension API +C or C++ code for an extension must include the header file +@file{gawkapi.h}, which declares the functions and defines the data +types used to communicate with @command{gawk}. This (rather large) @value{SECTION} describes the API in detail. @menu * Extension API Functions Introduction:: Introduction to the API functions. * General Data Types:: The data types. * Requesting Values:: How to get a value. +* Memory Allocation Functions:: Functions for allocating memory. * Constructor Functions:: Functions for creating values. * Registration Functions:: Functions to register things with @command{gawk}. @@ -28625,10 +30051,10 @@ by calling through function pointers passed into your extension. API function pointers are provided for the following kinds of operations: -@itemize @bullet +@itemize @value{BULLET} @item -Registrations functions. You may register: -@itemize @minus +Registration functions. You may register: +@itemize @value{MINUS} @item extension functions, @item @@ -28659,6 +30085,9 @@ Symbol table access: retrieving a global variable, creating one, or changing one. @item +Allocating, reallocating, and releasing memory. + +@item Creating and releasing cached values; this provides an efficient way to use values for multiple variables and can be a big performance win. @@ -28666,7 +30095,7 @@ can be a big performance win. @item Manipulating arrays: -@itemize @minus +@itemize @value{MINUS} @item Retrieving, adding, deleting, and modifying elements @@ -28686,7 +30115,7 @@ Flattening an array for easy C style looping over all its indices and elements Some points about using the API: -@itemize @bullet +@itemize @value{BULLET} @item The following types and/or macros and/or functions are referenced in @file{gawkapi.h}. For correct use, you must therefore include the @@ -28695,12 +30124,11 @@ corresponding standard header file @emph{before} including @file{gawkapi.h}: @multitable {@code{memset()}, @code{memcpy()}} {@code{<sys/types.h>}} @headitem C Entity @tab Header File @item @code{EOF} @tab @code{<stdio.h>} +@item Values for @code{errno} @tab @code{<errno.h>} @item @code{FILE} @tab @code{<stdio.h>} @item @code{NULL} @tab @code{<stddef.h>} -@item @code{malloc()} @tab @code{<stdlib.h>} @item @code{memcpy()} @tab @code{<string.h>} @item @code{memset()} @tab @code{<string.h>} -@item @code{realloc()} @tab @code{<stdlib.h>} @item @code{size_t} @tab @code{<sys/types.h>} @item @code{struct stat} @tab @code{<sys/stat.h>} @end multitable @@ -28712,9 +30140,6 @@ is necessary in order to keep @file{gawkapi.h} clean, instead of becoming a portability hodge-podge as can be seen in some parts of the @command{gawk} source code. -To pass reasonable integer values for @code{ERRNO}, you will also need to -include @code{<errno.h>}. - @item The @file{gawkapi.h} file may be included more than once without ill effect. Doing so, however, is poor coding practice. @@ -28730,14 +30155,15 @@ does not support this keyword, you should either place All pointers filled in by @command{gawk} are to memory managed by @command{gawk} and should be treated by the extension as read-only. Memory for @emph{all} strings passed into @command{gawk} -from the extension @emph{must} come from @code{malloc()} and is managed -by @command{gawk} from then on. +from the extension @emph{must} come from calling the API-provided function +pointers @code{api_malloc()}, @code{api_calloc()} or @code{api_realloc()}, +and is managed by @command{gawk} from then on. @item The API defines several simple @code{struct}s that map values as seen from @command{awk}. A value can be a @code{double}, a string, or an array (as in multidimensional arrays, or when creating a new array). -String values maintain both pointer and length since embedded @code{NUL} +String values maintain both pointer and length since embedded @sc{nul} characters are allowed. @quotation NOTE @@ -28771,10 +30197,14 @@ the macros as if they were functions. @node General Data Types @subsection General Purpose Data Types +@cindex Robbins, Arnold +@cindex Ramey, Chet @quotation @i{I have a true love/hate relationship with unions.} @author Arnold Robbins +@end quotation +@quotation @i{That's the thing about unions: the compiler will arrange things so they can accommodate both love and hate.} @author Chet Ramey @@ -28797,9 +30227,9 @@ certain fields in the API data structures unwritable from extension code, while allowing @command{gawk} to use them as it needs to. @item typedef enum awk_bool @{ -@item @ @ @ @ awk_false = 0, -@item @ @ @ @ awk_true -@item @} awk_bool_t; +@itemx @ @ @ @ awk_false = 0, +@itemx @ @ @ @ awk_true +@itemx @} awk_bool_t; A simple boolean type. @item typedef struct awk_string @{ @@ -28809,7 +30239,8 @@ A simple boolean type. This represents a mutable string. @command{gawk} owns the memory pointed to if it supplied the value. Otherwise, it takes ownership of the memory pointed to. -@strong{Such memory must come from @code{malloc()}!} +@strong{Such memory must come from calling the API-provided function +pointers @code{api_malloc()}, @code{api_calloc()}, or @code{api_realloc()}!} As mentioned earlier, strings are maintained using the current multibyte encoding. @@ -28864,7 +30295,7 @@ Scalar values in @command{awk} are either numbers or strings. The indicates what is in the @code{union}. Representing numbers is easy---the API uses a C @code{double}. Strings -require more work. Since @command{gawk} allows embedded @code{NUL} bytes +require more work. Since @command{gawk} allows embedded @sc{nul} bytes in string values, a string must be represented as a pair containing a data-pointer and length. This is the @code{awk_string_t} type. @@ -28894,8 +30325,11 @@ reading and/or changing the value of one or more scalar variables, you can obtain a @dfn{scalar cookie}@footnote{See @uref{http://catb.org/jargon/html/C/cookie.html, the ``cookie'' entry in the Jargon file} for a definition of @dfn{cookie}, and @uref{http://catb.org/jargon/html/M/magic-cookie.html, -the ``magic cookie'' entry in the Jargon file} for a nice example. See -also the entry for ``Cookie'' in the @ref{Glossary}.} +the ``magic cookie'' entry in the Jargon file} for a nice example. +@ifclear FOR_PRINT +See also the entry for ``Cookie'' in the @ref{Glossary}. +@end ifclear +} object for that variable, and then use the cookie for getting the variable's value or for changing the variable's value. @@ -28925,9 +30359,94 @@ print an error message, or reissue the request for the actual value type, as appropriate. This behavior is summarized in @ref{table-value-types-returned}. -@ifnotplaintext +@c FIXME: Try to do this with spans... + @float Table,table-value-types-returned -@caption{Value Types Returned} +@caption{API Value Types Returned} +@docbook +<informaltable> +<tgroup cols="2"> + <colspec colwidth="50*"/><colspec colwidth="50*"/> + <thead> + <row><entry></entry><entry><para>Type of Actual Value:</para></entry></row> + </thead> + <tbody> + <row><entry></entry><entry></entry></row> + </tbody> +</tgroup> +<tgroup cols="6"> + <colspec colwidth="16.6*"/> + <colspec colwidth="16.6*"/> + <colspec colwidth="19.8*"/> + <colspec colwidth="15*"/> + <colspec colwidth="15*"/> + <colspec colwidth="16.6*"/> + <thead> + <row> + <entry></entry> + <entry></entry> + <entry><para>String</para></entry> + <entry><para>Number</para></entry> + <entry><para>Array</para></entry> + <entry><para>Undefined</para></entry> + </row> + </thead> + <tbody> + <row> + <entry></entry> + <entry><para><emphasis role="bold">String</emphasis></para></entry> + <entry><para>String</para></entry> + <entry><para>String</para></entry> + <entry><para>false</para></entry> + <entry><para>false</para></entry> + </row> + <row> + <entry></entry> + <entry><para><emphasis role="bold">Number</emphasis></para></entry> + <entry><para>Number if can be converted, else false</para></entry> + <entry><para>Number</para></entry> + <entry><para>false</para></entry> + <entry><para>false</para></entry> + </row> + <row> + <entry><para><emphasis role="bold">Type</emphasis></para></entry> + <entry><para><emphasis role="bold">Array</emphasis></para></entry> + <entry><para>false</para></entry> + <entry><para>false</para></entry> + <entry><para>Array</para></entry> + <entry><para>false</para></entry> + </row> + <row> + <entry><para><emphasis role="bold">Requested:</emphasis></para></entry> + <entry><para><emphasis role="bold">Scalar</emphasis></para></entry> + <entry><para>Scalar</para></entry> + <entry><para>Scalar</para></entry> + <entry><para>false</para></entry> + <entry><para>false</para></entry> + </row> + <row> + <entry></entry> + <entry><para><emphasis role="bold">Undefined</emphasis></para></entry> + <entry><para>String</para></entry> + <entry><para>Number</para></entry> + <entry><para>Array</para></entry> + <entry><para>Undefined</para></entry> + </row> + <row> + <entry></entry> + <entry><para><emphasis role="bold">Value Cookie</emphasis></para></entry> + <entry><para>false</para></entry> + <entry><para>false</para></entry> + <entry><para>false</para> + </entry><entry><para>false</para></entry> + </row> + </tbody> +</tgroup> +</informaltable> +@end docbook + +@ifnotplaintext +@ifnotdocbook @multitable @columnfractions .50 .50 @headitem @tab Type of Actual Value: @end multitable @@ -28940,11 +30459,9 @@ value type, as appropriate. This behavior is summarized in @item @tab @b{Undefined} @tab String @tab Number @tab Array @tab Undefined @item @tab @b{Value Cookie} @tab false @tab false @tab false @tab false @end multitable -@end float +@end ifnotdocbook @end ifnotplaintext @ifplaintext -@float Table,table-value-types-returned -@caption{Value Types Returned} @example +-------------------------------------------------+ | Type of Actual Value: | @@ -28968,60 +30485,62 @@ value type, as appropriate. This behavior is summarized in | | Cookie | | | | | +-----------+-----------+------------+------------+-----------+-----------+ @end example -@end float @end ifplaintext +@end float -@node Constructor Functions -@subsection Constructor Functions and Convenience Macros +@node Memory Allocation Functions +@subsection Memory Allocation Functions and Convenience Macros +@cindex allocating memory for extensions +@cindex extensions, allocating memory -The API provides a number of @dfn{constructor} functions for creating -string and numeric values, as well as a number of convenience macros. -This @value{SUBSECTION} presents them all as function prototypes, in -the way that extension code would use them. +The API provides a number of @dfn{memory allocation} functions for +allocating memory that can be passed to @command{gawk}, as well as a number of +convenience macros. @table @code -@item static inline awk_value_t * -@itemx make_const_string(const char *string, size_t length, awk_value_t *result) -This function creates a string value in the @code{awk_value_t} variable -pointed to by @code{result}. It expects @code{string} to be a C string constant -(or other string data), and automatically creates a @emph{copy} of the data -for storage in @code{result}. It returns @code{result}. +@item void *gawk_malloc(size_t size); +Call @command{gawk}-provided @code{api_malloc()} to allocate storage that may +be passed to @command{gawk}. -@item static inline awk_value_t * -@itemx make_malloced_string(const char *string, size_t length, awk_value_t *result) -This function creates a string value in the @code{awk_value_t} variable -pointed to by @code{result}. It expects @code{string} to be a @samp{char *} -value pointing to data previously obtained from @code{malloc()}. The idea here -is that the data is passed directly to @command{gawk}, which assumes -responsibility for it. It returns @code{result}. +@item void *gawk_calloc(size_t nmemb, size_t size); +Call @command{gawk}-provided @code{api_calloc()} to allocate storage that may +be passed to @command{gawk}. -@item static inline awk_value_t * -@itemx make_null_string(awk_value_t *result) -This specialized function creates a null string (the ``undefined'' value) -in the @code{awk_value_t} variable pointed to by @code{result}. -It returns @code{result}. +@item void *gawk_realloc(void *ptr, size_t size); +Call @command{gawk}-provided @code{api_realloc()} to allocate storage that may +be passed to @command{gawk}. -@item static inline awk_value_t * -@itemx make_number(double num, awk_value_t *result) -This function simply creates a numeric value in the @code{awk_value_t} variable -pointed to by @code{result}. +@item void gawk_free(void *ptr); +Call @command{gawk}-provided @code{api_free()} to release storage that was +allocated with @code{gawk_malloc()}, @code{gawk_calloc()} or @code{gawk_realloc()}. @end table -Two convenience macros may be used for allocating storage from @code{malloc()} -and @code{realloc()}. If the allocation fails, they cause @command{gawk} to -exit with a fatal error message. They should be used as if they were +The API has to provide these functions because it is possible +for an extension to be compiled and linked against a different +version of the C library than was used for the @command{gawk} +executable.@footnote{This is more common on MS-Windows systems, but +can happen on Unix-like systems as well.} If @command{gawk} were +to use its version of @code{free()} when the memory came from an +unrelated version of @code{malloc()}, unexpected behavior would +likely result. + +Two convenience macros may be used for allocating storage +from the API-provided function pointers @code{api_malloc()} and +@code{api_realloc()}. If the allocation fails, they cause @command{gawk} +to exit with a fatal error message. They should be used as if they were procedure calls that do not return a value. @table @code @item #define emalloc(pointer, type, size, message) @dots{} The arguments to this macro are as follows: + @c nested table @table @code @item pointer The pointer variable to point at the allocated storage. @item type -The type of the pointer variable, used to create a cast for the call to @code{malloc()}. +The type of the pointer variable, used to create a cast for the call to @code{api_malloc()}. @item size The total number of bytes to be allocated. @@ -29045,13 +30564,51 @@ make_malloced_string(message, strlen(message), & result); @end example @item #define erealloc(pointer, type, size, message) @dots{} -This is like @code{emalloc()}, but it calls @code{realloc()}, -instead of @code{malloc()}. +This is like @code{emalloc()}, but it calls @code{api_realloc()}, +instead of @code{api_malloc()}. The arguments are the same as for the @code{emalloc()} macro. @end table +@node Constructor Functions +@subsection Constructor Functions + +The API provides a number of @dfn{constructor} functions for creating +string and numeric values, as well as a number of convenience macros. +This @value{SUBSECTION} presents them all as function prototypes, in +the way that extension code would use them. + +@table @code +@item static inline awk_value_t * +@itemx make_const_string(const char *string, size_t length, awk_value_t *result) +This function creates a string value in the @code{awk_value_t} variable +pointed to by @code{result}. It expects @code{string} to be a C string constant +(or other string data), and automatically creates a @emph{copy} of the data +for storage in @code{result}. It returns @code{result}. + +@item static inline awk_value_t * +@itemx make_malloced_string(const char *string, size_t length, awk_value_t *result) +This function creates a string value in the @code{awk_value_t} variable +pointed to by @code{result}. It expects @code{string} to be a @samp{char *} +value pointing to data previously obtained from the api-provided functions @code{api_malloc()}, @code{api_calloc()} or @code{api_realloc()}. The idea here +is that the data is passed directly to @command{gawk}, which assumes +responsibility for it. It returns @code{result}. + +@item static inline awk_value_t * +@itemx make_null_string(awk_value_t *result) +This specialized function creates a null string (the ``undefined'' value) +in the @code{awk_value_t} variable pointed to by @code{result}. +It returns @code{result}. + +@item static inline awk_value_t * +@itemx make_number(double num, awk_value_t *result) +This function simply creates a numeric value in the @code{awk_value_t} variable +pointed to by @code{result}. +@end table + @node Registration Functions @subsection Registration Functions +@cindex register extension +@cindex extension registration This @value{SECTION} describes the API functions for registering parts of your extension with @command{gawk}. @@ -29096,8 +30653,8 @@ Letter case in function names is significant. This is a pointer to the C function that provides the desired functionality. The function must fill in the result with either a number -or a string. @command{awk} takes ownership of any string memory. -As mentioned earlier, string memory @strong{must} come from @code{malloc()}. +or a string. @command{gawk} takes ownership of any string memory. +As mentioned earlier, string memory @strong{must} come from the api-provided functions @code{api_malloc()}, @code{api_calloc()} or @code{api_realloc()}. The @code{num_actual_args} argument tells the C function how many actual parameters were passed from the calling @command{awk} code. @@ -29128,7 +30685,7 @@ empty string (@code{""}). The @code{func} pointer is the address of a An @dfn{exit callback} function is a function that @command{gawk} calls before it exits. -Such functions are useful if you have general ``clean up'' tasks +Such functions are useful if you have general ``cleanup'' tasks that should be performed in your extension (such as closing data base connections or other resource deallocations). You can register such @@ -29138,6 +30695,7 @@ a function with @command{gawk} using the following function. @item void awk_atexit(void (*funcp)(void *data, int exit_status), @itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ void *arg0); The parameters are: + @c nested table @table @code @item funcp @@ -29173,6 +30731,7 @@ is invoked with the @option{--version} option. @node Input Parsers @subsubsection Customized Input Parsers +@cindex customized input parser By default, @command{gawk} reads text files as its input. It uses the value of @code{RS} to find the end of the record, and then uses @code{FS} @@ -29230,8 +30789,9 @@ A pointer to your @code{@var{XXX}_can_take_file()} function. A pointer to your @code{@var{XXX}_take_control_of()} function. @item awk_const struct input_parser *awk_const next; -This pointer is used by @command{gawk}. -The extension cannot modify it. +This is for use by @command{gawk}; +therefore it is marked @code{awk_const} so that the extension cannot +modify it. @end table The steps are as follows: @@ -29278,7 +30838,7 @@ open the file, then @code{fd} will @emph{not} be equal to @code{INVALID_HANDLE}. Otherwise, it will. @item struct stat sbuf; -If file descriptor is valid, then @command{gawk} will have filled +If the file descriptor is valid, then @command{gawk} will have filled in this structure via a call to the @code{fstat()} system call. @end table @@ -29420,7 +30980,9 @@ Register the input parser pointed to by @code{input_parser} with @node Output Wrappers @subsubsection Customized Output Wrappers +@cindex customized output wrapper +@cindex output wrapper An @dfn{output wrapper} is the mirror image of an input parser. It allows an extension to take over the output to a file opened with the @samp{>} or @samp{>>} I/O redirection operators (@pxref{Redirection}). @@ -29457,8 +31019,8 @@ as described below, and return true if successful, false otherwise. @item awk_const struct output_wrapper *awk_const next; This is for use by @command{gawk}; -therefore they are marked @code{awk_const} so that the extension cannot -modify them. +therefore it is marked @code{awk_const} so that the extension cannot +modify it. @end table The @code{awk_output_buf_t} structure looks like this: @@ -29520,7 +31082,7 @@ The @code{@var{XXX}_can_take_file()} function should make a decision based upon the @code{name} and @code{mode} fields, and any additional state (such as @command{awk} variable values) that is appropriate. -When @command{gawk} calls @code{@var{XXX}_take_control_of()}, it should fill +When @command{gawk} calls @code{@var{XXX}_take_control_of()}, that function should fill in the other fields, as appropriate, except for @code{fp}, which it should just use normally. @@ -29534,6 +31096,7 @@ Register the output wrapper pointed to by @code{output_wrapper} with @node Two-way processors @subsubsection Customized Two-way Processors +@cindex customized two-way processor A @dfn{two-way processor} combines an input parser and an output wrapper for two-way I/O with the @samp{|&} operator (@pxref{Redirection}). It makes identical @@ -29560,7 +31123,7 @@ The fields are as follows: The name of the two-way processor. @item awk_bool_t (*can_take_two_way)(const char *name); -This function returns true if it wants to take over two-way I/O for this filename. +This function returns true if it wants to take over two-way I/O for this @value{FN}. It should not change any state (variable values, etc.) within @command{gawk}. @@ -29573,8 +31136,8 @@ This function should fill in the @code{awk_input_buf_t} and @item awk_const struct two_way_processor *awk_const next; This is for use by @command{gawk}; -therefore they are marked @code{awk_const} so that the extension cannot -modify them. +therefore it is marked @code{awk_const} so that the extension cannot +modify it. @end table As with the input parser and output processor, you provide @@ -29591,6 +31154,8 @@ Register the two-way processor pointed to by @code{two_way_processor} with @node Printing Messages @subsection Printing Messages +@cindex printing messages from extensions +@cindex messages from extensions You can print different kinds of warning messages from your extension, as described below. Note that for these functions, @@ -29664,6 +31229,7 @@ for more information on creating arrays. @node Symbol Table Access @subsection Symbol Table Access +@cindex accessing global variables from extensions Two sets of routines provide access to global variables, and one set allows you to create and release cached values. @@ -29709,6 +31275,13 @@ An extension can look up the value of @command{gawk}'s special variables. However, with the exception of the @code{PROCINFO} array, an extension cannot change any of those variables. +@quotation NOTE +It is possible for the lookup of @code{PROCINFO} to fail. This happens if +the @command{awk} program being run does not reference @code{PROCINFO}; +in this case @command{gawk} doesn't bother to create the array and +populate it. +@end quotation + @node Symbol table by cookie @subsubsection Variable Access and Update by Cookie @@ -29730,7 +31303,7 @@ Return false if the value cannot be retrieved. @item awk_bool_t sym_update_scalar(awk_scalar_t cookie, awk_value_t *value); Update the value associated with a scalar cookie. Return false if -the new value is not one of @code{AWK_STRING} or @code{AWK_NUMBER}. +the new value is not of type @code{AWK_STRING} or @code{AWK_NUMBER}. Here too, the built-in variables may not be updated. @end table @@ -29835,7 +31408,7 @@ assign those values to variables using @code{sym_update()} or @code{sym_update_scalar()}, as you like. However, you can understand the point of cached values if you remember that -@emph{every} string value's storage @emph{must} come from @code{malloc()}. +@emph{every} string value's storage @emph{must} come from @code{api_malloc()}, @code{api_calloc()} or @code{api_realloc()}. If you have 20 variables, all of which have the same string value, you must create 20 identical copies of the string.@footnote{Numeric values are clearly less problematic, requiring only a C @code{double} to store.} @@ -29848,7 +31421,7 @@ is what the routines in this section let you do. The functions are as follows: @item awk_bool_t create_value(awk_value_t *value, awk_value_cookie_t *result); Create a cached string or numeric value from @code{value} for efficient later assignment. -Only @code{AWK_NUMBER} and @code{AWK_STRING} values are allowed. Any other type +Only values of type @code{AWK_NUMBER} and @code{AWK_STRING} are allowed. Any other type is rejected. While @code{AWK_UNDEFINED} could be allowed, doing so would result in inferior performance. @@ -29909,18 +31482,19 @@ What happens if @command{awk} code assigns a new value to @code{VAR1}, are all the others be changed too?'' That's a great question. The answer is that no, it's not a problem. -Internally, @command{gawk} uses reference-counted strings. This means +Internally, @command{gawk} uses @dfn{reference-counted strings}. This means that many variables can share the same string value, and @command{gawk} keeps track of the usage. When a variable's value changes, @command{gawk} simply decrements the reference count on the old value and updates the variable to use the new value. -Finally, as part of your clean up action (@pxref{Exit Callback Functions}) +Finally, as part of your cleanup action (@pxref{Exit Callback Functions}) you should release any cached values that you created, using @code{release_value()}. @node Array Manipulation @subsection Array Manipulation +@cindex array manipulation in extensions The primary data structure@footnote{Okay, the only data structure.} in @command{awk} is the associative array (@pxref{Arrays}). @@ -30032,7 +31606,7 @@ requires that you understand how such values are converted to strings (@pxref{Conversion}); thus using integral values is safest. As with @emph{all} strings passed into @code{gawk} from an extension, -the string value of @code{index} must come from @code{malloc()}, and +the string value of @code{index} must come from the API-provided functions @code{api_malloc()}, @code{api_calloc()} or @code{api_realloc()} and @command{gawk} releases the storage. @item awk_bool_t set_array_element(awk_array_t a_cookie, @@ -30040,7 +31614,8 @@ the string value of @code{index} must come from @code{malloc()}, and @itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const@ awk_value_t *const value); In the array represented by @code{a_cookie}, create or modify the element whose index is given by @code{index}. -The @code{ARGV} and @code{ENVIRON} arrays may not be changed. +The @code{ARGV} and @code{ENVIRON} arrays may not be changed, +although the @code{PROCINFO} array can be. @item awk_bool_t set_array_element_by_elem(awk_array_t a_cookie, @itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_element_t element); @@ -30311,7 +31886,7 @@ you must add the new array to its parent before adding any elements to it. Thus, the correct way to build an array is to work ``top down.'' Create the array, and immediately install it in @command{gawk}'s symbol table using @code{sym_update()}, or install it as an element in a previously -existing array using @code{set_element()}. We show example code shortly. +existing array using @code{set_array_element()}. We show example code shortly. @item Due to gawk internals, after using @code{sym_update()} to install an array @@ -30337,7 +31912,7 @@ of the array cookie after the call to @code{set_element()}. @end enumerate The following C code is a simple test extension to create an array -with two regular elements and with a subarray. The leading @samp{#include} +with two regular elements and with a subarray. The leading @code{#include} directives and boilerplate variable declarations are omitted for brevity. The first step is to create a new array and then install it in the symbol table: @@ -30500,6 +32075,8 @@ information about how @command{gawk} was invoked. @node Extension Versioning @subsubsection API Version Constants and Variables +@cindex API version +@cindex extension API version The API provides both a ``major'' and a ``minor'' version number. The API versions are available at compile time as constants: @@ -30553,18 +32130,23 @@ provided in @file{gawkapi.h} (discussed later, in @node Extension API Informational Variables @subsubsection Informational Variables +@cindex API informational variables +@cindex extension API informational variables The API provides access to several variables that describe whether the corresponding command-line options were enabled when @command{gawk} was invoked. The variables are: @table @code +@item do_debug +This variable is true if @command{gawk} was invoked with @option{--debug} option. + @item do_lint This variable is true if @command{gawk} was invoked with @option{--lint} option (@pxref{Options}). -@item do_traditional -This variable is true if @command{gawk} was invoked with @option{--traditional} option. +@item do_mpfr +This variable is true if @command{gawk} was invoked with @option{--bignum} option. @item do_profile This variable is true if @command{gawk} was invoked with @option{--profile} option. @@ -30572,11 +32154,8 @@ This variable is true if @command{gawk} was invoked with @option{--profile} opti @item do_sandbox This variable is true if @command{gawk} was invoked with @option{--sandbox} option. -@item do_debug -This variable is true if @command{gawk} was invoked with @option{--debug} option. - -@item do_mpfr -This variable is true if @command{gawk} was invoked with @option{--bignum} option. +@item do_traditional +This variable is true if @command{gawk} was invoked with @option{--traditional} option. @end table The value of @code{do_lint} can change if @command{awk} code @@ -30627,8 +32206,14 @@ These variables and functions are as follows: @table @code @item int plugin_is_GPL_compatible; -This asserts that the extension is compatible with the GNU GPL -(@pxref{Copying}). If your extension does not have this, @command{gawk} +This asserts that the extension is compatible with +@ifclear FOR_PRINT +the GNU GPL (@pxref{Copying}). +@end ifclear +@ifset FOR_PRINT +the GNU GPL. +@end ifset +If your extension does not have this, @command{gawk} will not load it (@pxref{Plugin License}). @item static gawk_api_t *const api; @@ -30652,8 +32237,9 @@ as described earlier (@pxref{Extension Functions}). It can then be looped over for multiple calls to @code{add_ext_func()}. +@c Use @var{OR} for docbook @item static awk_bool_t (*init_func)(void) = NULL; -@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @r{OR} +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @var{OR} @itemx static awk_bool_t init_my_module(void) @{ @dots{} @} @itemx static awk_bool_t (*init_func)(void) = init_my_module; If you need to do some initialization work, you should define a @@ -30698,6 +32284,8 @@ the version string with @command{gawk}. @node Finding Extensions @section How @command{gawk} Finds Extensions +@cindex extension search path +@cindex finding extensions Compiled extensions have to be installed in a directory where @command{gawk} can find them. If @command{gawk} is configured and @@ -30708,6 +32296,7 @@ path with a list of directories to search for compiled extensions. @node Extension Example @section Example: Some File Functions +@cindex extension example @quotation @i{No matter where you go, there you are.} @@ -30889,7 +32478,6 @@ Those are followed by the necessary variable declarations to make use of the API macros and boilerplate code (@pxref{Extension API Boilerplate}). -@c break line for page breaking @example #ifdef HAVE_CONFIG_H #include <config.h> @@ -30976,7 +32564,6 @@ The @code{stat()} extension is more involved. First comes a function that turns a numeric mode into a printable representation (e.g., 644 becomes @samp{-rw-r--r--}). This is omitted here for brevity: -@c break line for page breaking @example /* format_mode --- turn a stat mode field into something readable */ @@ -31166,7 +32753,7 @@ do_stat(int nargs, awk_value_t *result) awk_array_t array; int ret; struct stat sbuf; - /* default is stat() */ + /* default is lstat() */ int (*statfunc)(const char *path, struct stat *sbuf) = lstat; assert(result != NULL); @@ -31250,7 +32837,9 @@ structures for loading each function into @command{gawk}: static awk_ext_func_t func_table[] = @{ @{ "chdir", do_chdir, 1 @}, @{ "stat", do_stat, 2 @}, +#ifndef __MINGW32__ @{ "fts", do_fts, 3 @}, +#endif @}; @end example @@ -31264,9 +32853,7 @@ everything that needs to be loaded. It is simplest to use the dl_load_func(func_table, filefuncs, "") @end example -And that's it! As an exercise, consider adding functions to -implement system calls such as @code{chown()}, @code{chmod()}, -and @code{umask()}. +And that's it! @node Using Internal File Ops @subsection Integrating The Extensions @@ -31278,7 +32865,7 @@ code must be compiled. Assuming that the functions are in a file named @file{filefuncs.c}, and @var{idir} is the location of the @file{gawkapi.h} header file, the following steps@footnote{In practice, you would probably want to -use the GNU Autotools---Automake, Autoconf, Libtool, and Gettext---to +use the GNU Autotools---Automake, Autoconf, Libtool, and @command{gettext}---to configure and build your libraries. Instructions for doing so are beyond the scope of this @value{DOCUMENT}. @xref{gawkextlib}, for WWW links to the tools.} create a GNU/Linux shared library: @@ -31320,7 +32907,7 @@ BEGIN @{ @end example The @env{AWKLIBPATH} environment variable tells -@command{gawk} where to find shared libraries (@pxref{Finding Extensions}). +@command{gawk} where to find extensions (@pxref{Finding Extensions}). We set it to the current directory and run the program: @example @@ -31352,6 +32939,7 @@ $ @kbd{AWKLIBPATH=$PWD gawk -f testff.awk} @node Extension Samples @section The Sample Extensions In The @command{gawk} Distribution +@cindex extensions distributed with @command{gawk} This @value{SECTION} provides brief overviews of the sample extensions that come in the @command{gawk} distribution. Some of them are intended @@ -31382,19 +32970,19 @@ Others mainly provide example code that shows how to use the extension API. The @code{filefuncs} extension provides three different functions, as follows: The usage is: -@table @code +@table @asis @item @@load "filefuncs" This is how you load the extension. -@cindex @code{chdir} extension function -@item result = chdir("/some/directory") +@cindex @code{chdir()} extension function +@item @code{result = chdir("/some/directory")} The @code{chdir()} function is a direct hook to the @code{chdir()} system call to change the current directory. It returns zero upon success or less than zero upon error. In the latter case it updates @code{ERRNO}. -@cindex @code{stat} extension function -@item result = stat("/some/path", statdata [, follow]) +@cindex @code{stat()} extension function +@item @code{result = stat("/some/path", statdata} [@code{, follow}]@code{)} The @code{stat()} function provides a hook into the @code{stat()} system call. It returns zero upon success or less than zero upon error. @@ -31407,69 +32995,27 @@ In all cases, it clears the @code{statdata} array. When the call is successful, @code{stat()} fills the @code{statdata} array with information retrieved from the filesystem, as follows: -@c nested table -@multitable @columnfractions .25 .60 -@item @code{statdata["name"]} @tab -The name of the file. - -@item @code{statdata["dev"]} @tab -Corresponds to the @code{st_dev} field in the @code{struct stat}. - -@item @code{statdata["ino"]} @tab -Corresponds to the @code{st_ino} field in the @code{struct stat}. - -@item @code{statdata["mode"]} @tab -Corresponds to the @code{st_mode} field in the @code{struct stat}. - -@item @code{statdata["nlink"]} @tab -Corresponds to the @code{st_nlink} field in the @code{struct stat}. - -@item @code{statdata["uid"]} @tab -Corresponds to the @code{st_uid} field in the @code{struct stat}. - -@item @code{statdata["gid"]} @tab -Corresponds to the @code{st_gid} field in the @code{struct stat}. - -@item @code{statdata["size"]} @tab -Corresponds to the @code{st_size} field in the @code{struct stat}. - -@item @code{statdata["atime"]} @tab -Corresponds to the @code{st_atime} field in the @code{struct stat}. - -@item @code{statdata["mtime"]} @tab -Corresponds to the @code{st_mtime} field in the @code{struct stat}. - -@item @code{statdata["ctime"]} @tab -Corresponds to the @code{st_ctime} field in the @code{struct stat}. - -@item @code{statdata["rdev"]} @tab -Corresponds to the @code{st_rdev} field in the @code{struct stat}. -This element is only present for device files. - -@item @code{statdata["major"]} @tab -Corresponds to the @code{st_major} field in the @code{struct stat}. -This element is only present for device files. - -@item @code{statdata["minor"]} @tab -Corresponds to the @code{st_minor} field in the @code{struct stat}. -This element is only present for device files. - -@item @code{statdata["blksize"]} @tab -Corresponds to the @code{st_blksize} field in the @code{struct stat}, -if this field is present on your system. -(It is present on all modern systems that we know of.) - -@item @code{statdata["pmode"]} @tab -A human-readable version of the mode value, such as printed by -@command{ls}. For example, @code{"-rwxr-xr-x"}. - -@item @code{statdata["linkval"]} @tab -If the named file is a symbolic link, this element will exist -and its value is the value of the symbolic link (where the -symbolic link points to). - -@item @code{statdata["type"]} @tab -The type of the file as a string. One of +@multitable @columnfractions .15 .50 .20 +@headitem Subscript @tab Field in @code{struct stat} @tab File type +@item @code{"name"} @tab The @value{FN} @tab All +@item @code{"dev"} @tab @code{st_dev} @tab All +@item @code{"ino"} @tab @code{st_ino} @tab All +@item @code{"mode"} @tab @code{st_mode} @tab All +@item @code{"nlink"} @tab @code{st_nlink} @tab All +@item @code{"uid"} @tab @code{st_uid} @tab All +@item @code{"gid"} @tab @code{st_gid} @tab All +@item @code{"size"} @tab @code{st_size} @tab All +@item @code{"atime"} @tab @code{st_atime} @tab All +@item @code{"mtime"} @tab @code{st_mtime} @tab All +@item @code{"ctime"} @tab @code{st_ctime} @tab All +@item @code{"rdev"} @tab @code{st_rdev} @tab Device files +@item @code{"major"} @tab @code{st_major} @tab Device files +@item @code{"minor"} @tab @code{st_minor} @tab Device files +@item @code{"blksize"} @tab @code{st_blksize} @tab All +@item @code{"pmode"} @tab A human-readable version of the mode value, such as printed by +@command{ls}. For example, @code{"-rwxr-xr-x"} @tab All +@item @code{"linkval"} @tab The value of the symbolic link @tab Symbolic links +@item @code{"type"} @tab The type of the file as a string. One of @code{"file"}, @code{"blockdev"}, @code{"chardev"}, @@ -31480,12 +33026,12 @@ The type of the file as a string. One of @code{"door"}, or @code{"unknown"}. -Not all systems support all file types. +Not all systems support all file types. @tab All @end multitable -@cindex @code{fts} extension function -@item flags = or(FTS_PHYSICAL, ...) -@itemx result = fts(pathlist, flags, filedata) +@cindex @code{fts()} extension function +@item @code{flags = or(FTS_PHYSICAL, ...)} +@itemx @code{result = fts(pathlist, flags, filedata)} Walk the file trees provided in @code{pathlist} and fill in the @code{filedata} array as described below. @code{flags} is the bitwise OR of several predefined constant values, also described below. @@ -31502,7 +33048,7 @@ The arguments are as follows: @table @code @item pathlist -An array of filenames. The element values are used; the index values are ignored. +An array of @value{FN}s. The element values are used; the index values are ignored. @item flags This should be the bitwise OR of one or more of the following @@ -31604,58 +33150,48 @@ See @file{test/fts.awk} in the @command{gawk} distribution for an example. @node Extension Sample Fnmatch @subsection Interface To @code{fnmatch()} -@cindex @code{fnmatch} extension function This extension provides an interface to the C library @code{fnmatch()} function. The usage is: -@example -@@load "fnmatch" +@table @code +@item @@load "fnmatch" +This is how you load the extension. -result = fnmatch(pattern, string, flags) -@end example +@cindex @code{fnmatch()} extension function +@item result = fnmatch(pattern, string, flags) +The return value is zero on success, @code{FNM_NOMATCH} +if the string did not match the pattern, or +a different non-zero value if an error occurred. +@end table -The @code{fnmatch} extension adds a single function named -@code{fnmatch()}, one constant (@code{FNM_NOMATCH}), and an array of -flag values named @code{FNM}. +Besides the @code{fnmatch()} function, the @code{fnmatch} extension +adds one constant (@code{FNM_NOMATCH}), and an array of flag values +named @code{FNM}. The arguments to @code{fnmatch()} are: @table @code @item pattern -The filename wildcard to match. +The @value{FN} wildcard to match. @item string -The filename string. +The @value{FN} string. @item flag Either zero, or the bitwise OR of one or more of the flags in the @code{FNM} array. @end table -The return value is zero on success, @code{FNM_NOMATCH} -if the string did not match the pattern, or -a different non-zero value if an error occurred. - The flags are follows: @multitable @columnfractions .25 .75 -@item @code{FNM["CASEFOLD"]} @tab -Corresponds to the @code{FNM_CASEFOLD} flag as defined in @code{fnmatch()}. - -@item @code{FNM["FILE_NAME"]} @tab -Corresponds to the @code{FNM_FILE_NAME} flag as defined in @code{fnmatch()}. - -@item @code{FNM["LEADING_DIR"]} @tab -Corresponds to the @code{FNM_LEADING_DIR} flag as defined in @code{fnmatch()}. - -@item @code{FNM["NOESCAPE"]} @tab -Corresponds to the @code{FNM_NOESCAPE} flag as defined in @code{fnmatch()}. - -@item @code{FNM["PATHNAME"]} @tab -Corresponds to the @code{FNM_PATHNAME} flag as defined in @code{fnmatch()}. - -@item @code{FNM["PERIOD"]} @tab -Corresponds to the @code{FNM_PERIOD} flag as defined in @code{fnmatch()}. +@headitem Array element @tab Corresponding flag defined by @code{fnmatch()} +@item @code{FNM["CASEFOLD"]} @tab @code{FNM_CASEFOLD} +@item @code{FNM["FILE_NAME"]} @tab @code{FNM_FILE_NAME} +@item @code{FNM["LEADING_DIR"]} @tab @code{FNM_LEADING_DIR} +@item @code{FNM["NOESCAPE"]} @tab @code{FNM_NOESCAPE} +@item @code{FNM["PATHNAME"]} @tab @code{FNM_PATHNAME} +@item @code{FNM["PERIOD"]} @tab @code{FNM_PERIOD} @end multitable Here is an example: @@ -31677,21 +33213,21 @@ The @code{fork} extension adds three functions, as follows. @item @@load "fork" This is how you load the extension. -@cindex @code{fork} extension function +@cindex @code{fork()} extension function @item pid = fork() -This function creates a new process. The return value is the zero in the -child and the process-id number of the child in the parent, or @minus{}1 +This function creates a new process. The return value is zero in the +child and the process-ID number of the child in the parent, or @minus{}1 upon error. In the latter case, @code{ERRNO} indicates the problem. In the child, @code{PROCINFO["pid"]} and @code{PROCINFO["ppid"]} are updated to reflect the correct values. -@cindex @code{waitpid} extension function +@cindex @code{waitpid()} extension function @item ret = waitpid(pid) -This function takes a numeric argument, which is the process-id to +This function takes a numeric argument, which is the process-ID to wait for. The return value is that of the @code{waitpid()} system call. -@cindex @code{wait} extension function +@cindex @code{wait()} extension function @item ret = wait() This function waits for the first child to die. The return value is that of the @@ -31746,8 +33282,8 @@ standard output to a temporary file configured to have the same owner and permissions as the original. After the file has been processed, the extension restores standard output to its original destination. If @code{INPLACE_SUFFIX} is not an empty string, the original file is -linked to a backup filename created by appending that suffix. Finally, -the temporary file is renamed to the original filename. +linked to a backup @value{FN} created by appending that suffix. Finally, +the temporary file is renamed to the original @value{FN}. If any error occurs, the extension issues a fatal error to terminate processing immediately without damaging the original file. @@ -31765,9 +33301,6 @@ $ @kbd{gawk -i inplace -v INPLACE_SUFFIX=.bak '@{ gsub(/foo/, "bar") @}} > @kbd{@{ print @}' file1 file2 file3} @end example -We leave it as an exercise to write a wrapper script that presents an -interface similar to @samp{sed -i}. - @node Extension Sample Ord @subsection Character and Numeric values: @code{ord()} and @code{chr()} @@ -31778,11 +33311,11 @@ The @code{ordchr} extension adds two functions, named @item @@load "ordchr" This is how you load the extension. -@cindex @code{ord} extension function +@cindex @code{ord()} extension function @item number = ord(string) Return the numeric value of the first character in @code{string}. -@cindex @code{chr} extension function +@cindex @code{chr()} extension function @item char = chr(number) Return a string whose first character is that represented by @code{number}. @end table @@ -31813,11 +33346,14 @@ on the command line (or with @code{getline}), they are read, with each entry returned as a record. The record consists of three fields. The first two are the inode number and the -filename, separated by a forward slash character. +@value{FN}, separated by a forward slash character. On systems where the directory entry contains the file type, the record has a third field (also separated by a slash) which is a single letter -indicating the type of the file: +indicating the type of the file. The letters are file types are shown +in @ref{table-readdir-file-types}. +@float Table,table-readdir-file-types +@caption{File Types Returned By @code{readdir()}} @multitable @columnfractions .1 .9 @headitem Letter @tab File Type @item @code{b} @tab Block device @@ -31829,6 +33365,7 @@ indicating the type of the file: @item @code{s} @tab Socket @item @code{u} @tab Anything else (unknown) @end multitable +@end float On systems without the file type information, the third field is always @samp{u}. @@ -31863,12 +33400,12 @@ Here is an example: BEGIN @{ REVOUT = 1 - print "hello, world" > "/dev/stdout" + print "don't panic" > "/dev/stdout" @} @end example The output from this program is: -@samp{dlrow ,olleh}. +@samp{cinap t'nod}. @node Extension Sample Rev2way @subsection Two-Way I/O Example @@ -31885,13 +33422,22 @@ The following example shows how to use it: BEGIN @{ cmd = "/magic/mirror" - print "hello, world" |& cmd + print "don't panic" |& cmd cmd |& getline result print result close(cmd) @} @end example +The output from this program +@ifnotinfo +also is: +@end ifnotinfo +@ifinfo +is: +@end ifinfo +@samp{cinap t'nod}. + @node Extension Sample Read write array @subsection Dumping and Restoring An Array @@ -31899,14 +33445,14 @@ The @code{rwarray} extension adds two functions, named @code{writea()} and @code{reada()}, as follows: @table @code -@cindex @code{writea} extension function +@cindex @code{writea()} extension function @item ret = writea(file, array) This function takes a string argument, which is the name of the file -to which dump the array, and the array itself as the second argument. -@code{writea()} understands multidimensional arrays. It returns one on +to which to dump the array, and the array itself as the second argument. +@code{writea()} understands arrays of arrays. It returns one on success, or zero upon failure. -@cindex @code{reada} extension function +@cindex @code{reada()} extension function @item ret = reada(file, array) @code{reada()} is the inverse of @code{writea()}; it reads the file named as its first argument, filling in @@ -31943,17 +33489,23 @@ ret = reada("arraydump.bin", array) @subsection Reading An Entire File The @code{readfile} extension adds a single function -named @code{readfile()}: +named @code{readfile()}, and an input parser: @table @code @item @@load "readfile" This is how you load the extension. -@cindex @code{readfile} extension function +@cindex @code{readfile()} extension function @item result = readfile("/some/path") The argument is the name of the file to read. The return value is a string containing the entire contents of the requested file. Upon error, the function returns the empty string and sets @code{ERRNO}. + +@item BEGIN @{ PROCINFO["readfile"] = 1 @} +In addition, the extension adds an input parser that is activated if +@code{PROCINFO["readfile"]} exists. +When activated, each input file is returned in its entirety as @code{$0}. +@code{RT} is set to the null string. @end table Here is an example: @@ -31982,25 +33534,24 @@ for more information. @node Extension Sample Time @subsection Extension Time Functions -These functions can be used either by invoking @command{gawk} -with a command-line argument of @samp{-l time} or by -inserting @samp{@@load "time"} in your script. +The @code{time} extension adds two functions, named @code{gettimeofday()} +and @code{sleep()}, as follows: @table @code @item @@load "time" This is how you load the extension. -@cindex @code{gettimeofday} extension function +@cindex @code{gettimeofday()} extension function @item the_time = gettimeofday() Return the time in seconds that has elapsed since 1970-01-01 UTC as a floating point value. If the time is unavailable on this platform, return @minus{}1 and set @code{ERRNO}. The returned time should have sub-second precision, but the actual precision may vary based on the platform. If the standard C @code{gettimeofday()} system call is available on this -platform, then it simply returns the value. Otherwise, if on Windows, +platform, then it simply returns the value. Otherwise, if on MS-Windows, it tries to use @code{GetSystemTimeAsFileTime()}. -@cindex @code{sleep} extension function +@cindex @code{sleep()} extension function @item result = sleep(@var{seconds}) Attempt to sleep for @var{seconds} seconds. If @var{seconds} is negative, or the attempt to sleep fails, return @minus{}1 and set @code{ERRNO}. @@ -32012,6 +33563,8 @@ tries to use @code{nanosleep()} or @code{select()} to implement the delay. @node gawkextlib @section The @code{gawkextlib} Project +@cindex @code{gawkextlib} +@cindex extensions, where to find @cindex @code{gawkextlib} project The @uref{http://sourceforge.net/projects/gawkextlib/, @code{gawkextlib}} @@ -32019,14 +33572,17 @@ project provides a number of @command{gawk} extensions, including one for processing XML files. This is the evolution of the original @command{xgawk} (XML @command{gawk}) project. -As of this writing, there are four extensions: +As of this writing, there are five extensions: -@itemize @bullet +@itemize @value{BULLET} @item XML parser extension, using the @uref{http://expat.sourceforge.net, Expat} XML parsing library. @item +PDF extension. + +@item PostgreSQL extension. @item @@ -32042,8 +33598,9 @@ The @code{time} extension described earlier (@pxref{Extension Sample Time}) was originally from this project but has been moved in to the main @command{gawk} distribution. +@cindex @command{git} utility You can check out the code for the @code{gawkextlib} project -using the @uref{http://git-scm.com, GIT} distributed source +using the @uref{http://git-scm.com, Git} distributed source code control system. The command is as follows: @example @@ -32059,7 +33616,7 @@ In addition, you must have the GNU Autotools installed @uref{http://www.gnu.org/software/automake, Automake}, @uref{http://www.gnu.org/software/libtool, Libtool}, and -@uref{http://www.gnu.org/software/gettext, Gettext}). +@uref{http://www.gnu.org/software/gettext, GNU @command{gettext}}). The simple recipe for building and testing @code{gawkextlib} is as follows. First, build and install @command{gawk}: @@ -32093,26 +33650,168 @@ If you write an extension that you wish to share with other @code{gawkextlib} project. See the project's web site for more information. -@iftex -@part Part IV:@* Appendices -@end iftex +@node Extension summary +@section Summary + +@itemize @value{BULLET} +@item +You can write extensions (sometimes called plug-ins) for @command{gawk} +in C or C++ using the Application Programming Interface (API) defined +by the @command{gawk} developers. + +@item +Extensions must have a license compatible with the GNU General Public +License (GPL), and they must assert that fact by declaring a variable +named @code{plugin_is_GPL_compatible}. + +@item +Communication between @command{gawk} and an extension is two-way. +@command{gawk} passes a @code{struct} to the extension which contains +various data fields and function pointers. The extension can then call +into @command{gawk} via the supplied function pointers to accomplish +certain tasks. + +@item +One of these tasks is to ``register'' the name and implementation of +a new @command{awk}-level function with @command{gawk}. The implementation +takes the form of a C function pointer with a defined signature. +By convention, implementation functions are named @code{do_@var{XXXX}()} +for some @command{awk}-level function @code{@var{XXXX}()}. + +@item +The API is defined in a header file named @file{gawkpi.h}. You must include +a number of standard header files @emph{before} including it in your source file. + +@item +API function pointers are provided for the following kinds of operations: + +@itemize @value{BULLET} +@item +Registration functions. You may register +extension functions, +exit callbacks, +a version string, +input parsers, +output wrappers, +and two-way processors. + +@item +Printing fatal, warning, and ``lint'' warning messages. + +@item +Updating @code{ERRNO}, or unsetting it. + +@item +Accessing parameters, including converting an undefined parameter into +an array. + +@item +Symbol table access: retrieving a global variable, creating one, +or changing one. + +@item +Allocating, reallocating, and releasing memory. + +@item +Creating and releasing cached values; this provides an +efficient way to use values for multiple variables and +can be a big performance win. + +@item +Manipulating arrays: +retrieving, adding, deleting, and modifying elements; +getting the count of elements in an array; +creating a new array; +clearing an array; +and +flattening an array for easy C style looping over all its indices and elements +@end itemize + +@item +The API defines a number of standard data types for representing +@command{awk} values, array elements, and arrays. + +@item +The API provide convenience functions for constructing values. +It also provides memory management functions to ensure compatibility +between memory allocated by @command{gawk} and memory allocated by an +extension. + +@item +@emph{All} memory passed from @command{gawk} to an extension must be +treated as read-only by the extension. + +@item +@emph{All} memory passed from an extension to @command{gawk} must come from +the API's memory allocation functions. @command{gawk} takes responsibility for +the memory and will release it when appropriate. + +@item +The API provides information about the running version of @command{gawk} so +that an extension can make sure it is compatible with the @command{gawk} +that loaded it. + +@item +It is easiest to start a new extension by copying the boilerplate code +described in this @value{CHAPTER}. Macros in the @file{gawkapi.h} make +this easier to do. + +@item +The @command{gawk} distribution includes a number of small but useful +sample extensions. The @code{gawkextlib} project includes several more, +larger, extensions. If you wish to write an extension and contribute it +to the community of @command{gawk} users, the @code{gawkextlib} project +should be the place to do so. + +@end itemize + +@node Extension Exercises +@section Exercises + +@enumerate +@item +Add functions to implement system calls such as @code{chown()}, +@code{chmod()}, and @code{umask()} to the file operations extension +presented in @ref{Internal File Ops}. + +@item +(Hard.) +How would you provide namespaces in @command{gawk}, so that the +names of functions in different extensions don't conflict with each other? +If you come up with a really good scheme, contact the @command{gawk} +maintainer to tell him about it. + +@item +Write a wrapper script that provides an interface similar to +@samp{sed -i} for the ``inplace'' extension presented in +@ref{Extension Sample Inplace}. + +@end enumerate + +@ifnotinfo +@part @value{PART4}Appendices +@end ifnotinfo -@ignore @ifdocbook -@part Part IV:@* Appendices +@ifclear FOR_PRINT +Part IV contains the appendixes (including the two licenses that cover +the @command{gawk} source code and this @value{DOCUMENT}, respectively) +and the Glossary: +@end ifclear -Part IV provides the appendices, the Glossary, and two licenses that cover -the @command{gawk} source code and this @value{DOCUMENT}, respectively. -It contains the following appendices: +@ifset FOR_PRINT +Part IV contains two appendixes: +@end ifset -@itemize @bullet +@itemize @value{BULLET} @item @ref{Language History}. @item @ref{Installation}. +@ifclear FOR_PRINT @item @ref{Notes}. @@ -32127,24 +33826,31 @@ It contains the following appendices: @item @ref{GNU Free Documentation License}. +@end ifclear @end itemize @end ifdocbook -@end ignore @node Language History @appendix The Evolution of the @command{awk} Language -This @value{DOCUMENT} describes the GNU implementation of @command{awk}, which follows -the POSIX specification. -Many long-time @command{awk} users learned @command{awk} programming -with the original @command{awk} implementation in Version 7 Unix. -(This implementation was the basis for @command{awk} in Berkeley Unix, -through 4.3-Reno. Subsequent versions of Berkeley Unix, and some systems -derived from 4.4BSD-Lite, use various versions of @command{gawk} -for their @command{awk}.) -This @value{CHAPTER} briefly describes the -evolution of the @command{awk} language, with cross-references to other parts -of the @value{DOCUMENT} where you can find more information. +This @value{DOCUMENT} describes the GNU implementation of @command{awk}, +which follows the POSIX specification. Many long-time @command{awk} +users learned @command{awk} programming with the original @command{awk} +implementation in Version 7 Unix. (This implementation was the basis for +@command{awk} in Berkeley Unix, through 4.3-Reno. Subsequent versions +of Berkeley Unix, and some systems derived from 4.4BSD-Lite, used various +versions of @command{gawk} for their @command{awk}.) This @value{CHAPTER} +briefly describes the evolution of the @command{awk} language, with +cross-references to other parts of the @value{DOCUMENT} where you can +find more information. + +@ifset FOR_PRINT +To save space, we have omitted +information on the history of features in @command{gawk} from this +edition. You can find it in the +@uref{http://www.gnu.org/software/gawk/manual/html_node/Feature-History.html, +online documentation}. +@end ifset @menu * V7/SVR3.1:: The major changes between V7 and System V @@ -32156,9 +33862,11 @@ of the @value{DOCUMENT} where you can find more information. @command{awk}. * POSIX/GNU:: The extensions in @command{gawk} not in POSIX @command{awk}. +* Feature History:: The history of the features in @command{gawk}. * Common Extensions:: Common Extensions Summary. * Ranges and Locales:: How locales used to affect regexp ranges. * Contributors:: The major contributors to @command{gawk}. +* History summary:: History summary. @end menu @node V7/SVR3.1 @@ -32173,7 +33881,7 @@ Version 7 Unix (1978) and the new version that was first made generally availabl System V Release 3.1 (1987). This @value{SECTION} summarizes the changes, with cross-references to further details: -@itemize @bullet +@itemize @value{BULLET} @item The requirement for @samp{;} to separate rules on a line (@pxref{Statements/Lines}). @@ -32264,7 +33972,7 @@ Multidimensional arrays The System V Release 4 (1989) version of Unix @command{awk} added these features (some of which originated in @command{gawk}): -@itemize @bullet +@itemize @value{BULLET} @item The @code{ENVIRON} array (@pxref{Built-in Variables}). @c gawk and MKS awk @@ -32324,7 +34032,7 @@ Processing of escape sequences inside command-line variable assignments The POSIX Command Language and Utilities standard for @command{awk} (1992) introduced the following changes into the language: -@itemize @bullet +@itemize @value{BULLET} @item The use of @option{-W} for implementation-specific options (@pxref{Options}). @@ -32349,7 +34057,7 @@ features of the language. In 2012, a number of extensions that had been commonly available for many years were finally added to POSIX. They are: -@itemize @bullet +@itemize @value{BULLET} @item The @code{fflush()} built-in function for flushing buffered output (@pxref{I/O Functions}). @@ -32386,7 +34094,7 @@ has made his version available via his home page This @value{SECTION} describes common extensions that originally appeared in his version of @command{awk}. -@itemize @bullet +@itemize @value{BULLET} @item The @samp{**} and @samp{**=} operators (@pxref{Arithmetic Ops} @@ -32404,7 +34112,7 @@ The @code{fflush()} built-in function for flushing buffered output @ignore @item The @code{SYMTAB} array, that allows access to @command{awk}'s internal symbol -table. This feature is not documented, largely because +table. This feature was never documented for his @command{awk}, largely because it is somewhat shakily implemented. For instance, you cannot access arrays or array elements through it. @end ignore @@ -32431,12 +34139,12 @@ A number of features have come and gone over the years. This @value{SECTION} summarizes the additional features over POSIX @command{awk} that are in the current version of @command{gawk}. -@itemize @bullet +@itemize @value{BULLET} @item Additional built-in variables: -@itemize @minus +@itemize @value{MINUS} @item The @code{ARGIND} @@ -32457,10 +34165,10 @@ variables @item Special files in I/O redirections: -@itemize @minus{} +@itemize @value{MINUS} @item The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr} and -@file{/dev/fd/@var{N}} special file names +@file{/dev/fd/@var{N}} special @value{FN}s (@pxref{Special Files}). @item @@ -32473,7 +34181,7 @@ IP protocol to use. @item Changes and/or additions to the language: -@itemize @minus{} +@itemize @value{MINUS} @item The @samp{\x} escape sequence (@pxref{Escape Sequences}). @@ -32512,7 +34220,7 @@ Directories on the command line produce a warning and are skipped @item New keywords: -@itemize @minus{} +@itemize @value{MINUS} @item The @code{BEGINFILE} and @code{ENDFILE} special patterns. (@pxref{BEGINFILE/ENDFILE}). @@ -32533,7 +34241,7 @@ The @code{switch} statement @item Changes to standard @command{awk} functions: -@itemize @minus +@itemize @value{MINUS} @item The optional second argument to @code{close()} that allows closing one end of a two-way pipe to a coprocess @@ -32566,7 +34274,7 @@ argument which is an array to hold the text of the field separators. @item Additional functions only in @command{gawk}: -@itemize @minus +@itemize @value{MINUS} @item The @code{and()}, @@ -32609,7 +34317,7 @@ functions for working with timestamps @item Changes and/or additions in the command-line options: -@itemize @minus +@itemize @value{MINUS} @item The @env{AWKPATH} environment variable for specifying a path search for the @option{-f} command-line option @@ -32684,10 +34392,10 @@ long options @item Support for the following obsolete systems was removed from the code -and the documentation for @command{gawk} version 4.0: +and the documentation for @command{gawk} @value{PVERSION} 4.0: @c nested table -@itemize @minus +@itemize @value{MINUS} @item Amiga @@ -32721,6 +34429,9 @@ Tandem (non-POSIX) @item Prestandard VAX C compiler for VAX/VMS +@item +GCC for VAX and Alpha has not been tested for a while. + @end itemize @end itemize @@ -32731,6 +34442,618 @@ Prestandard VAX C compiler for VAX/VMS @c ENDOFRANGE exgnot @c ENDOFRANGE posnot +@c This does not need to be in the formal book. +@ifclear FOR_PRINT +@node Feature History +@appendixsec History of @command{gawk} Features + +@ignore +See the thread: +https://groups.google.com/forum/#!topic/comp.lang.awk/SAUiRuff30c +This motivated me to add this section. +@end ignore + +@ignore +I've tried to follow this general order, esp.@: for the 3.0 and 3.1 sections: + variables + special files + language changes (e.g., hex constants) + differences in standard awk functions + new gawk functions + new keywords + new command-line options + behavioral changes + new ports +Within each category, be alphabetical. +@end ignore + +This @value{SECTION} describes the features in @command{gawk} +over and above those in POSIX @command{awk}, +in the order they were added to @command{gawk}. + +Version 2.10 of @command{gawk} introduced the following features: + +@itemize @value{BULLET} +@item +The @env{AWKPATH} environment variable for specifying a path search for +the @option{-f} command-line option +(@pxref{Options}). + +@item +The @code{IGNORECASE} variable and its effects +(@pxref{Case-sensitivity}). + +@item +The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr} and +@file{/dev/fd/@var{N}} special @value{FN}s +(@pxref{Special Files}). +@end itemize + +Version 2.13 of @command{gawk} introduced the following features: + +@itemize @value{BULLET} +@item +The @code{FIELDWIDTHS} variable and its effects +(@pxref{Constant Size}). + +@item +The @code{systime()} and @code{strftime()} built-in functions for obtaining +and printing timestamps +(@pxref{Time Functions}). + +@item +Additional command-line options +(@pxref{Options}): + +@itemize @value{MINUS} +@item +The @option{-W lint} option to provide error and portability checking +for both the source code and at runtime. + +@item +The @option{-W compat} option to turn off the GNU extensions. + +@item +The @option{-W posix} option for full POSIX compliance. +@end itemize +@end itemize + +Version 2.14 of @command{gawk} introduced the following feature: + +@itemize @value{BULLET} +@item +The @code{next file} statement for skipping to the next @value{DF} +(@pxref{Nextfile Statement}). +@end itemize + +Version 2.15 of @command{gawk} introduced the following features: + +@itemize @value{BULLET} +@item +New variables (@pxref{Built-in Variables}): + +@itemize @value{MINUS} +@item +@code{ARGIND}, which tracks the movement of @code{FILENAME} +through @code{ARGV}. + +@item +@code{ERRNO}, which contains the system error message when +@code{getline} returns @minus{}1 or @code{close()} fails. +@end itemize + +@item +The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and +@file{/dev/user} special @value{FN}s. These have since been removed. + +@item +The ability to delete all of an array at once with @samp{delete @var{array}} +(@pxref{Delete}). + +@item +Command line option changes +(@pxref{Options}): + +@itemize @value{MINUS} +@item +The ability to use GNU-style long-named options that start with @option{--}. + +@item +The @option{--source} option for mixing command-line and library-file +source code. +@end itemize +@end itemize + +Version 3.0 of @command{gawk} introduced the following features: + +@itemize @value{BULLET} +@item +New or changed variables: + +@itemize @value{MINUS} +@item +@code{IGNORECASE} changed, now applying to string comparison as well +as regexp operations +(@pxref{Case-sensitivity}). + +@item +@code{RT}, which contains the input text that matched @code{RS} +(@pxref{Records}). +@end itemize + +@item +Full support for both POSIX and GNU regexps +(@pxref{Regexp}). + +@item +The @code{gensub()} function for more powerful text manipulation +(@pxref{String Functions}). + +@item +The @code{strftime()} function acquired a default time format, +allowing it to be called with no arguments +(@pxref{Time Functions}). + +@item +The ability for @code{FS} and for the third +argument to @code{split()} to be null strings +(@pxref{Single Character Fields}). + +@item +The ability for @code{RS} to be a regexp +(@pxref{Records}). + +@item +The @code{next file} statement became @code{nextfile} +(@pxref{Nextfile Statement}). + +@item +The @code{fflush()} function from +Brian Kernighan's @command{awk} +(then at Bell Laboratories; +@pxref{I/O Functions}). + +@item +New command line options: + +@itemize @value{MINUS} +@item +The @option{--lint-old} option to +warn about constructs that are not available in +the original Version 7 Unix version of @command{awk} +(@pxref{V7/SVR3.1}). + +@item +The @option{-m} option from Brian Kernighan's @command{awk}. (He was +still at Bell Laboratories at the time.) This was later removed from +both his @command{awk} and from @command{gawk}. + +@item +The @option{--re-interval} option to provide interval expressions in regexps +(@pxref{Regexp Operators}). + +@item +The @option{--traditional} option was added as a better name for +@option{--compat} (@pxref{Options}). +@end itemize + +@item +The use of GNU Autoconf to control the configuration process +(@pxref{Quick Installation}). + +@item +Amiga support. +This has since been removed. + +@end itemize + +Version 3.1 of @command{gawk} introduced the following features: + +@itemize @value{BULLET} +@item +New variables +(@pxref{Built-in Variables}): + +@itemize @value{MINUS} +@item +@code{BINMODE}, for non-POSIX systems, +which allows binary I/O for input and/or output files +(@pxref{PC Using}). + +@item +@code{LINT}, which dynamically controls lint warnings. + +@item +@code{PROCINFO}, an array for providing process-related information. + +@item +@code{TEXTDOMAIN}, for setting an application's internationalization text domain +(@pxref{Internationalization}). +@end itemize + +@item +The ability to use octal and hexadecimal constants in @command{awk} +program source code +(@pxref{Nondecimal-numbers}). + +@item +The @samp{|&} operator for two-way I/O to a coprocess +(@pxref{Two-way I/O}). + +@item +The @file{/inet} special files for TCP/IP networking using @samp{|&} +(@pxref{TCP/IP Networking}). + +@item +The optional second argument to @code{close()} that allows closing one end +of a two-way pipe to a coprocess +(@pxref{Two-way I/O}). + +@item +The optional third argument to the @code{match()} function +for capturing text-matching subexpressions within a regexp +(@pxref{String Functions}). + +@item +Positional specifiers in @code{printf} formats for +making translations easier +(@pxref{Printf Ordering}). + +@item +A number of new built-in functions: + +@itemize @value{MINUS} +@item +The @code{asort()} and @code{asorti()} functions for sorting arrays +(@pxref{Array Sorting}). + +@item +The @code{bindtextdomain()}, @code{dcgettext()} and @code{dcngettext()} functions +for internationalization +(@pxref{Programmer i18n}). + +@item +The @code{extension()} function and the ability to add +new built-in functions dynamically +(@pxref{Dynamic Extensions}). + +@item +The @code{mktime()} function for creating timestamps +(@pxref{Time Functions}). + +@item +The @code{and()}, @code{or()}, @code{xor()}, @code{compl()}, +@code{lshift()}, @code{rshift()}, and @code{strtonum()} functions +(@pxref{Bitwise Functions}). +@end itemize + +@item +@cindex @code{next file} statement +The support for @samp{next file} as two words was removed completely +(@pxref{Nextfile Statement}). + +@item +Additional command-line options +(@pxref{Options}): + +@itemize @value{MINUS} +@item +The @option{--dump-variables} option to print a list of all global variables. + +@item +The @option{--exec} option, for use in CGI scripts. + +@item +The @option{--gen-po} command-line option and the use of a leading +underscore to mark strings that should be translated +(@pxref{String Extraction}). + +@item +The @option{--non-decimal-data} option to allow non-decimal +input data +(@pxref{Nondecimal Data}). + +@item +The @option{--profile} option and @command{pgawk}, the +profiling version of @command{gawk}, for producing execution +profiles of @command{awk} programs +(@pxref{Profiling}). + +@item +The @option{--use-lc-numeric} option to force @command{gawk} +to use the locale's decimal point for parsing input data +(@pxref{Conversion}). +@end itemize + +@item +The use of GNU Automake to help in standardizing the configuration process +(@pxref{Quick Installation}). + +@item +The use of GNU @command{gettext} for @command{gawk}'s own message output +(@pxref{Gawk I18N}). + +@item +BeOS support. This was later removed. + +@item +Tandem support. This was later removed. + +@item +The Atari port became officially unsupported and was +later removed entirely. + +@item +The source code changed to use ISO C standard-style function definitions. + +@item +POSIX compliance for @code{sub()} and @code{gsub()} +(@pxref{Gory Details}). + +@item +The @code{length()} function was extended to accept an array argument +and return the number of elements in the array +(@pxref{String Functions}). + +@item +The @code{strftime()} function acquired a third argument to +enable printing times as UTC +(@pxref{Time Functions}). +@end itemize + +Version 4.0 of @command{gawk} introduced the following features: + +@itemize @value{BULLET} + +@item +Variable additions: + +@itemize @value{MINUS} +@item +@code{FPAT}, which allows you to specify a regexp that matches +the fields, instead of matching the field separator +(@pxref{Splitting By Content}). + +@item +If @code{PROCINFO["sorted_in"]} exists, @samp{for(iggy in foo)} loops sort the +indices before looping over them. The value of this element +provides control over how the indices are sorted before the loop +traversal starts +(@pxref{Controlling Scanning}). + +@item +@code{PROCINFO["strftime"]}, which holds +the default format for @code{strftime()} +(@pxref{Time Functions}). +@end itemize + +@item +The special files @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid} +and @file{/dev/user} were removed. + +@item +Support for IPv6 was added via the @file{/inet6} special file. +@file{/inet4} forces IPv4 and @file{/inet} chooses the system +default, which is probably IPv4 +(@pxref{TCP/IP Networking}). + +@item +The use of @samp{\s} and @samp{\S} escape sequences in regular expressions +(@pxref{GNU Regexp Operators}). + +@item +Interval expressions became part of default regular expressions +(@pxref{Regexp Operators}). + +@item +POSIX character classes work even with @option{--traditional} +(@pxref{Regexp Operators}). + +@item +@code{break} and @code{continue} became invalid outside a loop, +even with @option{--traditional} +(@pxref{Break Statement}, and also see +@ref{Continue Statement}). + +@item +@code{fflush()}, @code{nextfile}, and @samp{delete @var{array}} +are allowed if @option{--posix} or @option{--traditional}, since they +are all now part of POSIX. + +@item +An optional third argument to +@code{asort()} and @code{asorti()}, specifying how to sort +(@pxref{String Functions}). + +@item +The behavior of @code{fflush()} changed to match Brian Kernighan's @command{awk} +and for POSIX; now both @samp{fflush()} and @samp{fflush("")} +flush all open output redirections +(@pxref{I/O Functions}). + +@item +The @code{isarray()} +function which distinguishes if an item is an array +or not, to make it possible to traverse arrays of arrays +(@pxref{Type Functions}). + +@item +The @code{patsplit()} +function which gives the same capability as @code{FPAT}, for splitting +(@pxref{String Functions}). + +@item +An optional fourth argument to the @code{split()} function, +which is an array to hold the values of the separators +(@pxref{String Functions}). + +@item +Arrays of arrays +(@pxref{Arrays of Arrays}). + +@item +The @code{BEGINFILE} and @code{ENDFILE} special patterns +(@pxref{BEGINFILE/ENDFILE}). + +@item +Indirect function calls +(@pxref{Indirect Calls}). + +@item +@code{switch} / @code{case} are enabled by default +(@pxref{Switch Statement}). + +@item +Command line option changes +(@pxref{Options}): + +@itemize @value{MINUS} +@item +The @option{-b} and @option{--characters-as-bytes} options +which prevent @command{gawk} from treating input as a multibyte string. + +@item +The redundant @option{--compat}, @option{--copyleft}, and @option{--usage} +long options were removed. + +@item +The @option{--gen-po} option was finally renamed to the correct @option{--gen-pot}. + +@item +The @option{--sandbox} option which disables certain features. + +@item +All long options acquired corresponding short options, for use in @samp{#!} scripts. +@end itemize + +@item +Directories named on the command line now produce a warning, not a fatal +error, unless @option{--posix} or @option{--traditional} are used +(@pxref{Command line directories}). + +@item +The @command{gawk} internals were rewritten, bringing the @command{dgawk} +debugger and possibly improved performance +(@pxref{Debugger}). + +@item +Per the GNU Coding Standards, dynamic extensions must now define +a global symbol indicating that they are GPL-compatible +(@pxref{Plugin License}). + +@item +In POSIX mode, string comparisons use @code{strcoll()} / @code{wcscoll()} +(@pxref{POSIX String Comparison}). + +@item +The option for raw sockets was removed, since it was never implemented +(@pxref{TCP/IP Networking}). + +@item +Ranges of the form @samp{[d-h]} are treated as if they were in the +C locale, no matter what kind of regexp is being used, and even if +@option{--posix} +(@pxref{Ranges and Locales}). + +@item +Support was removed for the following systems: + +@itemize @value{MINUS} +@item +Atari + +@item +Amiga + +@item +BeOS + +@item +Cray + +@item +MIPS RiscOS + +@item +MS-DOS with Microsoft Compiler + +@item +MS-Windows with Microsoft Compiler + +@item +NeXT + +@item +SunOS 3.x, Sun 386 (Road Runner) + +@item +Tandem (non-POSIX) + +@item +Prestandard VAX C compiler for VAX/VMS +@end itemize +@end itemize + +Version 4.1 of @command{gawk} introduced the following features: + +@itemize @value{BULLET} + +@item +Three new arrays: +@code{SYMTAB}, @code{FUNCTAB}, and @code{PROCINFO["identifiers"]} +(@pxref{Auto-set}). + +@item +The three executables @command{gawk}, @command{pgawk}, and @command{dgawk}, were merged into +one, named just @command{gawk}. As a result the command line options changed. + +@item +Command line option changes +(@pxref{Options}): + +@itemize @value{MINUS} +@item +The @option{-D} option invokes the debugger. + +@item +The @option{-i} and @option{--include} options +load @command{awk} library files. + +@item +The @option{-l} and @option{--load} options load compiled dynamic extensions. + +@item +The @option{-M} and @option{--bignum} options enable MPFR. + +@item +The @option{-o} only does pretty-printing. + +@item +The @option{-p} option is used for profiling. + +@item +The @option{-R} option was removed. +@end itemize + +@item +Support for high precision arithmetic with MPFR. +(@pxref{Arbitrary Precision Arithmetic}). + +@item +The @code{and()}, @code{or()} and @code{xor()} functions +changed to allow any number of arguments, +with a minimum of two +(@pxref{Bitwise Functions}). + +@item +The dynamic extension interface was completely redone +(@pxref{Dynamic Extensions}). + +@end itemize + +@c XXX ADD MORE STUFF HERE +@end ifclear + @node Common Extensions @appendixsec Common Extensions Summary @@ -32744,18 +35067,18 @@ the three most widely-used freely available versions of @command{awk} @multitable {@file{/dev/stderr} special file} {BWK Awk} {Mawk} {GNU Awk} @headitem Feature @tab BWK Awk @tab Mawk @tab GNU Awk @item @samp{\x} Escape sequence @tab X @tab X @tab X -@item @code{RS} as regexp @tab @tab X @tab X @item @code{FS} as null string @tab X @tab X @tab X -@item @file{/dev/stdin} special file @tab X @tab @tab X +@item @file{/dev/stdin} special file @tab X @tab X @tab X @item @file{/dev/stdout} special file @tab X @tab X @tab X @item @file{/dev/stderr} special file @tab X @tab X @tab X -@item @code{**} and @code{**=} operators @tab X @tab @tab X +@item @code{delete} without subscript @tab X @tab X @tab X @item @code{fflush()} function @tab X @tab X @tab X -@item @code{func} keyword @tab X @tab @tab X +@item @code{length()} of an array @tab X @tab X @tab X @item @code{nextfile} statement @tab X @tab X @tab X -@item @code{delete} without subscript @tab X @tab X @tab X -@item @code{length()} of an array @tab X @tab @tab X +@item @code{**} and @code{**=} operators @tab X @tab @tab X +@item @code{func} keyword @tab X @tab @tab X @item @code{BINMODE} variable @tab @tab X @tab X +@item @code{RS} as regexp @tab @tab X @tab X @item Time related functions @tab @tab X @tab X @end multitable @@ -32775,7 +35098,7 @@ character ranges (such as @samp{[a-z]}) to match any character between the first character in the range and the last character in the range, inclusive. Ordering was based on the numeric value of each character in the machine's native character set. Thus, on ASCII-based systems, -@code{[a-z]} matched all the lowercase letters, and only the lowercase +@samp{[a-z]} matched all the lowercase letters, and only the lowercase letters, since the numeric values for the letters from @samp{a} through @samp{z} were contiguous. (On an EBCDIC system, the range @samp{[a-z]} includes additional, non-alphabetic characters as well.) @@ -32786,7 +35109,7 @@ as working in this fashion, and in particular, would teach that the that @samp{[A-Z]} was the ``correct'' way to match uppercase letters. And indeed, this was true.@footnote{And Life was good.} -The 1993 POSIX standard introduced the idea of locales (@pxref{Locales}). +The 1992 POSIX standard introduced the idea of locales (@pxref{Locales}). Since many locales include other letters besides the plain twenty-six letters of the American English alphabet, the POSIX standard added character classes (@pxref{Bracket Expressions}) as a way to match @@ -32825,9 +35148,10 @@ This output is unexpected, since the @samp{bc} at the end of This result is due to the locale setting (and thus you may not see it on your system). +@cindex Unicode Similar considerations apply to other ranges. For example, @samp{["-/]} is perfectly valid in ASCII, but is not valid in many Unicode locales, -such as @samp{en_US.UTF-8}. +such as @code{en_US.UTF-8}. Early versions of @command{gawk} used regexp matching code that was not locale aware, so ranges had their traditional interpretation. @@ -32836,18 +35160,19 @@ When @command{gawk} switched to using locale-aware regexp matchers, the problems began; especially as both GNU/Linux and commercial Unix vendors started implementing non-ASCII locales, @emph{and making them the default}. Perhaps the most frequently asked question became something -like ``why does @code{[A-Z]} match lowercase letters?!?'' +like ``why does @samp{[A-Z]} match lowercase letters?!?'' +@cindex Berry, Karl This situation existed for close to 10 years, if not more, and the @command{gawk} maintainer grew weary of trying to explain that @command{gawk} was being nicely standards-compliant, and that the issue -was in the user's locale. During the development of version 4.0, +was in the user's locale. During the development of @value{PVERSION} 4.0, he modified @command{gawk} to always treat ranges in the original, pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).@footnote{And -thus was born the Campain for Rational Range Interpretation (or RRI). A number -of GNU tools, such as @command{grep} and @command{sed}, have either -implemented this change, or will soon. Thanks to Karl Berry for coining the phrase -``Rational Range Interpretation.''} +thus was born the Campaign for Rational Range Interpretation (or +RRI). A number of GNU tools have either implemented this change, +or will soon. Thanks to Karl Berry for coining the phrase ``Rational +Range Interpretation.''} Fortunately, shortly before the final release of @command{gawk} 4.0, the maintainer learned that the 2008 standard had changed the @@ -32860,7 +35185,7 @@ and By using this lovely technical term, the standard gives license to implementors to implement ranges in whatever way they choose. The @command{gawk} maintainer chose to apply the pre-POSIX meaning in all -cases: the default regexp matching; with @option{--traditional}, and with +cases: the default regexp matching; with @option{--traditional} and with @option{--posix}; in all cases, @command{gawk} remains POSIX compliant. @node Contributors @@ -32874,7 +35199,7 @@ cases: the default regexp matching; with @option{--traditional}, and with This @value{SECTION} names the major contributors to @command{gawk} and/or this @value{DOCUMENT}, in approximate chronological order: -@itemize @bullet +@itemize @value{BULLET} @item @cindex Aho, Alfred @cindex Weinberger, Peter @@ -32954,8 +35279,8 @@ provided the initial port to OS/2 and its documentation. Michal Jaegermann provided the port to Atari systems and its documentation. (This port is no longer supported.) -He continues to provide portability checking with DEC Alpha -systems, and has done a lot of work to make sure @command{gawk} +He continues to provide portability checking, +and has done a lot of work to make sure @command{gawk} works on non-32-bit systems. @item @@ -33026,7 +35351,7 @@ provided the port to BeOS and its documentation. @cindex Peters, Arno Arno Peters did the initial work to convert @command{gawk} to use -GNU Automake and GNU @code{gettext}. +GNU Automake and GNU @command{gettext}. @item @cindex Broder, Alan J.@: @@ -33056,17 +35381,25 @@ environments. (This is no longer supported) @item +@cindex Wallin, Anders +Anders Wallin helped keep the VMS port going for several years. + +@item +@cindex Gordon, Assaf +Assaf Gordon contributed the code to implement the +@option{--sandbox} option. + +@item @cindex Haque, John John Haque made the following contributions: -@itemize @minus +@itemize @value{MINUS} @item The modifications to convert @command{gawk} into a byte-code interpreter, including the debugger. @item -The addition of true multidimensional arrays. -@ref{Arrays of Arrays}. +The addition of true arrays of arrays. @item The additional modifications for support of arbitrary precision arithmetic. @@ -33087,6 +35420,10 @@ The improved array sorting features were driven by John together with Pat Rankin. @end itemize +@cindex Papadopoulos, Panos +@item +Panos Papadopoulos contributed the original text for @ref{Include Files}. + @item @cindex Yawitz, Efraim Efraim Yawitz contributed the original text for @ref{Debugger}. @@ -33099,17 +35436,57 @@ Arnold Robbins and Andrew Schorr, with notable contributions from the rest of the development team. @item +@cindex Colombo, Antonio +Antonio Giovanni Colombo rewrote a number of examples in the early +chapters that were severely dated, for which I am incredibly grateful. + +@item @cindex Robbins, Arnold Arnold Robbins has been working on @command{gawk} since 1988, at first helping David Trueman, and as the primary maintainer since around 1994. @end itemize +@node History summary +@appendixsec Summary + +@itemize @value{BULLET} +@item +The @command{awk} language has evolved over time. The first release +was with V7 Unix circa 1978. In 1987 for System V Release 3.1, +major additions, including user-defined functions, were made to the language. +Additional changes were made for System V Release 4, in 1989. +Since then, further minor changes happen under the auspices of the +POSIX standard. + +@item +Brian Kernighan's @command{awk} provides a small number of extensions +that are implemented in common with other versions of @command{awk}. + +@item +@command{gawk} provides a large number of extensions over POSIX @command{awk}. +They can be disabled with either the @option{--traditional} or @option{--posix} +options. + +@item +The interaction of POSIX locales and regexp matching in @command{gawk} has been confusing over +the years. Today, @command{gawk} implements Rational Range Interpretation, where +ranges of the form @samp{[a-z]} match @emph{only} the characters numerically between +@samp{a} through @samp{z} in the machine's native character set. Usually this is ASCII +but it can be EBCDIC on IBM S/390 systems. + +@item +Many people have contributed to @command{gawk} development over the years. +We hope that the list provided in this @value{CHAPTER} is complete and gives +the appropriate credit where credit is due. + +@end itemize + @node Installation @appendix Installing @command{gawk} @c last two commas are part of see also -@cindex operating systems, See Also GNU/Linux, PC operating systems, Unix +@cindex operating systems, See Also GNU/Linux@comma{} PC operating systems@comma{} Unix @c STARTOFRANGE gligawk @cindex @command{gawk}, installing @c STARTOFRANGE ingawk @@ -33130,6 +35507,7 @@ the respective ports. * Bugs:: Reporting Problems and Bugs. * Other Versions:: Other freely available @command{awk} implementations. +* Installation summary:: Summary of installation. @end menu @node Gawk Distribution @@ -33149,9 +35527,9 @@ subdirectories. @node Getting @appendixsubsec Getting the @command{gawk} Distribution @cindex @command{gawk}, source code@comma{} obtaining -There are three ways to get GNU software: +There are two ways to get GNU software: -@itemize @bullet +@itemize @value{BULLET} @item Copy it from someone else who already has it. @@ -33190,7 +35568,6 @@ file and then use @code{tar} to extract it. You can use the following pipeline to produce the @command{gawk} distribution: @example -# Under System V, add 'o' to the tar options gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf - @end example @@ -33206,7 +35583,7 @@ Extracting the archive creates a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} in the current directory. -The distribution file name is of the form +The distribution @value{FN} is of the form @file{gawk-@var{V}.@var{R}.@var{P}.tar.gz}. The @var{V} represents the major version of @command{gawk}, the @var{R} represents the current release of version @var{V}, and @@ -33345,8 +35722,8 @@ actual @file{Makefile} for creating the documentation. @item Makefile.am @itemx */Makefile.am -Files used by the GNU @command{automake} software for generating -the @file{Makefile.in} files used by @command{autoconf} and +Files used by the GNU Automake software for generating +the @file{Makefile.in} files used by Autoconf and @command{configure}. @item Makefile.in @@ -33389,15 +35766,23 @@ They are installed as part of the installation process. The rest of the programs in this @value{DOCUMENT} are available in appropriate subdirectories of @file{awklib/eg}. +@item extension/* +The source code, manual pages, and infrastructure files for +the sample extensions included with @command{gawk}. +@xref{Dynamic Extensions}, for more information. + @item posix/* Files needed for building @command{gawk} on POSIX-compliant systems. @item pc/* -Files needed for building @command{gawk} under MS-Windows and OS/2 +Files needed for building @command{gawk} under MS-Windows +@ifclear FOR_PRINT +and OS/2 +@end ifclear (@pxref{PC Installation}, for details). @item vms/* -Files needed for building @command{gawk} under VMS +Files needed for building @command{gawk} under Vax/VMS and OpenVMS (@pxref{VMS Installation}, for details). @item test/* @@ -33434,9 +35819,9 @@ to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. Like most GNU software, @command{gawk} is configured automatically for your system by running the @command{configure} program. This program is a Bourne shell script that is generated automatically using -GNU @command{autoconf}. +GNU Autoconf. @ifnotinfo -(The @command{autoconf} software is +(The Autoconf software is described fully in @cite{Autoconf---Generating Automatic Configuration Scripts}, which can be found online at @@ -33444,7 +35829,7 @@ which can be found online at the Free Software Foundation's web site}.) @end ifnotinfo @ifinfo -(The @command{autoconf} software is described fully starting with +(The Autoconf software is described fully starting with @inforef{Top, , Autoconf, autoconf,Autoconf---Generating Automatic Configuration Scripts}.) @end ifinfo @@ -33492,7 +35877,7 @@ please send in a bug report (@pxref{Bugs}). Of course, once you've built @command{gawk}, it is likely that you will wish to install it. To do so, you need to run the command @samp{make -check}, as a user with the appropriate permissions. How to do this +install}, as a user with the appropriate permissions. How to do this varies by system, but on many systems you can use the @command{sudo} command to do so. The command then becomes @samp{sudo make install}. It is likely that you will be asked for your password, and you will have @@ -33509,7 +35894,7 @@ command line when compiling @command{gawk} from scratch, including: @table @code -@cindex @code{--disable-extensions} configuration option +@cindex @option{--disable-extensions} configuration option @cindex configuration option, @code{--disable-extensions} @item --disable-extensions Disable configuring and building the sample extensions in the @@ -33517,7 +35902,7 @@ Disable configuring and building the sample extensions in the The default action is to dynamically check if the extensions can be configured and compiled. -@cindex @code{--disable-lint} configuration option +@cindex @option{--disable-lint} configuration option @cindex configuration option, @code{--disable-lint} @item --disable-lint Disable all lint checking within @code{gawk}. The @@ -33537,17 +35922,17 @@ Using this option may bring you some slight performance improvement. Using this option will cause some of the tests in the test suite to fail. This option may be removed at a later date. -@cindex @code{--disable-nls} configuration option +@cindex @option{--disable-nls} configuration option @cindex configuration option, @code{--disable-nls} @item --disable-nls Disable all message-translation facilities. This is usually not desirable, but it may bring you some slight performance improvement. -@cindex @code{--with-whiny-user-strftime} configuration option +@cindex @option{--with-whiny-user-strftime} configuration option @cindex configuration option, @code{--with-whiny-user-strftime} @item --with-whiny-user-strftime -Force use of the included version of the @code{strftime()} +Force use of the included version of the C @code{strftime()} function for deficient systems. @end table @@ -33594,9 +35979,9 @@ should not have. @file{custom.h} is automatically included by @file{config.h}. It is also possible that the @command{configure} program generated by -@command{autoconf} will not work on your system in some other fashion. +Autoconf will not work on your system in some other fashion. If you do have a problem, the file @file{configure.ac} is the input for -@command{autoconf}. You may be able to change this file and generate a +Autoconf. You may be able to change this file and generate a new version of @command{configure} that works on your system (@pxref{Bugs}, for information on how to report problems in configuring @command{gawk}). @@ -33624,16 +36009,21 @@ various non-Unix systems. @cindex PC operating systems@comma{} @command{gawk} on, installing @cindex operating systems, PC@comma{} @command{gawk} on, installing This @value{SECTION} covers installation and usage of @command{gawk} on x86 machines +@ifclear FOR_PRINT running MS-DOS, any version of MS-Windows, or OS/2. +@end ifclear +@ifset FOR_PRINT +running MS-DOS and any version of MS-Windows. +@end ifset In this @value{SECTION}, the term ``Windows32'' -refers to any of Microsoft Windows-95/98/ME/NT/2000/XP/Vista/7. +refers to any of Microsoft Windows-95/98/ME/NT/2000/XP/Vista/7/8. -The limitations of MS-DOS (and MS-DOS shells under Windows32 or OS/2) has meant -that various ``DOS extenders'' are often used with programs such as -@command{gawk}. The varying capabilities of Microsoft Windows 3.1 -and Windows32 can add to the confusion. For an overview of the -considerations, please refer to @file{README_d/README.pc} in the -distribution. +The limitations of MS-DOS (and MS-DOS shells under the other operating +systems) has meant that various ``DOS extenders'' are often used with +programs such as @command{gawk}. The varying capabilities of Microsoft +Windows 3.1 and Windows32 can add to the confusion. For an overview +of the considerations, please refer to @file{README_d/README.pc} in +the distribution. @menu * PC Binary Installation:: Installing a prepared distribution. @@ -33647,6 +36037,7 @@ distribution. * MSYS:: Using @command{gawk} In The MSYS Environment. @end menu +@ifclear FOR_PRINT @node PC Binary Installation @appendixsubsubsec Installing a Prepared Distribution for PC Systems @@ -33685,13 +36076,21 @@ install-info --info-dir=x:/usr/info x:/usr/info/gawkinet.info The binary distribution may contain a separate file containing additional or more detailed installation instructions. +@end ifclear @node PC Compiling @appendixsubsubsec Compiling @command{gawk} for PC Operating Systems +@ifclear FOR_PRINT @command{gawk} can be compiled for MS-DOS, Windows32, and OS/2 using the GNU -development tools from DJ Delorie (DJGPP: MS-DOS only) or Eberhard -Mattes (EMX: MS-DOS, Windows32 and OS/2). The file +development tools from DJ Delorie (DJGPP: MS-DOS only), MinGW (Windows32) or Eberhard +Mattes (EMX: MS-DOS, Windows32 and OS/2). +@end ifclear +@ifset FOR_PRINT +@command{gawk} can be compiled for MS-DOS and Windows32 using the GNU +development tools from DJ Delorie (DJGPP: MS-DOS only) or MinGW (Windows32). +@end ifset +The file @file{README_d/README.pc} in the @command{gawk} distribution contains additional notes, and @file{pc/Makefile} contains important information on compilation options. @@ -33713,6 +36112,7 @@ build @command{gawk} using the DJGPP tools, enter @samp{make djgpp}. @uref{ftp://ftp.delorie.com/pub/djgpp/current/v2gnu/}.) To build a native MS-Windows binary of @command{gawk}, type @samp{make mingw32}. +@ifclear FOR_PRINT @cindex compiling @command{gawk} with EMX for OS/2 The 32 bit EMX version of @command{gawk} works ``out of the box'' under OS/2. However, it is highly recommended to use GCC 2.95.3 for the compilation. @@ -33747,7 +36147,7 @@ and @option{--libexecdir=c:/usr/lib}. @end ignore @ignore -The internal @code{gettext} library tends to be problematic. It is therefore recommended +The internal @command{gettext} library tends to be problematic. It is therefore recommended to use either an external one (@option{--without-included-gettext}) or to disable NLS entirely (@option{--disable-nls}). @end ignore @@ -33784,8 +36184,11 @@ Ancient OS/2 ports of GNU @command{make} are not able to handle the Makefiles of this package. If you encounter any problems with @command{make}, try GNU Make 3.79.1 or later versions. You should find the latest version on -@uref{ftp://hobbes.nmsu.edu/pub/os2/}. +@uref{ftp://hobbes.nmsu.edu/pub/os2/}.@footnote{As of May, 2014, +this site is still there, but the author could not find a package +for GNU Make.} @end quotation +@end ifclear @node PC Testing @appendixsubsubsec Testing @command{gawk} on PC Operating Systems @@ -33797,6 +36200,7 @@ be converted so that they have the usual MS-DOS-style end-of-line markers. Alternatively, run @command{make check CMP="diff -a"} to use GNU @command{diff} in text mode instead of @command{cmp} to compare the resulting files. +@ifclear FOR_PRINT Most of the tests work properly with Stewartson's shell along with the companion utilities or appropriate GNU utilities. However, some editing of @@ -33809,7 +36213,7 @@ On OS/2 the @code{pid} test fails because @code{spawnl()} is used instead of @code{fork()}/@code{execl()} to start child processes. Also the @code{mbfw1} and @code{mbprintf1} tests fail because the needed multibyte functionality is not available. - +@end ifclear @node PC Using @appendixsubsubsec Using @command{gawk} on PC Operating Systems @@ -33818,15 +36222,15 @@ multibyte functionality is not available. @c STARTOFRANGE pcgawon @cindex PC operating systems, @command{gawk} on -With the exception of the Cygwin environment, -the @samp{|&} operator and TCP/IP networking -(@pxref{TCP/IP Networking}) -are not supported for MS-DOS or MS-Windows. EMX (OS/2 only) does support -at least the @samp{|&} operator. +Under MS-DOS and MS-Windows, the Cygwin and MinGW environments support +both the @samp{|&} operator and TCP/IP networking +(@pxref{TCP/IP Networking}). +@ifclear FOR_PRINT +EMX (OS/2 only) supports at least the @samp{|&} operator. +@end ifclear @cindex search paths @cindex search paths, for source files -@cindex @command{gawk}, OS/2 version of @cindex @command{gawk}, MS-DOS version of @cindex @command{gawk}, MS-Windows version of @cindex @code{;} (semicolon), @code{AWKPATH} variable and @@ -33837,36 +36241,50 @@ program files as described in @ref{AWKPATH Variable}. However, semicolons (rather than colons) separate elements in the @env{AWKPATH} variable. If @env{AWKPATH} is not set or is empty, then the default search path for MS-Windows and MS-DOS versions is -@code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}. +@samp{@w{.;c:/lib/awk;c:/gnu/lib/awk}}. +@ifclear FOR_PRINT +@cindex @command{gawk}, OS/2 version of @cindex @code{UNIXROOT} variable, on OS/2 systems The search path for OS/2 (32 bit, EMX) is determined by the prefix directory (most likely @file{/usr} or @file{c:/usr}) that has been specified as an option of -the @command{configure} script like it is the case for the Unix versions. +the @command{configure} script as is the case for the Unix versions. If @file{c:/usr} is the prefix directory then the default search path contains @file{.} and @file{c:/usr/share/awk}. Additionally, to support binary distributions of @command{gawk} for OS/2 -systems whose drive @samp{c:} might not support long file names or might not exist +systems whose drive @samp{c:} might not support long @value{FN}s or might not exist at all, there is a special environment variable. If @env{UNIXROOT} specifies a drive then this specific drive is also searched for program files. E.g., if @env{UNIXROOT} is set to @file{e:} the complete default search path is -@code{@w{".;c:/usr/share/awk;e:/usr/share/awk"}}. +@samp{@w{.;c:/usr/share/awk;e:/usr/share/awk}}. An @command{sh}-like shell (as opposed to @command{command.com} under MS-DOS or @command{cmd.exe} under MS-Windows or OS/2) may be useful for @command{awk} programming. The DJGPP collection of tools includes an MS-DOS port of Bash, and several shells are available for OS/2, including @command{ksh}. +@end ifclear +@ifset FOR_PRINT +An @command{sh}-like shell (as opposed to @command{command.com} under MS-DOS +or @command{cmd.exe} under MS-Windows) may be useful for @command{awk} programming. +The DJGPP collection of tools includes an MS-DOS port of Bash. +@end ifset @cindex common extensions, @code{BINMODE} variable @cindex extensions, common@comma{} @code{BINMODE} variable @cindex differences in @command{awk} and @command{gawk}, @code{BINMODE} variable @cindex @code{BINMODE} variable -Under MS-Windows, OS/2 and MS-DOS, @command{gawk} (and many other text programs) silently -translate end-of-line @code{"\r\n"} to @code{"\n"} on input and @code{"\n"} -to @code{"\r\n"} on output. A special @code{BINMODE} variable @value{COMMONEXT} +@ifclear FOR_PRINT +Under MS-Windows, OS/2 and MS-DOS, +@end ifclear +@ifset FOR_PRINT +Under MS-Windows and MS-DOS, +@end ifset +@command{gawk} (and many other text programs) silently +translate end-of-line @samp{\r\n} to @samp{\n} on input and @samp{\n} +to @samp{\r\n} on output. A special @code{BINMODE} variable @value{COMMONEXT} allows control over these translations and is interpreted as follows: -@itemize @bullet +@itemize @value{BULLET} @item If @code{BINMODE} is @code{"r"}, or one, then @@ -33904,7 +36322,7 @@ The name @code{BINMODE} was chosen to match @command{mawk} @command{mawk} adds a @samp{-W BINMODE=@var{N}} option and an environment variable that can set @code{BINMODE}, @code{RS}, and @code{ORS}. The files @file{binmode[1-3].awk} (under @file{gnu/lib/awk} in some of the -prepared distributions) have been chosen to match @command{mawk}'s @samp{-W +prepared binary distributions) have been chosen to match @command{mawk}'s @samp{-W BINMODE=@var{N}} option. These can be changed or discarded; in particular, the setting of @code{RS} giving the fewest ``surprises'' is open to debate. @command{mawk} uses @samp{RS = "\r\n"} if binary mode is set on read, which is @@ -33952,7 +36370,7 @@ moved into the @code{BEGIN} rule. @command{gawk} can be built and used ``out of the box'' under MS-Windows if you are using the @uref{http://www.cygwin.com, Cygwin environment}. -This environment provides an excellent simulation of Unix, using the +This environment provides an excellent simulation of GNU/Linux, using the GNU tools, such as Bash, the GNU Compiler Collection (GCC), GNU Make, and other GNU programs. Compilation and installation for Cygwin is the same as for a Unix system: @@ -33968,13 +36386,6 @@ When compared to GNU/Linux on the same system, the @samp{configure} step on Cygwin takes considerably longer. However, it does finish, and then the @samp{make} proceeds as usual. -@quotation NOTE -The @samp{|&} operator and TCP/IP networking -(@pxref{TCP/IP Networking}) -are fully supported in the Cygwin environment. This is not true -for any other environment on MS-Windows. -@end quotation - @node MSYS @appendixsubsubsec Using @command{gawk} In The MSYS Environment @@ -33987,7 +36398,7 @@ been ported to MS-Windows that expect @command{gawk} to do automatic translation of @code{"\r\n"}, since it won't. Caveat Emptor! @node VMS Installation -@appendixsubsec How to Compile and Install @command{gawk} on VMS +@appendixsubsec How to Compile and Install @command{gawk} on Vax/VMS and OpenVMS @c based on material from Pat Rankin <rankin@eql.caltech.edu> @c now rankin@pactechdata.com @@ -34000,8 +36411,11 @@ The older designation ``VMS'' is used throughout to refer to OpenVMS. @menu * VMS Compilation:: How to compile @command{gawk} under VMS. +* VMS Dynamic Extensions:: Compiling @command{gawk} dynamic extensions on + VMS. * VMS Installation Details:: How to install @command{gawk} under VMS. * VMS Running:: How to run @command{gawk} under VMS. +* VMS GNV:: The VMS GNV Project. * VMS Old Gawk:: An old version comes with some VMS systems. @end menu @@ -34009,41 +36423,110 @@ The older designation ``VMS'' is used throughout to refer to OpenVMS. @appendixsubsubsec Compiling @command{gawk} on VMS @cindex compiling @command{gawk} for VMS -To compile @command{gawk} under VMS, there is a @code{DCL} command procedure that -issues all the necessary @code{CC} and @code{LINK} commands. There is -also a @file{Makefile} for use with the @code{MMS} utility. From the source -directory, use either: +To compile @command{gawk} under VMS, there is a @code{DCL} command procedure +that issues all the necessary @code{CC} and @code{LINK} commands. There is +also a @file{Makefile} for use with the @code{MMS} and @code{MMK} utilities. +From the source directory, use either: @example -$ @kbd{@@[.VMS]VMSBUILD.COM} +$ @kbd{@@[.vms]vmsbuild.com} @end example @noindent or: @example -$ @kbd{MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK} +$ @kbd{MMS/DESCRIPTION=[.vms]descrip.mms gawk} @end example -Older versions of @command{gawk} could be built with VAX C or -GNU C on VAX/VMS, as well as with DEC C, but that is no longer -supported. DEC C (also briefly known as ``Compaq C'' and now known -as ``HP C,'' but referred to here as ``DEC C'') is required. Both -@code{VMSBUILD.COM} and @code{DESCRIP.MMS} contain some obsolete support -for the older compilers but are set up to use DEC C by default. +@noindent +or: + +@example +$ @kbd{MMK/DESCRIPTION=[.vms]descrip.mms gawk} +@end example + +@command{MMK} is an open source, free, near-clone of @command{MMS} and +can better handle ODS-5 volumes with upper- and lowercase @value{FN}s. +@command{MMK} is available from @uref{https://github.com/endlesssoftware/mmk}. + +With ODS-5 volumes and extended parsing enabled, the case of the target +parameter may need to be exact. -@command{gawk} has been tested under Alpha/VMS 7.3-1 using Compaq C V6.4, -and on Alpha/VMS 7.3, Alpha/VMS 7.3-2, and IA64/VMS 8.3.@footnote{The IA64 -architecture is also known as ``Itanium.''} +@command{gawk} has been tested under VAX/VMS 7.3 and Alpha/VMS 7.3-1 +using Compaq C V6.4, and Alpha/VMS 7.3, Alpha/VMS 7.3-2, and IA64/VMS 8.3. +The most recent builds used HP C V7.3 on Alpha VMS 8.3 and both +Alpha and IA64 VMS 8.4 used HP C 7.3.@footnote{The IA64 architecture +is also known as ``Itanium.''} + +@xref{VMS GNV}, for information on building +@command{gawk} as a PCSI kit that is compatible with the GNV product. + +@node VMS Dynamic Extensions +@appendixsubsubsec Compiling @command{gawk} Dynamic Extensions on VMS + +The extensions that have been ported to VMS can be built using one of +the following commands. + +@example +$ @kbd{MMS/DESCRIPTION=[.vms]descrip.mms extensions} +@end example + +@noindent +or: + +@example +$ @kbd{MMK/DESCRIPTION=[.vms]descrip.mms extensions} +@end example + +@command{gawk} uses @code{AWKLIBPATH} as either an environment variable +or a logical name to find the dynamic extensions. + +Dynamic extensions need to be compiled with the same compiler options for +floating point, pointer size, and symbol name handling as were used +to compile @command{gawk} itself. +Alpha and Itanium should use IEEE floating point. The pointer size is 32 bits, +and the symbol name handling should be exact case with CRC shortening for +symbols longer than 32 bits. + +For Alpha and Itanium: + +@example +/name=(as_is,short) +/float=ieee/ieee_mode=denorm_results +@end example + +For VAX: + +@example +/name=(as_is,short) +@end example + +Compile time macros need to be defined before the first VMS-supplied +header file is included. + +@example +#if (__CRTL_VER >= 70200000) && !defined (__VAX) +#define _LARGEFILE 1 +#endif + +#ifndef __VAX +#ifdef __CRTL_VER +#if __CRTL_VER >= 80200000 +#define _USE_STD_STAT 1 +#endif +#endif +#endif +@end example @node VMS Installation Details @appendixsubsubsec Installing @command{gawk} on VMS -To install @command{gawk}, all you need is a ``foreign'' command, which is -a @code{DCL} symbol whose value begins with a dollar sign. For example: +To use @command{gawk}, all you need is a ``foreign'' command, which is a +@code{DCL} symbol whose value begins with a dollar sign. For example: @example -$ @kbd{GAWK :== $disk1:[gnubin]GAWK} +$ @kbd{GAWK :== $disk1:[gnubin]gawk} @end example @noindent @@ -34055,10 +36538,29 @@ Alternatively, the symbol may be placed in the system-wide @file{sylogin.com} procedure, which allows all users to run @command{gawk}. -Optionally, the help entry can be loaded into a VMS help library: +If your @command{gawk} was installed by a PCSI kit into the +@file{GNV$GNU:} directory tree, the program will be known as +@file{GNV$GNU:[bin]gnv$gawk.exe} and the help file will be +@file{GNV$GNU:[vms_help]gawk.hlp}. + +The PCSI kit also installs a @file{GNV$GNU:[vms_bin]gawk_verb.cld} file +which can be used to add @command{gawk} and @command{awk} as DCL commands. + +For just the current process you can use: + +@example +$ @kbd{set command gnv$gnu:[vms_bin]gawk_verb.cld} +@end example + +Or the system manager can use @file{GNV$GNU:[vms_bin]gawk_verb.cld} to +add the @command{gawk} and @command{awk} to the system wide @samp{DCLTABLES}. + +The DCL syntax is documented in the @file{gawk.hlp} file. + +Optionally, the @file{gawk.hlp} entry can be loaded into a VMS help library: @example -$ @kbd{LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP} +$ @kbd{LIBRARY/HELP sys$help:helplib [.vms]gawk.hlp} @end example @noindent @@ -34076,11 +36578,11 @@ provides information about both the @command{gawk} implementation and the The logical name @samp{AWK_LIBRARY} can designate a default location for @command{awk} program files. For the @option{-f} option, if the specified -file name has no device or directory path information in it, @command{gawk} +@value{FN} has no device or directory path information in it, @command{gawk} looks in the current directory first, then in the directory specified by the translation of @samp{AWK_LIBRARY} if the file is not found. If, after searching in both directories, the file still is not found, -@command{gawk} appends the suffix @samp{.awk} to the filename and retries +@command{gawk} appends the suffix @samp{.awk} to the @value{FN} and retries the file search. If @samp{AWK_LIBRARY} has no definition, a default value of @samp{SYS$LIBRARY:} is used for it. @@ -34109,9 +36611,42 @@ One side effect of dual command-line parsing is that if there is only a single parameter (as in the quoted string program above), the command becomes ambiguous. To work around this, the normally optional @option{--} flag is required to force Unix-style parsing rather than @code{DCL} parsing. If any -other dash-type options (or multiple parameters such as data files to +other dash-type options (or multiple parameters such as @value{DF}s to process) are present, there is no ambiguity and @option{--} can be omitted. +@cindex exit status, of VMS +The @code{exit} value is a Unix-style value and is encoded to a VMS exit +status value when the program exits. + +The VMS severity bits will be set based on the @code{exit} value. +A failure is indicated by 1 and VMS sets the @code{ERROR} status. +A fatal error is indicated by 2 and VMS will set the @code{FATAL} status. +All other values will have the @code{SUCCESS} status. The exit value is +encoded to comply with VMS coding standards and will have the +@code{C_FACILITY_NO} of @code{0x350000} with the constant @code{0xA000} +added to the number shifted over by 3 bits to make room for the severity codes. + +To extract the actual @command{gawk} exit code from the VMS status use: + +@example +unix_status = (vms_status .and. &x7f8) / 8 +@end example + +@noindent +A C program that uses @code{exec()} to call @command{gawk} will get the original +Unix-style exit value. + +Older versions of @command{gawk} treated a Unix exit code 0 as 1, a failure +as 2, a fatal error as 4, and passed all the other numbers through. +This violated the VMS exit status coding requirements. + +@cindex floating-point, VAX/VMS +VAX/VMS floating point uses unbiased rounding. @xref{Round Function}. + +VMS reports time values in GMT unless one of the @code{SYS$TIMEZONE_RULE} +or @code{TZ} logical names is set. Older versions of VMS, such as VAX/VMS +7.3 do not set these logical names. + @c @cindex directory search @c @cindex path, search @cindex search paths @@ -34123,6 +36658,21 @@ of @env{AWKPATH} is a comma-separated list of directory specifications. When defining it, the value should be quoted so that it retains a single translation and not a multitranslation @code{RMS} searchlist. +@node VMS GNV +@appendixsubsubsec The VMS GNV Project + +The VMS GNV package provides a build environment similar to POSIX with ports +of a collection of open source tools. The @command{gawk} found in the GNV +base kit is an older port. Currently the GNV project is being reorganized +to supply individual PCSI packages for each component. +See @w{@uref{https://sourceforge.net/p/gnv/wiki/InstallingGNVPackages/}.} + +The normal build procedure for @command{gawk} produces a program that +is suitable for use with GNV. + +The @file{vms/gawk_build_steps.txt} in the source documents the procedure +for building a VMS PCSI kit that is compatible with GNV. + @ignore @c The VMS POSIX product, also known as POSIX for OpenVMS, is long defunct @c and building gawk for it has not been tested in many years, but these @@ -34170,7 +36720,7 @@ define a symbol, as follows: $ @kbd{gawk :== $sys$common:[syshlp.examples.tcpip.snmp]gawk.exe} @end example -This is apparently version 2.15.6, which is extremely old. We +This is apparently @value{PVERSION} 2.15.6, which is extremely old. We recommend compiling and using the current version. @c ENDOFRANGE opgawx @@ -34199,8 +36749,8 @@ what you're trying to do. If it's not clear whether you should be able to do something or not, report that too; it's a bug in the documentation! Before reporting a bug or trying to fix it yourself, try to isolate it -to the smallest possible @command{awk} program and input data file that -reproduces the problem. Then send us the program and data file, +to the smallest possible @command{awk} program and input @value{DF} that +reproduces the problem. Then send us the program and @value{DF}, some idea of what kind of Unix system you're using, the compiler you used to compile @command{gawk}, and the exact results @command{gawk} gave you. Also say what you expected to occur; this helps @@ -34216,12 +36766,14 @@ Once you have a precise problem, send email to @EMAIL{bug-gawk@@gnu.org,bug-gawk at gnu dot org}. @cindex Robbins, Arnold -Using this address automatically sends a copy of your -mail to me. If necessary, I can be reached directly at +The @command{gawk} maintainers subscribe to this address and +thus they will receive your bug report. +If necessary, the primary maintainer can be reached directly at @EMAIL{arnold@@skeeve.com,arnold at skeeve dot com}. The bug reporting address is preferred since the email list is archived at the GNU Project. -@emph{All email should be in English, since that is my native language.} +@emph{All email should be in English. This is the only language +understood in common by all the maintainers.} @cindex @code{comp.lang.awk} newsgroup @quotation CAUTION @@ -34254,32 +36806,39 @@ mail at the Internet address noted previously. If you find bugs in one of the non-Unix ports of @command{gawk}, please send an electronic mail message to the person who maintains that port. They -are named in the following list, as well as in the @file{README} file in the @command{gawk} -distribution. Information in the @file{README} file should be considered -authoritative if it conflicts with this @value{DOCUMENT}. +are named in the following list, as well as in the @file{README} file +in the @command{gawk} distribution. Information in the @file{README} +file should be considered authoritative if it conflicts with this +@value{DOCUMENT}. The people maintaining the non-Unix ports of @command{gawk} are as follows: -@multitable {MS-Windows with MINGW} {123456789012345678901234567890123456789001234567890} +@c put the index entries outside the table, for docbook @cindex Deifik, Scott +@cindex Zaretskii, Eli +@cindex Buening, Andreas +@cindex Rankin, Pat +@cindex Malmberg, John +@cindex Pitts, Dave +@multitable {MS-Windows with MinGW} {123456789012345678901234567890123456789001234567890} @item MS-DOS with DJGPP @tab Scott Deifik, @EMAIL{scottd.mail@@sbcglobal.net,scottd dot mail at sbcglobal dot net}. -@cindex Zaretskii, Eli -@item MS-Windows with MINGW @tab Eli Zaretskii, @EMAIL{eliz@@gnu.org,eliz at gnu dot org}. +@item MS-Windows with MinGW @tab Eli Zaretskii, @EMAIL{eliz@@gnu.org,eliz at gnu dot org}. -@cindex Buening, Andreas +@c Leave this in the print version on purpose. +@c OS/2 is not mentioned anywhere else in the print version though. @item OS/2 @tab Andreas Buening, @EMAIL{andreas.buening@@nexgo.de,andreas dot buening at nexgo dot de}. -@cindex Rankin, Pat -@item VMS @tab Pat Rankin, @EMAIL{r.pat.rankin@@gmail.com,r.pat.rankin at gmail.com} +@item VMS @tab Pat Rankin, @EMAIL{r.pat.rankin@@gmail.com,r.pat.rankin at gmail.com}, and +John Malmberg, @EMAIL{wb8tyw@@qsl.net,wb8tyw at qsl.net}. -@cindex Pitts, Dave @item z/OS (OS/390) @tab Dave Pitts, @EMAIL{dpitts@@cozx.com,dpitts at cozx dot com}. @end multitable If your bug is also reproducible under Unix, please send a copy of your -report to the @EMAIL{bug-gawk@@gnu.org,bug-gawk at gnu dot org} email list as well. +report to the @EMAIL{bug-gawk@@gnu.org,bug-gawk at gnu dot org} email +list as well. @c ENDOFRANGE dbugg @c ENDOFRANGE tblgawb @@ -34308,7 +36867,7 @@ This @value{SECTION} briefly describes where to get them: @cindex Kernighan, Brian @cindex source code, Brian Kernighan's @command{awk} @cindex @command{awk}, versions of, See Also Brian Kernighan's @command{awk} -@cindex Brian Kernighan's @command{awk} +@cindex Brian Kernighan's @command{awk}, source code @item Unix @command{awk} Brian Kernighan, one of the original designers of Unix @command{awk}, has made his implementation of @@ -34328,6 +36887,7 @@ It is available in several archive formats: @uref{http://www.cs.princeton.edu/~bwk/btl.mirror/awk.zip} @end table +@cindex @command{git} utility You can also retrieve it from Git Hub: @example @@ -34347,12 +36907,17 @@ from GCC (the GNU Compiler Collection) works quite nicely. for a list of extensions in this @command{awk} that are not in POSIX @command{awk}. @cindex Brennan, Michael -@cindex @command{mawk} program +@cindex @command{mawk} utility @cindex source code, @command{mawk} @item @command{mawk} Michael Brennan wrote an independent implementation of @command{awk}, -called @command{mawk}. It is available under the GPL -(@pxref{Copying}), +called @command{mawk}. It is available under the +@ifclear FOR_PRINT +GPL (@pxref{Copying}), +@end ifclear +@ifset FOR_PRINT +GPL, +@end ifset just as @command{gawk} is. The original distribution site for the @command{mawk} source code @@ -34393,7 +36958,7 @@ To get @command{awka}, go to @url{http://sourceforge.net/projects/awka}. The project seems to be frozen; no new code changes have been made since approximately 2003. -@cindex Beebe, Nelson +@cindex Beebe, Nelson H.F.@: @cindex @command{pawk} (profiling version of Brian Kernighan's @command{awk}) @cindex source code, @command{pawk} @item @command{pawk} @@ -34421,10 +36986,10 @@ information, see the @uref{http://busybox.net, project's home page}. @cindex Solaris, POSIX-compliant @command{awk} @cindex source code, Solaris @command{awk} @item The OpenSolaris POSIX @command{awk} -The version of @command{awk} in @file{/usr/xpg4/bin} on Solaris is -more-or-less POSIX-compliant. It is based on the @command{awk} from -Mortice Kern Systems for PCs. -This author was able to make it compile and work under GNU/Linux +The versions of @command{awk} in @file{/usr/xpg4/bin} and +@file{/usr/xpg6/bin} on Solaris are more-or-less POSIX-compliant. +They are based on the @command{awk} from Mortice Kern Systems for PCs. +This author was able to make this code compile and work under GNU/Linux with 1--2 hours of work. Making it more generally portable (using GNU Autoconf and/or Automake) would take more work, and this has not been done, at least to our knowledge. @@ -34456,6 +37021,7 @@ This is an embeddable @command{awk} interpreter derived from @uref{http://repo.hu/projects/libmawk/}. @item @code{pawk} +@cindex source code, @command{pawk} (Python version) @cindex @code{pawk}, @command{awk}-like facilities for Python This is a Python module that claims to bring @command{awk}-like features to Python. See @uref{https://github.com/alecthomas/pawk} @@ -34478,15 +37044,56 @@ under the GPL. It has a large number of extensions over standard See @uref{http://www.quiktrim.org/QTawk.html} for more information, including the manual and a download link. +The project may also be frozen; no new code changes have been made +since approximately 2008. + @item Other Versions See also the @uref{http://en.wikipedia.org/wiki/Awk_language#Versions_and_implementations, Wikipedia article}, for information on additional versions. @end table +@c ENDOFRANGE awkim + +@node Installation summary +@appendixsec Summary + +@itemize @value{BULLET} +@item +The @command{gawk} distribution is available from GNU project's main +distribution site, @code{ftp.gnu.org}. The canonical build recipe is: + +@example +wget http://ftp.gnu.org/gnu/gawk/gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz +tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz +cd gawk-@value{VERSION}.@value{PATCHLEVEL} +./configure && make && make check +@end example + +@item +@command{gawk} may be built on non-POSIX systems as well. The currently +supported systems are MS-Windows using DJGPP, MSYS, MinGW and Cygwin, +@ifclear FOR_PRINT +OS/2 using EMX, +@end ifclear +and both Vax/VMS and OpenVMS. +Instructions for each system are included in this @value{CHAPTER}. + +@item +Bug reports should be sent via email to @email{bug-gawk@@gnu.org}. +Bug reports should be in English, and should include the version of @command{gawk}, +how it was compiled, and a short program and @value{DF} which demonstrate +the problem. + +@item +There are a number of other freely available @command{awk} +implementations. Many are POSIX compliant; others are less so. + +@end itemize + @c ENDOFRANGE gligawk @c ENDOFRANGE ingawk -@c ENDOFRANGE awkim +@ifclear FOR_PRINT @node Notes @appendix Implementation Notes @c STARTOFRANGE gawii @@ -34506,6 +37113,7 @@ maintainers of @command{gawk}. Everything in it applies specifically to * Implementation Limitations:: Some limitations of the implementation. * Extension Design:: Design notes about the extension API. * Old Extension Mechanism:: Some compatibility for old extensions. +* Notes summary:: Summary of implementation notes. @end menu @node Compatibility Mode @@ -34526,7 +37134,7 @@ is one more option available on the command line: @table @code @item -Y @itemx --parsedebug -Prints out the parse stack information as the program is being parsed. +Print out the parse stack information as the program is being parsed. @end table This option is intended only for serious @command{gawk} developers @@ -34550,8 +37158,8 @@ as well as any considerations you should bear in mind. @command{gawk}. * New Ports:: Porting @command{gawk} to a new operating system. -* Derived Files:: Why derived files are kept in the - @command{git} repository. +* Derived Files:: Why derived files are kept in the Git + repository. @end menu @node Accessing The Source @@ -34561,6 +37169,7 @@ As @command{gawk} is Free Software, the source code is always available. @ref{Gawk Distribution}, describes how to get and build the formal, released versions of @command{gawk}. +@cindex @command{git} utility However, if you want to modify @command{gawk} and contribute back your changes, you will probably wish to work with the development version. To do so, you will need to access the @command{gawk} source code @@ -34574,8 +37183,8 @@ git clone git://git.savannah.gnu.org/gawk.git @end example @noindent -This will clone the @command{gawk} repository. If you are behind a -firewall that will not allow you to use the Git native protocol, you +This clones the @command{gawk} repository. If you are behind a +firewall that does not allow you to use the Git native protocol, you can still access the repository using: @example @@ -34603,7 +37212,7 @@ that has a Git plug-in for working with Git repositories. You are free to add any new features you like to @command{gawk}. However, if you want your changes to be incorporated into the @command{gawk} distribution, there are several steps that you need to take in order to -make it possible to include your changes: +make it possible to include them: @enumerate 1 @item @@ -34625,14 +37234,15 @@ or @EMAIL{assign@@gnu.org,assign at gnu dot org}. @item Get the latest version. It is much easier for me to integrate changes if they are relative to -the most recent distributed version of @command{gawk}. If your version of -@command{gawk} is very old, I may not be able to integrate them at all. +the most recent distributed version of @command{gawk}, or better yet, +relative to the latest code in the Git repository. If your version of +@command{gawk} is very old, I may not be able to integrate your changes at all. (@xref{Getting}, for information on getting the latest version of @command{gawk}.) @item @ifnotinfo -Follow the @cite{GNU Coding Standards}. +Follow the @uref{http://www.gnu.org/prep/standards/, @cite{GNU Coding Standards}}. @end ifnotinfo @ifinfo See @inforef{Top, , Version, standards, GNU Coding Standards}. @@ -34653,7 +37263,7 @@ using the traditional ``K&R'' style, particularly as regards to the placement of braces and the use of TABs. In brief, the coding rules for @command{gawk} are as follows: -@itemize @bullet +@itemize @value{BULLET} @item Use ANSI/ISO style (prototype) function headers when defining functions. @@ -34736,6 +37346,7 @@ If possible, please update the @command{man} page as well. You will also have to sign paperwork for your documentation changes. +@cindex @command{git} utility @item Submit changes as unified diffs. Use @samp{diff -u -r -N} to compare @@ -34756,6 +37367,7 @@ not do so, particularly if there are lots of changes. Include an entry for the @file{ChangeLog} file with your submission. This helps further minimize the amount of work I have to do, making it easier for me to accept patches. +It is simplest if you just make this part of your diff. @end enumerate Although this sounds like a lot of work, please remember that while you @@ -34791,11 +37403,9 @@ Be prepared to sign the appropriate paperwork. In order for the FSF to distribute your code, you must either place your code in the public domain and submit a signed statement to that effect, or assign the copyright in your code to the FSF. -@ifinfo Both of these actions are easy to do and @emph{many} people have done so already. If you have questions, please contact me, or @email{gnu@@gnu.org}. -@end ifinfo @item When doing a port, bear in mind that your code must coexist peacefully @@ -34815,10 +37425,39 @@ A number of the files that come with @command{gawk} are maintained by other people. Thus, you should not change them unless it is for a very good reason; i.e., changes are not out of the question, but changes to these files are scrutinized extra carefully. -The files are @file{dfa.c}, @file{dfa.h}, @file{getopt1.c}, @file{getopt.c}, -@file{getopt.h}, @file{install-sh}, @file{mkinstalldirs}, @file{regcomp.c}, -@file{regex.c}, @file{regexec.c}, @file{regexex.c}, @file{regex.h}, -@file{regex_internal.c}, and @file{regex_internal.h}. +The files are +@file{dfa.c}, +@file{dfa.h}, +@file{getopt.c}, +@file{getopt.h}, +@file{getopt1.c}, +@file{getopt_int.h}, +@file{gettext.h}, +@file{regcomp.c}, +@file{regex.c}, +@file{regex.h}, +@file{regex_internal.c}, +@file{regex_internal.h}, +and +@file{regexec.c}. + +@item +A number of other files are provided by the GNU +Autotools (Autoconf, Automake, and GNU @command{gettext}). +You should not change them either, unless it is for a very +good reason. The files are +@file{ABOUT-NLS}, +@file{config.guess}, +@file{config.rpath}, +@file{config.sub}, +@file{depcomp}, +@file{INSTALL}, +@file{install-sh}, +@file{missing}, +@file{mkinstalldirs}, +@file{xalloc.h}, +and +@file{ylwrap}. @item Be willing to continue to maintain the port. @@ -34869,14 +37508,16 @@ In the code that you supply and maintain, feel free to use a coding style and brace layout that suits your taste. @node Derived Files -@appendixsubsec Why Generated Files Are Kept In @command{git} +@appendixsubsec Why Generated Files Are Kept In Git +@c STARTOFRANGE gawkgit +@cindex Git, use of for @command{gawk} source code @c From emails written March 22, 2012, to the gawk developers list. -If you look at the @command{gawk} source in the @command{git} +If you look at the @command{gawk} source in the Git repository, you will notice that it includes files that are automatically generated by GNU infrastructure tools, such as @file{Makefile.in} from -@command{automake} and even @file{configure} from @command{autoconf}. +Automake and even @file{configure} from Autoconf. This is different from many Free Software projects that do not store the derived files, because that keeps the repository less cluttered, @@ -34902,11 +37543,10 @@ there a guarantee that we could find that @command{bison} version? Or that @emph{it} would build?) If the repository has all the generated files, then it's easy to just check -them out and build. (Or @emph{easier}, depending upon how far back we go. -@code{:-)}) +them out and build. (Or @emph{easier}, depending upon how far back we go.) And that brings us to the second (and stronger) reason why all the files -really need to be in @command{git}. It boils down to who do you cater +really need to be in Git. It boils down to who do you cater to---the @command{gawk} developer(s), or the user who just wants to check out a version and try it out? @@ -34915,10 +37555,10 @@ wants it to be possible for any interested @command{awk} user in the world to just clone the repository, check out the branch of interest and build it. Without their having to have the correct version(s) of the autotools.@footnote{There is one GNU program that is (in our opinion) -severely difficult to bootstrap from the @command{git} repository. For -example, on the author's old (but still working) PowerPC macintosh with +severely difficult to bootstrap from the Git repository. For +example, on the author's old (but still working) PowerPC Macintosh with Mac OS X 10.5, it was necessary to bootstrap a ton of software, starting -with @command{git} itself, in order to try to work with the latest code. +with Git itself, in order to try to work with the latest code. It's not pleasant, and especially on older systems, it's a big waste of time. @@ -34941,14 +37581,14 @@ This is extremely important for the @code{master} and Further, the @command{gawk} maintainer would argue that it's also important for the @command{gawk} developers. When he tried to check out -the @code{xgawk} branch@footnote{A branch created by one of the other +the @code{xgawk} branch@footnote{A branch (since removed) created by one of the other developers that did not include the generated files.} to build it, he couldn't. (No @file{ltmain.sh} file, and he had no idea how to create it, and that was not the only problem.) He felt @emph{extremely} frustrated. With respect to that branch, the maintainer is no different than Jane User who wants to try to build -@code{gawk-4.0-stable} or @code{master} from the repository. +@code{gawk-4.1-stable} or @code{master} from the repository. Thus, the maintainer thinks that it's not just important, but critical, that for any given branch, the above incantation @emph{just works}. @@ -34968,29 +37608,29 @@ It's the maintainer's job to merge them and he will deal with it. @item He is really good at @samp{git diff x y > /tmp/diff1 ; gvim /tmp/diff1} to -remove the diffs that aren't of interest in order to review code. @code{:-)} +remove the diffs that aren't of interest in order to review code. @end enumerate @item It would certainly help if everyone used the same versions of the GNU tools as he does, which in general are the latest released versions of -@command{automake}, -@command{autoconf}, +Automake, +Autoconf, @command{bison}, and -@command{gettext}. +GNU @command{gettext}. @ignore -If it would help if I sent out an "I just upgraded to version x.y -of tool Z" kind of message to this list, I can do that. Up until +If it would help if I sent out an ``I just upgraded to version x.y +of tool Z'' kind of message to this list, I can do that. Up until now it hasn't been a real issue since I'm the only one who's been dorking with the configuration machinery. @end ignore -@enumerate A -@item +@c @enumerate A +@c @item Installing from source is quite easy. It's how the maintainer worked for years -under Fedora. +(and still works). He had @file{/usr/local/bin} at the front of his @env{PATH} and just did: @example @@ -35001,10 +37641,11 @@ cd @var{package}-@var{x}.@var{y}.@var{z} make install # as root @end example -@item +@c @item +@ignore These days the maintainer uses Ubuntu 12.04 which is medium current, but -he is already doing the above for @command{autoconf}, @command{automake} -and @command{bison}. +he is already doing the above for Automake, Autoconf, and @command{bison}. +@end ignore @ignore (C. Rant: Recent Linux versions with GNOME 3 really suck. What @@ -35012,7 +37653,7 @@ and @command{bison}. me to Ubuntu, but Ubuntu 11.04 and 11.10 are totally unusable from a UI perspective. Bleah.) @end ignore -@end enumerate +@c @end enumerate @ignore @item @@ -35028,7 +37669,7 @@ the "real" changes and the second with "everything else needed for Most of the above was originally written by the maintainer to other @command{gawk} developers. It raised the objection from one of the developers ``@dots{} that anybody pulling down the source from -@command{git} is not an end user.'' +Git is not an end user.'' However, this is not true. There are ``power @command{awk} users'' who can build @command{gawk} (using the magic incantation shown previously) @@ -35038,10 +37679,10 @@ kept buildable all the time. It was then suggested that there be a @command{cron} job to create nightly tarballs of ``the source.'' Here, the problem is that there are source trees, corresponding to the various branches! So, -nightly tar balls aren't the answer, especially as the repository can go +nightly tarballs aren't the answer, especially as the repository can go for weeks without significant change being introduced. -Fortunately, the @command{git} server can meet this need. For any given +Fortunately, the Git server can meet this need. For any given branch named @var{branchname}, use: @example @@ -35050,7 +37691,7 @@ wget http://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-@var{branchname}.ta @noindent to retrieve a snapshot of the given branch. - +@c ENDOFRANGE gawkgit @node Future Extensions @appendixsec Probable Future Extensions @@ -35094,14 +37735,17 @@ Larry @quotation @i{AWK is a language similar to PERL, only considerably more elegant.} @author Arnold Robbins +@end quotation +@quotation @i{Hey!} @author Larry Wall @end quotation -The @file{TODO} file in the @command{gawk} Git repository lists possible -future enhancements. Some of these relate to the source code, and others -to possible new features. Please see that file for the list. +The @file{TODO} file in the @code{master} branch of the @command{gawk} +Git repository lists possible future enhancements. Some of these relate +to the source code, and others to possible new features. Please see +that file for the list. @xref{Additions}, if you are interested in tackling any of the projects listed there. @@ -35115,7 +37759,7 @@ different limits. @multitable @columnfractions .40 .60 @headitem Item @tab Limit @item Characters in a character class @tab 2^(number of bits per byte) -@item Length of input record @tab @code{MAX_INT } +@item Length of input record @tab @code{MAX_INT} @item Length of output record @tab Unlimited @item Length of source line @tab Unlimited @item Number of fields in a record @tab @code{MAX_LONG} @@ -35124,9 +37768,9 @@ different limits. @item Number of input records total @tab @code{MAX_LONG} @item Number of pipe redirections @tab min(number of processes per user, number of open files) @item Numeric values @tab Double-precision floating point (if not using MPFR) -@item Size of a field @tab @code{MAX_INT } -@item Size of a literal string @tab @code{MAX_INT } -@item Size of a printf string @tab @code{MAX_INT } +@item Size of a field @tab @code{MAX_INT} +@item Size of a literal string @tab @code{MAX_INT} +@item Size of a printf string @tab @code{MAX_INT} @end multitable @node Extension Design @@ -35161,7 +37805,7 @@ mechanism was bolted onto the side and was not really well thought out. The old extension mechanism had several problems: -@itemize @bullet +@itemize @value{BULLET} @item It depended heavily upon @command{gawk} internals. Any time the @code{NODE} structure@footnote{A critical central data structure @@ -35173,8 +37817,8 @@ documentation in this @value{DOCUMENT}, but it was quite minimal. @item Being able to call into @command{gawk} from an extension required linker facilities that are common on Unix-derived systems but that did -not work on Windows systems; users wanting extensions on Windows -had to statically link them into @command{gawk}, even though Windows supports +not work on MS-Windows systems; users wanting extensions on MS-Windows +had to statically link them into @command{gawk}, even though MS-Windows supports dynamic loading of shared objects. @item @@ -35197,7 +37841,7 @@ project is provided in @ref{gawkextlib}. Some goals for the new API were: -@itemize @bullet +@itemize @value{BULLET} @item The API should be independent of @command{gawk} internals. Changes in @command{gawk} internals should not be visible to the writer of an @@ -35212,7 +37856,7 @@ The API should enable extensions written in C or C++ to have roughly the same ``appearance'' to @command{awk}-level code as @command{awk} functions do. This means that extensions should have: -@itemize @minus +@itemize @value{MINUS} @item The ability to access function parameters. @@ -35228,13 +37872,13 @@ in order to loop over all the element in an easy fashion for C code. @item The ability to create arrays (including @command{gawk}'s true -multidimensional arrays). +arrays of arrays). @end itemize @end itemize Some additional important goals were: -@itemize @bullet +@itemize @value{BULLET} @item The API should use only features in ISO C 90, so that extensions can be written using the widest range of C and C++ compilers. The header @@ -35249,15 +37893,15 @@ The API mechanism should not require access to @command{gawk}'s symbols@footnote{The @dfn{symbols} are the variables and functions defined inside @command{gawk}. Access to these symbols by code external to @command{gawk} loaded dynamically at runtime is -problematic on Windows.} by the compile-time or dynamic linker, -in order to enable creation of extensions that also work on Windows. +problematic on MS-Windows.} by the compile-time or dynamic linker, +in order to enable creation of extensions that also work on MS-Windows. @end itemize During development, it became clear that there were other features that should be available to extensions, which were also subsequently provided: -@itemize @bullet +@itemize @value{BULLET} @item Extensions should have the ability to hook into @command{gawk}'s I/O redirection mechanism. In particular, the @command{xgawk} @@ -35268,7 +37912,7 @@ two-way I/O. @item An extension should be able to provide a ``call back'' function -to perform clean up actions when @command{gawk} exits. +to perform cleanup actions when @command{gawk} exits. @item An extension should be able to provide a version string so that @@ -35338,7 +37982,7 @@ to provide a minimal yet powerful set of features for creating extensions. The API can later be expanded, in two ways: -@itemize @bullet +@itemize @value{BULLET} @item @command{gawk} passes an ``extension id'' into the extension when it first loads the extension. The extension then passes this id back @@ -35361,12 +38005,12 @@ to any of the above. @ref{Dynamic Extensions}, describes the supported API and mechanisms for writing extensions for @command{gawk}. This API was introduced -in version 4.1. However, for many years @command{gawk} +in @value{PVERSION} 4.1. However, for many years @command{gawk} provided an extension mechanism that required knowledge of @command{gawk} internals and that was not as well designed. -In order to provide a transition period, @command{gawk} version -4.1 continues to support the original extension mechanism. +In order to provide a transition period, @command{gawk} @value{PVERSION} 4.1 +continues to support the original extension mechanism. This will be true for the life of exactly one major release. This support will be withdrawn, and removed from the source code, at the next major release. @@ -35392,6 +38036,42 @@ The @command{gawk} development team strongly recommends that you convert any old extensions that you may have to use the new API described in @ref{Dynamic Extensions}. +@node Notes summary +@appendixsec Summary + +@itemize @value{BULLET} +@item +@command{gawk}'s extensions can be disabled with either the +@option{--traditional} option or with the @option{--posix} option. +The @option{--parsedebug} option is available if @command{gawk} is +compiled with @samp{-DDEBUG}. + +@item +The source code for @command{gawk} is maintained in a publicly +accessable Git repository. Anyone may check it out and view the source. + +@item +Contributions to @command{gawk} are welcome. Following the steps +outlined in this @value{CHAPTER} will make it easier to integrate +your contributions into the code base. +This applies both to new feature contributions and to ports to +additional operating systems. + +@item +@command{gawk} has some limits---generally those that are imposed by +the machine architecture. + +@item +The extension API design was intended to solve a number of problems +with the previous extension mechanism, enable features needed by +the @code{xgawk} project, and provide binary compatibility going forward. + +@item +The previous extension mechanism is still supported in @value{PVERSION} 4.1 +of @command{gawk}, but it @emph{will} be removed in the next major release. + +@end itemize + @c ENDOFRANGE impis @c ENDOFRANGE gawii @@ -35419,8 +38099,15 @@ other introductory texts that you should refer to instead.) @cindex processing data At the most basic level, the job of a program is to process -some input data and produce results. See @ref{figure-general-flow}. +some input data and produce results. +@ifnotdocbook +See @ref{figure-general-flow}. +@end ifnotdocbook +@ifdocbook +See @inlineraw{docbook, <xref linkend="figure-general-flow"/>}. +@end ifdocbook +@ifnotdocbook @float Figure,figure-general-flow @caption{General Program Flow} @ifinfo @@ -35430,6 +38117,16 @@ some input data and produce results. See @ref{figure-general-flow}. @center @image{general-program, , , General program flow} @end ifnotinfo @end float +@end ifnotdocbook + +@docbook +<figure id="figure-general-flow" float="0"> +<title>General Program Flow</title> +<mediaobject> +<imageobject role="web"><imagedata fileref="general-program.png" format="PNG"/></imageobject> +</mediaobject> +</figure> +@end docbook @cindex compiled programs @cindex interpreted programs @@ -35445,9 +38142,15 @@ instructions in your program to process the data. @cindex programming, basic steps When you write a program, it usually consists -of the following, very basic set of steps, as shown -in @ref{figure-process-flow}: +of the following, very basic set of steps, +@ifnotdocbook +as shown in @ref{figure-process-flow}: +@end ifnotdocbook +@ifdocbook +as shown in @inlineraw{docbook, <xref linkend="figure-process-flow"/>}: +@end ifdocbook +@ifnotdocbook @float Figure,figure-process-flow @caption{Basic Program Steps} @ifinfo @@ -35457,6 +38160,16 @@ in @ref{figure-process-flow}: @center @image{process-flow, , , Basic Program Stages} @end ifnotinfo @end float +@end ifnotdocbook + +@docbook +<figure id="figure-process-flow" float="0"> +<title>Basic Program Stages</title> +<mediaobject> +<imageobject role="web"><imagedata fileref="process-flow.png" format="PNG"/></imageobject> +</mediaobject> +</figure> +@end docbook @table @asis @item Initialization @@ -35552,7 +38265,7 @@ Individual variables, as well as numeric and string variables, are referred to as @dfn{scalar} values. Groups of values, such as arrays, are not scalars. -@ref{General Arithmetic}, provided a basic introduction to numeric +@ref{Computer Arithmetic}, provided a basic introduction to numeric types (integer and floating-point) and how they are used in a computer. Please review that information, including a number of caveats that were presented. @@ -35568,14 +38281,14 @@ like this: @code{""}. Humans are used to working in decimal; i.e., base 10. In base 10, numbers go from 0 to 9, and then ``roll over'' into the next -column. (Remember grade school? 42 is 4 times 10 plus 2.) +column. (Remember grade school? 42 = 4 x 10 + 2.) There are other number bases though. Computers commonly use base 2 or @dfn{binary}, base 8 or @dfn{octal}, and base 16 or @dfn{hexadecimal}. In binary, each column represents two times the value in the column to its right. Each column may contain either a 0 or a 1. -Thus, binary 1010 represents 1 times 8, plus 0 times 4, plus 1 times 2, -plus 0 times 1, or decimal 10. +Thus, binary 1010 represents (1 x 8) + (0 x 4) + (1 x 2) ++ (0 x 1), or decimal 10. Octal and hexadecimal are discussed more in @ref{Nondecimal-numbers}. @@ -35612,7 +38325,7 @@ Where it makes sense, POSIX @command{awk} is compatible with 1999 ISO C. @item Action A series of @command{awk} statements attached to a rule. If the rule's pattern matches an input record, @command{awk} executes the -rule's action. Actions are always enclosed in curly braces. +rule's action. Actions are always enclosed in braces. (@xref{Action Overview}.) @cindex Spencer, Henry @@ -35627,7 +38340,7 @@ better written in another language. You can get it from @uref{http://awk.info/?awk100/aaa}. @cindex Ada programming language -@cindex Programming languages, Ada +@cindex programming languages, Ada @item Ada A programming language originally defined by the U.S.@: Department of Defense for embedded programming. It was designed to enforce good @@ -35695,9 +38408,6 @@ The GNU version of the standard shell @end ifinfo See also ``Bourne Shell.'' -@item BBS -See ``Bulletin Board System.'' - @item Bit Short for ``Binary Digit.'' All values in computer memory ultimately reduce to binary digits: values @@ -35720,7 +38430,7 @@ Named after the English mathematician Boole. See also ``Logical Expression.'' @item Bourne Shell The standard shell (@file{/bin/sh}) on Unix and Unix-like systems, -originally written by Steven R.@: Bourne. +originally written by Steven R.@: Bourne at Bell Laboratories. Many shells (Bash, @command{ksh}, @command{pdksh}, @command{zsh}) are generally upwardly compatible with the Bourne shell. @@ -35770,12 +38480,9 @@ Changing some of them affects @command{awk}'s running environment. (@xref{Built-in Variables}.) @item Braces -See ``Curly Braces.'' - -@item Bulletin Board System -A computer system allowing users to log in and read and/or leave messages -for other users of the system, much like leaving paper notes on a bulletin -board. +The characters @samp{@{} and @samp{@}}. Braces are used in +@command{awk} for delimiting actions, compound statements, and function +bodies. @item C The system programming language that most GNU software is written in. The @@ -35800,9 +38507,11 @@ or place. The most common character set in use today is ASCII (American Standard Code for Information Interchange). Many European countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1). The @uref{http://www.unicode.org, Unicode character set} is -becoming increasingly popular and standard, and is particularly +increasingly popular and standard, and is particularly widely used on GNU/Linux systems. +@cindex Kernighan, Brian +@cindex Bentley, Jon @cindex @command{chem} utility @item CHEM A preprocessor for @command{pic} that reads descriptions of molecules @@ -35811,10 +38520,11 @@ It was written in @command{awk} by Brian Kernighan and Jon Bentley, and is available from @uref{http://netlib.sandia.gov/netlib/typesetting/chem.gz}. +@cindex McIlroy, Doug @cindex cookie @item Cookie A peculiar goodie, token, saying or remembrance -produced by or presented to a program. (With thanks to Doug McIlroy.) +produced by or presented to a program. (With thanks to Professor Doug McIlroy.) @ignore From: Doug McIlroy <doug@cs.dartmouth.edu> Date: Sat, 13 Oct 2012 19:55:25 -0400 @@ -35892,9 +38602,7 @@ statements, and in patterns to select which input records to process. (@xref{Typing and Comparison}.) @item Curly Braces -The characters @samp{@{} and @samp{@}}. Curly braces are used in -@command{awk} for delimiting actions, compound statements, and function -bodies. +See ``Braces.'' @cindex dark corner @item Dark Corner @@ -35939,7 +38647,7 @@ ordinary expression. It could be a string constant, such as (@xref{Computed Regexps}.) @item Environment -A collection of strings, of the form @var{name@code{=}val}, that each +A collection of strings, of the form @samp{@var{name}=@var{val}}, that each program has available to it. Users generally place values into the environment in order to provide information to various programs. Typical examples are the environment variables @env{HOME} and @env{PATH}. @@ -35993,8 +38701,8 @@ this is just a number that can have a fractional part. See also ``Double Precision'' and ``Single Precision.'' @item Format -Format strings are used to control the appearance of output in the -@code{strftime()} and @code{sprintf()} functions, and are used in the +Format strings control the appearance of output in the +@code{strftime()} and @code{sprintf()} functions, and in the @code{printf} statement as well. Also, data conversions from numbers to strings are controlled by the format strings contained in the built-in variables @code{CONVFMT} and @code{OFMT}. (@xref{Control Letters}.) @@ -36063,7 +38771,7 @@ Base 16 notation, where the digits are @code{0}--@code{9} and @code{A}--@code{F}, with @samp{A} representing 10, @samp{B} representing 11, and so on, up to @samp{F} for 15. Hexadecimal numbers are written in C using a leading @samp{0x}, -to indicate their base. Thus, @code{0x12} is 18 (1 times 16 plus 2). +to indicate their base. Thus, @code{0x12} is 18 ((1 x 16) + 2). @xref{Nondecimal-numbers}. @item I/O @@ -36108,7 +38816,7 @@ information about the name of the organization and its language-independent three-letter acronym. @cindex Java programming language -@cindex Programming languages, Java +@cindex programming languages, Java @item Java A modern programming language originally developed by Sun Microsystems (now Oracle) supporting Object-Oriented programming. Although usually @@ -36137,8 +38845,8 @@ meaning. Keywords are reserved and may not be used as variable names. @code{function}, @code{func}, @code{if}, -@code{nextfile}, @code{next}, +@code{nextfile}, @code{switch}, and @code{while}. @@ -36199,13 +38907,9 @@ Ancient @command{awk} implementations used single precision floating-point. @item Octal Base-eight notation, where the digits are @code{0}--@code{7}. Octal numbers are written in C using a leading @samp{0}, -to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3). +to indicate their base. Thus, @code{013} is 11 ((1 x 8) + 3). @xref{Nondecimal-numbers}. -@cindex P1003.1 POSIX standard -@item P1003.1 -See ``POSIX.'' - @item Pattern Patterns tell @command{awk} which input records are interesting to which rules. @@ -36246,8 +38950,8 @@ specify single lines. (@xref{Pattern Overview}.) @item Recursion When a function calls itself, either directly or indirectly. -As long as this is not clear, refer to the entry for ``recursion.'' If this is clear, stop, and proceed to the next entry. +Otherwise, refer to the entry for ``recursion.'' @item Redirection Redirection means performing input from something other than the standard input @@ -36326,14 +39030,14 @@ expressions, and function calls have side effects. An internal representation of numbers that can have fractional parts. Single precision numbers keep track of fewer digits than do double precision numbers, but operations on them are sometimes less expensive in terms of CPU time. -This is the type used by some very old versions of @command{awk} to store +This is the type used by some ancient versions of @command{awk} to store numeric values. It is the C type @code{float}. @item Space The character generated by hitting the space bar on the keyboard. @item Special File -A file name interpreted internally by @command{gawk}, instead of being handed +A @value{FN} interpreted internally by @command{gawk}, instead of being handed directly to the underlying operating system---for example, @file{/dev/stderr}. (@xref{Special Files}.) @@ -36363,7 +39067,7 @@ into the local language. A value in the ``seconds since the epoch'' format used by Unix and POSIX systems. Used for the @command{gawk} functions @code{mktime()}, @code{strftime()}, and @code{systime()}. -See also ``Epoch'' and ``UTC.'' +See also ``Epoch,'' ``GMT,'' and ``UTC.'' @cindex Linux @cindex GNU/Linux @@ -36395,7 +39099,12 @@ record or a string. @c The GNU General Public License. @node Copying @unnumbered GNU General Public License +@ifnotdocbook @center Version 3, 29 June 2007 +@end ifnotdocbook +@docbook +<subtitle>Version 3, 29 June 2007</subtitle> +@end docbook @c This file is intended to be included within another document, @c hence no sectioning command or @node. @@ -37120,10 +39829,17 @@ first, please read @url{http://www.gnu.org/philosophy/why-not-lgpl.html}. @c The GNU Free Documentation License. @node GNU Free Documentation License @unnumbered GNU Free Documentation License +@ifnotdocbook +@center Version 1.3, 3 November 2008 +@end ifnotdocbook + +@docbook +<subtitle>Version 1.3, 3 November 2008</subtitle> +@end docbook + @cindex FDL (Free Documentation License) @cindex Free Documentation License (FDL) @cindex GNU Free Documentation License -@center Version 1.3, 3 November 2008 @c This file is intended to be included within another document, @c hence no sectioning command or @node. @@ -37624,12 +40340,12 @@ recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software. -@c Local Variables: -@c ispell-local-pdict: "ispell-dict" -@c End: +@end ifclear +@ifnotdocbook @node Index @unnumbered Index +@end ifnotdocbook @printindex cp @bye @@ -37737,11 +40453,6 @@ ORA uses filename, thus the macro. Suggestions: ------------ -% Next edition: -% 1. Standardize the error messages from the functions and programs -% in the two sample code chapters. -% 2. Nuke the BBS stuff and use something that won't be obsolete -% 3. Turn the advanced notes into sidebars by using @cartouche Better sidebars can almost sort of be done with: @@ -37774,3 +40485,5 @@ But to use it you have to say which sorta sucks. +TODO: +----- |