diff options
Diffstat (limited to 'doc/gawk.texi')
-rw-r--r-- | doc/gawk.texi | 12578 |
1 files changed, 8116 insertions, 4462 deletions
diff --git a/doc/gawk.texi b/doc/gawk.texi index 8cd7e38e..fb008d74 100644 --- a/doc/gawk.texi +++ b/doc/gawk.texi @@ -20,9 +20,9 @@ @c applies to and all the info about who's publishing this edition @c These apply across the board. -@set UPDATE-MONTH February, 2012 +@set UPDATE-MONTH November, 2012 @set VERSION 4.0 -@set PATCHLEVEL 1 +@set PATCHLEVEL 2 @set FSF @@ -66,6 +66,15 @@ @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) @end ifdocbook +@ifxml +@set DOCUMENT book +@set CHAPTER chapter +@set APPENDIX appendix +@set SECTION section +@set SUBSECTION subsection +@set DARKCORNER (d.c.) +@set COMMONEXT (c.e.) +@end ifxml @ifplaintext @set DOCUMENT book @set CHAPTER chapter @@ -285,22 +294,24 @@ particular records in a file and perform operations upon them. * Arrays:: The description and use of arrays. Also includes array-oriented control statements. * Functions:: Built-in and user-defined functions. +* Library Functions:: A Library of @command{awk} Functions. +* Sample Programs:: Many @command{awk} programs with complete + explanations. * Internationalization:: Getting @command{gawk} to speak your language. -* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with - @command{gawk}. * Advanced Features:: Stuff for advanced users, specific to @command{gawk}. -* Library Functions:: A Library of @command{awk} Functions. -* Sample Programs:: Many @command{awk} programs with complete - explanations. * Debugger:: The @code{gawk} debugger. +* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with + @command{gawk}. +* Dynamic Extensions:: Adding new built-in functions to + @command{gawk}. * Language History:: The evolution of the @command{awk} language. * Installation:: Installing @command{gawk} under various operating systems. -* Notes:: Notes about @command{gawk} extensions and - possible future work. +* Notes:: Notes about adding things to @command{gawk} + and possible future work. * Basic Concepts:: A very quick introduction to programming concepts. * Glossary:: An explanation of some unfamiliar terms. @@ -310,416 +321,532 @@ particular records in a file and perform operations upon them. * Index:: Concept and Variable Index. @detailmenu -* History:: The history of @command{gawk} and - @command{awk}. -* Names:: What name to use to find @command{awk}. -* This Manual:: Using this @value{DOCUMENT}. Includes - sample input files that you can use. -* Conventions:: Typographical Conventions. -* Manual History:: Brief history of the GNU project and this - @value{DOCUMENT}. -* How To Contribute:: Helping to save the world. -* Acknowledgments:: Acknowledgments. -* Running gawk:: How to run @command{gawk} programs; - includes command-line syntax. -* One-shot:: Running a short throwaway @command{awk} - program. -* Read Terminal:: Using no input files (input from terminal - instead). -* Long:: Putting permanent @command{awk} programs in - files. -* Executable Scripts:: Making self-contained @command{awk} - programs. -* Comments:: Adding documentation to @command{gawk} - programs. -* Quoting:: More discussion of shell quoting issues. -* DOS Quoting:: Quoting in Windows Batch Files. -* Sample Data Files:: Sample data files for use in the - @command{awk} programs illustrated in this - @value{DOCUMENT}. -* Very Simple:: A very simple example. -* Two Rules:: A less simple one-line example using two - rules. -* More Complex:: A more complex example. -* Statements/Lines:: Subdividing or combining statements into - lines. -* Other Features:: Other Features of @command{awk}. -* When:: When to use @command{gawk} and when to use - other things. -* Command Line:: How to run @command{awk}. -* Options:: Command-line options and their meanings. -* Other Arguments:: Input file names and variable assignments. -* Naming Standard Input:: How to specify standard input with other - files. -* Environment Variables:: The environment variables @command{gawk} - uses. -* AWKPATH Variable:: Searching directories for @command{awk} - programs. -* Other Environment Variables:: The environment variables. -* Exit Status:: @command{gawk}'s exit status. -* Include Files:: Including other files into your program. -* Obsolete:: Obsolete Options and/or features. -* Undocumented:: Undocumented Options and Features. -* Regexp Usage:: How to Use Regular Expressions. -* Escape Sequences:: How to write nonprinting characters. -* Regexp Operators:: Regular Expression Operators. -* Bracket Expressions:: What can go between @samp{[...]}. -* GNU Regexp Operators:: Operators specific to GNU software. -* Case-sensitivity:: How to do case-insensitive matching. -* Leftmost Longest:: How much text matches. -* Computed Regexps:: Using Dynamic Regexps. -* Records:: Controlling how data is split into records. -* Fields:: An introduction to fields. -* Nonconstant Fields:: Nonconstant Field Numbers. -* Changing Fields:: Changing the Contents of a Field. -* Field Separators:: The field separator and how to change it. -* Default Field Splitting:: How fields are normally separated. -* Regexp Field Splitting:: Using regexps as the field separator. -* Single Character Fields:: Making each character a separate field. -* Command Line Field Separator:: Setting @code{FS} from the command-line. -* Field Splitting Summary:: Some final points and a summary table. -* Constant Size:: Reading constant width data. -* Splitting By Content:: Defining Fields By Content -* Multiple Line:: Reading multi-line records. -* Getline:: Reading files under explicit program - control using the @code{getline} function. -* Plain Getline:: Using @code{getline} with no arguments. -* Getline/Variable:: Using @code{getline} into a variable. -* Getline/File:: Using @code{getline} from a file. -* Getline/Variable/File:: Using @code{getline} into a variable from a - file. -* Getline/Pipe:: Using @code{getline} from a pipe. -* Getline/Variable/Pipe:: Using @code{getline} into a variable from a - pipe. -* Getline/Coprocess:: Using @code{getline} from a coprocess. -* Getline/Variable/Coprocess:: Using @code{getline} into a variable from a - coprocess. -* Getline Notes:: Important things to know about - @code{getline}. -* Getline Summary:: Summary of @code{getline} Variants. -* Read Timeout:: Reading input with a timeout. -* Command line directories:: What happens if you put a directory on the - command line. -* Print:: The @code{print} statement. -* Print Examples:: Simple examples of @code{print} statements. -* Output Separators:: The output separators and how to change - them. -* OFMT:: Controlling Numeric Output With - @code{print}. -* Printf:: The @code{printf} statement. -* Basic Printf:: Syntax of the @code{printf} statement. -* Control Letters:: Format-control letters. -* Format Modifiers:: Format-specification modifiers. -* Printf Examples:: Several examples. -* Redirection:: How to redirect output to multiple files - and pipes. -* Special Files:: File name interpretation in @command{gawk}. - @command{gawk} allows access to inherited - file descriptors. -* Special FD:: Special files for I/O. -* Special Network:: Special files for network communications. -* Special Caveats:: Things to watch out for. -* Close Files And Pipes:: Closing Input and Output Files and Pipes. -* Values:: Constants, Variables, and Regular - Expressions. -* Constants:: String, numeric and regexp constants. -* Scalar Constants:: Numeric and string constants. -* Nondecimal-numbers:: What are octal and hex numbers. -* Regexp Constants:: Regular Expression constants. -* Using Constant Regexps:: When and how to use a regexp constant. -* Variables:: Variables give names to values for later - use. -* Using Variables:: Using variables in your programs. -* Assignment Options:: Setting variables on the command-line and a - summary of command-line syntax. This is an - advanced method of input. -* Conversion:: The conversion of strings to numbers and - vice versa. -* All Operators:: @command{gawk}'s operators. -* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, - etc.) -* Concatenation:: Concatenating strings. -* Assignment Ops:: Changing the value of a variable or a - field. -* Increment Ops:: Incrementing the numeric value of a - variable. -* Truth Values and Conditions:: Testing for true and false. -* Truth Values:: What is ``true'' and what is ``false''. -* Typing and Comparison:: How variables acquire types and how this - affects comparison of numbers and strings - with @samp{<}, etc. -* Variable Typing:: String type versus numeric type. -* Comparison Operators:: The comparison operators. -* POSIX String Comparison:: String comparison with POSIX rules. -* Boolean Ops:: Combining comparison expressions using - boolean operators @samp{||} (``or''), - @samp{&&} (``and'') and @samp{!} (``not''). -* Conditional Exp:: Conditional expressions select between two - subexpressions under control of a third - subexpression. -* Function Calls:: A function call is an expression. -* Precedence:: How various operators nest. -* Locales:: How the locale affects things. -* Pattern Overview:: What goes into a pattern. -* Regexp Patterns:: Using regexps as patterns. -* Expression Patterns:: Any expression can be used as a pattern. -* Ranges:: Pairs of patterns specify record ranges. -* BEGIN/END:: Specifying initialization and cleanup - rules. -* Using BEGIN/END:: How and why to use BEGIN/END rules. -* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. -* BEGINFILE/ENDFILE:: Two special patterns for advanced control. -* Empty:: The empty pattern, which matches every - record. -* Using Shell Variables:: How to use shell variables with - @command{awk}. -* Action Overview:: What goes into an action. -* Statements:: Describes the various control statements in - detail. -* If Statement:: Conditionally execute some @command{awk} - statements. -* While Statement:: Loop until some condition is satisfied. -* Do Statement:: Do specified action while looping until - some condition is satisfied. -* For Statement:: Another looping statement, that provides - initialization and increment clauses. -* Switch Statement:: Switch/case evaluation for conditional - execution of statements based on a value. -* Break Statement:: Immediately exit the innermost enclosing - loop. -* Continue Statement:: Skip to the end of the innermost enclosing - loop. -* Next Statement:: Stop processing the current input record. -* Nextfile Statement:: Stop processing the current file. -* Exit Statement:: Stop execution of @command{awk}. -* Built-in Variables:: Summarizes the built-in variables. -* User-modified:: Built-in variables that you change to - control @command{awk}. -* Auto-set:: Built-in variables where @command{awk} - gives you information. -* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. -* Array Basics:: The basics of arrays. -* Array Intro:: Introduction to Arrays -* Reference to Elements:: How to examine one element of an array. -* Assigning Elements:: How to change an element of an array. -* Array Example:: Basic Example of an Array -* Scanning an Array:: A variation of the @code{for} statement. It - loops through the indices of an array's - existing elements. -* Controlling Scanning:: Controlling the order in which arrays are - scanned. -* Delete:: The @code{delete} statement removes an - element from an array. -* Numeric Array Subscripts:: How to use numbers as subscripts in - @command{awk}. -* Uninitialized Subscripts:: Using Uninitialized variables as - subscripts. -* Multi-dimensional:: Emulating multidimensional arrays in - @command{awk}. -* Multi-scanning:: Scanning multidimensional arrays. -* Arrays of Arrays:: True multidimensional arrays. -* Built-in:: Summarizes the built-in functions. -* Calling Built-in:: How to call built-in functions. -* Numeric Functions:: Functions that work with numbers, including - @code{int()}, @code{sin()} and - @code{rand()}. -* String Functions:: Functions for string manipulation, such as - @code{split()}, @code{match()} and - @code{sprintf()}. -* Gory Details:: More than you want to know about @samp{\} - and @samp{&} with @code{sub()}, - @code{gsub()}, and @code{gensub()}. -* I/O Functions:: Functions for files and shell commands. -* Time Functions:: Functions for dealing with timestamps. -* Bitwise Functions:: Functions for bitwise operations. -* Type Functions:: Functions for type information. -* I18N Functions:: Functions for string translation. -* User-defined:: Describes User-defined functions in detail. -* Definition Syntax:: How to write definitions and what they - mean. -* Function Example:: An example function definition and what it - does. -* Function Caveats:: Things to watch out for. -* Calling A Function:: Don't use spaces. -* Variable Scope:: Controlling variable scope. -* Pass By Value/Reference:: Passing parameters. -* Return Statement:: Specifying the value a function returns. -* Dynamic Typing:: How variable types can change at runtime. -* Indirect Calls:: Choosing the function to call at runtime. -* I18N and L10N:: Internationalization and Localization. -* Explaining gettext:: How GNU @code{gettext} works. -* Programmer i18n:: Features for the programmer. -* Translator i18n:: Features for the translator. -* String Extraction:: Extracting marked strings. -* Printf Ordering:: Rearranging @code{printf} arguments. -* I18N Portability:: @command{awk}-level portability issues. -* I18N Example:: A simple i18n example. -* Gawk I18N:: @command{gawk} is also internationalized. -* Floating-point Programming:: Effective floating-point programming. -* Floating-point Representation:: Binary floating-point representation. -* Floating-point Context:: Floating-point context. -* Rounding Mode:: Floating-point rounding mode. -* Arbitrary Precision Floats:: Arbitrary precision floating-point - arithmetic with @command{gawk}. -* Setting Precision:: Setting the working precision. -* Setting Rounding Mode:: Setting the rounding mode. -* Floating-point Constants:: Representing floating-point constants. -* Changing Precision:: Changing the precision of a number. -* Exact Arithmetic:: Exact arithmetic with floating-point numbers. -* Integer Programming:: Effective integer programming. -* Arbitrary Precision Integers:: Arbitrary precision integer - arithmetic with @command{gawk}. -* MPFR and GMP Libraries:: Information about the MPFR and GMP libraries. -* Nondecimal Data:: Allowing nondecimal input data. -* Array Sorting:: Facilities for controlling array traversal - and sorting arrays. -* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. -* Array Sorting Functions:: How to use @code{asort()} and - @code{asorti()}. -* Two-way I/O:: Two-way communications with another - process. -* TCP/IP Networking:: Using @command{gawk} for network - programming. -* Profiling:: Profiling your @command{awk} programs. -* Library Names:: How to best name private global variables - in library functions. -* General Functions:: Functions that are of general use. -* Strtonum Function:: A replacement for the built-in - @code{strtonum()} function. -* Assert Function:: A function for assertions in @command{awk} - programs. -* Round Function:: A function for rounding if @code{sprintf()} - does not do it correctly. -* Cliff Random Function:: The Cliff Random Number Generator. -* Ordinal Functions:: Functions for using characters as numbers - and vice versa. -* Join Function:: A function to join an array into a string. -* Gettimeofday Function:: A function to get formatted times. -* Data File Management:: Functions for managing command-line data - files. -* Filetrans Function:: A function for handling data file - transitions. -* Rewind Function:: A function for rereading the current file. -* File Checking:: Checking that data files are readable. -* Empty Files:: Checking for zero-length files. -* Ignoring Assigns:: Treating assignments as file names. -* Getopt Function:: A function for processing command-line - arguments. -* Passwd Functions:: Functions for getting user information. -* Group Functions:: Functions for getting group information. -* Walking Arrays:: A function to walk arrays of arrays. -* Running Examples:: How to run these examples. -* Clones:: Clones of common utilities. -* Cut Program:: The @command{cut} utility. -* Egrep Program:: The @command{egrep} utility. -* Id Program:: The @command{id} utility. -* Split Program:: The @command{split} utility. -* Tee Program:: The @command{tee} utility. -* Uniq Program:: The @command{uniq} utility. -* Wc Program:: The @command{wc} utility. -* Miscellaneous Programs:: Some interesting @command{awk} programs. -* Dupword Program:: Finding duplicated words in a document. -* Alarm Program:: An alarm clock. -* Translate Program:: A program similar to the @command{tr} - utility. -* Labels Program:: Printing mailing labels. -* Word Sorting:: A program to produce a word usage count. -* History Sorting:: Eliminating duplicate entries from a - history file. -* Extract Program:: Pulling out programs from Texinfo source - files. -* Simple Sed:: A Simple Stream Editor. -* Igawk Program:: A wrapper for @command{awk} that includes - files. -* Anagram Program:: Finding anagrams from a dictionary. -* Signature Program:: People do amazing things with too much time - on their hands. -* Debugging:: Introduction to @command{gawk} Debugger. -* Debugging Concepts:: Debugging in General. -* Debugging Terms:: Additional Debugging Concepts. -* Awk Debugging:: Awk Debugging. -* Sample Debugging Session:: Sample Debugging Session. -* Debugger Invocation:: How to Start the Debugger. -* Finding The Bug:: Finding the Bug. -* List of Debugger Commands:: Main Commands. -* Breakpoint Control:: Control of Breakpoints. -* Debugger Execution Control:: Control of Execution. -* Viewing And Changing Data:: Viewing and Changing Data. -* Execution Stack:: Dealing with the Stack. -* Debugger Info:: Obtaining Information about the Program and - the Debugger State. -* Miscellaneous Debugger Commands:: Miscellaneous Commands. -* Readline Support:: Readline Support. -* Limitations:: Limitations and Future Plans. -* V7/SVR3.1:: The major changes between V7 and System V - Release 3.1. -* SVR4:: Minor changes between System V Releases 3.1 - and 4. -* POSIX:: New features from the POSIX standard. -* BTL:: New features from Brian Kernighan's version - of @command{awk}. -* POSIX/GNU:: The extensions in @command{gawk} not in - POSIX @command{awk}. -* Common Extensions:: Common Extensions Summary. -* Ranges and Locales:: How locales used to affect regexp ranges. -* Contributors:: The major contributors to @command{gawk}. -* Gawk Distribution:: What is in the @command{gawk} distribution. -* Getting:: How to get the distribution. -* Extracting:: How to extract the distribution. -* Distribution contents:: What is in the distribution. -* Unix Installation:: Installing @command{gawk} under various - versions of Unix. -* Quick Installation:: Compiling @command{gawk} under Unix. -* Additional Configuration Options:: Other compile-time options. -* Configuration Philosophy:: How it's all supposed to work. -* Non-Unix Installation:: Installation on Other Operating Systems. -* PC Installation:: Installing and Compiling @command{gawk} on - MS-DOS and OS/2. -* PC Binary Installation:: Installing a prepared distribution. -* PC Compiling:: Compiling @command{gawk} for MS-DOS, - Windows32, and OS/2. -* PC Testing:: Testing @command{gawk} on PC systems. -* PC Using:: Running @command{gawk} on MS-DOS, Windows32 - and OS/2. -* Cygwin:: Building and running @command{gawk} for - Cygwin. -* MSYS:: Using @command{gawk} In The MSYS - Environment. -* VMS Installation:: Installing @command{gawk} on VMS. -* VMS Compilation:: How to compile @command{gawk} under VMS. -* VMS Installation Details:: How to install @command{gawk} under VMS. -* VMS Running:: How to run @command{gawk} under VMS. -* VMS Old Gawk:: An old version comes with some VMS systems. -* Bugs:: Reporting Problems and Bugs. -* Other Versions:: Other freely available @command{awk} - implementations. -* Compatibility Mode:: How to disable certain @command{gawk} - extensions. -* Additions:: Making Additions To @command{gawk}. -* Accessing The Source:: Accessing the Git repository. -* Adding Code:: Adding code to the main body of - @command{gawk}. -* New Ports:: Porting @command{gawk} to a new operating - system. -* Dynamic Extensions:: Adding new built-in functions to - @command{gawk}. -* Internals:: A brief look at some @command{gawk} - internals. -* Plugin License:: A note about licensing. -* Loading Extensions:: How to load dynamic extensions. -* Sample Library:: A example of new functions. -* Internal File Description:: What the new functions will do. -* Internal File Ops:: The code for internal file operations. -* Using Internal File Ops:: How to use an external extension. -* Future Extensions:: New features that may be implemented one - day. -* Basic High Level:: The high level view. -* Basic Data Typing:: A very quick intro to data types. -* Floating Point Issues:: Stuff to know about floating-point numbers. -* String Conversion Precision:: The String Value Can Lie. -* Unexpected Results:: Floating Point Numbers Are Not Abstract - Numbers. -* POSIX Floating Point Problems:: Standards Versus Existing Practice. +* History:: The history of @command{gawk} and + @command{awk}. +* Names:: What name to use to find + @command{awk}. +* This Manual:: Using this @value{DOCUMENT}. Includes + sample input files that you can use. +* Conventions:: Typographical Conventions. +* Manual History:: Brief history of the GNU project and + this @value{DOCUMENT}. +* How To Contribute:: Helping to save the world. +* Acknowledgments:: Acknowledgments. +* Running gawk:: How to run @command{gawk} programs; + includes command-line syntax. +* One-shot:: Running a short throwaway + @command{awk} program. +* Read Terminal:: Using no input files (input from + terminal instead). +* Long:: Putting permanent @command{awk} + programs in files. +* Executable Scripts:: Making self-contained @command{awk} + programs. +* Comments:: Adding documentation to @command{gawk} + programs. +* Quoting:: More discussion of shell quoting + issues. +* DOS Quoting:: Quoting in Windows Batch Files. +* Sample Data Files:: Sample data files for use in the + @command{awk} programs illustrated in + this @value{DOCUMENT}. +* Very Simple:: A very simple example. +* Two Rules:: A less simple one-line example using + two rules. +* More Complex:: A more complex example. +* Statements/Lines:: Subdividing or combining statements + into lines. +* Other Features:: Other Features of @command{awk}. +* When:: When to use @command{gawk} and when to + use other things. +* Command Line:: How to run @command{awk}. +* Options:: Command-line options and their + meanings. +* Other Arguments:: Input file names and variable + assignments. +* Naming Standard Input:: How to specify standard input with + other files. +* Environment Variables:: The environment variables + @command{gawk} uses. +* AWKPATH Variable:: Searching directories for + @command{awk} programs. +* AWKLIBPATH Variable:: Searching directories for + @command{awk} shared libraries. +* Other Environment Variables:: The environment variables. +* Exit Status:: @command{gawk}'s exit status. +* Include Files:: Including other files into your + program. +* Loading Shared Libraries:: Loading shared libraries into your + program. +* Obsolete:: Obsolete Options and/or features. +* Undocumented:: Undocumented Options and Features. +* Regexp Usage:: How to Use Regular Expressions. +* Escape Sequences:: How to write nonprinting characters. +* Regexp Operators:: Regular Expression Operators. +* Bracket Expressions:: What can go between @samp{[...]}. +* GNU Regexp Operators:: Operators specific to GNU software. +* Case-sensitivity:: How to do case-insensitive matching. +* Leftmost Longest:: How much text matches. +* Computed Regexps:: Using Dynamic Regexps. +* Records:: Controlling how data is split into + records. +* Fields:: An introduction to fields. +* Nonconstant Fields:: Nonconstant Field Numbers. +* Changing Fields:: Changing the Contents of a Field. +* Field Separators:: The field separator and how to change + it. +* Default Field Splitting:: How fields are normally separated. +* Regexp Field Splitting:: Using regexps as the field separator. +* Single Character Fields:: Making each character a separate + field. +* Command Line Field Separator:: Setting @code{FS} from the + command-line. +* Field Splitting Summary:: Some final points and a summary table. +* Constant Size:: Reading constant width data. +* Splitting By Content:: Defining Fields By Content +* Multiple Line:: Reading multi-line records. +* Getline:: Reading files under explicit program + control using the @code{getline} + function. +* Plain Getline:: Using @code{getline} with no + arguments. +* Getline/Variable:: Using @code{getline} into a variable. +* Getline/File:: Using @code{getline} from a file. +* Getline/Variable/File:: Using @code{getline} into a variable + from a file. +* Getline/Pipe:: Using @code{getline} from a pipe. +* Getline/Variable/Pipe:: Using @code{getline} into a variable + from a pipe. +* Getline/Coprocess:: Using @code{getline} from a coprocess. +* Getline/Variable/Coprocess:: Using @code{getline} into a variable + from a coprocess. +* Getline Notes:: Important things to know about + @code{getline}. +* Getline Summary:: Summary of @code{getline} Variants. +* Read Timeout:: Reading input with a timeout. +* Command line directories:: What happens if you put a directory on + the command line. +* Print:: The @code{print} statement. +* Print Examples:: Simple examples of @code{print} + statements. +* Output Separators:: The output separators and how to + change them. +* OFMT:: Controlling Numeric Output With + @code{print}. +* Printf:: The @code{printf} statement. +* Basic Printf:: Syntax of the @code{printf} statement. +* Control Letters:: Format-control letters. +* Format Modifiers:: Format-specification modifiers. +* Printf Examples:: Several examples. +* Redirection:: How to redirect output to multiple + files and pipes. +* Special Files:: File name interpretation in + @command{gawk}. @command{gawk} allows + access to inherited file descriptors. +* Special FD:: Special files for I/O. +* Special Network:: Special files for network + communications. +* Special Caveats:: Things to watch out for. +* Close Files And Pipes:: Closing Input and Output Files and + Pipes. +* Values:: Constants, Variables, and Regular + Expressions. +* Constants:: String, numeric and regexp constants. +* Scalar Constants:: Numeric and string constants. +* Nondecimal-numbers:: What are octal and hex numbers. +* Regexp Constants:: Regular Expression constants. +* Using Constant Regexps:: When and how to use a regexp constant. +* Variables:: Variables give names to values for + later use. +* Using Variables:: Using variables in your programs. +* Assignment Options:: Setting variables on the command-line + and a summary of command-line syntax. + This is an advanced method of input. +* Conversion:: The conversion of strings to numbers + and vice versa. +* All Operators:: @command{gawk}'s operators. +* Arithmetic Ops:: Arithmetic operations (@samp{+}, + @samp{-}, etc.) +* Concatenation:: Concatenating strings. +* Assignment Ops:: Changing the value of a variable or a + field. +* Increment Ops:: Incrementing the numeric value of a + variable. +* Truth Values and Conditions:: Testing for true and false. +* Truth Values:: What is ``true'' and what is + ``false''. +* Typing and Comparison:: How variables acquire types and how + this affects comparison of numbers and + strings with @samp{<}, etc. +* Variable Typing:: String type versus numeric type. +* Comparison Operators:: The comparison operators. +* POSIX String Comparison:: String comparison with POSIX rules. +* Boolean Ops:: Combining comparison expressions using + boolean operators @samp{||} (``or''), + @samp{&&} (``and'') and @samp{!} + (``not''). +* Conditional Exp:: Conditional expressions select between + two subexpressions under control of a + third subexpression. +* Function Calls:: A function call is an expression. +* Precedence:: How various operators nest. +* Locales:: How the locale affects things. +* Pattern Overview:: What goes into a pattern. +* Regexp Patterns:: Using regexps as patterns. +* Expression Patterns:: Any expression can be used as a + pattern. +* Ranges:: Pairs of patterns specify record + ranges. +* BEGIN/END:: Specifying initialization and cleanup + rules. +* Using BEGIN/END:: How and why to use BEGIN/END rules. +* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. +* BEGINFILE/ENDFILE:: Two special patterns for advanced + control. +* Empty:: The empty pattern, which matches every + record. +* Using Shell Variables:: How to use shell variables with + @command{awk}. +* Action Overview:: What goes into an action. +* Statements:: Describes the various control + statements in detail. +* If Statement:: Conditionally execute some + @command{awk} statements. +* While Statement:: Loop until some condition is + satisfied. +* Do Statement:: Do specified action while looping + until some condition is satisfied. +* For Statement:: Another looping statement, that + provides initialization and increment + clauses. +* Switch Statement:: Switch/case evaluation for conditional + execution of statements based on a + value. +* Break Statement:: Immediately exit the innermost + enclosing loop. +* Continue Statement:: Skip to the end of the innermost + enclosing loop. +* Next Statement:: Stop processing the current input + record. +* Nextfile Statement:: Stop processing the current file. +* Exit Statement:: Stop execution of @command{awk}. +* Built-in Variables:: Summarizes the built-in variables. +* User-modified:: Built-in variables that you change to + control @command{awk}. +* Auto-set:: Built-in variables where @command{awk} + gives you information. +* ARGC and ARGV:: Ways to use @code{ARGC} and + @code{ARGV}. +* Array Basics:: The basics of arrays. +* Array Intro:: Introduction to Arrays +* Reference to Elements:: How to examine one element of an + array. +* Assigning Elements:: How to change an element of an array. +* Array Example:: Basic Example of an Array +* Scanning an Array:: A variation of the @code{for} + statement. It loops through the + indices of an array's existing + elements. +* Controlling Scanning:: Controlling the order in which arrays + are scanned. +* Delete:: The @code{delete} statement removes an + element from an array. +* Numeric Array Subscripts:: How to use numbers as subscripts in + @command{awk}. +* Uninitialized Subscripts:: Using Uninitialized variables as + subscripts. +* Multi-dimensional:: Emulating multidimensional arrays in + @command{awk}. +* Multi-scanning:: Scanning multidimensional arrays. +* Arrays of Arrays:: True multidimensional arrays. +* Built-in:: Summarizes the built-in functions. +* Calling Built-in:: How to call built-in functions. +* Numeric Functions:: Functions that work with numbers, + including @code{int()}, @code{sin()} + and @code{rand()}. +* String Functions:: Functions for string manipulation, + such as @code{split()}, @code{match()} + and @code{sprintf()}. +* Gory Details:: More than you want to know about + @samp{\} and @samp{&} with + @code{sub()}, @code{gsub()}, and + @code{gensub()}. +* I/O Functions:: Functions for files and shell + commands. +* Time Functions:: Functions for dealing with timestamps. +* Bitwise Functions:: Functions for bitwise operations. +* Type Functions:: Functions for type information. +* I18N Functions:: Functions for string translation. +* User-defined:: Describes User-defined functions in + detail. +* Definition Syntax:: How to write definitions and what they + mean. +* Function Example:: An example function definition and + what it does. +* Function Caveats:: Things to watch out for. +* Calling A Function:: Don't use spaces. +* Variable Scope:: Controlling variable scope. +* Pass By Value/Reference:: Passing parameters. +* Return Statement:: Specifying the value a function + returns. +* Dynamic Typing:: How variable types can change at + runtime. +* Indirect Calls:: Choosing the function to call at + runtime. +* Library Names:: How to best name private global + variables in library functions. +* General Functions:: Functions that are of general use. +* Strtonum Function:: A replacement for the built-in + @code{strtonum()} function. +* Assert Function:: A function for assertions in + @command{awk} programs. +* Round Function:: A function for rounding if + @code{sprintf()} does not do it + correctly. +* Cliff Random Function:: The Cliff Random Number Generator. +* Ordinal Functions:: Functions for using characters as + numbers and vice versa. +* Join Function:: A function to join an array into a + string. +* Getlocaltime Function:: A function to get formatted times. +* Data File Management:: Functions for managing command-line + data files. +* Filetrans Function:: A function for handling data file + transitions. +* Rewind Function:: A function for rereading the current + file. +* File Checking:: Checking that data files are readable. +* Empty Files:: Checking for zero-length files. +* Ignoring Assigns:: Treating assignments as file names. +* Getopt Function:: A function for processing command-line + arguments. +* Passwd Functions:: Functions for getting user + information. +* Group Functions:: Functions for getting group + information. +* Walking Arrays:: A function to walk arrays of arrays. +* Running Examples:: How to run these examples. +* Clones:: Clones of common utilities. +* Cut Program:: The @command{cut} utility. +* Egrep Program:: The @command{egrep} utility. +* Id Program:: The @command{id} utility. +* Split Program:: The @command{split} utility. +* Tee Program:: The @command{tee} utility. +* Uniq Program:: The @command{uniq} utility. +* Wc Program:: The @command{wc} utility. +* Miscellaneous Programs:: Some interesting @command{awk} + programs. +* Dupword Program:: Finding duplicated words in a + document. +* Alarm Program:: An alarm clock. +* Translate Program:: A program similar to the @command{tr} + utility. +* Labels Program:: Printing mailing labels. +* Word Sorting:: A program to produce a word usage + count. +* History Sorting:: Eliminating duplicate entries from a + history file. +* Extract Program:: Pulling out programs from Texinfo + source files. +* Simple Sed:: A Simple Stream Editor. +* Igawk Program:: A wrapper for @command{awk} that + includes files. +* Anagram Program:: Finding anagrams from a dictionary. +* Signature Program:: People do amazing things with too much + time on their hands. +* I18N and L10N:: Internationalization and Localization. +* Explaining gettext:: How GNU @code{gettext} works. +* Programmer i18n:: Features for the programmer. +* Translator i18n:: Features for the translator. +* String Extraction:: Extracting marked strings. +* Printf Ordering:: Rearranging @code{printf} arguments. +* I18N Portability:: @command{awk}-level portability + issues. +* I18N Example:: A simple i18n example. +* Gawk I18N:: @command{gawk} is also + internationalized. +* Nondecimal Data:: Allowing nondecimal input data. +* Array Sorting:: Facilities for controlling array + traversal and sorting arrays. +* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. +* Array Sorting Functions:: How to use @code{asort()} and + @code{asorti()}. +* Two-way I/O:: Two-way communications with another + process. +* TCP/IP Networking:: Using @command{gawk} for network + programming. +* Profiling:: Profiling your @command{awk} programs. +* Debugging:: Introduction to @command{gawk} + debugger. +* Debugging Concepts:: Debugging in General. +* Debugging Terms:: Additional Debugging Concepts. +* Awk Debugging:: Awk Debugging. +* Sample Debugging Session:: Sample debugging session. +* Debugger Invocation:: How to Start the Debugger. +* Finding The Bug:: Finding the Bug. +* List of Debugger Commands:: Main debugger commands. +* Breakpoint Control:: Control of Breakpoints. +* Debugger Execution Control:: Control of Execution. +* Viewing And Changing Data:: Viewing and Changing Data. +* Execution Stack:: Dealing with the Stack. +* Debugger Info:: Obtaining Information about the + Program and the Debugger State. +* Miscellaneous Debugger Commands:: Miscellaneous Commands. +* Readline Support:: Readline support. +* Limitations:: Limitations and future plans. +* General Arithmetic:: An introduction to computer + arithmetic. +* Floating Point Issues:: Stuff to know about floating-point + numbers. +* String Conversion Precision:: The String Value Can Lie. +* Unexpected Results:: Floating Point Numbers Are Not + Abstract Numbers. +* POSIX Floating Point Problems:: Standards Versus Existing Practice. +* Integer Programming:: Effective integer programming. +* Floating-point Programming:: Effective Floating-point Programming. +* Floating-point Representation:: Binary floating-point representation. +* Floating-point Context:: Floating-point context. +* Rounding Mode:: Floating-point rounding mode. +* Gawk and MPFR:: How @command{gawk} provides + arbitrary-precision arithmetic. +* Arbitrary Precision Floats:: Arbitrary Precision Floating-point + Arithmetic with @command{gawk}. +* Setting Precision:: Setting the working precision. +* Setting Rounding Mode:: Setting the rounding mode. +* Floating-point Constants:: Representing floating-point constants. +* Changing Precision:: Changing the precision of a number. +* Exact Arithmetic:: Exact arithmetic with floating-point + numbers. +* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic + with @command{gawk}. +* Extension Intro:: What is an extension. +* Plugin License:: A note about licensing. +* Extension Design:: Design notes about the extension API. +* Old Extension Problems:: Problems with the old mechanism. +* Extension New Mechanism Goals:: Goals for the new mechanism. +* Extension Other Design Decisions:: Some other design decisions. +* Extension Mechanism Outline:: An outline of how it works. +* Extension Future Growth:: Some room for future growth. +* Extension API Description:: A full description of the API. +* Extension API Functions Introduction:: Introduction to the API functions. +* General Data Types:: The data types. +* Requesting Values:: How to get a value. +* Constructor Functions:: Functions for creating values. +* Registration Functions:: Functions to register things with + @command{gawk}. +* Extension Functions:: Registering extension functions. +* Exit Callback Functions:: Registering an exit callback. +* Extension Version String:: Registering a version string. +* Input Parsers:: Registering an input parser. +* Output Wrappers:: Registering an output wrapper. +* Two-way processors:: Registering a two-way processor. +* Printing Messages:: Functions for printing messages. +* Updating @code{ERRNO}:: Functions for updating @code{ERRNO}. +* Accessing Parameters:: Functions for accessing parameters. +* Symbol Table Access:: Functions for accessing global + variables. +* Symbol table by name:: Accessing variables by name. +* Symbol table by cookie:: Accessing variables by ``cookie''. +* Cached values:: Creating and using cached values. +* Array Manipulation:: Functions for working with arrays. +* Array Data Types:: Data types for working with arrays. +* Array Functions:: Functions for working with arrays. +* Flattening Arrays:: How to flatten arrays. +* Creating Arrays:: How to create and populate arrays. +* Extension API Variables:: Variables provided by the API. +* Extension Versioning:: API Version information. +* Extension API Informational Variables:: Variables providing information about + @command{gawk}'s invocation. +* Extension API Boilerplate:: Boilerplate code for using the API. +* Finding Extensions:: How @command{gawk} finds compiled + extensions. +* Extension Example:: Example C code for an extension. +* Internal File Description:: What the new functions will do. +* Internal File Ops:: The code for internal file operations. +* Using Internal File Ops:: How to use an external extension. +* Extension Samples:: The sample extensions that ship with + @code{gawk}. +* Extension Sample File Functions:: The file functions sample. +* Extension Sample Fnmatch:: An interface to @code{fnmatch()}. +* Extension Sample Fork:: An interface to @code{fork()} and + other process functions. +* Extension Sample Ord:: Character to value to character + conversions. +* Extension Sample Readdir:: An interface to @code{readdir()}. +* Extension Sample Revout:: Reversing output sample output + wrapper. +* Extension Sample Rev2way:: Reversing data sample two-way + processor. +* Extension Sample Read write array:: Serializing an array to a file. +* Extension Sample Readfile:: Reading an entire file into a string. +* Extension Sample API Tests:: Tests for the API. +* Extension Sample Time:: An interface to @code{gettimeofday()} + and @code{sleep()}. +* gawkextlib:: The @code{gawkextlib} project. +* V7/SVR3.1:: The major changes between V7 and + System V Release 3.1. +* SVR4:: Minor changes between System V + Releases 3.1 and 4. +* POSIX:: New features from the POSIX standard. +* BTL:: New features from Brian Kernighan's + version of @command{awk}. +* POSIX/GNU:: The extensions in @command{gawk} not + in POSIX @command{awk}. +* Common Extensions:: Common Extensions Summary. +* Ranges and Locales:: How locales used to affect regexp + ranges. +* Contributors:: The major contributors to + @command{gawk}. +* Gawk Distribution:: What is in the @command{gawk} + distribution. +* Getting:: How to get the distribution. +* Extracting:: How to extract the distribution. +* Distribution contents:: What is in the distribution. +* Unix Installation:: Installing @command{gawk} under + various versions of Unix. +* Quick Installation:: Compiling @command{gawk} under Unix. +* Additional Configuration Options:: Other compile-time options. +* Configuration Philosophy:: How it's all supposed to work. +* Non-Unix Installation:: Installation on Other Operating + Systems. +* PC Installation:: Installing and Compiling + @command{gawk} on MS-DOS and OS/2. +* PC Binary Installation:: Installing a prepared distribution. +* PC Compiling:: Compiling @command{gawk} for MS-DOS, + Windows32, and OS/2. +* PC Testing:: Testing @command{gawk} on PC systems. +* PC Using:: Running @command{gawk} on MS-DOS, + Windows32 and OS/2. +* Cygwin:: Building and running @command{gawk} + for Cygwin. +* MSYS:: Using @command{gawk} In The MSYS + Environment. +* VMS Installation:: Installing @command{gawk} on VMS. +* VMS Compilation:: How to compile @command{gawk} under + VMS. +* VMS Installation Details:: How to install @command{gawk} under + VMS. +* VMS Running:: How to run @command{gawk} under VMS. +* VMS Old Gawk:: An old version comes with some VMS + systems. +* Bugs:: Reporting Problems and Bugs. +* Other Versions:: Other freely available @command{awk} + implementations. +* Compatibility Mode:: How to disable certain @command{gawk} + extensions. +* Additions:: Making Additions To @command{gawk}. +* Accessing The Source:: Accessing the Git repository. +* Adding Code:: Adding code to the main body of + @command{gawk}. +* New Ports:: Porting @command{gawk} to a new + operating system. +* Derived Files:: Why derived files are kept in the + @command{git} repository. +* Future Extensions:: New features that may be implemented + one day. +* Implementation Limitations:: Some limitations of the implementation. +* Basic High Level:: The high level view. +* Basic Data Typing:: A very quick intro to data types. @end detailmenu @end menu @@ -1125,6 +1252,12 @@ expert should find useful. In particular, the description of POSIX @ref{Sample Programs}, should be of interest. +This @value{DOCUMENT} is split into several parts, as follows: + +Part I describes the @command{awk} language and @command{gawk} program in detail. +It starts with the basics, and continues through all of the features of @command{awk}. +It contains the following chapters: + @ref{Getting Started}, provides the essentials you need to know to begin using @command{awk}. @@ -1168,6 +1301,22 @@ describes the built-in functions @command{awk} and @command{gawk} provide, as well as how to define your own functions. +Part II shows how to use @command{awk} and @command{gawk} for problem solving. +There is lots of code here for you to read and learn from. +It contains the following chapters: + +@ref{Library Functions}, which provides a number of functions meant to +be used from main @command{awk} programs. + +@ref{Sample Programs}, +which provides many sample @command{awk} programs. + +Reading these two chapters allows you to see @command{awk} +solving real problems. + +Part III focuses on features specific to @command{gawk}. +It contains the following chapters: + @ref{Internationalization}, describes special features in @command{gawk} for translating program messages into different languages at runtime. @@ -1179,14 +1328,19 @@ are the abilities to have two-way communications with another process, perform TCP/IP networking, and profile your @command{awk} programs. -@ref{Library Functions}, and -@ref{Sample Programs}, -provide many sample @command{awk} programs. -Reading them allows you to see @command{awk} -solving real problems. - @ref{Debugger}, describes the @command{awk} debugger. +@ref{Arbitrary Precision Arithmetic}, +describes advanced arithmetic facilities provided by +@command{gawk}. + +@ref{Dynamic Extensions}, describes how to add new variables and +functions to @command{gawk} by writing extensions in C. + +Part IV provides the appendices, the Glossary, and two licenses that cover +the @command{gawk} source code and this @value{DOCUMENT}, respectively. +It contains the following appendices: + @ref{Language History}, describes how the @command{awk} language has evolved since its first release to present. It also describes how @command{gawk} @@ -1203,8 +1357,7 @@ available @command{awk} implementations. @ref{Notes}, describes how to disable @command{gawk}'s extensions, as well as how to contribute new code to @command{gawk}, -how to write extension libraries, and some possible -future directions for @command{gawk} development. +and some possible future directions for @command{gawk} development. @ref{Basic Concepts}, provides some very cursory background material for those who @@ -1648,12 +1801,14 @@ Nof Ayalon @* ISRAEL @* March, 2011 -@ignore -@c Try this @iftex -@page -@headings off -@majorheading I@ @ @ @ The @command{awk} Language and @command{gawk} +@part Part I:@* The @command{awk} Language +@end iftex + +@ignore +@ifdocbook +@part Part I:@* The @command{awk} Language + Part I describes the @command{awk} language and @command{gawk} program in detail. It starts with the basics, and continues through all of the features of @command{awk} and @command{gawk}. It contains the following chapters: @@ -1663,6 +1818,9 @@ and @command{gawk}. It contains the following chapters: @ref{Getting Started}. @item +@ref{Invoking Gawk}. + +@item @ref{Regexp}. @item @@ -1682,21 +1840,8 @@ and @command{gawk}. It contains the following chapters: @item @ref{Functions}. - -@item -@ref{Internationalization}. - -@item -@ref{Advanced Features}. - -@item -@ref{Invoking Gawk}. @end itemize - -@page -@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| -@oddheading @| @| @strong{@thischapter}@ @ @ @thispage -@end iftex +@end ifdocbook @end ignore @node Getting Started @@ -2927,6 +3072,7 @@ things in this @value{CHAPTER} that don't interest you right now. * Environment Variables:: The environment variables @command{gawk} uses. * Exit Status:: @command{gawk}'s exit status. * Include Files:: Including other files into your program. +* Loading Shared Libraries:: Loading shared libraries into your program. * Obsolete:: Obsolete Options and/or features. * Undocumented:: Undocumented Options and Features. @end menu @@ -3018,6 +3164,22 @@ This option may be given multiple times; the @command{awk} program consists of the concatenation the contents of each specified @var{source-file}. +@item -i @var{source-file} +@itemx --include @var{source-file} +@cindex @code{-i} option +@cindex @code{--include} option +@cindex @command{awk} programs, location of +Read @command{awk} source library from @var{source-file}. This option is +completely equivalent to using the @samp{@@include} directive inside +your program. This option is very +similar to the @option{-f} option, but there are two important differences. +First, when @option{-i} is used, the program source will not be loaded if it has +been previously loaded, whereas the @option{-f} will always load the file. +Second, because this option is intended to be used with code libraries, the +@command{awk} command does not recognize such files as constituting main program +input. Thus, after processing an @option{-i} argument, we still expect to +find the main source code via the @option{-f} option or on the command-line. + @item -v @var{var}=@var{val} @itemx --assign @var{var}=@var{val} @cindex @code{-v} option @@ -3078,6 +3240,9 @@ The following list describes @command{gawk}-specific options: @cindex @code{-b} option @cindex @code{--characters-as-bytes} option Cause @command{gawk} to treat all input data as single-byte characters. +In addition, all output written with @code{print} or @code{printf} +are treated as single-byte characters. + Normally, @command{gawk} follows the POSIX standard and attempts to process its input data according to the current locale. This can often involve converting multibyte characters into wide characters (internally), and @@ -3209,9 +3374,12 @@ that @command{gawk} accepts and then exit. @cindex @code{-l} option @cindex @code{--load} option @cindex loading, library -Load a shared library @var{lib}. This searches for the library using the @env{AWKPATH} -environment variable. The suffix @samp{.so} in the library name is optional. -The library initialization routine should be named @code{dlload()}. +Load a shared library @var{lib}. This searches for the library using the @env{AWKLIBPATH} +environment variable. The correct library suffix for your platform will be +supplied by default, so it need not be specified in the library name. +The library initialization routine should be named @code{dl_load()}. +An alternative is to use the @samp{@@load} keyword inside the program to load +a shared library. @item -L @r{[}value@r{]} @itemx --lint@r{[}=value@r{]} @@ -3590,6 +3758,8 @@ behaves. @menu * AWKPATH Variable:: Searching directories for @command{awk} programs. +* AWKLIBPATH Variable:: Searching directories for @command{awk} shared + libraries. * Other Environment Variables:: The environment variables. @end menu @@ -3607,7 +3777,8 @@ on the command-line with the @option{-f} option. In most @command{awk} implementations, you must supply a precise path name for each program file, unless the file is in the current directory. -But in @command{gawk}, if the @value{FN} supplied to the @option{-f} option +But in @command{gawk}, if the @value{FN} supplied to the @option{-f} +or @option{-i} options does not contain a @samp{/}, then @command{gawk} searches a list of directories (called the @dfn{search path}), one by one, looking for a file with the specified name. @@ -3629,13 +3800,16 @@ standard directory in the default path and then specified on the command line with a short @value{FN}. Otherwise, the full @value{FN} would have to be typed for each file. -By using both the @option{--source} and @option{-f} options, your command-line +By using the @option{-i} option, or the @option{--source} and @option{-f} options, your command-line @command{awk} programs can use facilities in @command{awk} library files (@pxref{Library Functions}). Path searching is not done if @command{gawk} is in compatibility mode. This is true for both @option{--traditional} and @option{--posix}. @xref{Options}. +If the source code is not found after the initial search, the path is searched +again after adding the default @samp{.awk} suffix to the filename. + @quotation NOTE To include the current directory in the path, either place @@ -3665,6 +3839,21 @@ sense: the @env{AWKPATH} environment variable is used to find the program source files. Once your program is running, all the files have been found, and @command{gawk} no longer needs to use @env{AWKPATH}. +@node AWKLIBPATH Variable +@subsection The @env{AWKLIBPATH} Environment Variable +@cindex @env{AWKLIBPATH} environment variable +@cindex directories, searching +@cindex search paths +@cindex search paths, for shared libraries +@cindex differences in @command{awk} and @command{gawk}, @code{AWKLIBPATH} environment variable + +The @env{AWKLIBPATH} environment variable is similar to the @env{AWKPATH} +variable, but it is used to search for shared libraries specified +with the @option{-l} option rather than for source files. If the library +is not found, the path is searched again after adding the appropriate +shared library suffix for the platform. For example, on GNU/Linux systems, +the suffix @samp{.so} is used. + @node Other Environment Variables @subsection Other Environment Variables @@ -3767,7 +3956,8 @@ code from various @command{awk} scripts. In other words, you can group together @command{awk} functions, used to carry out specific tasks, into external files. These files can be used just like function libraries, using the @samp{@@include} keyword in conjunction with the @env{AWKPATH} -environment variable. +environment variable. Note that source files may also be included +using the @option{-i} option. Let's see an example. We'll start with two (trivial) @command{awk} scripts, namely @@ -3873,6 +4063,41 @@ As mentioned in @ref{AWKPATH Variable}, the current directory is always searched first for source files, before searching in @env{AWKPATH}, and this also applies to files named with @samp{@@include}. +@node Loading Shared Libraries +@section Loading Shared Libraries Into Your Program + +This @value{SECTION} describes a feature that is specific to @command{gawk}. + +The @samp{@@load} keyword can be used to read external @command{awk} shared +libraries. This allows you to link in compiled code that may offer superior +performance and/or give you access to extended capabilities not supported +by the @command{awk} language. The @env{AWKLIBPATH} variable is used to +search for the shared library. Using @samp{@@load} is completely equivalent +to using the @option{-l} command-line option. + +If the shared library is not initially found in @env{AWKLIBPATH}, another +search is conducted after appending the platform's default shared library +suffix to the filename. For example, on GNU/Linux systems, the suffix +@samp{.so} is used. + +@example +$ @kbd{gawk '@@load "ordchr"; BEGIN @{print chr(65)@}'} +@print{} A +@end example + +@noindent +This is equivalent to the following example: + +@example +$ @kbd{gawk -lordchr 'BEGIN @{print chr(65)@}'} +@print{} A +@end example + +@noindent +For command-line usage, the @option{-l} option is more convenient, +but @samp{@@load} is useful for embedding inside an @command{awk} source file +that requires access to a shared library. + @node Obsolete @section Obsolete Options and/or Features @@ -3969,31 +4194,6 @@ long-undocumented ``feature'' of Unix @code{awk}. @end ignore -@ignore -@c Try this -@iftex -@page -@headings off -@majorheading II@ @ @ Using @command{awk} and @command{gawk} -Part II shows how to use @command{awk} and @command{gawk} for problem solving. -There is lots of code here for you to read and learn from. -It contains the following chapters: - -@itemize @bullet -@item -@ref{Library Functions}. - -@item -@ref{Sample Programs}. - -@end itemize - -@page -@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| -@oddheading @| @| @strong{@thischapter}@ @ @ @thispage -@end iftex -@end ignore - @node Regexp @chapter Regular Expressions @cindex regexp, See regular expressions @@ -5180,7 +5380,6 @@ used with it do not have to be named on the @command{awk} command line * Getline:: Reading files under explicit program control using the @code{getline} function. * Read Timeout:: Reading input with a timeout. - * Command line directories:: What happens if you put a directory on the command line. @end menu @@ -5306,16 +5505,22 @@ awk '@{ print $0 @}' RS="/" BBS-list This sets @code{RS} to @samp{/} before processing @file{BBS-list}. Using an unusual character such as @samp{/} for the record separator -produces correct behavior in the vast majority of cases. However, -the following (extreme) pipeline prints a surprising @samp{1}: +produces correct behavior in the vast majority of cases. + +There is one unusual case, that occurs when @command{gawk} is +being fully POSIX-compliant (@pxref{Options}). +Then, the following (extreme) pipeline prints a surprising @samp{1}: @example -$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}' +$ echo | gawk --posix 'BEGIN @{ RS = "a" @} ; @{ print NF @}' @print{} 1 @end example There is one field, consisting of a newline. The value of the built-in variable @code{NF} is the number of fields in the current record. +(In the normal case, @command{gawk} treats the newline as whitespace, +printing @samp{0} as the result. Most other versions of @command{awk} +also act this way.) @cindex dark corner, input files Reaching the end of an input file terminates the current input record, @@ -7230,6 +7435,34 @@ trying to accomplish. It is worth noting that those variants which do not use redirection can cause @code{FILENAME} to be updated if they cause @command{awk} to start reading a new input file. + +@item +If the variable being assigned is an expression with side effects, +different versions of @command{awk} behave differently upon encountering +end-of-file. Some versions don't evaluate the expression; many versions +(including @command{gawk}) do. Here is an example, due to Duncan Moore: + +@ignore +Date: Sun, 01 Apr 2012 11:49:33 +0100 +From: Duncan Moore <duncan.moore@@gmx.com> +@end ignore + +@example +BEGIN @{ + system("echo 1 > f") + while ((getline a[++c] < "f") > 0) @{ @} + print c +@} +@end example + +@noindent +Here, the side effect is the @samp{++c}. Is @code{c} incremented if +end of file is encountered, before the element in @code{a} is assigned? + +@command{gawk} treats @code{getline} like a function call, and evaluates +the expression @samp{a[++c]} before attempting to read from @file{f}. +Other versions of @command{awk} only evaluate the expression once they +know that there is a string value to be assigned. Caveat Emptor. @end itemize @node Getline Summary @@ -7240,9 +7473,10 @@ can cause @code{FILENAME} to be updated if they cause summarizes the eight variants of @code{getline}, listing which built-in variables are set by each one, and whether the variant is standard or a @command{gawk} extension. +Note: for each variant, @command{gawk} sets the @code{RT} built-in variable. @float Table,table-getline-variants -@caption{getline Variants and What They Set} +@caption{@code{getline} Variants and What They Set} @multitable @columnfractions .33 .38 .27 @headitem Variant @tab Effect @tab Standard / Extension @item @code{getline} @tab Sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR} @tab Standard @@ -11481,9 +11715,9 @@ fatal error. @item If you have written extensions that modify the record handling (by inserting -an ``open hook''), you can invoke them at this point, before @command{gawk} +an ``input parser''), you can invoke them at this point, before @command{gawk} has started processing the file. (This is a @emph{very} advanced feature, -currently used only by the @uref{http://xmlgawk.sourceforge.net, XMLgawk project}.) +currently used only by the @uref{http://gawkextlib.sourceforge.net, @code{gawkextlib} project}.) @end itemize The @code{ENDFILE} rule is called when @command{gawk} has finished processing @@ -12275,44 +12509,46 @@ function body reads the next record and starts processing it with the first rule in the program. @node Nextfile Statement -@subsection Using @command{gawk}'s @code{nextfile} Statement +@subsection The @code{nextfile} Statement @cindex @code{nextfile} statement @cindex differences in @command{awk} and @command{gawk}, @code{next}/@code{nextfile} statements @cindex common extensions, @code{nextfile} statement @cindex extensions, common@comma{} @code{nextfile} statement -@command{gawk} provides the @code{nextfile} statement, -which is similar to the @code{next} statement. @value{COMMONEXT} +The @code{nextfile} statement +is similar to the @code{next} statement. However, instead of abandoning processing of the current record, the -@code{nextfile} statement instructs @command{gawk} to stop processing the +@code{nextfile} statement instructs @command{awk} to stop processing the current @value{DF}. -The @code{nextfile} statement is a @command{gawk} extension. -In most other @command{awk} implementations, -or if @command{gawk} is in compatibility mode -(@pxref{Options}), -@code{nextfile} is not special. - Upon execution of the @code{nextfile} statement, -any @code{ENDFILE} rules are executed except in the case as -mentioned below, @code{FILENAME} is +@code{FILENAME} is updated to the name of the next @value{DF} listed on the command line, -@code{FNR} is reset to one, @code{ARGIND} is incremented, -any @code{BEGINFILE} rules are executed, and processing +@code{FNR} is reset to one, +and processing starts over with the first rule in the program. -(@code{ARGIND} hasn't been introduced yet. @xref{Built-in Variables}.) If the @code{nextfile} statement causes the end of the input to be reached, then the code in any @code{END} rules is executed. An exception to this is -when the @code{nextfile} is invoked during execution of any statement in an +when @code{nextfile} is invoked during execution of any statement in an @code{END} rule; In this case, it causes the program to stop immediately. @xref{BEGIN/END}. The @code{nextfile} statement is useful when there are many @value{DF}s to process but it isn't necessary to process every record in every file. -Normally, in order to move on to the next @value{DF}, a program -has to continue scanning the unwanted records. The @code{nextfile} +Without @code{nextfile}, +in order to move on to the next @value{DF}, a program +would have to continue scanning the unwanted records. The @code{nextfile} statement accomplishes this much more efficiently. -In addition, @code{nextfile} is useful inside a @code{BEGINFILE} +In @command{gawk}, execution of @code{nextfile} causes additional things +to happen: +any @code{ENDFILE} rules are executed except in the case as +mentioned below, +@code{ARGIND} is incremented, +and +any @code{BEGINFILE} rules are executed +(@code{ARGIND} hasn't been introduced yet. @xref{Built-in Variables}.) + +With @command{gawk}, @code{nextfile} is useful inside a @code{BEGINFILE} rule to skip over a file that would otherwise cause @command{gawk} to exit with a fatal error. In this case, @code{ENDFILE} rules are not executed. @xref{BEGINFILE/ENDFILE}. @@ -12323,6 +12559,13 @@ reserved for closing files, pipes, and coprocesses that are opened with redirections. It is not related to the main processing that @command{awk} does with the files listed in @code{ARGV}. +@quotation NOTE +For many years, @code{nextfile} was a +@command{gawk} extension. As of September, 2012, it was accepted for +inclusion into the POSIX standard. +See @uref{http://austingroupbugs.net/view.php?id=607, the Austin Group website}. +@end quotation + @cindex functions, user-defined, @code{next}/@code{nextfile} statements and @cindex @code{nextfile} statement, user-defined functions and The current version of the Brian Kernighan's @command{awk} (@pxref{Other @@ -12794,7 +13037,9 @@ does not affect the environment passed on to any programs that Some operating systems may not have environment variables. On such systems, the @code{ENVIRON} array is empty (except for @w{@code{ENVIRON["AWKPATH"]}}, -@pxref{AWKPATH Variable}). +@pxref{AWKPATH Variable} and +@w{@code{ENVIRON["AWKLIBPATH"]}}, +@pxref{AWKLIBPATH Variable}). @cindex @command{gawk}, @code{ERRNO} variable in @cindex @code{ERRNO} variable @@ -12870,6 +13115,16 @@ assigning a value to @code{NF} has the potential to affect to @code{NF} can be used to create or remove fields from the current record. @xref{Changing Fields}. +@cindex @code{FUNCTAB} array +@cindex @command{gawk}, @code{FUNCTAB} array in +@cindex differences in @command{awk} and @command{gawk}, @code{FUNCTAB} variable +@item FUNCTAB # +An array whose indices are the names of all the user-defined +or extension functions in the program. +@strong{NOTE}: The array values cannot currently be used. +Also, you may not use the @code{delete} statement with the +@code{FUNCTAB} array. + @cindex @code{NR} variable @item NR The number of input records @command{awk} has processed since @@ -12899,6 +13154,34 @@ This is @code{"FIELDWIDTHS"} if field splitting with @code{FIELDWIDTHS} is in effect, or @code{"FPAT"} if field matching with @code{FPAT} is in effect. +@item PROCINFO["identifiers"] +A subarray, indexed by the names of all identifiers used in the +text of the AWK program. For each identifier, the value of the element is one of the following: + +@table @code +@item "array" +The identifier is an array. + +@item "extension" +The identifier is an extension function loaded via +@code{@@load}. + +@item "scalar" +The identifier is a scalar. + +@item "untyped" +The identifier is untyped (could be used as a scalar or array, +@command{gawk} doesn't know yet). + +@item "user" +The identifier is a user-defined function. +@end table + +@noindent +The values indicate what @command{gawk} knows about the identifiers +after it has finished parsing the program; they are @emph{not} updated +while the program runs. + @item PROCINFO["gid"] The value of the @code{getgid()} system call. @@ -12997,6 +13280,57 @@ In other @command{awk} implementations, or if @command{gawk} is in compatibility mode (@pxref{Options}), it is not special. + +@cindex @command{gawk}, @code{SYMTAB} array in +@cindex @code{SYMTAB} array +@cindex differences in @command{awk} and @command{gawk}, @code{SYMTAB} variable +@item SYMTAB # +An array whose indices are the names of all currently defined +global variables and arrays in the program. The array may be used +for indirect access to read or write the value of a variable: + +@example +foo = 5 +SYMTAB["foo"] = 4 +print foo # prints 4 +@end example + +@noindent +The @code{isarray()} function (@pxref{Type Functions}) may be used to test +if an element in @code{SYMTAB} is an array. +Also, you may not use the @code{delete} statement with the +@code{SYMTAB} array. + +You may use an index for @code{SYMTAB} that is not a predefined identifer: + +@example +SYMTAB["xxx"] = 5 +print SYMTAB["xxx"] +@end example + +@noindent +This works as expected: in this case @code{SYMTAB} acts just like +a regular array. The only difference is that you can't then delete +@code{SYMTAB["xxx"]}. + +The @code{SYMTAB} array is more interesting than it looks. Andrew Schorr +points out that it effectively gives @command{awk} data pointers. Consider his +example: + +@example +# Indirect multiply of any variable by amount, return result + +function multiply(variable, amount) +@{ + return SYMTAB[variable] *= amount +@} +@end example + +@quotation NOTE +In order to avoid severe time-travel paradoxes@footnote{Not to mention difficult +implementation issues.}, neither @code{FUNCTAB} nor @code{SYMTAB} +are available as elements within the @code{SYMTAB} array. +@end quotation @end table @c ENDOFRANGE bvconi @c ENDOFRANGE vbconi @@ -13815,21 +14149,28 @@ is not in the array is deleted. @cindex deleting entire arrays @cindex differences in @command{awk} and @command{gawk}, array elements, deleting All the elements of an array may be deleted with a single statement -@value{COMMONEXT} by leaving off the subscript in the @code{delete} statement, as follows: + @example delete @var{array} @end example -This ability is a @command{gawk} extension; it is not available in -compatibility mode (@pxref{Options}). - Using this version of the @code{delete} statement is about three times more efficient than the equivalent loop that deletes each element one at a time. +@quotation NOTE +For many years, +using @code{delete} without a subscript was a @command{gawk} extension. +As of September, 2012, it was accepted for +inclusion into the POSIX standard. See @uref{http://austingroupbugs.net/view.php?id=544, +the Austin Group website}. This form of the @code{delete} statement is also supported +by Brian Kernighan's @command{awk} and @command{mawk}, as well as +by a number of other implementations (@pxref{Other Versions}). +@end quotation + @cindex portability, deleting array elements @cindex Brennan, Michael The following statement provides a portable but nonobvious way to clear @@ -15349,7 +15690,7 @@ output literally. The interpretation of @samp{\} and @samp{&} then becomes as shown in @ref{table-sub-posix-92}. @float Table,table-sub-posix-92 -@caption{1992 POSIX Rules for sub and gsub Escape Sequence Processing} +@caption{1992 POSIX Rules for @code{sub()} and @code{gsub()} Escape Sequence Processing} @c thanks to Karl Berry for formatting this table @tex \vbox{\bigskip @@ -15418,7 +15759,7 @@ to produce a @samp{\} preceding the matched text. This is shown in @ref{table-sub-proposed}. @float Table,table-sub-proposed -@caption{Proposed rules for sub and backslash} +@caption{Proposed Rules For @code{sub()} And Backslash} @tex \vbox{\bigskip % This table has lots of &'s and \'s, so unspecialize them. @@ -15480,7 +15821,7 @@ by anything else is not special; the @samp{\} is placed straight into the output These rules are presented in @ref{table-posix-sub}. @float Table,table-posix-sub -@caption{POSIX rules for @code{sub()} and @code{gsub()}} +@caption{POSIX Rules For @code{sub()} And @code{gsub()}} @tex \vbox{\bigskip % This table has lots of &'s and \'s, so unspecialize them. @@ -15548,7 +15889,7 @@ appears in the generated text and the @samp{\} does not, as shown in @ref{table-gensub-escapes}. @float Table,table-gensub-escapes -@caption{Escape Sequence Processing for @code{gensub()}} +@caption{Escape Sequence Processing For @code{gensub()}} @tex \vbox{\bigskip % This table has lots of &'s and \'s, so unspecialize them. @@ -16388,8 +16729,8 @@ bitwise operations just described. They are: @cindex @command{gawk}, bitwise operations in @table @code @cindex @code{and()} function (@command{gawk}) -@item and(@var{v1}, @var{v2}) -Return the bitwise AND of the values provided by @var{v1} and @var{v2}. +@item and(@var{v1}, @var{v2} @r{[}, @r{@dots{}]}) +Return the bitwise AND of the arguments. There must be at least two. @cindex @code{compl()} function (@command{gawk}) @item compl(@var{val}) @@ -16400,16 +16741,16 @@ Return the bitwise complement of @var{val}. Return the value of @var{val}, shifted left by @var{count} bits. @cindex @code{or()} function (@command{gawk}) -@item or(@var{v1}, @var{v2}) -Return the bitwise OR of the values provided by @var{v1} and @var{v2}. +@item or(@var{v1}, @var{v2} @r{[}, @r{@dots{}]}) +Return the bitwise OR of the arguments. There must be at least two. @cindex @code{rshift()} function (@command{gawk}) @item rshift(@var{val}, @var{count}) Return the value of @var{val}, shifted right by @var{count} bits. @cindex @code{xor()} function (@command{gawk}) -@item xor(@var{v1}, @var{v2}) -Return the bitwise XOR of the values provided by @var{v1} and @var{v2}. +@item xor(@var{v1}, @var{v2} @r{[}, @r{@dots{}]}) +Return the bitwise XOR of the arguments. There must be at least two. @end table For all of these functions, first the double precision floating-point value is @@ -16974,6 +17315,47 @@ foo's i=1 top's i=10 @end example +Besides scalar values (strings and numbers), you may also have +local arrays. By using a parameter name as an array, @command{awk} +treats it as an array, and it is local to the function. +In addition, recursive calls create new arrays. +Consider this example: + +@example +function some_func(p1, a) +@{ + if (p1++ > 3) + return + + a[p1] = p1 + + some_func(p1) + + printf("At level %d, index %d %s found in a\n", + p1, (p1 - 1), (p1 - 1) in a ? "is" : "is not") + printf("At level %d, index %d %s found in a\n", + p1, p1, p1 in a ? "is" : "is not") + print "" +@} + +BEGIN @{ + some_func(1) +@} +@end example + +When run, this program produces the following output: + +@example +At level 4, index 3 is not found in a +At level 4, index 4 is found in a + +At level 3, index 2 is not found in a +At level 3, index 3 is found in a + +At level 2, index 1 is not found in a +At level 2, index 2 is found in a +@end example + @node Pass By Value/Reference @subsubsection Passing Function Arguments By Value Or By Reference @@ -17579,2748 +17961,28 @@ for (i = 1; i <= n; i++) @c ENDOFRANGE funcud -@node Internationalization -@chapter Internationalization with @command{gawk} - -Once upon a time, computer makers -wrote software that worked only in English. -Eventually, hardware and software vendors noticed that if their -systems worked in the native languages of non-English-speaking -countries, they were able to sell more systems. -As a result, internationalization and localization -of programs and software systems became a common practice. - -@c STARTOFRANGE inloc -@cindex internationalization, localization -@cindex @command{gawk}, internationalization and, See internationalization -@cindex internationalization, localization, @command{gawk} and -For many years, the ability to provide internationalization -was largely restricted to programs written in C and C++. -This @value{CHAPTER} describes the underlying library @command{gawk} -uses for internationalization, as well as how -@command{gawk} makes internationalization -features available at the @command{awk} program level. -Having internationalization available at the @command{awk} level -gives software developers additional flexibility---they are no -longer forced to write in C or C++ when internationalization is -a requirement. - -@menu -* I18N and L10N:: Internationalization and Localization. -* Explaining gettext:: How GNU @code{gettext} works. -* Programmer i18n:: Features for the programmer. -* Translator i18n:: Features for the translator. -* I18N Example:: A simple i18n example. -* Gawk I18N:: @command{gawk} is also internationalized. -@end menu - -@node I18N and L10N -@section Internationalization and Localization - -@cindex internationalization -@cindex localization, See internationalization@comma{} localization -@cindex localization -@dfn{Internationalization} means writing (or modifying) a program once, -in such a way that it can use multiple languages without requiring -further source-code changes. -@dfn{Localization} means providing the data necessary for an -internationalized program to work in a particular language. -Most typically, these terms refer to features such as the language -used for printing error messages, the language used to read -responses, and information related to how numerical and -monetary values are printed and read. - -@node Explaining gettext -@section GNU @code{gettext} - -@cindex internationalizing a program -@c STARTOFRANGE gettex -@cindex @code{gettext} library -The facilities in GNU @code{gettext} focus on messages; strings printed -by a program, either directly or via formatting with @code{printf} or -@code{sprintf()}.@footnote{For some operating systems, the @command{gawk} -port doesn't support GNU @code{gettext}. -Therefore, these features are not available -if you are using one of those operating systems. Sorry.} - -@cindex portability, @code{gettext} library and -When using GNU @code{gettext}, each application has its own -@dfn{text domain}. This is a unique name, such as @samp{kpilot} or @samp{gawk}, -that identifies the application. -A complete application may have multiple components---programs written -in C or C++, as well as scripts written in @command{sh} or @command{awk}. -All of the components use the same text domain. - -To make the discussion concrete, assume we're writing an application -named @command{guide}. Internationalization consists of the -following steps, in this order: - -@enumerate -@item -The programmer goes -through the source for all of @command{guide}'s components -and marks each string that is a candidate for translation. -For example, @code{"`-F': option required"} is a good candidate for translation. -A table with strings of option names is not (e.g., @command{gawk}'s -@option{--profile} option should remain the same, no matter what the local -language). - -@cindex @code{textdomain()} function (C library) -@item -The programmer indicates the application's text domain -(@code{"guide"}) to the @code{gettext} library, -by calling the @code{textdomain()} function. - -@cindex @code{.pot} files -@cindex files, @code{.pot} -@cindex portable object template files -@cindex files, portable object template -@item -Messages from the application are extracted from the source code and -collected into a portable object template file (@file{guide.pot}), -which lists the strings and their translations. -The translations are initially empty. -The original (usually English) messages serve as the key for -lookup of the translations. - -@cindex @code{.po} files -@cindex files, @code{.po} -@cindex portable object files -@cindex files, portable object -@item -For each language with a translator, @file{guide.pot} -is copied to a portable object file (@code{.po}) -and translations are created and shipped with the application. -For example, there might be a @file{fr.po} for a French translation. - -@cindex @code{.mo} files -@cindex files, @code{.mo} -@cindex message object files -@cindex files, message object -@item -Each language's @file{.po} file is converted into a binary -message object (@file{.mo}) file. -A message object file contains the original messages and their -translations in a binary format that allows fast lookup of translations -at runtime. - -@item -When @command{guide} is built and installed, the binary translation files -are installed in a standard place. - -@cindex @code{bindtextdomain()} function (C library) -@item -For testing and development, it is possible to tell @code{gettext} -to use @file{.mo} files in a different directory than the standard -one by using the @code{bindtextdomain()} function. - -@cindex @code{.mo} files, specifying directory of -@cindex files, @code{.mo}, specifying directory of -@cindex message object files, specifying directory of -@cindex files, message object, specifying directory of -@item -At runtime, @command{guide} looks up each string via a call -to @code{gettext()}. The returned string is the translated string -if available, or the original string if not. - -@item -If necessary, it is possible to access messages from a different -text domain than the one belonging to the application, without -having to switch the application's default text domain back -and forth. -@end enumerate - -@cindex @code{gettext()} function (C library) -In C (or C++), the string marking and dynamic translation lookup -are accomplished by wrapping each string in a call to @code{gettext()}: - -@example -printf("%s", gettext("Don't Panic!\n")); -@end example - -The tools that extract messages from source code pull out all -strings enclosed in calls to @code{gettext()}. - -@cindex @code{_} (underscore), @code{_} C macro -@cindex underscore (@code{_}), @code{_} C macro -The GNU @code{gettext} developers, recognizing that typing -@samp{gettext(@dots{})} over and over again is both painful and ugly to look -at, use the macro @samp{_} (an underscore) to make things easier: - -@example -/* In the standard header file: */ -#define _(str) gettext(str) - -/* In the program text: */ -printf("%s", _("Don't Panic!\n")); -@end example - -@cindex internationalization, localization, locale categories -@cindex @code{gettext} library, locale categories -@cindex locale categories -@noindent -This reduces the typing overhead to just three extra characters per string -and is considerably easier to read as well. - -There are locale @dfn{categories} -for different types of locale-related information. -The defined locale categories that @code{gettext} knows about are: - -@table @code -@cindex @code{LC_MESSAGES} locale category -@item LC_MESSAGES -Text messages. This is the default category for @code{gettext} -operations, but it is possible to supply a different one explicitly, -if necessary. (It is almost never necessary to supply a different category.) - -@cindex sorting characters in different languages -@cindex @code{LC_COLLATE} locale category -@item LC_COLLATE -Text-collation information; i.e., how different characters -and/or groups of characters sort in a given language. - -@cindex @code{LC_CTYPE} locale category -@item LC_CTYPE -Character-type information (alphabetic, digit, upper- or lowercase, and -so on). -This information is accessed via the -POSIX character classes in regular expressions, -such as @code{/[[:alnum:]]/} -(@pxref{Regexp Operators}). - -@cindex monetary information, localization -@cindex currency symbols, localization -@cindex @code{LC_MONETARY} locale category -@item LC_MONETARY -Monetary information, such as the currency symbol, and whether the -symbol goes before or after a number. - -@cindex @code{LC_NUMERIC} locale category -@item LC_NUMERIC -Numeric information, such as which characters to use for the decimal -point and the thousands separator.@footnote{Americans -use a comma every three decimal places and a period for the decimal -point, while many Europeans do exactly the opposite: -1,234.56 versus 1.234,56.} - -@cindex @code{LC_RESPONSE} locale category -@item LC_RESPONSE -Response information, such as how ``yes'' and ``no'' appear in the -local language, and possibly other information as well. - -@cindex time, localization and -@cindex dates, information related to@comma{} localization -@cindex @code{LC_TIME} locale category -@item LC_TIME -Time- and date-related information, such as 12- or 24-hour clock, month printed -before or after the day in a date, local month abbreviations, and so on. - -@cindex @code{LC_ALL} locale category -@item LC_ALL -All of the above. (Not too useful in the context of @code{gettext}.) -@end table -@c ENDOFRANGE gettex - -@node Programmer i18n -@section Internationalizing @command{awk} Programs -@c STARTOFRANGE inap -@cindex @command{awk} programs, internationalizing - -@command{gawk} provides the following variables and functions for -internationalization: - -@table @code -@cindex @code{TEXTDOMAIN} variable -@item TEXTDOMAIN -This variable indicates the application's text domain. -For compatibility with GNU @code{gettext}, the default -value is @code{"messages"}. - -@cindex internationalization, localization, marked strings -@cindex strings, for localization -@item _"your message here" -String constants marked with a leading underscore -are candidates for translation at runtime. -String constants without a leading underscore are not translated. - -@cindex @code{dcgettext()} function (@command{gawk}) -@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) -Return the translation of @var{string} in -text domain @var{domain} for locale category @var{category}. -The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. -The default value for @var{category} is @code{"LC_MESSAGES"}. - -If you supply a value for @var{category}, it must be a string equal to -one of the known locale categories described in -@ifnotinfo -the previous @value{SECTION}. -@end ifnotinfo -@ifinfo -@ref{Explaining gettext}. -@end ifinfo -You must also supply a text domain. Use @code{TEXTDOMAIN} if -you want to use the current domain. - -@quotation CAUTION -The order of arguments to the @command{awk} version -of the @code{dcgettext()} function is purposely different from the order for -the C version. The @command{awk} version's order was -chosen to be simple and to allow for reasonable @command{awk}-style -default arguments. -@end quotation - -@cindex @code{dcngettext()} function (@command{gawk}) -@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) -Return the plural form used for @var{number} of the -translation of @var{string1} and @var{string2} in text domain -@var{domain} for locale category @var{category}. @var{string1} is the -English singular variant of a message, and @var{string2} the English plural -variant of the same message. -The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. -The default value for @var{category} is @code{"LC_MESSAGES"}. - -The same remarks about argument order as for the @code{dcgettext()} function apply. - -@cindex @code{.mo} files, specifying directory of -@cindex files, @code{.mo}, specifying directory of -@cindex message object files, specifying directory of -@cindex files, message object, specifying directory of -@cindex @code{bindtextdomain()} function (@command{gawk}) -@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]}) -Change the directory in which -@code{gettext} looks for @file{.mo} files, in case they -will not or cannot be placed in the standard locations -(e.g., during testing). -Return the directory in which @var{domain} is ``bound.'' - -The default @var{domain} is the value of @code{TEXTDOMAIN}. -If @var{directory} is the null string (@code{""}), then -@code{bindtextdomain()} returns the current binding for the -given @var{domain}. -@end table - -To use these facilities in your @command{awk} program, follow the steps -outlined in -@ifnotinfo -the previous @value{SECTION}, -@end ifnotinfo -@ifinfo -@ref{Explaining gettext}, -@end ifinfo -like so: - -@enumerate -@cindex @code{BEGIN} pattern, @code{TEXTDOMAIN} variable and -@cindex @code{TEXTDOMAIN} variable, @code{BEGIN} pattern and -@item -Set the variable @code{TEXTDOMAIN} to the text domain of -your program. This is best done in a @code{BEGIN} rule -(@pxref{BEGIN/END}), -or it can also be done via the @option{-v} command-line -option (@pxref{Options}): - -@example -BEGIN @{ - TEXTDOMAIN = "guide" - @dots{} -@} -@end example - -@cindex @code{_} (underscore), translatable string -@cindex underscore (@code{_}), translatable string -@item -Mark all translatable strings with a leading underscore (@samp{_}) -character. It @emph{must} be adjacent to the opening -quote of the string. For example: - -@example -print _"hello, world" -x = _"you goofed" -printf(_"Number of users is %d\n", nusers) -@end example - -@item -If you are creating strings dynamically, you can -still translate them, using the @code{dcgettext()} -built-in function: - -@example -message = nusers " users logged in" -message = dcgettext(message, "adminprog") -print message -@end example - -Here, the call to @code{dcgettext()} supplies a different -text domain (@code{"adminprog"}) in which to find the -message, but it uses the default @code{"LC_MESSAGES"} category. - -@cindex @code{LC_MESSAGES} locale category, @code{bindtextdomain()} function (@command{gawk}) -@item -During development, you might want to put the @file{.mo} -file in a private directory for testing. This is done -with the @code{bindtextdomain()} built-in function: - -@example -BEGIN @{ - TEXTDOMAIN = "guide" # our text domain - if (Testing) @{ - # where to find our files - bindtextdomain("testdir") - # joe is in charge of adminprog - bindtextdomain("../joe/testdir", "adminprog") - @} - @dots{} -@} -@end example - -@end enumerate - -@xref{I18N Example}, -for an example program showing the steps to create -and use translations from @command{awk}. - -@node Translator i18n -@section Translating @command{awk} Programs - -@cindex @code{.po} files -@cindex files, @code{.po} -@cindex portable object files -@cindex files, portable object -Once a program's translatable strings have been marked, they must -be extracted to create the initial @file{.po} file. -As part of translation, it is often helpful to rearrange the order -in which arguments to @code{printf} are output. - -@command{gawk}'s @option{--gen-pot} command-line option extracts -the messages and is discussed next. -After that, @code{printf}'s ability to -rearrange the order for @code{printf} arguments at runtime -is covered. - -@menu -* String Extraction:: Extracting marked strings. -* Printf Ordering:: Rearranging @code{printf} arguments. -* I18N Portability:: @command{awk}-level portability issues. -@end menu - -@node String Extraction -@subsection Extracting Marked Strings -@cindex strings, extracting -@cindex marked strings@comma{} extracting -@cindex @code{--gen-pot} option -@cindex command-line options, string extraction -@cindex string extraction (internationalization) -@cindex marked string extraction (internationalization) -@cindex extraction, of marked strings (internationalization) - -@cindex @code{--gen-pot} option -Once your @command{awk} program is working, and all the strings have -been marked and you've set (and perhaps bound) the text domain, -it is time to produce translations. -First, use the @option{--gen-pot} command-line option to create -the initial @file{.pot} file: - -@example -$ @kbd{gawk --gen-pot -f guide.awk > guide.pot} -@end example - -@cindex @code{xgettext} utility -When run with @option{--gen-pot}, @command{gawk} does not execute your -program. Instead, it parses it as usual and prints all marked strings -to standard output in the format of a GNU @code{gettext} Portable Object -file. Also included in the output are any constant strings that -appear as the first argument to @code{dcgettext()} or as the first and -second argument to @code{dcngettext()}.@footnote{The -@command{xgettext} utility that comes with GNU -@code{gettext} can handle @file{.awk} files.} -@xref{I18N Example}, -for the full list of steps to go through to create and test -translations for @command{guide}. - -@node Printf Ordering -@subsection Rearranging @code{printf} Arguments - -@cindex @code{printf} statement, positional specifiers -@cindex positional specifiers, @code{printf} statement -Format strings for @code{printf} and @code{sprintf()} -(@pxref{Printf}) -present a special problem for translation. -Consider the following:@footnote{This example is borrowed -from the GNU @code{gettext} manual.} - -@c line broken here only for smallbook format -@example -printf(_"String `%s' has %d characters\n", - string, length(string))) -@end example - -A possible German translation for this might be: - -@example -"%d Zeichen lang ist die Zeichenkette `%s'\n" -@end example - -The problem should be obvious: the order of the format -specifications is different from the original! -Even though @code{gettext()} can return the translated string -at runtime, -it cannot change the argument order in the call to @code{printf}. - -To solve this problem, @code{printf} format specifiers may have -an additional optional element, which we call a @dfn{positional specifier}. -For example: - -@example -"%2$d Zeichen lang ist die Zeichenkette `%1$s'\n" -@end example - -Here, the positional specifier consists of an integer count, which indicates which -argument to use, and a @samp{$}. Counts are one-based, and the -format string itself is @emph{not} included. Thus, in the following -example, @samp{string} is the first argument and @samp{length(string)} is the second: - -@example -$ @kbd{gawk 'BEGIN @{} -> @kbd{string = "Dont Panic"} -> @kbd{printf _"%2$d characters live in \"%1$s\"\n",} -> @kbd{string, length(string)} -> @kbd{@}'} -@print{} 10 characters live in "Dont Panic" -@end example - -If present, positional specifiers come first in the format specification, -before the flags, the field width, and/or the precision. - -Positional specifiers can be used with the dynamic field width and -precision capability: - -@example -$ @kbd{gawk 'BEGIN @{} -> @kbd{printf("%*.*s\n", 10, 20, "hello")} -> @kbd{printf("%3$*2$.*1$s\n", 20, 10, "hello")} -> @kbd{@}'} -@print{} hello -@print{} hello -@end example - -@quotation NOTE -When using @samp{*} with a positional specifier, the @samp{*} -comes first, then the integer position, and then the @samp{$}. -This is somewhat counterintuitive. -@end quotation - -@cindex @code{printf} statement, positional specifiers, mixing with regular formats -@cindex positional specifiers, @code{printf} statement, mixing with regular formats -@cindex format specifiers, mixing regular with positional specifiers -@command{gawk} does not allow you to mix regular format specifiers -and those with positional specifiers in the same string: - -@example -$ @kbd{gawk 'BEGIN @{ printf _"%d %3$s\n", 1, 2, "hi" @}'} -@error{} gawk: cmd. line:1: fatal: must use `count$' on all formats or none -@end example - -@quotation NOTE -There are some pathological cases that @command{gawk} may fail to -diagnose. In such cases, the output may not be what you expect. -It's still a bad idea to try mixing them, even if @command{gawk} -doesn't detect it. -@end quotation - -Although positional specifiers can be used directly in @command{awk} programs, -their primary purpose is to help in producing correct translations of -format strings into languages different from the one in which the program -is first written. - -@node I18N Portability -@subsection @command{awk} Portability Issues - -@cindex portability, internationalization and -@cindex internationalization, localization, portability and -@command{gawk}'s internationalization features were purposely chosen to -have as little impact as possible on the portability of @command{awk} -programs that use them to other versions of @command{awk}. -Consider this program: - -@example -BEGIN @{ - TEXTDOMAIN = "guide" - if (Test_Guide) # set with -v - bindtextdomain("/test/guide/messages") - print _"don't panic!" -@} -@end example - -@noindent -As written, it won't work on other versions of @command{awk}. -However, it is actually almost portable, requiring very little -change: - -@itemize @bullet -@cindex @code{TEXTDOMAIN} variable, portability and -@item -Assignments to @code{TEXTDOMAIN} won't have any effect, -since @code{TEXTDOMAIN} is not special in other @command{awk} implementations. - -@item -Non-GNU versions of @command{awk} treat marked strings -as the concatenation of a variable named @code{_} with the string -following it.@footnote{This is good fodder for an ``Obfuscated -@command{awk}'' contest.} Typically, the variable @code{_} has -the null string (@code{""}) as its value, leaving the original string constant as -the result. - -@item -By defining ``dummy'' functions to replace @code{dcgettext()}, @code{dcngettext()} -and @code{bindtextdomain()}, the @command{awk} program can be made to run, but -all the messages are output in the original language. -For example: - -@cindex @code{bindtextdomain()} function (@command{gawk}), portability and -@cindex @code{dcgettext()} function (@command{gawk}), portability and -@cindex @code{dcngettext()} function (@command{gawk}), portability and -@example -@c file eg/lib/libintl.awk -function bindtextdomain(dir, domain) -@{ - return dir -@} - -function dcgettext(string, domain, category) -@{ - return string -@} - -function dcngettext(string1, string2, number, domain, category) -@{ - return (number == 1 ? string1 : string2) -@} -@c endfile -@end example - -@item -The use of positional specifications in @code{printf} or -@code{sprintf()} is @emph{not} portable. -To support @code{gettext()} at the C level, many systems' C versions of -@code{sprintf()} do support positional specifiers. But it works only if -enough arguments are supplied in the function call. Many versions of -@command{awk} pass @code{printf} formats and arguments unchanged to the -underlying C library version of @code{sprintf()}, but only one format and -argument at a time. What happens if a positional specification is -used is anybody's guess. -However, since the positional specifications are primarily for use in -@emph{translated} format strings, and since non-GNU @command{awk}s never -retrieve the translated string, this should not be a problem in practice. -@end itemize -@c ENDOFRANGE inap - -@node I18N Example -@section A Simple Internationalization Example - -Now let's look at a step-by-step example of how to internationalize and -localize a simple @command{awk} program, using @file{guide.awk} as our -original source: - -@example -@c file eg/prog/guide.awk -BEGIN @{ - TEXTDOMAIN = "guide" - bindtextdomain(".") # for testing - print _"Don't Panic" - print _"The Answer Is", 42 - print "Pardon me, Zaphod who?" -@} -@c endfile -@end example - -@noindent -Run @samp{gawk --gen-pot} to create the @file{.pot} file: - -@example -$ @kbd{gawk --gen-pot -f guide.awk > guide.pot} -@end example - -@noindent -This produces: - -@example -@c file eg/data/guide.po -#: guide.awk:4 -msgid "Don't Panic" -msgstr "" - -#: guide.awk:5 -msgid "The Answer Is" -msgstr "" - -@c endfile -@end example - -This original portable object template file is saved and reused for each language -into which the application is translated. The @code{msgid} -is the original string and the @code{msgstr} is the translation. - -@quotation NOTE -Strings not marked with a leading underscore do not -appear in the @file{guide.pot} file. -@end quotation - -Next, the messages must be translated. -Here is a translation to a hypothetical dialect of English, -called ``Mellow'':@footnote{Perhaps it would be better if it were -called ``Hippy.'' Ah, well.} - -@example -@group -$ cp guide.pot guide-mellow.po -@var{Add translations to} guide-mellow.po @dots{} -@end group -@end example - -@noindent -Following are the translations: - -@example -@c file eg/data/guide-mellow.po -#: guide.awk:4 -msgid "Don't Panic" -msgstr "Hey man, relax!" - -#: guide.awk:5 -msgid "The Answer Is" -msgstr "Like, the scoop is" - -@c endfile -@end example - -@cindex Linux -@cindex GNU/Linux -The next step is to make the directory to hold the binary message object -file and then to create the @file{guide.mo} file. -The directory layout shown here is standard for GNU @code{gettext} on -GNU/Linux systems. Other versions of @code{gettext} may use a different -layout: - -@example -$ @kbd{mkdir en_US en_US/LC_MESSAGES} -@end example - -@cindex @code{.po} files, converting to @code{.mo} -@cindex files, @code{.po}, converting to @code{.mo} -@cindex @code{.mo} files, converting from @code{.po} -@cindex files, @code{.mo}, converting from @code{.po} -@cindex portable object files, converting to message object files -@cindex files, portable object, converting to message object files -@cindex message object files, converting from portable object files -@cindex files, message object, converting from portable object files -@cindex @command{msgfmt} utility -The @command{msgfmt} utility does the conversion from human-readable -@file{.po} file to machine-readable @file{.mo} file. -By default, @command{msgfmt} creates a file named @file{messages}. -This file must be renamed and placed in the proper directory so that -@command{gawk} can find it: - -@example -$ @kbd{msgfmt guide-mellow.po} -$ @kbd{mv messages en_US/LC_MESSAGES/guide.mo} -@end example - -Finally, we run the program to test it: - -@example -$ @kbd{gawk -f guide.awk} -@print{} Hey man, relax! -@print{} Like, the scoop is 42 -@print{} Pardon me, Zaphod who? -@end example - -If the three replacement functions for @code{dcgettext()}, @code{dcngettext()} -and @code{bindtextdomain()} -(@pxref{I18N Portability}) -are in a file named @file{libintl.awk}, -then we can run @file{guide.awk} unchanged as follows: - -@example -$ @kbd{gawk --posix -f guide.awk -f libintl.awk} -@print{} Don't Panic -@print{} The Answer Is 42 -@print{} Pardon me, Zaphod who? -@end example - -@node Gawk I18N -@section @command{gawk} Can Speak Your Language - -@command{gawk} itself has been internationalized -using the GNU @code{gettext} package. -(GNU @code{gettext} is described in -complete detail in -@ifinfo -@inforef{Top, , GNU @code{gettext} utilities, gettext, GNU gettext tools}.) -@end ifinfo -@ifnotinfo -@cite{GNU gettext tools}.) -@end ifnotinfo -As of this writing, the latest version of GNU @code{gettext} is -@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.18.1.tar.gz, @value{PVERSION} 0.18.1}. - -If a translation of @command{gawk}'s messages exists, -then @command{gawk} produces usage messages, warnings, -and fatal errors in the local language. -@c ENDOFRANGE inloc - -@node Arbitrary Precision Arithmetic -@chapter Arbitrary Precision Arithmetic with @command{gawk} -@cindex arbitrary precision -@cindex multiple precision -@cindex infinite precision -@cindex floating-point numbers, arbitrary precision -@cindex MPFR -@cindex GMP - -@cindex Knuth, Donald -@quotation -@i{There's a credibility gap: We don't know how much of the computer's answers -to believe. Novice computer users solve this problem by implicitly trusting -in the computer as an infallible authority; they tend to believe that all -digits of a printed answer are significant. Disillusioned computer users have -just the opposite approach; they are constantly afraid that their answers -are almost meaningless.} - -Donald Knuth@footnote{Donald E.@: Knuth. -@cite{The Art of Computer Programming}. Volume 2, -@cite{Seminumerical Algorithms}, third edition, -1998, ISBN 0-201-89683-4, p.@: 229.} -@end quotation - -This @value{SECTION} decsribes how to use the arbitrary precision -(also known as @dfn{multiple precision} or @dfn{infinite precision}) numeric -capabilites in @command{gawk} to produce maximally accurate results -when you need it. But first you should check if your version of -@command{gawk} supports arbitrary precision arithmetic. -The easiest way to find out is to look at the output of -the following command: - -@example -$ @kbd{gawk --version} -@print{} GNU Awk 4.1.0 (GNU MPFR 3.1.0, GNU MP 5.0.3) -@print{} Copyright (C) 1989, 1991-2012 Free Software Foundation. -@dots{} -@end example - -@command{gawk} uses the -@uref{http://www.mpfr.org, GNU MPFR} -and -@uref{http://gmplib.org, GNU MP} (GMP) -libraries for arbitrary precision -arithmetic on numbers. So if you do not see the names of these libraries -in the output, then your version of @command{gawk} does not support -arbitrary precision arithmetic. - -Even if you aren't interested in arbitrary precision arithmetic, you -may still benifit from knowing about how @command{gawk} handles numbers -in general, and the limitations of doing arithmetic with ordinary -@command{gawk} numbers. - -@menu -* Floating-point Programming:: Effective Floating-point Programming. -* Floating-point Representation:: Binary Floating-point Representation. -* Floating-point Context:: Floating-point Context. -* Rounding Mode:: Floating-point Rounding Mode. -* Arbitrary Precision Floats:: Arbitrary Precision Floating-point - Arithmetic with @command{gawk}. -* Setting Precision:: Setting the Working Precision. -* Setting Rounding Mode:: Setting the Rounding Mode. -* Floating-point Constants:: Representing Floating-point Constants. -* Changing Precision:: Changing the Precision of a Number. -* Exact Arithmetic:: Exact Arithmetic with Floating-point Numbers. -* Integer Programming:: Effective Integer Programming. -* Arbitrary Precision Integers:: Arbitrary Precision Integer - Arithmetic with @command{gawk}. -* MPFR and GMP Libraries:: Information About the MPFR and GMP Libraries. -@end menu - -@node Floating-point Programming -@section Effective Floating-point Programming - -Numerical programming is an extensive area; if you need to develop -sophisticated numerical algorithms then @command{gawk} may not be -the ideal tool, and this documentation may not be sufficient. -@c FIXME: JOHN: Do you want to cite some actual books? -It might require a book or two to communicate how to compute -with ideal accuracy and precision -and the result often depends on the particular application. - -@quotation NOTE -A floating-point calculation's @dfn{accuracy} is how close it comes -to the real value. This is as opposed to the @dfn{precision}, which -usually refers to the number of bits used to represent the number -(see @uref{http://en.wikipedia.org/wiki/Accuracy_and_precision, -the Wikipedia article} for more information). -@end quotation - -Binary floating-point representations and arithmetic are inexact. -Simple values like 0.1 cannot be precisely represented using -binary floating-point numbers, and the limited precision of -floating-point numbers means that slight changes in -the order of operations or the precision of intermediate storage -can change the result. To make matters worse with arbitrary precision -floating-point, you can set the precision before starting a computation, -but then you cannot be sure of the number of significant decimal places -in the final result. - -Sometimes you need to think more about what you really want -and what's really happening. Consider the two numbers -in the following example: - -@example -x = 0.875 # 1/2 + 1/4 + 1/8 -y = 0.425 -@end example - -Unlike the number in @code{y}, the number stored in @code{x} -is exactly representable -in binary since it can be written as a finite sum of one or -more fractions whose denominators are all powers of two. -When @command{gawk} reads a floating-point number from -program source, it automatically rounds that number to whatever -precision your machine supports. If you try to print the numeric -content of a variable using an output format string of @code{"%.17g"}, -it may not produce the same number as you assigned to it: - -@example -$ @kbd{gawk 'BEGIN @{ x = 0.875; y = 0.425} -> @kbd{ printf("%0.17g, %0.17g\n", x, y) @}'} -@print{} 0.875, 0.42499999999999999 -@end example - -Often the error is so small you do not even notice it, and if you do, -you can always specify how much precision you would like in your output. -Usually this is a format string like @code{"%.15g"}, which when -used in the previous example, produces an output identical to the input. - -Because the underlying representation can be little bit off from the exact value, -comparing floats to see if they are equal is generally not a good idea. -Here is an example where it does not work like you expect: - -@example -$ @kbd{gawk 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} -@print{} 0 -@end example - -The loss of accuracy during a single computation with floating-point numbers -usually isn't enough to worry about. However, if you compute a value -which is the result of a sequence of floating point operations, -the error can accumulate and greatly affect the computation itself. -Here is an attempt to compute the value of the constant -@value{PI} using one of its many series representations: - -@example -BEGIN @{ - x = 1.0 / sqrt(3.0) - n = 6 - for (i = 1; i < 30; i++) @{ - n = n * 2.0 - x = (sqrt(x * x + 1) - 1) / x - printf("%.15f\n", n * x) - @} -@} -@end example - -When run, the early errors propagating through later computations -cause the loop to terminate prematurely after an attempt to divide by zero. - -@example -$ @kbd{gawk -f pi.awk} -@print{} 3.215390309173475 -@print{} 3.159659942097510 -@print{} 3.146086215131467 -@print{} 3.142714599645573 -@dots{} -@print{} 3.224515243534819 -@print{} 2.791117213058638 -@print{} 0.000000000000000 -@error{} gawk: pi.awk:6: fatal: division by zero attempted -@end example - -Here is one more example where the inaccuracies in internal representations -yield an unexpected result: - -@example -$ @kbd{gawk 'BEGIN @{} -> @kbd{for (d = 1.1; d <= 1.5; d += 0.1)} -> @kbd{i++} -> @kbd{print i} -> @kbd{@}'} -@print{} 4 -@end example - -Can computation using aribitrary precision help with the previous examples? -If you are impatient to know, see -@ref{Exact Arithmetic}. - -Instead of aribitrary precision floating-point arithmetic, -often all you need is an adjustment of your logic -or a different order for the operations in your calculation. -The stability and the accuracy of the computation of the constant @value{PI} -in the previous example can be enhanced by using the following -simple algebraic transformation: - -@example -(sqrt(x * x + 1) - 1) / x = x / (sqrt(x * x + 1) + x) -@end example - -There is no need to be unduly suspicious about the results from -floating-point arithmetic. The lesson to remember is that -floating-point math is always more complex than the math using -pencil and paper. In order to take advantage of the power -of computer floating-point, you need to know its limitations -and work within them. For most casual use of floating-point arithmetic, -you will often get the expected result in the end if you simply round -the display of your final results to the correct number of significant -decimal digits. Avoid presenting numerical data in a manner that -implies better precision than is actually the case. - -@node Floating-point Representation -@section Binary Floating-point Representation -@cindex IEEE-754 format - -Although floating-point representations vary from machine to machine, -the most commonly encountered representation is that defined by the -IEEE 754 Standard. An IEEE-754 format value has three components: - -@itemize @bullet -@item -a sign bit telling whether the number is positive or negative, - -@item -an @dfn{exponent} giving its order of magnitude, @var{e}, - -@item -and a @dfn{significand}, @var{s}, -specifying the actual digits of the number. -@end itemize - -The value of the -number is then @iftex -@math{s @cdot 2^e}. +@part Part II:@* Problem Solving With @command{awk} @end iftex -@ifnottex -@var{s * 2^e}. -@end ifnottex -The first bit of a non-zero binary significand -is always one, so the significand in an IEEE-754 format only includes the -fractional part, leaving the leading one implicit. - -Three of the standard IEEE-754 types are 32-bit single precision, -64-bit double precision and 128-bit quadruple precision. -The standard also specifies extended precision formats -to allow greater precisions and larger exponent ranges. -@node Floating-point Context -@section Floating-point Context -@cindex context, floating-point - -A floating-point context defines the environment for arithmetic operations. -It governs precision, sets rules for rounding and limits range for exponents. -The context has the following primary components: - -@table @code -@item precision -Precision of the floating-point format in bits. -@item emax -Maximum exponent allowed for this format. -@item emin -Minimum exponent allowed for this format. -@item underflow behavior -The format may or may not support gradual underflow. -@item rounding -The rounding mode of this context. -@end table - -@ref{table-ieee-formats} lists the precision and exponent -field values for the basic IEEE-754 binary formats: - -@float Table,table-ieee-formats -@caption{Basic IEEE Formats} -@multitable @columnfractions .20 .20 .20 .20 .20 -@headitem Name @tab Total bits @tab Precision @tab emin @tab emax -@item Single @tab 32 @tab 24 @tab @minus{}126 @tab +127 -@item Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023 -@item Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383 -@end multitable -@end float - -@quotation NOTE -The precision numbers include the implied leading one that gives them -one extra bit of significand. -@end quotation - -A floating-point context can also determine which signals are treated -as exceptions, and can set rules for arithmetic with special values. -Please consult the IEEE-754 standard or other resources for details. - -@command{gawk} ordinarily uses the hardware double precision -representation for numbers. On most systems, this is IEEE-754 -floating-point format, corresponding to 64-bit binary with 53 bits -of precision. - -@quotation NOTE -In case an underflow occurs, the standard allows, but does not require, -the result from an arithmetic operation to be a number smaller than -the smallest nonzero normalized number. Such numbers do -not have as many significant digits as normal numbers, and are called -@dfn{denormals} or @dfn{subnormals}. The alternative, simply returning a zero, -is called @dfn{flush to zero}. The basic IEEE-754 binary formats -support subnormal numbers. -@end quotation - -@node Rounding Mode -@section Floating-point Rounding Mode -@cindex rounding mode, floating-point - -The @dfn{rounding mode} specifies the behavior for the results of numerical -operations when discarding extra precision. Each rounding mode indicates -how the least significant returned digit of a rounded result is to -be calculated. -The @code{ROUNDMODE} variable (@pxref{Setting Rounding Mode}) provides -program level control over the rounding mode. -@ref{table-rounding-modes} lists the IEEE-754 defined -rounding modes: - -@float Table,table-rounding-modes -@caption{Rounding Modes} -@multitable @columnfractions .45 .30 .25 -@headitem Rounding Mode @tab IEEE Name @tab @code{ROUNDMODE} -@item Round to nearest, ties to even @tab @code{roundTiesToEven} @tab @code{"N"} or @code{"n"} -@item Round toward plus Infinity @tab @code{roundTowardPositive} @tab @code{"U"} or @code{"u"} -@item Round toward negative Infinity @tab @code{roundTowardNegative} @tab @code{"D"} or @code{"d"} -@item Round toward zero @tab @code{roundTowardZero} @tab @code{"Z"} or @code{"z"} -@item Round to nearest, ties away from zero @tab @code{roundTiesToAway} @tab @code{"A"} or @code{"a"} -@end multitable -@end float - -The default mode @samp{roundTiesToEven} is the most preferred, -but the least intuitive. This method does the obvious thing for most values, -by rounding them up or down to the nearest digit. -For example, rounding 1.132 to two digits yields 1.13, -and rounding 1.157 yields 1.16. - -However, when it comes to rounding a value that is exactly halfway between, -things do not work the way you probably learned in school. -In this case, the number is rounded to the nearest even digit. -So rounding 0.125 to two digits rounds down to 0.12, -but rounding 0.6875 to three digits rounds up to 0.688. -You probably have already encountered this rounding mode when -using the @code{printf} routine to format floating-point numbers. -For example: - -@example -BEGIN @{ - x = -4.5 - for (i = 1; i < 10; i++) @{ - x += 1.0 - printf("%4.1f => %2.0f\n", x, x) - @} -@} -@end example - -@noindent -produces the following output when run@footnote{It -is possible for the output to be completely different if the -C library in your system does not use the IEEE-754 even-rounding -rule to round halfway cases for @code{printf()}.}: - -@example --3.5 => -4 --2.5 => -2 --1.5 => -2 --0.5 => 0 - 0.5 => 0 - 1.5 => 2 - 2.5 => 2 - 3.5 => 4 - 4.5 => 4 -@end example - -The theory behind the rounding mode @samp{roundTiesToEven} is that -it more or less evenly distributes upward and downward rounds -of exact halves, which might cause the round-off error -to cancel itself out. This is the default rounding mode used -in IEEE-754 computing functions and operators. - -The other rounding modes are rarely used. -Round toward positive infinity (@samp{roundTowardPositive}) -and round toward negative infinity (@samp{roundTowardNegative}) -are often used to implement interval arithmetic, -where you adjust the rounding mode to calculate upper and lower bounds -for the range of output. The @samp{roundTowardZero} -mode can be used for converting floating-point numbers to integers. -The rounding mode @samp{roundTiesToAway} rounds the result to the -nearest number and selects the number with the larger magnitude -if a tie occurs. - -Some numerical analysts will tell you that your choice of rounding style -has tremendous impact on the final outcome, and advise you to wait until -final output for any rounding. Instead, you can often achieve this goal by -setting the precision initially to some value sufficiently larger than -the final desired precision, so that the accumulation of round-off error -does not influence the outcome. -If you suspect that results from your computation are -sensitive to accumulation of round-off error, -one way to be sure is to look for a significant difference in output -when you change the rounding mode. - -@node Arbitrary Precision Floats -@section Arbitrary Precision Floating-point Arithmetic with @command{gawk} - -@command{gawk} uses the GNU MPFR library -for arbitrary precision floating-point arithmetic. The MPFR library -provides precise control over precisions and rounding modes, and gives -correctly rounded reproducible platform-independent results. With the -command-line option @option{--bignum} or @option{-M}, -all floating-point arithmetic operators and numeric functions can yield -results to any desired precision level supported by MPFR. -Two built-in -variables @code{PREC} -(@pxref{Setting Precision}) -and @code{ROUNDMODE} -(@pxref{Setting Rounding Mode}) -provide control over the working precision and the rounding mode. -The precision and the rounding mode are set globally for every operation -to follow. - -The default working precision for arbitrary precision floats is 53, -and the default value for @code{ROUNDMODE} is @code{"N"}, -which selects the IEEE-754 -@samp{roundTiesToEven} (@pxref{Rounding Mode}) rounding mode.@footnote{The -default precision is 53, since according to the MPFR documentation, -the library should be able to exactly reproduce all computations with -double-precision machine floating-point numbers (@code{double} type -in C), except the default exponent range is much wider and subnormal -numbers are not implemented.} -@command{gawk} uses the default exponent range in MPFR -@iftex -(@math{emax = 2^{30} - 1, emin = -emax}) -@end iftex -@ifnottex -(@var{emax} = 2^30 @minus{} 1, @var{emin} = @minus{}@var{emax}) -@end ifnottex -for all floating-point contexts. -There is no explicit mechanism to adjust the exponent range. -MPFR does not implement subnormal numbers by default, -and this behavior cannot be changed in @command{gawk}. - -@quotation NOTE -When emulating an IEEE-754 format (@pxref{Setting Precision}), -@command{gawk} internally adjusts the exponent range -to the value defined for the format and also performs computations needed for -gradual underflow (subnormal numbers). -@end quotation - -@quotation NOTE -MPFR numbers are variable-size entities, consuming only as much space as -needed to store the significant digits. Since the performance using MPFR -numbers pales in comparison to doing math using the underlying machine -types, you should consider using only as much precision as needed by -your program. -@end quotation - -@node Setting Precision -@section Setting the Working Precision -@cindex @code{PREC} variable - -@command{gawk} uses a global working precision; it does not keep track of -the precision or accuracy of individual numbers. Performing an arithmetic -operation or calling a built-in function rounds the result to the current -working precision. The default working precision is 53 which can be -modified using the built-in variable @code{PREC}. You can also set the -value to one of the following pre-defined case-insensitive strings -to emulate an IEEE-754 binary format: - -@multitable {@code{"double"}} {12345678901234567890123456789012345} -@headitem @code{PREC} @tab IEEE-754 Binary Format -@item @code{"half"} @tab 16-bit half-precision. -@item @code{"single"} @tab Basic 32-bit single precision. -@item @code{"double"} @tab Basic 64-bit double precision. -@item @code{"quad"} @tab Basic 128-bit quadruple precision. -@item @code{"oct"} @tab 256-bit octuple precision. -@end multitable - -The following example illustrates the effects of changing precision -on arithmetic operations: - -@example -$ @kbd{gawk -M -vPREC=100 'BEGIN @{ x = 1.0e-400; print x + 0; \} -> @kbd{PREC = "double"; print x + 0 @}'} -@print{} 1e-400 -@print{} 0 -@end example - -Binary and decimal precisions are related approximately according to the -formula: - -@iftex -@math{prec = 3.322 @cdot dps} -@end iftex -@ifnottex -@var{prec} = 3.322 * @var{dps} -@end ifnottex - -@noindent -Here, @var{prec} denotes the binary precision -(measured in bits) and @var{dps} (short for decimal places) -is the decimal digits. We can easily calculate how many decimal -digits the 53-bit significand of an IEEE double is equivalent to: -53 / 3.332 which is equal to about 15.95. -But what does 15.95 digits actually mean? It depends whether you are -concerned about how many digits you can rely on, or how many digits -you need. - -It is important to know how many bits it takes to uniquely identify -a double-precision value (the C type @code{double}). If you want to -convert from @code{double} to decimal and back to @code{double} (e.g., -saving a @code{double} representing an intermediate result to a file, and -later reading it back to restart the computation), then a few more decimal -digits are required. 17 digits is generally enough for a @code{double}. - -It can also be important to know what decimal numbers can be uniquely -represented with a @code{double}. If you want to convert -from decimal to @code{double} and back again, 15 digits is the most that -you can get. Stated differently, you should not present -the numbers from your floating-point computations with more than 15 -significant digits in them. - -Conversely, it takes a precision of 332 bits to hold an approximation -of constant @value{PI} that is accurate to 100 decimal places. -You should always add some extra bits in order to avoid the confusing round-off -issues that occur because numbers are stored internally in binary. - -@node Setting Rounding Mode -@section Setting the Rounding Mode -@cindex @code{ROUNDMODE} variable - -The built-in variable @code{ROUNDMODE} has the default value @code{"N"}, -which selects the IEEE-754 rounding mode @samp{roundTiesToEven}. -The other possible values for @code{ROUNDMODE} are @code{"U"} for rounding mode -@samp{roundTowardPositive}, @code{"D"} for @samp{roundTowardNegative}, -and @code{"Z"} for @samp{roundTowardZero}. -@command{gawk} also accepts @code{"A"} to select the IEEE-754 mode -@samp{roundTiesToAway} -if your version of the MPFR library supports it; otherwise setting -@code{ROUNDMODE} to this value has no effect. @xref{Rounding Mode}, -for the meanings of the various rounding modes. - -Here is an example of how to change the default rounding behavior of -@code{printf}'s output: - -@example -$ @kbd{gawk -M -vROUNDMODE="Z" 'BEGIN @{ printf("%.2f\n", 1.378) @}'} -@print{} 1.37 -@end example - -@node Floating-point Constants -@section Representing Floating-point Constants -@cindex constants, floating-point - -Be wary of floating-point constants! When reading a floating-point constant -from program source code, @command{gawk} uses the default precision, -unless overridden -by an assignment to the special variable @code{PREC} on the command -line, to store it internally as a MPFR number. -Changing the precision using @code{PREC} in the program text does -not change the precision of a constant. If you need to -represent a floating-point constant at a higher precision than the -default and cannot use a command line assignment to @code{PREC}, -you should either specify the constant as a string, or -a rational number whenever possible. The following example -illustrates the differences among various ways to -print a floating-point constant: - -@example -$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'} -@print{} 0.1000000000000000055511151 -$ @kbd{gawk -M -vPREC = 113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'} -@print{} 0.1000000000000000000000000 -$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'} -@print{} 0.1000000000000000000000000 -$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'} -@print{} 0.1000000000000000000000000 -@end example - -In the first case, the number is stored with the default precision of 53. - -@node Changing Precision -@section Changing the Precision of a Number - -@cindex Laurie, Dirk -@quotation -@i{The point is that in any variable-precision package, -a decision is made on how to treat numbers given as data, -or arising in intermediate results, which are represented in -floating-point format to a precision lower than working precision. -Do we promote them to full membership of the high-precision club, -or do we treat them and all their associates as second-class citizens? -Sometimes the first course is proper, sometimes the second, and it takes -careful analysis to tell which.} - -Dirk Laurie@footnote{Dirk Laurie. -@cite{Variable-precision Arithmetic Considered Perilous -- A Detective Story}. -Electronic Transactions on Numerical Analysis. Volume 28, pp. 168-173, 2008.} -@end quotation - -@command{gawk} does not implicitly modify the precision of any previously -computed results when the working precision is changed with an assignment -to @code{PREC}. The precision of a number is always the one that was -used at the time of its creation, and there is no way for the user -to explicitly change it afterwards. However, since the result of a -floating-point arithmetic operation is always an arbitrary precision -floating-point value---with a precision set by the value of @code{PREC}---one of the -following workarounds effectively accomplishes the desired behavior: - -@example -x = x + 0.0 -@end example - -@noindent -or: - -@example -x += 0.0 -@end example - -@node Exact Arithmetic -@section Exact Arithmetic with Floating-point Numbers - -@quotation CAUTION -Never depend on the exactness of floating-point arithmetic, -even for apparently simple expressions! -@end quotation - -Can arbitrary precision arithmetic give exact results? There are -no easy answers. The standard rules of algebra often do not apply -when using floating-point arithmetic. -Among other things, the distributive and associative laws -do not hold completely, and order of operation may be important -for your computation. Rounding error, cumulative precision loss -and underflow are often troublesome. - -When @command{gawk} tests the expressions @samp{0.1 + 12.2} and @samp{12.3} -for equality -using the machine double precision arithmetic, it decides that they -are not equal! -(@xref{Floating-point Programming}.) -You can get the result you want by increasing the precision; -56 in this case will get the job done: - -@example -$ @kbd{gawk -M -vPREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} -@print{} 1 -@end example - -If adding more bits is good, perhaps adding even more bits of -precision is better? -Here is what happens if we use an even larger value of @code{PREC}: - -@example -$ @kbd{gawk -M -vPREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} -@print{} 0 -@end example - -This is not a bug in @command{gawk} or in the MPFR library. -It is easy to forget that the finite number of bits used to store the value -is often just an approximation after proper rounding. -The test for equality succeeds if and only if @emph{all} bits in the two operands -are exactly the same. Since this is not necessarily true after floating-point -computations with a particular precision and effective rounding rule, -a straight test for equality may not work. - -So, don't assume that floating-point values can be compared for equality. -You should also exercise caution when using other forms of comparisons. -The standard way to compare between floating-point numbers is to determine -how much error (or @dfn{tolerance}) you will allow in a comparison and -check to see if one value is within this error range of the other. - -In applications where 15 or fewer decimal places suffice, -hardware double precision arithmetic can be adequate, and is usually much faster. -But you do need to keep in mind that every floating-point operation -can suffer a new rounding error with catastrophic consequences as illustrated -by our attempt to compute the value of the constant @value{PI}, -(@pxref{Floating-point Programming}). -Extra precision can greatly enhance the stability and the accuracy -of your computation in such cases. - -Repeated addition is not necessarily equivalent to multiplication -in floating-point arithmetic. In the last example -(@pxref{Floating-point Programming}), -you may or may not succeed in getting the correct result by choosing -an arbitrarily large value for @code{PREC}. Reformulation of -the problem at hand is often the correct approach in such situations. - - -@node Integer Programming -@section Effective Integer Programming - -As has been mentioned already, @command{gawk} ordinarily uses hardware double -precision with 64-bit IEEE binary floating-point representation -for numbers on most systems. A large integer like 9007199254740997 -has a binary representation that, although finite, is more than 53 bits long; -it must also be rounded to 53 bits. -The biggest integer that can be stored in a C @code{double} is usually the same -as the largest possible value of a @code{double}. If your system @code{double} -is an IEEE 64-bit @code{double}, this largest possible value is an integer and -can be represented precisely. What more should one know about integers? - -If you want to know what is the largest integer, such that it and -all smaller integers can be stored in 64-bit doubles without losing precision, -then the answer is -@iftex -@math{2^{53}}. -@end iftex -@ifnottex -2^53. -@end ifnottex -The next representable number is the even number -@iftex -@math{2^{53} + 2}, -@end iftex -@ifnottex -2^53 + 2, -@end ifnottex -meaning it is unlikely that you will be able to make -@command{gawk} print -@iftex -@math{2^{53} + 1} -@end iftex -@ifnottex -2^53 + 1 -@end ifnottex -in integer format. -The range of integers exactly representable by a 64-bit double -is -@iftex -@math{[-2^{53}, 2^{53}]}. -@end iftex -@ifnottex -[@minus{}2^53, 2^53]. -@end ifnottex -If you ever see an integer outside this range in @command{gawk} -using 64-bit doubles, you have reason to be very suspicious about -the accuracy of the output. Here is a simple program with erroneous output: - -@example -$ @kbd{gawk 'BEGIN @{ i = 2^53 - 1; for (j = 0; j < 4; j++) print i + j @}'} -@print{} 9007199254740991 -@print{} 9007199254740992 -@print{} 9007199254740992 -@print{} 9007199254740994 -@end example - -The lesson is to not assume that any large integer printed by @command{gawk} -represents an exact result from your computation, especially if it wraps -around on your screen. - -@node Arbitrary Precision Integers -@section Arbitrary Precision Integer Arithmetic with @command{gawk} -@cindex integer, arbitrary precision - -If the option @option{--bignum} or @option{-M} is specified, -@command{gawk} performs all -integer arithmetic using GMP arbitrary precision integers. -Any number that looks like an integer in a program source or data file -is stored as an arbitrary precision integer. -The size of the integer is limited only by your computer's memory. -The current floating-point context has no effect on operations involving integers. -For example, the following computes -@iftex -@math{5^{4^{3^{2}}}}, -@end iftex -@ifnottex -5^4^3^2, -@end ifnottex -the result of which is beyond the -limits of ordinary @command{gawk} numbers: - -@example -$ @kbd{gawk -M 'BEGIN @{} -> @kbd{x = 5^4^3^2} -> @kbd{print "# of digits =", length(x)} -> @kbd{print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)} -> @kbd{@}'} -@print{} # of digits = 183231 -@print{} 62060698786608744707 ... 92256259918212890625 -@end example - -If you were to compute the same value using arbitrary precision -floating-point values instead, the precision needed for correct output -(using the formula -@iftex -@math{prec = 3.322 @cdot dps}), -would be @math{3.322 @cdot 183231}, -@end iftex -@ifnottex -@samp{prec = 3.322 * dps}), -would be 3.322 x 183231, -@end ifnottex -or 608693. - -The result from an arithmetic operation with an integer and a floating-point value -is a floating-point value with a precision equal to the working precision. -The following program calculates the eighth term in -Sylvester's sequence@footnote{Weisstein, Eric W. -@cite{Sylvester's Sequence}. From MathWorld--A Wolfram Web Resource. -@url{http://mathworld.wolfram.com/SylvestersSequence.html}} -using a recurrence: - -@example -$ @kbd{gawk -M 'BEGIN @{} -> @kbd{s = 2.0} -> @kbd{for (i = 1; i <= 7; i++)} -> @kbd{s = s * (s - 1) + 1} -> @kbd{print s} -> @kbd{@}'} -@print{} 113423713055421845118910464 -@end example - -The output differs from the acutal number, 113423713055421844361000443, -because the default precision of 53 is not enough to represent the -floating-point results exactly. You can either increase the precision -(100 is enough in this case), or replace the floating-point constant -@code{2.0} with an integer, to perform all computations using integer -arithmetic to get the correct output. - -It will sometimes be necessary for @command{gawk} to implicitly convert an -arbitrary precision integer into an arbitrary precision floating-point value. -This is primarily because the MPFR library does not always provide the -relevant interface to process arbitrary precision integers or mixed-mode -numbers as needed by an operation or function. -In such a case, the precision is set to the minimum value necessary -for exact conversion, and the working precision is not used for this purpose. -If this is not what you need or want, you can employ a subterfuge -like this: - -@example -gawk -M 'BEGIN @{ n = 13; print (n + 0.0) % 2.0 @}' -@end example - -You can avoid this issue altogether by specifying the number as a float -to begin with: - -@example -gawk -M 'BEGIN @{ n = 13.0; print n % 2.0 @}' -@end example - -Note that for the particular example above, there is unlikely to be a -reason for simply not using the following: - -@example -gawk -M 'BEGIN @{ n = 13; print n % 2 @}' -@end example - - -@node MPFR and GMP Libraries -@section Information About the MPFR and GMP Libraries - -There are a few elements available in the @code{PROCINFO} array -to provide information about the MPFR and GMP libraries. -@xref{Auto-set}, for more information. - -@node Advanced Features -@chapter Advanced Features of @command{gawk} -@cindex advanced features, network connections, See Also networks, connections -@c STARTOFRANGE gawadv -@cindex @command{gawk}, features, advanced -@c STARTOFRANGE advgaw -@cindex advanced features, @command{gawk} @ignore -Contributed by: Peter Langston <pud!psl@bellcore.bellcore.com> - - Found in Steve English's "signature" line: - -"Write documentation as if whoever reads it is a violent psychopath -who knows where you live." -@end ignore -@quotation -@i{Write documentation as if whoever reads it is -a violent psychopath who knows where you live.}@* -Steve English, as quoted by Peter Langston -@end quotation - -This @value{CHAPTER} discusses advanced features in @command{gawk}. -It's a bit of a ``grab bag'' of items that are otherwise unrelated -to each other. -First, a command-line option allows @command{gawk} to recognize -nondecimal numbers in input data, not just in @command{awk} -programs. -Then, @command{gawk}'s special features for sorting arrays are presented. -Next, two-way I/O, discussed briefly in earlier parts of this -@value{DOCUMENT}, is described in full detail, along with the basics -of TCP/IP networking. Finally, @command{gawk} -can @dfn{profile} an @command{awk} program, making it possible to tune -it for performance. - -@ref{Dynamic Extensions}, -discusses the ability to dynamically add new built-in functions to -@command{gawk}. As this feature is still immature and likely to change, -its description is relegated to an appendix. - -@menu -* Nondecimal Data:: Allowing nondecimal input data. -* Array Sorting:: Facilities for controlling array traversal and - sorting arrays. -* Two-way I/O:: Two-way communications with another process. -* TCP/IP Networking:: Using @command{gawk} for network programming. -* Profiling:: Profiling your @command{awk} programs. -@end menu - -@node Nondecimal Data -@section Allowing Nondecimal Input Data -@cindex @code{--non-decimal-data} option -@cindex advanced features, @command{gawk}, nondecimal input data -@cindex input, data@comma{} nondecimal -@cindex constants, nondecimal - -If you run @command{gawk} with the @option{--non-decimal-data} option, -you can have nondecimal constants in your input data: - -@c line break here for small book format -@example -$ @kbd{echo 0123 123 0x123 |} -> @kbd{gawk --non-decimal-data '@{ printf "%d, %d, %d\n",} -> @kbd{$1, $2, $3 @}'} -@print{} 83, 123, 291 -@end example - -For this feature to work, write your program so that -@command{gawk} treats your data as numeric: - -@example -$ @kbd{echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'} -@print{} 0123 123 0x123 -@end example - -@noindent -The @code{print} statement treats its expressions as strings. -Although the fields can act as numbers when necessary, -they are still strings, so @code{print} does not try to treat them -numerically. You may need to add zero to a field to force it to -be treated as a number. For example: - -@example -$ @kbd{echo 0123 123 0x123 | gawk --non-decimal-data '} -> @kbd{@{ print $1, $2, $3} -> @kbd{print $1 + 0, $2 + 0, $3 + 0 @}'} -@print{} 0123 123 0x123 -@print{} 83 123 291 -@end example - -Because it is common to have decimal data with leading zeros, and because -using this facility could lead to surprising results, the default is to leave it -disabled. If you want it, you must explicitly request it. - -@cindex programming conventions, @code{--non-decimal-data} option -@cindex @code{--non-decimal-data} option, @code{strtonum()} function and -@cindex @code{strtonum()} function (@command{gawk}), @code{--non-decimal-data} option and -@quotation CAUTION -@emph{Use of this option is not recommended.} -It can break old programs very badly. -Instead, use the @code{strtonum()} function to convert your data -(@pxref{Nondecimal-numbers}). -This makes your programs easier to write and easier to read, and -leads to less surprising results. -@end quotation - -@node Array Sorting -@section Controlling Array Traversal and Array Sorting - -@command{gawk} lets you control the order in which a @samp{for (i in array)} -loop traverses an array. - -In addition, two built-in functions, @code{asort()} and @code{asorti()}, -let you sort arrays based on the array values and indices, respectively. -These two functions also provide control over the sorting criteria used -to order the elements during sorting. - -@menu -* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. -* Array Sorting Functions:: How to use @code{asort()} and @code{asorti()}. -@end menu - -@node Controlling Array Traversal -@subsection Controlling Array Traversal - -By default, the order in which a @samp{for (i in array)} loop -scans an array is not defined; it is generally based upon -the internal implementation of arrays inside @command{awk}. - -Often, though, it is desirable to be able to loop over the elements -in a particular order that you, the programmer, choose. @command{gawk} -lets you do this. - -@ref{Controlling Scanning}, describes how you can assign special, -pre-defined values to @code{PROCINFO["sorted_in"]} in order to -control the order in which @command{gawk} will traverse an array -during a @code{for} loop. - -In addition, the value of @code{PROCINFO["sorted_in"]} can be a function name. -This lets you traverse an array based on any custom criterion. -The array elements are ordered according to the return value of this -function. The comparison function should be defined with at least -four arguments: - -@example -function comp_func(i1, v1, i2, v2) -@{ - @var{compare elements 1 and 2 in some fashion} - @var{return < 0; 0; or > 0} -@} -@end example - -Here, @var{i1} and @var{i2} are the indices, and @var{v1} and @var{v2} -are the corresponding values of the two elements being compared. -Either @var{v1} or @var{v2}, or both, can be arrays if the array being -traversed contains subarrays as values. -(@xref{Arrays of Arrays}, for more information about subarrays.) -The three possible return values are interpreted as follows: - -@table @code -@item comp_func(i1, v1, i2, v2) < 0 -Index @var{i1} comes before index @var{i2} during loop traversal. - -@item comp_func(i1, v1, i2, v2) == 0 -Indices @var{i1} and @var{i2} -come together but the relative order with respect to each other is undefined. - -@item comp_func(i1, v1, i2, v2) > 0 -Index @var{i1} comes after index @var{i2} during loop traversal. -@end table - -Our first comparison function can be used to scan an array in -numerical order of the indices: - -@example -function cmp_num_idx(i1, v1, i2, v2) -@{ - # numerical index comparison, ascending order - return (i1 - i2) -@} -@end example - -Our second function traverses an array based on the string order of -the element values rather than by indices: - -@example -function cmp_str_val(i1, v1, i2, v2) -@{ - # string value comparison, ascending order - v1 = v1 "" - v2 = v2 "" - if (v1 < v2) - return -1 - return (v1 != v2) -@} -@end example - -The third -comparison function makes all numbers, and numeric strings without -any leading or trailing spaces, come out first during loop traversal: - -@example -function cmp_num_str_val(i1, v1, i2, v2, n1, n2) -@{ - # numbers before string value comparison, ascending order - n1 = v1 + 0 - n2 = v2 + 0 - if (n1 == v1) - return (n2 == v2) ? (n1 - n2) : -1 - else if (n2 == v2) - return 1 - return (v1 < v2) ? -1 : (v1 != v2) -@} -@end example - -Here is a main program to demonstrate how @command{gawk} -behaves using each of the previous functions: - -@example -BEGIN @{ - data["one"] = 10 - data["two"] = 20 - data[10] = "one" - data[100] = 100 - data[20] = "two" - - f[1] = "cmp_num_idx" - f[2] = "cmp_str_val" - f[3] = "cmp_num_str_val" - for (i = 1; i <= 3; i++) @{ - printf("Sort function: %s\n", f[i]) - PROCINFO["sorted_in"] = f[i] - for (j in data) - printf("\tdata[%s] = %s\n", j, data[j]) - print "" - @} -@} -@end example - -Here are the results when the program is run: -@page - -@example -$ @kbd{gawk -f compdemo.awk} -@print{} Sort function: cmp_num_idx @ii{Sort by numeric index} -@print{} data[two] = 20 -@print{} data[one] = 10 @ii{Both strings are numerically zero} -@print{} data[10] = one -@print{} data[20] = two -@print{} data[100] = 100 -@print{} -@print{} Sort function: cmp_str_val @ii{Sort by element values as strings} -@print{} data[one] = 10 -@print{} data[100] = 100 @ii{String 100 is less than string 20} -@print{} data[two] = 20 -@print{} data[10] = one -@print{} data[20] = two -@print{} -@print{} Sort function: cmp_num_str_val @ii{Sort all numeric values before all strings} -@print{} data[one] = 10 -@print{} data[two] = 20 -@print{} data[100] = 100 -@print{} data[10] = one -@print{} data[20] = two -@end example - -Consider sorting the entries of a GNU/Linux system password file -according to login name. The following program sorts records -by a specific field position and can be used for this purpose: - -@example -# sort.awk --- simple program to sort by field position -# field position is specified by the global variable POS - -function cmp_field(i1, v1, i2, v2) -@{ - # comparison by value, as string, and ascending order - return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS]) -@} - -@{ - for (i = 1; i <= NF; i++) - a[NR][i] = $i -@} - -END @{ - PROCINFO["sorted_in"] = "cmp_field" - if (POS < 1 || POS > NF) - POS = 1 - for (i in a) @{ - for (j = 1; j <= NF; j++) - printf("%s%c", a[i][j], j < NF ? ":" : "") - print "" - @} -@} -@end example - -The first field in each entry of the password file is the user's login name, -and the fields are separated by colons. -Each record defines a subarray, -with each field as an element in the subarray. -Running the program produces the -following output: - -@example -$ @kbd{gawk -vPOS=1 -F: -f sort.awk /etc/passwd} -@print{} adm:x:3:4:adm:/var/adm:/sbin/nologin -@print{} apache:x:48:48:Apache:/var/www:/sbin/nologin -@print{} avahi:x:70:70:Avahi daemon:/:/sbin/nologin -@dots{} -@end example - -The comparison should normally always return the same value when given a -specific pair of array elements as its arguments. If inconsistent -results are returned then the order is undefined. This behavior can be -exploited to introduce random order into otherwise seemingly -ordered data: - -@example -function cmp_randomize(i1, v1, i2, v2) -@{ - # random order - return (2 - 4 * rand()) -@} -@end example - -As mentioned above, the order of the indices is arbitrary if two -elements compare equal. This is usually not a problem, but letting -the tied elements come out in arbitrary order can be an issue, especially -when comparing item values. The partial ordering of the equal elements -may change during the next loop traversal, if other elements are added or -removed from the array. One way to resolve ties when comparing elements -with otherwise equal values is to include the indices in the comparison -rules. Note that doing this may make the loop traversal less efficient, -so consider it only if necessary. The following comparison functions -force a deterministic order, and are based on the fact that the -indices of two elements are never equal: - -@example -function cmp_numeric(i1, v1, i2, v2) -@{ - # numerical value (and index) comparison, descending order - return (v1 != v2) ? (v2 - v1) : (i2 - i1) -@} - -function cmp_string(i1, v1, i2, v2) -@{ - # string value (and index) comparison, descending order - v1 = v1 i1 - v2 = v2 i2 - return (v1 > v2) ? -1 : (v1 != v2) -@} -@end example - -@c Avoid using the term ``stable'' when describing the unpredictable behavior -@c if two items compare equal. Usually, the goal of a "stable algorithm" -@c is to maintain the original order of the items, which is a meaningless -@c concept for a list constructed from a hash. - -A custom comparison function can often simplify ordered loop -traversal, and the sky is really the limit when it comes to -designing such a function. - -When string comparisons are made during a sort, either for element -values where one or both aren't numbers, or for element indices -handled as strings, the value of @code{IGNORECASE} -(@pxref{Built-in Variables}) controls whether -the comparisons treat corresponding uppercase and lowercase letters as -equivalent or distinct. - -Another point to keep in mind is that in the case of subarrays -the element values can themselves be arrays; a production comparison -function should use the @code{isarray()} function -(@pxref{Type Functions}), -to check for this, and choose a defined sorting order for subarrays. - -All sorting based on @code{PROCINFO["sorted_in"]} -is disabled in POSIX mode, -since the @code{PROCINFO} array is not special in that case. - -As a side note, sorting the array indices before traversing -the array has been reported to add 15% to 20% overhead to the -execution time of @command{awk} programs. For this reason, -sorted array traversal is not the default. - -@c The @command{gawk} -@c maintainers believe that only the people who wish to use a -@c feature should have to pay for it. - -@node Array Sorting Functions -@subsection Sorting Array Values and Indices with @command{gawk} - -@cindex arrays, sorting -@cindex @code{asort()} function (@command{gawk}) -@cindex @code{asort()} function (@command{gawk}), arrays@comma{} sorting -@cindex sort function, arrays, sorting -In most @command{awk} implementations, sorting an array requires -writing a @code{sort()} function. -While this can be educational for exploring different sorting algorithms, -usually that's not the point of the program. -@command{gawk} provides the built-in @code{asort()} -and @code{asorti()} functions -(@pxref{String Functions}) -for sorting arrays. For example: - -@example -@var{populate the array} data -n = asort(data) -for (i = 1; i <= n; i++) - @var{do something with} data[i] -@end example - -After the call to @code{asort()}, the array @code{data} is indexed from 1 -to some number @var{n}, the total number of elements in @code{data}. -(This count is @code{asort()}'s return value.) -@code{data[1]} @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on. -The comparison is based on the type of the elements -(@pxref{Typing and Comparison}). -All numeric values come before all string values, -which in turn come before all subarrays. - -@cindex side effects, @code{asort()} function -An important side effect of calling @code{asort()} is that -@emph{the array's original indices are irrevocably lost}. -As this isn't always desirable, @code{asort()} accepts a -second argument: - -@example -@var{populate the array} source -n = asort(source, dest) -for (i = 1; i <= n; i++) - @var{do something with} dest[i] -@end example - -In this case, @command{gawk} copies the @code{source} array into the -@code{dest} array and then sorts @code{dest}, destroying its indices. -However, the @code{source} array is not affected. - -@code{asort()} accepts a third string argument to control comparison of -array elements. As with @code{PROCINFO["sorted_in"]}, this argument -may be one of the predefined names that @command{gawk} provides -(@pxref{Controlling Scanning}), or the name of a user-defined function -(@pxref{Controlling Array Traversal}). - -@quotation NOTE -In all cases, the sorted element values consist of the original -array's element values. The ability to control comparison merely -affects the way in which they are sorted. -@end quotation - -Often, what's needed is to sort on the values of the @emph{indices} -instead of the values of the elements. -To do that, use the -@code{asorti()} function. The interface is identical to that of -@code{asort()}, except that the index values are used for sorting, and -become the values of the result array: - -@example -@{ source[$0] = some_func($0) @} - -END @{ - n = asorti(source, dest) - for (i = 1; i <= n; i++) @{ - @ii{Work with sorted indices directly:} - @var{do something with} dest[i] - @dots{} - @ii{Access original array via sorted indices:} - @var{do something with} source[dest[i]] - @} -@} -@end example - -Similar to @code{asort()}, -in all cases, the sorted element values consist of the original -array's indices. The ability to control comparison merely -affects the way in which they are sorted. - -Sorting the array by replacing the indices provides maximal flexibility. -To traverse the elements in decreasing order, use a loop that goes from -@var{n} down to 1, either over the elements or over the indices.@footnote{You -may also use one of the predefined sorting names that sorts in -decreasing order.} - -@cindex reference counting, sorting arrays -Copying array indices and elements isn't expensive in terms of memory. -Internally, @command{gawk} maintains @dfn{reference counts} to data. -For example, when @code{asort()} copies the first array to the second one, -there is only one copy of the original array elements' data, even though -both arrays use the values. - -@c Document It And Call It A Feature. Sigh. -@cindex @command{gawk}, @code{IGNORECASE} variable in -@cindex @code{IGNORECASE} variable -@cindex arrays, sorting, @code{IGNORECASE} variable and -@cindex @code{IGNORECASE} variable, array sorting and -Because @code{IGNORECASE} affects string comparisons, the value -of @code{IGNORECASE} also affects sorting for both @code{asort()} and @code{asorti()}. -Note also that the locale's sorting order does @emph{not} -come into play; comparisons are based on character values only.@footnote{This -is true because locale-based comparison occurs only when in POSIX -compatibility mode, and since @code{asort()} and @code{asorti()} are -@command{gawk} extensions, they are not available in that case.} -Caveat Emptor. - -@node Two-way I/O -@section Two-Way Communications with Another Process -@cindex Brennan, Michael -@cindex programmers, attractiveness of -@smallexample -@c Path: cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan -From: brennan@@whidbey.com (Mike Brennan) -Newsgroups: comp.lang.awk -Subject: Re: Learn the SECRET to Attract Women Easily -Date: 4 Aug 1997 17:34:46 GMT -@c Organization: WhidbeyNet -@c Lines: 12 -Message-ID: <5s53rm$eca@@news.whidbey.com> -@c References: <5s20dn$2e1@chronicle.concentric.net> -@c Reply-To: brennan@whidbey.com -@c NNTP-Posting-Host: asn202.whidbey.com -@c X-Newsreader: slrn (0.9.4.1 UNIX) -@c Xref: cssun.mathcs.emory.edu comp.lang.awk:5403 - -On 3 Aug 1997 13:17:43 GMT, Want More Dates??? -<tracy78@@kilgrona.com> wrote: ->Learn the SECRET to Attract Women Easily -> ->The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women - -The scent of awk programmers is a lot more attractive to women than -the scent of perl programmers. --- -Mike Brennan -@c brennan@@whidbey.com -@end smallexample - -@cindex advanced features, @command{gawk}, processes@comma{} communicating with -@cindex processes, two-way communications with -It is often useful to be able to -send data to a separate program for -processing and then read the result. This can always be -done with temporary files: - -@example -# Write the data for processing -tempfile = ("mydata." PROCINFO["pid"]) -while (@var{not done with data}) - print @var{data} | ("subprogram > " tempfile) -close("subprogram > " tempfile) - -# Read the results, remove tempfile when done -while ((getline newdata < tempfile) > 0) - @var{process} newdata @var{appropriately} -close(tempfile) -system("rm " tempfile) -@end example - -@noindent -This works, but not elegantly. Among other things, it requires that -the program be run in a directory that cannot be shared among users; -for example, @file{/tmp} will not do, as another user might happen -to be using a temporary file with the same name. - -@cindex coprocesses -@cindex input/output, two-way -@cindex @code{|} (vertical bar), @code{|&} operator (I/O) -@cindex vertical bar (@code{|}), @code{|&} operator (I/O) -@cindex @command{csh} utility, @code{|&} operator, comparison with -However, with @command{gawk}, it is possible to -open a @emph{two-way} pipe to another process. The second process is -termed a @dfn{coprocess}, since it runs in parallel with @command{gawk}. -The two-way connection is created using the @samp{|&} operator -(borrowed from the Korn shell, @command{ksh}):@footnote{This is very -different from the same operator in the C shell.} - -@example -do @{ - print @var{data} |& "subprogram" - "subprogram" |& getline results -@} while (@var{data left to process}) -close("subprogram") -@end example - -The first time an I/O operation is executed using the @samp{|&} -operator, @command{gawk} creates a two-way pipeline to a child process -that runs the other program. Output created with @code{print} -or @code{printf} is written to the program's standard input, and -output from the program's standard output can be read by the @command{gawk} -program using @code{getline}. -As is the case with processes started by @samp{|}, the subprogram -can be any program, or pipeline of programs, that can be started by -the shell. +@ifdocbook +@part Part II:@* Problem Solving With @command{awk} -There are some cautionary items to be aware of: +Part II shows how to use @command{awk} and @command{gawk} for problem solving. +There is lots of code here for you to read and learn from. +It contains the following chapters: @itemize @bullet @item -As the code inside @command{gawk} currently stands, the coprocess's -standard error goes to the same place that the parent @command{gawk}'s -standard error goes. It is not possible to read the child's -standard error separately. +@ref{Library Functions}. -@cindex deadlocks -@cindex buffering, input/output -@cindex @code{getline} command, deadlock and @item -I/O buffering may be a problem. @command{gawk} automatically -flushes all output down the pipe to the coprocess. -However, if the coprocess does not flush its output, -@command{gawk} may hang when doing a @code{getline} in order to read -the coprocess's results. This could lead to a situation -known as @dfn{deadlock}, where each process is waiting for the -other one to do something. +@ref{Sample Programs}. @end itemize - -@cindex @code{close()} function, two-way pipes and -It is possible to close just one end of the two-way pipe to -a coprocess, by supplying a second argument to the @code{close()} -function of either @code{"to"} or @code{"from"} -(@pxref{Close Files And Pipes}). -These strings tell @command{gawk} to close the end of the pipe -that sends data to the coprocess or the end that reads from it, -respectively. - -@cindex @command{sort} utility, coprocesses and -This is particularly necessary in order to use -the system @command{sort} utility as part of a coprocess; -@command{sort} must read @emph{all} of its input -data before it can produce any output. -The @command{sort} program does not receive an end-of-file indication -until @command{gawk} closes the write end of the pipe. - -When you have finished writing data to the @command{sort} -utility, you can close the @code{"to"} end of the pipe, and -then start reading sorted data via @code{getline}. -For example: - -@example -BEGIN @{ - command = "LC_ALL=C sort" - n = split("abcdefghijklmnopqrstuvwxyz", a, "") - - for (i = n; i > 0; i--) - print a[i] |& command - close(command, "to") - - while ((command |& getline line) > 0) - print "got", line - close(command) -@} -@end example - -This program writes the letters of the alphabet in reverse order, one -per line, down the two-way pipe to @command{sort}. It then closes the -write end of the pipe, so that @command{sort} receives an end-of-file -indication. This causes @command{sort} to sort the data and write the -sorted data back to the @command{gawk} program. Once all of the data -has been read, @command{gawk} terminates the coprocess and exits. - -As a side note, the assignment @samp{LC_ALL=C} in the @command{sort} -command ensures traditional Unix (ASCII) sorting from @command{sort}. - -@cindex @command{gawk}, @code{PROCINFO} array in -@cindex @code{PROCINFO} array -You may also use pseudo-ttys (ptys) for -two-way communication instead of pipes, if your system supports them. -This is done on a per-command basis, by setting a special element -in the @code{PROCINFO} array -(@pxref{Auto-set}), -like so: - -@example -command = "sort -nr" # command, save in convenience variable -PROCINFO[command, "pty"] = 1 # update PROCINFO -print @dots{} |& command # start two-way pipe -@dots{} -@end example - -@noindent -Using ptys avoids the buffer deadlock issues described earlier, at some -loss in performance. If your system does not have ptys, or if all the -system's ptys are in use, @command{gawk} automatically falls back to -using regular pipes. - -@node TCP/IP Networking -@section Using @command{gawk} for Network Programming -@cindex advanced features, @command{gawk}, network programming -@cindex networks, programming -@c STARTOFRANGE tcpip -@cindex TCP/IP -@cindex @code{/inet/@dots{}} special files (@command{gawk}) -@cindex files, @code{/inet/@dots{}} (@command{gawk}) -@cindex @code{/inet4/@dots{}} special files (@command{gawk}) -@cindex files, @code{/inet4/@dots{}} (@command{gawk}) -@cindex @code{/inet6/@dots{}} special files (@command{gawk}) -@cindex files, @code{/inet6/@dots{}} (@command{gawk}) -@cindex @code{EMISTERED} -@quotation -@code{EMISTERED}:@* -@ @ @ @ @i{A host is a host from coast to coast,@* -@ @ @ @ and no-one can talk to host that's close,@* -@ @ @ @ unless the host that isn't close@* -@ @ @ @ is busy hung or dead.} -@end quotation - -In addition to being able to open a two-way pipeline to a coprocess -on the same system -(@pxref{Two-way I/O}), -it is possible to make a two-way connection to -another process on another system across an IP network connection. - -You can think of this as just a @emph{very long} two-way pipeline to -a coprocess. -The way @command{gawk} decides that you want to use TCP/IP networking is -by recognizing special @value{FN}s that begin with one of @samp{/inet/}, -@samp{/inet4/} or @samp{/inet6}. - -The full syntax of the special @value{FN} is -@file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}. -The components are: - -@table @var -@item net-type -Specifies the kind of Internet connection to make. -Use @samp{/inet4/} to force IPv4, and -@samp{/inet6/} to force IPv6. -Plain @samp{/inet/} (which used to be the only option) uses -the system default, most likely IPv4. - -@item protocol -The protocol to use over IP. This must be either @samp{tcp}, or -@samp{udp}, for a TCP or UDP IP connection, -respectively. The use of TCP is recommended for most applications. - -@item local-port -@cindex @code{getaddrinfo()} function (C library) -The local TCP or UDP port number to use. Use a port number of @samp{0} -when you want the system to pick a port. This is what you should do -when writing a TCP or UDP client. -You may also use a well-known service name, such as @samp{smtp} -or @samp{http}, in which case @command{gawk} attempts to determine -the predefined port number using the C @code{getaddrinfo()} function. - -@item remote-host -The IP address or fully-qualified domain name of the Internet -host to which you want to connect. - -@item remote-port -The TCP or UDP port number to use on the given @var{remote-host}. -Again, use @samp{0} if you don't care, or else a well-known -service name. -@end table - -@cindex @command{gawk}, @code{ERRNO} variable in -@cindex @code{ERRNO} variable -@quotation NOTE -Failure in opening a two-way socket will result in a non-fatal error -being returned to the calling code. The value of @code{ERRNO} indicates -the error (@pxref{Auto-set}). -@end quotation - -Consider the following very simple example: - -@example -BEGIN @{ - Service = "/inet/tcp/0/localhost/daytime" - Service |& getline - print $0 - close(Service) -@} -@end example - -This program reads the current date and time from the local system's -TCP @samp{daytime} server. -It then prints the results and closes the connection. - -Because this topic is extensive, the use of @command{gawk} for -TCP/IP programming is documented separately. -@ifinfo -See -@inforef{Top, , General Introduction, gawkinet, TCP/IP Internetworking with @command{gawk}}, -@end ifinfo -@ifnotinfo -See @cite{TCP/IP Internetworking with @command{gawk}}, -which comes as part of the @command{gawk} distribution, -@end ifnotinfo -for a much more complete introduction and discussion, as well as -extensive examples. - -@c ENDOFRANGE tcpip - -@node Profiling -@section Profiling Your @command{awk} Programs -@c STARTOFRANGE awkp -@cindex @command{awk} programs, profiling -@c STARTOFRANGE proawk -@cindex profiling @command{awk} programs -@cindex profiling @command{gawk} -@cindex @code{awkprof.out} file -@cindex files, @code{awkprof.out} - -You may produce execution traces of your @command{awk} programs. -This is done by passing the option @option{--profile} to @command{gawk}. -When @command{gawk} has finished running, it creates a profile of your program in a file -named @file{awkprof.out}. Because it is profiling, it also executes up to 45% slower than -@command{gawk} normally does. - -@cindex @code{--profile} option -As shown in the following example, -the @option{--profile} option can be used to change the name of the file -where @command{gawk} will write the profile: - -@example -gawk --profile=myprog.prof -f myprog.awk data1 data2 -@end example - -@noindent -In the above example, @command{gawk} places the profile in -@file{myprog.prof} instead of in @file{awkprof.out}. - -Here is a sample session showing a simple @command{awk} program, its input data, and the -results from running @command{gawk} with the @option{--profile} option. -First, the @command{awk} program: - -@example -BEGIN @{ print "First BEGIN rule" @} - -END @{ print "First END rule" @} - -/foo/ @{ - print "matched /foo/, gosh" - for (i = 1; i <= 3; i++) - sing() -@} - -@{ - if (/foo/) - print "if is true" - else - print "else is true" -@} - -BEGIN @{ print "Second BEGIN rule" @} - -END @{ print "Second END rule" @} - -function sing( dummy) -@{ - print "I gotta be me!" -@} -@end example - -Following is the input data: - -@example -foo -bar -baz -foo -junk -@end example - -Here is the @file{awkprof.out} that results from running the @command{gawk} -profiler on this program and data (this example also illustrates that @command{awk} -programmers sometimes have to work late): - -@cindex @code{BEGIN} pattern -@cindex @code{END} pattern -@example - # gawk profile, created Sun Aug 13 00:00:15 2000 - - # BEGIN block(s) - - BEGIN @{ - 1 print "First BEGIN rule" - 1 print "Second BEGIN rule" - @} - - # Rule(s) - - 5 /foo/ @{ # 2 - 2 print "matched /foo/, gosh" - 6 for (i = 1; i <= 3; i++) @{ - 6 sing() - @} - @} - - 5 @{ - 5 if (/foo/) @{ # 2 - 2 print "if is true" - 3 @} else @{ - 3 print "else is true" - @} - @} - - # END block(s) - - END @{ - 1 print "First END rule" - 1 print "Second END rule" - @} - - # Functions, listed alphabetically - - 6 function sing(dummy) - @{ - 6 print "I gotta be me!" - @} -@end example - -This example illustrates many of the basic features of profiling output. -They are as follows: - -@itemize @bullet -@item -The program is printed in the order @code{BEGIN} rule, -@code{BEGINFILE} rule, -pattern/action rules, -@code{ENDFILE} rule, @code{END} rule and functions, listed -alphabetically. -Multiple @code{BEGIN} and @code{END} rules are merged together, -as are multiple @code{BEGINFILE} and @code{ENDFILE} rules. - -@cindex patterns, counts -@item -Pattern-action rules have two counts. -The first count, to the left of the rule, shows how many times -the rule's pattern was @emph{tested}. -The second count, to the right of the rule's opening left brace -in a comment, -shows how many times the rule's action was @emph{executed}. -The difference between the two indicates how many times the rule's -pattern evaluated to false. - -@item -Similarly, -the count for an @code{if}-@code{else} statement shows how many times -the condition was tested. -To the right of the opening left brace for the @code{if}'s body -is a count showing how many times the condition was true. -The count for the @code{else} -indicates how many times the test failed. - -@cindex loops, count for header -@item -The count for a loop header (such as @code{for} -or @code{while}) shows how many times the loop test was executed. -(Because of this, you can't just look at the count on the first -statement in a rule to determine how many times the rule was executed. -If the first statement is a loop, the count is misleading.) - -@cindex functions, user-defined, counts -@cindex user-defined, functions, counts -@item -For user-defined functions, the count next to the @code{function} -keyword indicates how many times the function was called. -The counts next to the statements in the body show how many times -those statements were executed. - -@cindex @code{@{@}} (braces) -@cindex braces (@code{@{@}}) -@item -The layout uses ``K&R'' style with TABs. -Braces are used everywhere, even when -the body of an @code{if}, @code{else}, or loop is only a single statement. - -@cindex @code{()} (parentheses) -@cindex parentheses @code{()} -@item -Parentheses are used only where needed, as indicated by the structure -of the program and the precedence rules. -@c extra verbiage here satisfies the copyeditor. ugh. -For example, @samp{(3 + 5) * 4} means add three plus five, then multiply -the total by four. However, @samp{3 + 5 * 4} has no parentheses, and -means @samp{3 + (5 * 4)}. - -@ignore -@item -All string concatenations are parenthesized too. -(This could be made a bit smarter.) +@end ifdocbook @end ignore -@item -Parentheses are used around the arguments to @code{print} -and @code{printf} only when -the @code{print} or @code{printf} statement is followed by a redirection. -Similarly, if -the target of a redirection isn't a scalar, it gets parenthesized. - -@item -@command{gawk} supplies leading comments in -front of the @code{BEGIN} and @code{END} rules, -the pattern/action rules, and the functions. - -@end itemize - -The profiled version of your program may not look exactly like what you -typed when you wrote it. This is because @command{gawk} creates the -profiled version by ``pretty printing'' its internal representation of -the program. The advantage to this is that @command{gawk} can produce -a standard representation. The disadvantage is that all source-code -comments are lost, as are the distinctions among multiple @code{BEGIN}, -@code{END}, @code{BEGINFILE}, and @code{ENDFILE} rules. Also, things such as: - -@example -/foo/ -@end example - -@noindent -come out as: - -@example -/foo/ @{ - print $0 -@} -@end example - -@noindent -which is correct, but possibly surprising. - -@cindex profiling @command{awk} programs, dynamically -@cindex @command{gawk} program, dynamic profiling -Besides creating profiles when a program has completed, -@command{gawk} can produce a profile while it is running. -This is useful if your @command{awk} program goes into an -infinite loop and you want to see what has been executed. -To use this feature, run @command{gawk} with the @option{--profile} -option in the background: - -@example -$ @kbd{gawk --profile -f myprog &} -[1] 13992 -@end example - -@cindex @command{kill} command@comma{} dynamic profiling -@cindex @code{USR1} signal -@cindex @code{SIGUSR1} signal -@cindex signals, @code{USR1}/@code{SIGUSR1} -@noindent -The shell prints a job number and process ID number; in this case, 13992. -Use the @command{kill} command to send the @code{USR1} signal -to @command{gawk}: - -@example -$ @kbd{kill -USR1 13992} -@end example - -@noindent -As usual, the profiled version of the program is written to -@file{awkprof.out}, or to a different file if one specified with -the @option{--profile} option. - -Along with the regular profile, as shown earlier, the profile -includes a trace of any active functions: - -@example -# Function Call Stack: - -# 3. baz -# 2. bar -# 1. foo -# -- main -- -@end example - -You may send @command{gawk} the @code{USR1} signal as many times as you like. -Each time, the profile and function call trace are appended to the output -profile file. - -@cindex @code{HUP} signal -@cindex @code{SIGHUP} signal -@cindex signals, @code{HUP}/@code{SIGHUP} -If you use the @code{HUP} signal instead of the @code{USR1} signal, -@command{gawk} produces the profile and the function call trace and then exits. - -@cindex @code{INT} signal (MS-Windows) -@cindex @code{SIGINT} signal (MS-Windows) -@cindex signals, @code{INT}/@code{SIGINT} (MS-Windows) -@cindex @code{QUIT} signal (MS-Windows) -@cindex @code{SIGQUIT} signal (MS-Windows) -@cindex signals, @code{QUIT}/@code{SIGQUIT} (MS-Windows) -When @command{gawk} runs on MS-Windows systems, it uses the -@code{INT} and @code{QUIT} signals for producing the profile and, in -the case of the @code{INT} signal, @command{gawk} exits. This is -because these systems don't support the @command{kill} command, so the -only signals you can deliver to a program are those generated by the -keyboard. The @code{INT} signal is generated by the -@kbd{@value{CTL}-@key{C}} or @kbd{@value{CTL}-@key{BREAK}} key, while the -@code{QUIT} signal is generated by the @kbd{@value{CTL}-@key{\}} key. - -Finally, @command{gawk} also accepts another option @option{--pretty-print}. -When called this way, @command{gawk} ``pretty prints'' the program into -@file{awkprof.out}, without any execution counts. -@c ENDOFRANGE advgaw -@c ENDOFRANGE gawadv -@c ENDOFRANGE awkp -@c ENDOFRANGE proawk - @node Library Functions @chapter A Library of @command{awk} Functions @c STARTOFRANGE libf @@ -20523,7 +18185,7 @@ programming use. * Ordinal Functions:: Functions for using characters as numbers and vice versa. * Join Function:: A function to join an array into a string. -* Gettimeofday Function:: A function to get formatted times. +* Getlocaltime Function:: A function to get formatted times. @end menu @node Strtonum Function @@ -21048,7 +18710,7 @@ be nice if @command{awk} had an assignment operator for concatenation. The lack of an explicit operator for concatenation makes string operations more difficult than they really need to be.} -@node Gettimeofday Function +@node Getlocaltime Function @subsection Managing the Time of Day @cindex libraries of @command{awk} functions, managing, time @@ -21062,14 +18724,14 @@ in human readable form. While @code{strftime()} is extensive, the control formats are not necessarily easy to remember or intuitively obvious when reading a program. -The following function, @code{gettimeofday()}, populates a user-supplied array +The following function, @code{getlocaltime()}, populates a user-supplied array with preformatted time information. It returns a string with the current time formatted in the same way as the @command{date} utility: -@cindex @code{gettimeofday()} user-defined function +@cindex @code{getlocaltime()} user-defined function @example @c file eg/lib/gettime.awk -# gettimeofday.awk --- get the time of day in a usable format +# getlocaltime.awk --- get the time of day in a usable format @c endfile @ignore @c file eg/lib/gettime.awk @@ -21102,7 +18764,7 @@ time formatted in the same way as the @command{date} utility: # time["weeknum"] -- week number, Sunday first day # time["altweeknum"] -- week number, Monday first day -function gettimeofday(time, ret, now, i) +function getlocaltime(time, ret, now, i) @{ # get time once, avoids unnecessary system calls now = systime() @@ -21144,7 +18806,7 @@ The string indices are easier to use and read than the various formats required by @code{strftime()}. The @code{alarm} program presented in @ref{Alarm Program}, uses this function. -A more general design for the @code{gettimeofday()} function would have +A more general design for the @code{getlocaltime()} function would have allowed the user to supply an optional timestamp value to use instead of the current time. @@ -24436,8 +22098,8 @@ it prints the message on the standard output. In addition, you can give it the number of times to repeat the message as well as a delay between repetitions. -This program uses the @code{gettimeofday()} function from -@ref{Gettimeofday Function}. +This program uses the @code{getlocaltime()} function from +@ref{Getlocaltime Function}. All the work is done in the @code{BEGIN} rule. The first part is argument checking and setting of defaults: the delay, the count, and the message to @@ -24456,7 +22118,7 @@ Here is the program: @c file eg/prog/alarm.awk # alarm.awk --- set an alarm # -# Requires gettimeofday() library function +# Requires getlocaltime() library function @c endfile @ignore @c file eg/prog/alarm.awk @@ -24528,7 +22190,7 @@ is how long to wait before setting off the alarm: minute = atime[2] + 0 # force numeric # get current broken down time - gettimeofday(now) + getlocaltime(now) # if time given is 12-hour hours and it's after that # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m., @@ -26195,6 +23857,1921 @@ BEGIN { } @end ignore +@iftex +@part Part III:@* Moving Beyond Standard @command{awk} With @command{gawk} +@end iftex + +@ignore +@ifdocbook + +@part Part III:@* Moving Beyond Standard @command{awk} With @command{gawk} + +Part III focuses on features specific to @command{gawk}. +It contains the following chapters: + +@itemize @bullet +@item +@ref{Internationalization}. + +@item +@ref{Advanced Features}. + +@item +@ref{Debugger}. + +@item +@ref{Arbitrary Precision Arithmetic}. + +@item +@ref{Dynamic Extensions}. +@end ifdocbook +@end ignore + +@node Internationalization +@chapter Internationalization with @command{gawk} + +Once upon a time, computer makers +wrote software that worked only in English. +Eventually, hardware and software vendors noticed that if their +systems worked in the native languages of non-English-speaking +countries, they were able to sell more systems. +As a result, internationalization and localization +of programs and software systems became a common practice. + +@c STARTOFRANGE inloc +@cindex internationalization, localization +@cindex @command{gawk}, internationalization and, See internationalization +@cindex internationalization, localization, @command{gawk} and +For many years, the ability to provide internationalization +was largely restricted to programs written in C and C++. +This @value{CHAPTER} describes the underlying library @command{gawk} +uses for internationalization, as well as how +@command{gawk} makes internationalization +features available at the @command{awk} program level. +Having internationalization available at the @command{awk} level +gives software developers additional flexibility---they are no +longer forced to write in C or C++ when internationalization is +a requirement. + +@menu +* I18N and L10N:: Internationalization and Localization. +* Explaining gettext:: How GNU @code{gettext} works. +* Programmer i18n:: Features for the programmer. +* Translator i18n:: Features for the translator. +* I18N Example:: A simple i18n example. +* Gawk I18N:: @command{gawk} is also internationalized. +@end menu + +@node I18N and L10N +@section Internationalization and Localization + +@cindex internationalization +@cindex localization, See internationalization@comma{} localization +@cindex localization +@dfn{Internationalization} means writing (or modifying) a program once, +in such a way that it can use multiple languages without requiring +further source-code changes. +@dfn{Localization} means providing the data necessary for an +internationalized program to work in a particular language. +Most typically, these terms refer to features such as the language +used for printing error messages, the language used to read +responses, and information related to how numerical and +monetary values are printed and read. + +@node Explaining gettext +@section GNU @code{gettext} + +@cindex internationalizing a program +@c STARTOFRANGE gettex +@cindex @code{gettext} library +The facilities in GNU @code{gettext} focus on messages; strings printed +by a program, either directly or via formatting with @code{printf} or +@code{sprintf()}.@footnote{For some operating systems, the @command{gawk} +port doesn't support GNU @code{gettext}. +Therefore, these features are not available +if you are using one of those operating systems. Sorry.} + +@cindex portability, @code{gettext} library and +When using GNU @code{gettext}, each application has its own +@dfn{text domain}. This is a unique name, such as @samp{kpilot} or @samp{gawk}, +that identifies the application. +A complete application may have multiple components---programs written +in C or C++, as well as scripts written in @command{sh} or @command{awk}. +All of the components use the same text domain. + +To make the discussion concrete, assume we're writing an application +named @command{guide}. Internationalization consists of the +following steps, in this order: + +@enumerate +@item +The programmer goes +through the source for all of @command{guide}'s components +and marks each string that is a candidate for translation. +For example, @code{"`-F': option required"} is a good candidate for translation. +A table with strings of option names is not (e.g., @command{gawk}'s +@option{--profile} option should remain the same, no matter what the local +language). + +@cindex @code{textdomain()} function (C library) +@item +The programmer indicates the application's text domain +(@code{"guide"}) to the @code{gettext} library, +by calling the @code{textdomain()} function. + +@cindex @code{.pot} files +@cindex files, @code{.pot} +@cindex portable object template files +@cindex files, portable object template +@item +Messages from the application are extracted from the source code and +collected into a portable object template file (@file{guide.pot}), +which lists the strings and their translations. +The translations are initially empty. +The original (usually English) messages serve as the key for +lookup of the translations. + +@cindex @code{.po} files +@cindex files, @code{.po} +@cindex portable object files +@cindex files, portable object +@item +For each language with a translator, @file{guide.pot} +is copied to a portable object file (@code{.po}) +and translations are created and shipped with the application. +For example, there might be a @file{fr.po} for a French translation. + +@cindex @code{.mo} files +@cindex files, @code{.mo} +@cindex message object files +@cindex files, message object +@item +Each language's @file{.po} file is converted into a binary +message object (@file{.mo}) file. +A message object file contains the original messages and their +translations in a binary format that allows fast lookup of translations +at runtime. + +@item +When @command{guide} is built and installed, the binary translation files +are installed in a standard place. + +@cindex @code{bindtextdomain()} function (C library) +@item +For testing and development, it is possible to tell @code{gettext} +to use @file{.mo} files in a different directory than the standard +one by using the @code{bindtextdomain()} function. + +@cindex @code{.mo} files, specifying directory of +@cindex files, @code{.mo}, specifying directory of +@cindex message object files, specifying directory of +@cindex files, message object, specifying directory of +@item +At runtime, @command{guide} looks up each string via a call +to @code{gettext()}. The returned string is the translated string +if available, or the original string if not. + +@item +If necessary, it is possible to access messages from a different +text domain than the one belonging to the application, without +having to switch the application's default text domain back +and forth. +@end enumerate + +@cindex @code{gettext()} function (C library) +In C (or C++), the string marking and dynamic translation lookup +are accomplished by wrapping each string in a call to @code{gettext()}: + +@example +printf("%s", gettext("Don't Panic!\n")); +@end example + +The tools that extract messages from source code pull out all +strings enclosed in calls to @code{gettext()}. + +@cindex @code{_} (underscore), @code{_} C macro +@cindex underscore (@code{_}), @code{_} C macro +The GNU @code{gettext} developers, recognizing that typing +@samp{gettext(@dots{})} over and over again is both painful and ugly to look +at, use the macro @samp{_} (an underscore) to make things easier: + +@example +/* In the standard header file: */ +#define _(str) gettext(str) + +/* In the program text: */ +printf("%s", _("Don't Panic!\n")); +@end example + +@cindex internationalization, localization, locale categories +@cindex @code{gettext} library, locale categories +@cindex locale categories +@noindent +This reduces the typing overhead to just three extra characters per string +and is considerably easier to read as well. + +There are locale @dfn{categories} +for different types of locale-related information. +The defined locale categories that @code{gettext} knows about are: + +@table @code +@cindex @code{LC_MESSAGES} locale category +@item LC_MESSAGES +Text messages. This is the default category for @code{gettext} +operations, but it is possible to supply a different one explicitly, +if necessary. (It is almost never necessary to supply a different category.) + +@cindex sorting characters in different languages +@cindex @code{LC_COLLATE} locale category +@item LC_COLLATE +Text-collation information; i.e., how different characters +and/or groups of characters sort in a given language. + +@cindex @code{LC_CTYPE} locale category +@item LC_CTYPE +Character-type information (alphabetic, digit, upper- or lowercase, and +so on). +This information is accessed via the +POSIX character classes in regular expressions, +such as @code{/[[:alnum:]]/} +(@pxref{Regexp Operators}). + +@cindex monetary information, localization +@cindex currency symbols, localization +@cindex @code{LC_MONETARY} locale category +@item LC_MONETARY +Monetary information, such as the currency symbol, and whether the +symbol goes before or after a number. + +@cindex @code{LC_NUMERIC} locale category +@item LC_NUMERIC +Numeric information, such as which characters to use for the decimal +point and the thousands separator.@footnote{Americans +use a comma every three decimal places and a period for the decimal +point, while many Europeans do exactly the opposite: +1,234.56 versus 1.234,56.} + +@cindex @code{LC_RESPONSE} locale category +@item LC_RESPONSE +Response information, such as how ``yes'' and ``no'' appear in the +local language, and possibly other information as well. + +@cindex time, localization and +@cindex dates, information related to@comma{} localization +@cindex @code{LC_TIME} locale category +@item LC_TIME +Time- and date-related information, such as 12- or 24-hour clock, month printed +before or after the day in a date, local month abbreviations, and so on. + +@cindex @code{LC_ALL} locale category +@item LC_ALL +All of the above. (Not too useful in the context of @code{gettext}.) +@end table +@c ENDOFRANGE gettex + +@node Programmer i18n +@section Internationalizing @command{awk} Programs +@c STARTOFRANGE inap +@cindex @command{awk} programs, internationalizing + +@command{gawk} provides the following variables and functions for +internationalization: + +@table @code +@cindex @code{TEXTDOMAIN} variable +@item TEXTDOMAIN +This variable indicates the application's text domain. +For compatibility with GNU @code{gettext}, the default +value is @code{"messages"}. + +@cindex internationalization, localization, marked strings +@cindex strings, for localization +@item _"your message here" +String constants marked with a leading underscore +are candidates for translation at runtime. +String constants without a leading underscore are not translated. + +@cindex @code{dcgettext()} function (@command{gawk}) +@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) +Return the translation of @var{string} in +text domain @var{domain} for locale category @var{category}. +The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. +The default value for @var{category} is @code{"LC_MESSAGES"}. + +If you supply a value for @var{category}, it must be a string equal to +one of the known locale categories described in +@ifnotinfo +the previous @value{SECTION}. +@end ifnotinfo +@ifinfo +@ref{Explaining gettext}. +@end ifinfo +You must also supply a text domain. Use @code{TEXTDOMAIN} if +you want to use the current domain. + +@quotation CAUTION +The order of arguments to the @command{awk} version +of the @code{dcgettext()} function is purposely different from the order for +the C version. The @command{awk} version's order was +chosen to be simple and to allow for reasonable @command{awk}-style +default arguments. +@end quotation + +@cindex @code{dcngettext()} function (@command{gawk}) +@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]}) +Return the plural form used for @var{number} of the +translation of @var{string1} and @var{string2} in text domain +@var{domain} for locale category @var{category}. @var{string1} is the +English singular variant of a message, and @var{string2} the English plural +variant of the same message. +The default value for @var{domain} is the current value of @code{TEXTDOMAIN}. +The default value for @var{category} is @code{"LC_MESSAGES"}. + +The same remarks about argument order as for the @code{dcgettext()} function apply. + +@cindex @code{.mo} files, specifying directory of +@cindex files, @code{.mo}, specifying directory of +@cindex message object files, specifying directory of +@cindex files, message object, specifying directory of +@cindex @code{bindtextdomain()} function (@command{gawk}) +@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]}) +Change the directory in which +@code{gettext} looks for @file{.mo} files, in case they +will not or cannot be placed in the standard locations +(e.g., during testing). +Return the directory in which @var{domain} is ``bound.'' + +The default @var{domain} is the value of @code{TEXTDOMAIN}. +If @var{directory} is the null string (@code{""}), then +@code{bindtextdomain()} returns the current binding for the +given @var{domain}. +@end table + +To use these facilities in your @command{awk} program, follow the steps +outlined in +@ifnotinfo +the previous @value{SECTION}, +@end ifnotinfo +@ifinfo +@ref{Explaining gettext}, +@end ifinfo +like so: + +@enumerate +@cindex @code{BEGIN} pattern, @code{TEXTDOMAIN} variable and +@cindex @code{TEXTDOMAIN} variable, @code{BEGIN} pattern and +@item +Set the variable @code{TEXTDOMAIN} to the text domain of +your program. This is best done in a @code{BEGIN} rule +(@pxref{BEGIN/END}), +or it can also be done via the @option{-v} command-line +option (@pxref{Options}): + +@example +BEGIN @{ + TEXTDOMAIN = "guide" + @dots{} +@} +@end example + +@cindex @code{_} (underscore), translatable string +@cindex underscore (@code{_}), translatable string +@item +Mark all translatable strings with a leading underscore (@samp{_}) +character. It @emph{must} be adjacent to the opening +quote of the string. For example: + +@example +print _"hello, world" +x = _"you goofed" +printf(_"Number of users is %d\n", nusers) +@end example + +@item +If you are creating strings dynamically, you can +still translate them, using the @code{dcgettext()} +built-in function: + +@example +message = nusers " users logged in" +message = dcgettext(message, "adminprog") +print message +@end example + +Here, the call to @code{dcgettext()} supplies a different +text domain (@code{"adminprog"}) in which to find the +message, but it uses the default @code{"LC_MESSAGES"} category. + +@cindex @code{LC_MESSAGES} locale category, @code{bindtextdomain()} function (@command{gawk}) +@item +During development, you might want to put the @file{.mo} +file in a private directory for testing. This is done +with the @code{bindtextdomain()} built-in function: + +@example +BEGIN @{ + TEXTDOMAIN = "guide" # our text domain + if (Testing) @{ + # where to find our files + bindtextdomain("testdir") + # joe is in charge of adminprog + bindtextdomain("../joe/testdir", "adminprog") + @} + @dots{} +@} +@end example + +@end enumerate + +@xref{I18N Example}, +for an example program showing the steps to create +and use translations from @command{awk}. + +@node Translator i18n +@section Translating @command{awk} Programs + +@cindex @code{.po} files +@cindex files, @code{.po} +@cindex portable object files +@cindex files, portable object +Once a program's translatable strings have been marked, they must +be extracted to create the initial @file{.po} file. +As part of translation, it is often helpful to rearrange the order +in which arguments to @code{printf} are output. + +@command{gawk}'s @option{--gen-pot} command-line option extracts +the messages and is discussed next. +After that, @code{printf}'s ability to +rearrange the order for @code{printf} arguments at runtime +is covered. + +@menu +* String Extraction:: Extracting marked strings. +* Printf Ordering:: Rearranging @code{printf} arguments. +* I18N Portability:: @command{awk}-level portability issues. +@end menu + +@node String Extraction +@subsection Extracting Marked Strings +@cindex strings, extracting +@cindex marked strings@comma{} extracting +@cindex @code{--gen-pot} option +@cindex command-line options, string extraction +@cindex string extraction (internationalization) +@cindex marked string extraction (internationalization) +@cindex extraction, of marked strings (internationalization) + +@cindex @code{--gen-pot} option +Once your @command{awk} program is working, and all the strings have +been marked and you've set (and perhaps bound) the text domain, +it is time to produce translations. +First, use the @option{--gen-pot} command-line option to create +the initial @file{.pot} file: + +@example +$ @kbd{gawk --gen-pot -f guide.awk > guide.pot} +@end example + +@cindex @code{xgettext} utility +When run with @option{--gen-pot}, @command{gawk} does not execute your +program. Instead, it parses it as usual and prints all marked strings +to standard output in the format of a GNU @code{gettext} Portable Object +file. Also included in the output are any constant strings that +appear as the first argument to @code{dcgettext()} or as the first and +second argument to @code{dcngettext()}.@footnote{The +@command{xgettext} utility that comes with GNU +@code{gettext} can handle @file{.awk} files.} +@xref{I18N Example}, +for the full list of steps to go through to create and test +translations for @command{guide}. + +@node Printf Ordering +@subsection Rearranging @code{printf} Arguments + +@cindex @code{printf} statement, positional specifiers +@cindex positional specifiers, @code{printf} statement +Format strings for @code{printf} and @code{sprintf()} +(@pxref{Printf}) +present a special problem for translation. +Consider the following:@footnote{This example is borrowed +from the GNU @code{gettext} manual.} + +@c line broken here only for smallbook format +@example +printf(_"String `%s' has %d characters\n", + string, length(string))) +@end example + +A possible German translation for this might be: + +@example +"%d Zeichen lang ist die Zeichenkette `%s'\n" +@end example + +The problem should be obvious: the order of the format +specifications is different from the original! +Even though @code{gettext()} can return the translated string +at runtime, +it cannot change the argument order in the call to @code{printf}. + +To solve this problem, @code{printf} format specifiers may have +an additional optional element, which we call a @dfn{positional specifier}. +For example: + +@example +"%2$d Zeichen lang ist die Zeichenkette `%1$s'\n" +@end example + +Here, the positional specifier consists of an integer count, which indicates which +argument to use, and a @samp{$}. Counts are one-based, and the +format string itself is @emph{not} included. Thus, in the following +example, @samp{string} is the first argument and @samp{length(string)} is the second: + +@example +$ @kbd{gawk 'BEGIN @{} +> @kbd{string = "Dont Panic"} +> @kbd{printf _"%2$d characters live in \"%1$s\"\n",} +> @kbd{string, length(string)} +> @kbd{@}'} +@print{} 10 characters live in "Dont Panic" +@end example + +If present, positional specifiers come first in the format specification, +before the flags, the field width, and/or the precision. + +Positional specifiers can be used with the dynamic field width and +precision capability: + +@example +$ @kbd{gawk 'BEGIN @{} +> @kbd{printf("%*.*s\n", 10, 20, "hello")} +> @kbd{printf("%3$*2$.*1$s\n", 20, 10, "hello")} +> @kbd{@}'} +@print{} hello +@print{} hello +@end example + +@quotation NOTE +When using @samp{*} with a positional specifier, the @samp{*} +comes first, then the integer position, and then the @samp{$}. +This is somewhat counterintuitive. +@end quotation + +@cindex @code{printf} statement, positional specifiers, mixing with regular formats +@cindex positional specifiers, @code{printf} statement, mixing with regular formats +@cindex format specifiers, mixing regular with positional specifiers +@command{gawk} does not allow you to mix regular format specifiers +and those with positional specifiers in the same string: + +@example +$ @kbd{gawk 'BEGIN @{ printf _"%d %3$s\n", 1, 2, "hi" @}'} +@error{} gawk: cmd. line:1: fatal: must use `count$' on all formats or none +@end example + +@quotation NOTE +There are some pathological cases that @command{gawk} may fail to +diagnose. In such cases, the output may not be what you expect. +It's still a bad idea to try mixing them, even if @command{gawk} +doesn't detect it. +@end quotation + +Although positional specifiers can be used directly in @command{awk} programs, +their primary purpose is to help in producing correct translations of +format strings into languages different from the one in which the program +is first written. + +@node I18N Portability +@subsection @command{awk} Portability Issues + +@cindex portability, internationalization and +@cindex internationalization, localization, portability and +@command{gawk}'s internationalization features were purposely chosen to +have as little impact as possible on the portability of @command{awk} +programs that use them to other versions of @command{awk}. +Consider this program: + +@example +BEGIN @{ + TEXTDOMAIN = "guide" + if (Test_Guide) # set with -v + bindtextdomain("/test/guide/messages") + print _"don't panic!" +@} +@end example + +@noindent +As written, it won't work on other versions of @command{awk}. +However, it is actually almost portable, requiring very little +change: + +@itemize @bullet +@cindex @code{TEXTDOMAIN} variable, portability and +@item +Assignments to @code{TEXTDOMAIN} won't have any effect, +since @code{TEXTDOMAIN} is not special in other @command{awk} implementations. + +@item +Non-GNU versions of @command{awk} treat marked strings +as the concatenation of a variable named @code{_} with the string +following it.@footnote{This is good fodder for an ``Obfuscated +@command{awk}'' contest.} Typically, the variable @code{_} has +the null string (@code{""}) as its value, leaving the original string constant as +the result. + +@item +By defining ``dummy'' functions to replace @code{dcgettext()}, @code{dcngettext()} +and @code{bindtextdomain()}, the @command{awk} program can be made to run, but +all the messages are output in the original language. +For example: + +@cindex @code{bindtextdomain()} function (@command{gawk}), portability and +@cindex @code{dcgettext()} function (@command{gawk}), portability and +@cindex @code{dcngettext()} function (@command{gawk}), portability and +@example +@c file eg/lib/libintl.awk +function bindtextdomain(dir, domain) +@{ + return dir +@} + +function dcgettext(string, domain, category) +@{ + return string +@} + +function dcngettext(string1, string2, number, domain, category) +@{ + return (number == 1 ? string1 : string2) +@} +@c endfile +@end example + +@item +The use of positional specifications in @code{printf} or +@code{sprintf()} is @emph{not} portable. +To support @code{gettext()} at the C level, many systems' C versions of +@code{sprintf()} do support positional specifiers. But it works only if +enough arguments are supplied in the function call. Many versions of +@command{awk} pass @code{printf} formats and arguments unchanged to the +underlying C library version of @code{sprintf()}, but only one format and +argument at a time. What happens if a positional specification is +used is anybody's guess. +However, since the positional specifications are primarily for use in +@emph{translated} format strings, and since non-GNU @command{awk}s never +retrieve the translated string, this should not be a problem in practice. +@end itemize +@c ENDOFRANGE inap + +@node I18N Example +@section A Simple Internationalization Example + +Now let's look at a step-by-step example of how to internationalize and +localize a simple @command{awk} program, using @file{guide.awk} as our +original source: + +@example +@c file eg/prog/guide.awk +BEGIN @{ + TEXTDOMAIN = "guide" + bindtextdomain(".") # for testing + print _"Don't Panic" + print _"The Answer Is", 42 + print "Pardon me, Zaphod who?" +@} +@c endfile +@end example + +@noindent +Run @samp{gawk --gen-pot} to create the @file{.pot} file: + +@example +$ @kbd{gawk --gen-pot -f guide.awk > guide.pot} +@end example + +@noindent +This produces: + +@example +@c file eg/data/guide.po +#: guide.awk:4 +msgid "Don't Panic" +msgstr "" + +#: guide.awk:5 +msgid "The Answer Is" +msgstr "" + +@c endfile +@end example + +This original portable object template file is saved and reused for each language +into which the application is translated. The @code{msgid} +is the original string and the @code{msgstr} is the translation. + +@quotation NOTE +Strings not marked with a leading underscore do not +appear in the @file{guide.pot} file. +@end quotation + +Next, the messages must be translated. +Here is a translation to a hypothetical dialect of English, +called ``Mellow'':@footnote{Perhaps it would be better if it were +called ``Hippy.'' Ah, well.} + +@example +@group +$ cp guide.pot guide-mellow.po +@var{Add translations to} guide-mellow.po @dots{} +@end group +@end example + +@noindent +Following are the translations: + +@example +@c file eg/data/guide-mellow.po +#: guide.awk:4 +msgid "Don't Panic" +msgstr "Hey man, relax!" + +#: guide.awk:5 +msgid "The Answer Is" +msgstr "Like, the scoop is" + +@c endfile +@end example + +@cindex Linux +@cindex GNU/Linux +The next step is to make the directory to hold the binary message object +file and then to create the @file{guide.mo} file. +The directory layout shown here is standard for GNU @code{gettext} on +GNU/Linux systems. Other versions of @code{gettext} may use a different +layout: + +@example +$ @kbd{mkdir en_US en_US/LC_MESSAGES} +@end example + +@cindex @code{.po} files, converting to @code{.mo} +@cindex files, @code{.po}, converting to @code{.mo} +@cindex @code{.mo} files, converting from @code{.po} +@cindex files, @code{.mo}, converting from @code{.po} +@cindex portable object files, converting to message object files +@cindex files, portable object, converting to message object files +@cindex message object files, converting from portable object files +@cindex files, message object, converting from portable object files +@cindex @command{msgfmt} utility +The @command{msgfmt} utility does the conversion from human-readable +@file{.po} file to machine-readable @file{.mo} file. +By default, @command{msgfmt} creates a file named @file{messages}. +This file must be renamed and placed in the proper directory so that +@command{gawk} can find it: + +@example +$ @kbd{msgfmt guide-mellow.po} +$ @kbd{mv messages en_US/LC_MESSAGES/guide.mo} +@end example + +Finally, we run the program to test it: + +@example +$ @kbd{gawk -f guide.awk} +@print{} Hey man, relax! +@print{} Like, the scoop is 42 +@print{} Pardon me, Zaphod who? +@end example + +If the three replacement functions for @code{dcgettext()}, @code{dcngettext()} +and @code{bindtextdomain()} +(@pxref{I18N Portability}) +are in a file named @file{libintl.awk}, +then we can run @file{guide.awk} unchanged as follows: + +@example +$ @kbd{gawk --posix -f guide.awk -f libintl.awk} +@print{} Don't Panic +@print{} The Answer Is 42 +@print{} Pardon me, Zaphod who? +@end example + +@node Gawk I18N +@section @command{gawk} Can Speak Your Language + +@command{gawk} itself has been internationalized +using the GNU @code{gettext} package. +(GNU @code{gettext} is described in +complete detail in +@ifinfo +@inforef{Top, , GNU @code{gettext} utilities, gettext, GNU gettext tools}.) +@end ifinfo +@ifnotinfo +@cite{GNU gettext tools}.) +@end ifnotinfo +As of this writing, the latest version of GNU @code{gettext} is +@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.18.1.tar.gz, @value{PVERSION} 0.18.1}. + +If a translation of @command{gawk}'s messages exists, +then @command{gawk} produces usage messages, warnings, +and fatal errors in the local language. +@c ENDOFRANGE inloc + +@node Advanced Features +@chapter Advanced Features of @command{gawk} +@cindex advanced features, network connections, See Also networks, connections +@c STARTOFRANGE gawadv +@cindex @command{gawk}, features, advanced +@c STARTOFRANGE advgaw +@cindex advanced features, @command{gawk} +@ignore +Contributed by: Peter Langston <pud!psl@bellcore.bellcore.com> + + Found in Steve English's "signature" line: + +"Write documentation as if whoever reads it is a violent psychopath +who knows where you live." +@end ignore +@quotation +@i{Write documentation as if whoever reads it is +a violent psychopath who knows where you live.}@* +Steve English, as quoted by Peter Langston +@end quotation + +This @value{CHAPTER} discusses advanced features in @command{gawk}. +It's a bit of a ``grab bag'' of items that are otherwise unrelated +to each other. +First, a command-line option allows @command{gawk} to recognize +nondecimal numbers in input data, not just in @command{awk} +programs. +Then, @command{gawk}'s special features for sorting arrays are presented. +Next, two-way I/O, discussed briefly in earlier parts of this +@value{DOCUMENT}, is described in full detail, along with the basics +of TCP/IP networking. Finally, @command{gawk} +can @dfn{profile} an @command{awk} program, making it possible to tune +it for performance. + +@ref{Dynamic Extensions}, +discusses the ability to dynamically add new built-in functions to +@command{gawk}. As this feature is still immature and likely to change, +its description is relegated to an appendix. + +@menu +* Nondecimal Data:: Allowing nondecimal input data. +* Array Sorting:: Facilities for controlling array traversal and + sorting arrays. +* Two-way I/O:: Two-way communications with another process. +* TCP/IP Networking:: Using @command{gawk} for network programming. +* Profiling:: Profiling your @command{awk} programs. +@end menu + +@node Nondecimal Data +@section Allowing Nondecimal Input Data +@cindex @code{--non-decimal-data} option +@cindex advanced features, @command{gawk}, nondecimal input data +@cindex input, data@comma{} nondecimal +@cindex constants, nondecimal + +If you run @command{gawk} with the @option{--non-decimal-data} option, +you can have nondecimal constants in your input data: + +@c line break here for small book format +@example +$ @kbd{echo 0123 123 0x123 |} +> @kbd{gawk --non-decimal-data '@{ printf "%d, %d, %d\n",} +> @kbd{$1, $2, $3 @}'} +@print{} 83, 123, 291 +@end example + +For this feature to work, write your program so that +@command{gawk} treats your data as numeric: + +@example +$ @kbd{echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'} +@print{} 0123 123 0x123 +@end example + +@noindent +The @code{print} statement treats its expressions as strings. +Although the fields can act as numbers when necessary, +they are still strings, so @code{print} does not try to treat them +numerically. You may need to add zero to a field to force it to +be treated as a number. For example: + +@example +$ @kbd{echo 0123 123 0x123 | gawk --non-decimal-data '} +> @kbd{@{ print $1, $2, $3} +> @kbd{print $1 + 0, $2 + 0, $3 + 0 @}'} +@print{} 0123 123 0x123 +@print{} 83 123 291 +@end example + +Because it is common to have decimal data with leading zeros, and because +using this facility could lead to surprising results, the default is to leave it +disabled. If you want it, you must explicitly request it. + +@cindex programming conventions, @code{--non-decimal-data} option +@cindex @code{--non-decimal-data} option, @code{strtonum()} function and +@cindex @code{strtonum()} function (@command{gawk}), @code{--non-decimal-data} option and +@quotation CAUTION +@emph{Use of this option is not recommended.} +It can break old programs very badly. +Instead, use the @code{strtonum()} function to convert your data +(@pxref{Nondecimal-numbers}). +This makes your programs easier to write and easier to read, and +leads to less surprising results. +@end quotation + +@node Array Sorting +@section Controlling Array Traversal and Array Sorting + +@command{gawk} lets you control the order in which a @samp{for (i in array)} +loop traverses an array. + +In addition, two built-in functions, @code{asort()} and @code{asorti()}, +let you sort arrays based on the array values and indices, respectively. +These two functions also provide control over the sorting criteria used +to order the elements during sorting. + +@menu +* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. +* Array Sorting Functions:: How to use @code{asort()} and @code{asorti()}. +@end menu + +@node Controlling Array Traversal +@subsection Controlling Array Traversal + +By default, the order in which a @samp{for (i in array)} loop +scans an array is not defined; it is generally based upon +the internal implementation of arrays inside @command{awk}. + +Often, though, it is desirable to be able to loop over the elements +in a particular order that you, the programmer, choose. @command{gawk} +lets you do this. + +@ref{Controlling Scanning}, describes how you can assign special, +pre-defined values to @code{PROCINFO["sorted_in"]} in order to +control the order in which @command{gawk} will traverse an array +during a @code{for} loop. + +In addition, the value of @code{PROCINFO["sorted_in"]} can be a function name. +This lets you traverse an array based on any custom criterion. +The array elements are ordered according to the return value of this +function. The comparison function should be defined with at least +four arguments: + +@example +function comp_func(i1, v1, i2, v2) +@{ + @var{compare elements 1 and 2 in some fashion} + @var{return < 0; 0; or > 0} +@} +@end example + +Here, @var{i1} and @var{i2} are the indices, and @var{v1} and @var{v2} +are the corresponding values of the two elements being compared. +Either @var{v1} or @var{v2}, or both, can be arrays if the array being +traversed contains subarrays as values. +(@xref{Arrays of Arrays}, for more information about subarrays.) +The three possible return values are interpreted as follows: + +@table @code +@item comp_func(i1, v1, i2, v2) < 0 +Index @var{i1} comes before index @var{i2} during loop traversal. + +@item comp_func(i1, v1, i2, v2) == 0 +Indices @var{i1} and @var{i2} +come together but the relative order with respect to each other is undefined. + +@item comp_func(i1, v1, i2, v2) > 0 +Index @var{i1} comes after index @var{i2} during loop traversal. +@end table + +Our first comparison function can be used to scan an array in +numerical order of the indices: + +@example +function cmp_num_idx(i1, v1, i2, v2) +@{ + # numerical index comparison, ascending order + return (i1 - i2) +@} +@end example + +Our second function traverses an array based on the string order of +the element values rather than by indices: + +@example +function cmp_str_val(i1, v1, i2, v2) +@{ + # string value comparison, ascending order + v1 = v1 "" + v2 = v2 "" + if (v1 < v2) + return -1 + return (v1 != v2) +@} +@end example + +The third +comparison function makes all numbers, and numeric strings without +any leading or trailing spaces, come out first during loop traversal: + +@example +function cmp_num_str_val(i1, v1, i2, v2, n1, n2) +@{ + # numbers before string value comparison, ascending order + n1 = v1 + 0 + n2 = v2 + 0 + if (n1 == v1) + return (n2 == v2) ? (n1 - n2) : -1 + else if (n2 == v2) + return 1 + return (v1 < v2) ? -1 : (v1 != v2) +@} +@end example + +Here is a main program to demonstrate how @command{gawk} +behaves using each of the previous functions: + +@example +BEGIN @{ + data["one"] = 10 + data["two"] = 20 + data[10] = "one" + data[100] = 100 + data[20] = "two" + + f[1] = "cmp_num_idx" + f[2] = "cmp_str_val" + f[3] = "cmp_num_str_val" + for (i = 1; i <= 3; i++) @{ + printf("Sort function: %s\n", f[i]) + PROCINFO["sorted_in"] = f[i] + for (j in data) + printf("\tdata[%s] = %s\n", j, data[j]) + print "" + @} +@} +@end example + +Here are the results when the program is run: +@page + +@example +$ @kbd{gawk -f compdemo.awk} +@print{} Sort function: cmp_num_idx @ii{Sort by numeric index} +@print{} data[two] = 20 +@print{} data[one] = 10 @ii{Both strings are numerically zero} +@print{} data[10] = one +@print{} data[20] = two +@print{} data[100] = 100 +@print{} +@print{} Sort function: cmp_str_val @ii{Sort by element values as strings} +@print{} data[one] = 10 +@print{} data[100] = 100 @ii{String 100 is less than string 20} +@print{} data[two] = 20 +@print{} data[10] = one +@print{} data[20] = two +@print{} +@print{} Sort function: cmp_num_str_val @ii{Sort all numeric values before all strings} +@print{} data[one] = 10 +@print{} data[two] = 20 +@print{} data[100] = 100 +@print{} data[10] = one +@print{} data[20] = two +@end example + +Consider sorting the entries of a GNU/Linux system password file +according to login name. The following program sorts records +by a specific field position and can be used for this purpose: + +@example +# sort.awk --- simple program to sort by field position +# field position is specified by the global variable POS + +function cmp_field(i1, v1, i2, v2) +@{ + # comparison by value, as string, and ascending order + return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS]) +@} + +@{ + for (i = 1; i <= NF; i++) + a[NR][i] = $i +@} + +END @{ + PROCINFO["sorted_in"] = "cmp_field" + if (POS < 1 || POS > NF) + POS = 1 + for (i in a) @{ + for (j = 1; j <= NF; j++) + printf("%s%c", a[i][j], j < NF ? ":" : "") + print "" + @} +@} +@end example + +The first field in each entry of the password file is the user's login name, +and the fields are separated by colons. +Each record defines a subarray, +with each field as an element in the subarray. +Running the program produces the +following output: + +@example +$ @kbd{gawk -v POS=1 -F: -f sort.awk /etc/passwd} +@print{} adm:x:3:4:adm:/var/adm:/sbin/nologin +@print{} apache:x:48:48:Apache:/var/www:/sbin/nologin +@print{} avahi:x:70:70:Avahi daemon:/:/sbin/nologin +@dots{} +@end example + +The comparison should normally always return the same value when given a +specific pair of array elements as its arguments. If inconsistent +results are returned then the order is undefined. This behavior can be +exploited to introduce random order into otherwise seemingly +ordered data: + +@example +function cmp_randomize(i1, v1, i2, v2) +@{ + # random order + return (2 - 4 * rand()) +@} +@end example + +As mentioned above, the order of the indices is arbitrary if two +elements compare equal. This is usually not a problem, but letting +the tied elements come out in arbitrary order can be an issue, especially +when comparing item values. The partial ordering of the equal elements +may change during the next loop traversal, if other elements are added or +removed from the array. One way to resolve ties when comparing elements +with otherwise equal values is to include the indices in the comparison +rules. Note that doing this may make the loop traversal less efficient, +so consider it only if necessary. The following comparison functions +force a deterministic order, and are based on the fact that the +indices of two elements are never equal: + +@example +function cmp_numeric(i1, v1, i2, v2) +@{ + # numerical value (and index) comparison, descending order + return (v1 != v2) ? (v2 - v1) : (i2 - i1) +@} + +function cmp_string(i1, v1, i2, v2) +@{ + # string value (and index) comparison, descending order + v1 = v1 i1 + v2 = v2 i2 + return (v1 > v2) ? -1 : (v1 != v2) +@} +@end example + +@c Avoid using the term ``stable'' when describing the unpredictable behavior +@c if two items compare equal. Usually, the goal of a "stable algorithm" +@c is to maintain the original order of the items, which is a meaningless +@c concept for a list constructed from a hash. + +A custom comparison function can often simplify ordered loop +traversal, and the sky is really the limit when it comes to +designing such a function. + +When string comparisons are made during a sort, either for element +values where one or both aren't numbers, or for element indices +handled as strings, the value of @code{IGNORECASE} +(@pxref{Built-in Variables}) controls whether +the comparisons treat corresponding uppercase and lowercase letters as +equivalent or distinct. + +Another point to keep in mind is that in the case of subarrays +the element values can themselves be arrays; a production comparison +function should use the @code{isarray()} function +(@pxref{Type Functions}), +to check for this, and choose a defined sorting order for subarrays. + +All sorting based on @code{PROCINFO["sorted_in"]} +is disabled in POSIX mode, +since the @code{PROCINFO} array is not special in that case. + +As a side note, sorting the array indices before traversing +the array has been reported to add 15% to 20% overhead to the +execution time of @command{awk} programs. For this reason, +sorted array traversal is not the default. + +@c The @command{gawk} +@c maintainers believe that only the people who wish to use a +@c feature should have to pay for it. + +@node Array Sorting Functions +@subsection Sorting Array Values and Indices with @command{gawk} + +@cindex arrays, sorting +@cindex @code{asort()} function (@command{gawk}) +@cindex @code{asort()} function (@command{gawk}), arrays@comma{} sorting +@cindex sort function, arrays, sorting +In most @command{awk} implementations, sorting an array requires +writing a @code{sort()} function. +While this can be educational for exploring different sorting algorithms, +usually that's not the point of the program. +@command{gawk} provides the built-in @code{asort()} +and @code{asorti()} functions +(@pxref{String Functions}) +for sorting arrays. For example: + +@example +@var{populate the array} data +n = asort(data) +for (i = 1; i <= n; i++) + @var{do something with} data[i] +@end example + +After the call to @code{asort()}, the array @code{data} is indexed from 1 +to some number @var{n}, the total number of elements in @code{data}. +(This count is @code{asort()}'s return value.) +@code{data[1]} @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on. +The comparison is based on the type of the elements +(@pxref{Typing and Comparison}). +All numeric values come before all string values, +which in turn come before all subarrays. + +@cindex side effects, @code{asort()} function +An important side effect of calling @code{asort()} is that +@emph{the array's original indices are irrevocably lost}. +As this isn't always desirable, @code{asort()} accepts a +second argument: + +@example +@var{populate the array} source +n = asort(source, dest) +for (i = 1; i <= n; i++) + @var{do something with} dest[i] +@end example + +In this case, @command{gawk} copies the @code{source} array into the +@code{dest} array and then sorts @code{dest}, destroying its indices. +However, the @code{source} array is not affected. + +@code{asort()} accepts a third string argument to control comparison of +array elements. As with @code{PROCINFO["sorted_in"]}, this argument +may be one of the predefined names that @command{gawk} provides +(@pxref{Controlling Scanning}), or the name of a user-defined function +(@pxref{Controlling Array Traversal}). + +@quotation NOTE +In all cases, the sorted element values consist of the original +array's element values. The ability to control comparison merely +affects the way in which they are sorted. +@end quotation + +Often, what's needed is to sort on the values of the @emph{indices} +instead of the values of the elements. +To do that, use the +@code{asorti()} function. The interface is identical to that of +@code{asort()}, except that the index values are used for sorting, and +become the values of the result array: + +@example +@{ source[$0] = some_func($0) @} + +END @{ + n = asorti(source, dest) + for (i = 1; i <= n; i++) @{ + @ii{Work with sorted indices directly:} + @var{do something with} dest[i] + @dots{} + @ii{Access original array via sorted indices:} + @var{do something with} source[dest[i]] + @} +@} +@end example + +Similar to @code{asort()}, +in all cases, the sorted element values consist of the original +array's indices. The ability to control comparison merely +affects the way in which they are sorted. + +Sorting the array by replacing the indices provides maximal flexibility. +To traverse the elements in decreasing order, use a loop that goes from +@var{n} down to 1, either over the elements or over the indices.@footnote{You +may also use one of the predefined sorting names that sorts in +decreasing order.} + +@cindex reference counting, sorting arrays +Copying array indices and elements isn't expensive in terms of memory. +Internally, @command{gawk} maintains @dfn{reference counts} to data. +For example, when @code{asort()} copies the first array to the second one, +there is only one copy of the original array elements' data, even though +both arrays use the values. + +@c Document It And Call It A Feature. Sigh. +@cindex @command{gawk}, @code{IGNORECASE} variable in +@cindex @code{IGNORECASE} variable +@cindex arrays, sorting, @code{IGNORECASE} variable and +@cindex @code{IGNORECASE} variable, array sorting and +Because @code{IGNORECASE} affects string comparisons, the value +of @code{IGNORECASE} also affects sorting for both @code{asort()} and @code{asorti()}. +Note also that the locale's sorting order does @emph{not} +come into play; comparisons are based on character values only.@footnote{This +is true because locale-based comparison occurs only when in POSIX +compatibility mode, and since @code{asort()} and @code{asorti()} are +@command{gawk} extensions, they are not available in that case.} +Caveat Emptor. + +@node Two-way I/O +@section Two-Way Communications with Another Process +@cindex Brennan, Michael +@cindex programmers, attractiveness of +@smallexample +@c Path: cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan +From: brennan@@whidbey.com (Mike Brennan) +Newsgroups: comp.lang.awk +Subject: Re: Learn the SECRET to Attract Women Easily +Date: 4 Aug 1997 17:34:46 GMT +@c Organization: WhidbeyNet +@c Lines: 12 +Message-ID: <5s53rm$eca@@news.whidbey.com> +@c References: <5s20dn$2e1@chronicle.concentric.net> +@c Reply-To: brennan@whidbey.com +@c NNTP-Posting-Host: asn202.whidbey.com +@c X-Newsreader: slrn (0.9.4.1 UNIX) +@c Xref: cssun.mathcs.emory.edu comp.lang.awk:5403 + +On 3 Aug 1997 13:17:43 GMT, Want More Dates??? +<tracy78@@kilgrona.com> wrote: +>Learn the SECRET to Attract Women Easily +> +>The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women + +The scent of awk programmers is a lot more attractive to women than +the scent of perl programmers. +-- +Mike Brennan +@c brennan@@whidbey.com +@end smallexample + +@cindex advanced features, @command{gawk}, processes@comma{} communicating with +@cindex processes, two-way communications with +It is often useful to be able to +send data to a separate program for +processing and then read the result. This can always be +done with temporary files: + +@example +# Write the data for processing +tempfile = ("mydata." PROCINFO["pid"]) +while (@var{not done with data}) + print @var{data} | ("subprogram > " tempfile) +close("subprogram > " tempfile) + +# Read the results, remove tempfile when done +while ((getline newdata < tempfile) > 0) + @var{process} newdata @var{appropriately} +close(tempfile) +system("rm " tempfile) +@end example + +@noindent +This works, but not elegantly. Among other things, it requires that +the program be run in a directory that cannot be shared among users; +for example, @file{/tmp} will not do, as another user might happen +to be using a temporary file with the same name. + +@cindex coprocesses +@cindex input/output, two-way +@cindex @code{|} (vertical bar), @code{|&} operator (I/O) +@cindex vertical bar (@code{|}), @code{|&} operator (I/O) +@cindex @command{csh} utility, @code{|&} operator, comparison with +However, with @command{gawk}, it is possible to +open a @emph{two-way} pipe to another process. The second process is +termed a @dfn{coprocess}, since it runs in parallel with @command{gawk}. +The two-way connection is created using the @samp{|&} operator +(borrowed from the Korn shell, @command{ksh}):@footnote{This is very +different from the same operator in the C shell.} + +@example +do @{ + print @var{data} |& "subprogram" + "subprogram" |& getline results +@} while (@var{data left to process}) +close("subprogram") +@end example + +The first time an I/O operation is executed using the @samp{|&} +operator, @command{gawk} creates a two-way pipeline to a child process +that runs the other program. Output created with @code{print} +or @code{printf} is written to the program's standard input, and +output from the program's standard output can be read by the @command{gawk} +program using @code{getline}. +As is the case with processes started by @samp{|}, the subprogram +can be any program, or pipeline of programs, that can be started by +the shell. + +There are some cautionary items to be aware of: + +@itemize @bullet +@item +As the code inside @command{gawk} currently stands, the coprocess's +standard error goes to the same place that the parent @command{gawk}'s +standard error goes. It is not possible to read the child's +standard error separately. + +@cindex deadlocks +@cindex buffering, input/output +@cindex @code{getline} command, deadlock and +@item +I/O buffering may be a problem. @command{gawk} automatically +flushes all output down the pipe to the coprocess. +However, if the coprocess does not flush its output, +@command{gawk} may hang when doing a @code{getline} in order to read +the coprocess's results. This could lead to a situation +known as @dfn{deadlock}, where each process is waiting for the +other one to do something. +@end itemize + +@cindex @code{close()} function, two-way pipes and +It is possible to close just one end of the two-way pipe to +a coprocess, by supplying a second argument to the @code{close()} +function of either @code{"to"} or @code{"from"} +(@pxref{Close Files And Pipes}). +These strings tell @command{gawk} to close the end of the pipe +that sends data to the coprocess or the end that reads from it, +respectively. + +@cindex @command{sort} utility, coprocesses and +This is particularly necessary in order to use +the system @command{sort} utility as part of a coprocess; +@command{sort} must read @emph{all} of its input +data before it can produce any output. +The @command{sort} program does not receive an end-of-file indication +until @command{gawk} closes the write end of the pipe. + +When you have finished writing data to the @command{sort} +utility, you can close the @code{"to"} end of the pipe, and +then start reading sorted data via @code{getline}. +For example: + +@example +BEGIN @{ + command = "LC_ALL=C sort" + n = split("abcdefghijklmnopqrstuvwxyz", a, "") + + for (i = n; i > 0; i--) + print a[i] |& command + close(command, "to") + + while ((command |& getline line) > 0) + print "got", line + close(command) +@} +@end example + +This program writes the letters of the alphabet in reverse order, one +per line, down the two-way pipe to @command{sort}. It then closes the +write end of the pipe, so that @command{sort} receives an end-of-file +indication. This causes @command{sort} to sort the data and write the +sorted data back to the @command{gawk} program. Once all of the data +has been read, @command{gawk} terminates the coprocess and exits. + +As a side note, the assignment @samp{LC_ALL=C} in the @command{sort} +command ensures traditional Unix (ASCII) sorting from @command{sort}. + +@cindex @command{gawk}, @code{PROCINFO} array in +@cindex @code{PROCINFO} array +You may also use pseudo-ttys (ptys) for +two-way communication instead of pipes, if your system supports them. +This is done on a per-command basis, by setting a special element +in the @code{PROCINFO} array +(@pxref{Auto-set}), +like so: + +@example +command = "sort -nr" # command, save in convenience variable +PROCINFO[command, "pty"] = 1 # update PROCINFO +print @dots{} |& command # start two-way pipe +@dots{} +@end example + +@noindent +Using ptys avoids the buffer deadlock issues described earlier, at some +loss in performance. If your system does not have ptys, or if all the +system's ptys are in use, @command{gawk} automatically falls back to +using regular pipes. + +@node TCP/IP Networking +@section Using @command{gawk} for Network Programming +@cindex advanced features, @command{gawk}, network programming +@cindex networks, programming +@c STARTOFRANGE tcpip +@cindex TCP/IP +@cindex @code{/inet/@dots{}} special files (@command{gawk}) +@cindex files, @code{/inet/@dots{}} (@command{gawk}) +@cindex @code{/inet4/@dots{}} special files (@command{gawk}) +@cindex files, @code{/inet4/@dots{}} (@command{gawk}) +@cindex @code{/inet6/@dots{}} special files (@command{gawk}) +@cindex files, @code{/inet6/@dots{}} (@command{gawk}) +@cindex @code{EMISTERED} +@quotation +@code{EMISTERED}:@* +@ @ @ @ @i{A host is a host from coast to coast,@* +@ @ @ @ and no-one can talk to host that's close,@* +@ @ @ @ unless the host that isn't close@* +@ @ @ @ is busy hung or dead.} +@end quotation + +In addition to being able to open a two-way pipeline to a coprocess +on the same system +(@pxref{Two-way I/O}), +it is possible to make a two-way connection to +another process on another system across an IP network connection. + +You can think of this as just a @emph{very long} two-way pipeline to +a coprocess. +The way @command{gawk} decides that you want to use TCP/IP networking is +by recognizing special @value{FN}s that begin with one of @samp{/inet/}, +@samp{/inet4/} or @samp{/inet6}. + +The full syntax of the special @value{FN} is +@file{/@var{net-type}/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}. +The components are: + +@table @var +@item net-type +Specifies the kind of Internet connection to make. +Use @samp{/inet4/} to force IPv4, and +@samp{/inet6/} to force IPv6. +Plain @samp{/inet/} (which used to be the only option) uses +the system default, most likely IPv4. + +@item protocol +The protocol to use over IP. This must be either @samp{tcp}, or +@samp{udp}, for a TCP or UDP IP connection, +respectively. The use of TCP is recommended for most applications. + +@item local-port +@cindex @code{getaddrinfo()} function (C library) +The local TCP or UDP port number to use. Use a port number of @samp{0} +when you want the system to pick a port. This is what you should do +when writing a TCP or UDP client. +You may also use a well-known service name, such as @samp{smtp} +or @samp{http}, in which case @command{gawk} attempts to determine +the predefined port number using the C @code{getaddrinfo()} function. + +@item remote-host +The IP address or fully-qualified domain name of the Internet +host to which you want to connect. + +@item remote-port +The TCP or UDP port number to use on the given @var{remote-host}. +Again, use @samp{0} if you don't care, or else a well-known +service name. +@end table + +@cindex @command{gawk}, @code{ERRNO} variable in +@cindex @code{ERRNO} variable +@quotation NOTE +Failure in opening a two-way socket will result in a non-fatal error +being returned to the calling code. The value of @code{ERRNO} indicates +the error (@pxref{Auto-set}). +@end quotation + +Consider the following very simple example: + +@example +BEGIN @{ + Service = "/inet/tcp/0/localhost/daytime" + Service |& getline + print $0 + close(Service) +@} +@end example + +This program reads the current date and time from the local system's +TCP @samp{daytime} server. +It then prints the results and closes the connection. + +Because this topic is extensive, the use of @command{gawk} for +TCP/IP programming is documented separately. +@ifinfo +See +@inforef{Top, , General Introduction, gawkinet, TCP/IP Internetworking with @command{gawk}}, +@end ifinfo +@ifnotinfo +See @cite{TCP/IP Internetworking with @command{gawk}}, +which comes as part of the @command{gawk} distribution, +@end ifnotinfo +for a much more complete introduction and discussion, as well as +extensive examples. + +@c ENDOFRANGE tcpip + +@node Profiling +@section Profiling Your @command{awk} Programs +@c STARTOFRANGE awkp +@cindex @command{awk} programs, profiling +@c STARTOFRANGE proawk +@cindex profiling @command{awk} programs +@cindex profiling @command{gawk} +@cindex @code{awkprof.out} file +@cindex files, @code{awkprof.out} + +You may produce execution traces of your @command{awk} programs. +This is done by passing the option @option{--profile} to @command{gawk}. +When @command{gawk} has finished running, it creates a profile of your program in a file +named @file{awkprof.out}. Because it is profiling, it also executes up to 45% slower than +@command{gawk} normally does. + +@cindex @code{--profile} option +As shown in the following example, +the @option{--profile} option can be used to change the name of the file +where @command{gawk} will write the profile: + +@example +gawk --profile=myprog.prof -f myprog.awk data1 data2 +@end example + +@noindent +In the above example, @command{gawk} places the profile in +@file{myprog.prof} instead of in @file{awkprof.out}. + +Here is a sample session showing a simple @command{awk} program, its input data, and the +results from running @command{gawk} with the @option{--profile} option. +First, the @command{awk} program: + +@example +BEGIN @{ print "First BEGIN rule" @} + +END @{ print "First END rule" @} + +/foo/ @{ + print "matched /foo/, gosh" + for (i = 1; i <= 3; i++) + sing() +@} + +@{ + if (/foo/) + print "if is true" + else + print "else is true" +@} + +BEGIN @{ print "Second BEGIN rule" @} + +END @{ print "Second END rule" @} + +function sing( dummy) +@{ + print "I gotta be me!" +@} +@end example + +Following is the input data: + +@example +foo +bar +baz +foo +junk +@end example + +Here is the @file{awkprof.out} that results from running the @command{gawk} +profiler on this program and data (this example also illustrates that @command{awk} +programmers sometimes have to work late): + +@cindex @code{BEGIN} pattern +@cindex @code{END} pattern +@example + # gawk profile, created Sun Aug 13 00:00:15 2000 + + # BEGIN block(s) + + BEGIN @{ + 1 print "First BEGIN rule" + 1 print "Second BEGIN rule" + @} + + # Rule(s) + + 5 /foo/ @{ # 2 + 2 print "matched /foo/, gosh" + 6 for (i = 1; i <= 3; i++) @{ + 6 sing() + @} + @} + + 5 @{ + 5 if (/foo/) @{ # 2 + 2 print "if is true" + 3 @} else @{ + 3 print "else is true" + @} + @} + + # END block(s) + + END @{ + 1 print "First END rule" + 1 print "Second END rule" + @} + + # Functions, listed alphabetically + + 6 function sing(dummy) + @{ + 6 print "I gotta be me!" + @} +@end example + +This example illustrates many of the basic features of profiling output. +They are as follows: + +@itemize @bullet +@item +The program is printed in the order @code{BEGIN} rule, +@code{BEGINFILE} rule, +pattern/action rules, +@code{ENDFILE} rule, @code{END} rule and functions, listed +alphabetically. +Multiple @code{BEGIN} and @code{END} rules are merged together, +as are multiple @code{BEGINFILE} and @code{ENDFILE} rules. + +@cindex patterns, counts +@item +Pattern-action rules have two counts. +The first count, to the left of the rule, shows how many times +the rule's pattern was @emph{tested}. +The second count, to the right of the rule's opening left brace +in a comment, +shows how many times the rule's action was @emph{executed}. +The difference between the two indicates how many times the rule's +pattern evaluated to false. + +@item +Similarly, +the count for an @code{if}-@code{else} statement shows how many times +the condition was tested. +To the right of the opening left brace for the @code{if}'s body +is a count showing how many times the condition was true. +The count for the @code{else} +indicates how many times the test failed. + +@cindex loops, count for header +@item +The count for a loop header (such as @code{for} +or @code{while}) shows how many times the loop test was executed. +(Because of this, you can't just look at the count on the first +statement in a rule to determine how many times the rule was executed. +If the first statement is a loop, the count is misleading.) + +@cindex functions, user-defined, counts +@cindex user-defined, functions, counts +@item +For user-defined functions, the count next to the @code{function} +keyword indicates how many times the function was called. +The counts next to the statements in the body show how many times +those statements were executed. + +@cindex @code{@{@}} (braces) +@cindex braces (@code{@{@}}) +@item +The layout uses ``K&R'' style with TABs. +Braces are used everywhere, even when +the body of an @code{if}, @code{else}, or loop is only a single statement. + +@cindex @code{()} (parentheses) +@cindex parentheses @code{()} +@item +Parentheses are used only where needed, as indicated by the structure +of the program and the precedence rules. +@c extra verbiage here satisfies the copyeditor. ugh. +For example, @samp{(3 + 5) * 4} means add three plus five, then multiply +the total by four. However, @samp{3 + 5 * 4} has no parentheses, and +means @samp{3 + (5 * 4)}. + +@ignore +@item +All string concatenations are parenthesized too. +(This could be made a bit smarter.) +@end ignore + +@item +Parentheses are used around the arguments to @code{print} +and @code{printf} only when +the @code{print} or @code{printf} statement is followed by a redirection. +Similarly, if +the target of a redirection isn't a scalar, it gets parenthesized. + +@item +@command{gawk} supplies leading comments in +front of the @code{BEGIN} and @code{END} rules, +the pattern/action rules, and the functions. + +@end itemize + +The profiled version of your program may not look exactly like what you +typed when you wrote it. This is because @command{gawk} creates the +profiled version by ``pretty printing'' its internal representation of +the program. The advantage to this is that @command{gawk} can produce +a standard representation. The disadvantage is that all source-code +comments are lost, as are the distinctions among multiple @code{BEGIN}, +@code{END}, @code{BEGINFILE}, and @code{ENDFILE} rules. Also, things such as: + +@example +/foo/ +@end example + +@noindent +come out as: + +@example +/foo/ @{ + print $0 +@} +@end example + +@noindent +which is correct, but possibly surprising. + +@cindex profiling @command{awk} programs, dynamically +@cindex @command{gawk} program, dynamic profiling +Besides creating profiles when a program has completed, +@command{gawk} can produce a profile while it is running. +This is useful if your @command{awk} program goes into an +infinite loop and you want to see what has been executed. +To use this feature, run @command{gawk} with the @option{--profile} +option in the background: + +@example +$ @kbd{gawk --profile -f myprog &} +[1] 13992 +@end example + +@cindex @command{kill} command@comma{} dynamic profiling +@cindex @code{USR1} signal +@cindex @code{SIGUSR1} signal +@cindex signals, @code{USR1}/@code{SIGUSR1} +@noindent +The shell prints a job number and process ID number; in this case, 13992. +Use the @command{kill} command to send the @code{USR1} signal +to @command{gawk}: + +@example +$ @kbd{kill -USR1 13992} +@end example + +@noindent +As usual, the profiled version of the program is written to +@file{awkprof.out}, or to a different file if one specified with +the @option{--profile} option. + +Along with the regular profile, as shown earlier, the profile +includes a trace of any active functions: + +@example +# Function Call Stack: + +# 3. baz +# 2. bar +# 1. foo +# -- main -- +@end example + +You may send @command{gawk} the @code{USR1} signal as many times as you like. +Each time, the profile and function call trace are appended to the output +profile file. + +@cindex @code{HUP} signal +@cindex @code{SIGHUP} signal +@cindex signals, @code{HUP}/@code{SIGHUP} +If you use the @code{HUP} signal instead of the @code{USR1} signal, +@command{gawk} produces the profile and the function call trace and then exits. + +@cindex @code{INT} signal (MS-Windows) +@cindex @code{SIGINT} signal (MS-Windows) +@cindex signals, @code{INT}/@code{SIGINT} (MS-Windows) +@cindex @code{QUIT} signal (MS-Windows) +@cindex @code{SIGQUIT} signal (MS-Windows) +@cindex signals, @code{QUIT}/@code{SIGQUIT} (MS-Windows) +When @command{gawk} runs on MS-Windows systems, it uses the +@code{INT} and @code{QUIT} signals for producing the profile and, in +the case of the @code{INT} signal, @command{gawk} exits. This is +because these systems don't support the @command{kill} command, so the +only signals you can deliver to a program are those generated by the +keyboard. The @code{INT} signal is generated by the +@kbd{@value{CTL}-@key{C}} or @kbd{@value{CTL}-@key{BREAK}} key, while the +@code{QUIT} signal is generated by the @kbd{@value{CTL}-@key{\}} key. + +Finally, @command{gawk} also accepts another option @option{--pretty-print}. +When called this way, @command{gawk} ``pretty prints'' the program into +@file{awkprof.out}, without any execution counts. +@c ENDOFRANGE advgaw +@c ENDOFRANGE gawadv +@c ENDOFRANGE awkp +@c ENDOFRANGE proawk + @c The original text for this chapter was contributed by Efraim Yawitz. @c FIXME: Add more indexing. @@ -27444,15 +27021,4971 @@ The @command{gawk} debugger only accepts source supplied with the @option{-f} op Look forward to a future release when these and other missing features may be added, and of course feel free to try to add them yourself! +@node Arbitrary Precision Arithmetic +@chapter Arithmetic and Arbitrary Precision Arithmetic with @command{gawk} +@cindex arbitrary precision +@cindex multiple precision +@cindex infinite precision +@cindex floating-point numbers, arbitrary precision +@cindex MPFR +@cindex GMP + +@cindex Knuth, Donald +@quotation +@i{There's a credibility gap: We don't know how much of the computer's answers +to believe. Novice computer users solve this problem by implicitly trusting +in the computer as an infallible authority; they tend to believe that all +digits of a printed answer are significant. Disillusioned computer users have +just the opposite approach; they are constantly afraid that their answers +are almost meaningless.}@* +Donald Knuth@footnote{Donald E.@: Knuth. +@cite{The Art of Computer Programming}. Volume 2, +@cite{Seminumerical Algorithms}, third edition, +1998, ISBN 0-201-89683-4, p.@: 229.} +@end quotation + +This @value{CHAPTER} discusses issues that you may encounter +when performing arithmetic. It begins by discussing some of +the general attributes of computer arithmetic, along with how +this can influence what you see when running @command{awk} programs. +This discussion applies to all versions of @command{awk}. + +Then the @value{CHAPTER} moves on to @dfn{arbitrary precision +arithmetic}, a feature which is specific to @command{gawk}. + +@menu +* General Arithmetic:: An introduction to computer arithmetic. +* Floating-point Programming:: Effective Floating-point Programming. +* Gawk and MPFR:: How @command{gawk} provides + arbitrary-precision arithmetic. +* Arbitrary Precision Floats:: Arbitrary Precision Floating-point Arithmetic + with @command{gawk}. +* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic with + @command{gawk}. +@end menu + +@node General Arithmetic +@section A General Description of Computer Arithmetic + +@cindex integers +@cindex floating-point, numbers +@cindex numbers, floating-point +Within computers, there are two kinds of numeric values: @dfn{integers} +and @dfn{floating-point}. +In school, integer values were referred to as ``whole'' numbers---that is, +numbers without any fractional part, such as 1, 42, or @minus{}17. +The advantage to integer numbers is that they represent values exactly. +The disadvantage is that their range is limited. On most systems, +this range is @minus{}2,147,483,648 to 2,147,483,647. +However, many systems now support a range from +@minus{}9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. + +@cindex unsigned integers +@cindex integers, unsigned +Integer values come in two flavors: @dfn{signed} and @dfn{unsigned}. +Signed values may be negative or positive, with the range of values just +described. +Unsigned values are always positive. On most systems, +the range is from 0 to 4,294,967,295. +However, many systems now support a range from +0 to 18,446,744,073,709,551,615. + +@cindex double precision floating-point +@cindex single precision floating-point +Floating-point numbers represent what are called ``real'' numbers; i.e., +those that do have a fractional part, such as 3.1415927. +The advantage to floating-point numbers is that they +can represent a much larger range of values. +The disadvantage is that there are numbers that they cannot represent +exactly. +@command{awk} uses @dfn{double precision} floating-point numbers, which +can hold more digits than @dfn{single precision} +floating-point numbers. +@c Floating-point issues are discussed more fully in +@c @ref{Floating Point Issues}. + +There a several important issues to be aware of, described next. + +@menu +* Floating Point Issues:: Stuff to know about floating-point numbers. +* Integer Programming:: Effective integer programming. +@end menu + +@node Floating Point Issues +@subsection Floating-Point Number Caveats + +This @value{SECTION} describes some of the issues +involved in using floating-point numbers. + +There is a very nice +@uref{http://www.validlab.com/goldberg/paper.pdf, paper on floating-point arithmetic} +by David Goldberg, +``What Every Computer Scientist Should Know About Floating-point Arithmetic,'' +@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 5-48. +This is worth reading if you are interested in the details, +but it does require a background in computer science. + +@menu +* String Conversion Precision:: The String Value Can Lie. +* Unexpected Results:: Floating Point Numbers Are Not Abstract + Numbers. +* POSIX Floating Point Problems:: Standards Versus Existing Practice. +@end menu + +@node String Conversion Precision +@subsubsection The String Value Can Lie + +Internally, @command{awk} keeps both the numeric value +(double precision floating-point) and the string value for a variable. +Separately, @command{awk} keeps +track of what type the variable has +(@pxref{Typing and Comparison}), +which plays a role in how variables are used in comparisons. + +It is important to note that the string value for a number may not +reflect the full value (all the digits) that the numeric value +actually contains. +The following program (@file{values.awk}) illustrates this: + +@example +@{ + sum = $1 + $2 + # see it for what it is + printf("sum = %.12g\n", sum) + # use CONVFMT + a = "<" sum ">" + print "a =", a + # use OFMT + print "sum =", sum +@} +@end example + +@noindent +This program shows the full value of the sum of @code{$1} and @code{$2} +using @code{printf}, and then prints the string values obtained +from both automatic conversion (via @code{CONVFMT}) and +from printing (via @code{OFMT}). + +Here is what happens when the program is run: + +@example +$ @kbd{echo 3.654321 1.2345678 | awk -f values.awk} +@print{} sum = 4.8888888 +@print{} a = <4.88889> +@print{} sum = 4.88889 +@end example + +This makes it clear that the full numeric value is different from +what the default string representations show. + +@code{CONVFMT}'s default value is @code{"%.6g"}, which yields a value with +at least six significant digits. For some applications, you might want to +change it to specify more precision. +On most modern machines, most of the time, +17 digits is enough to capture a floating-point number's +value exactly.@footnote{Pathological cases can require up to +752 digits (!), but we doubt that you need to worry about this.} + +@node Unexpected Results +@subsubsection Floating Point Numbers Are Not Abstract Numbers + +@cindex floating-point, numbers +Unlike numbers in the abstract sense (such as what you studied in high school +or college arithmetic), numbers stored in computers are limited in certain ways. +They cannot represent an infinite number of digits, nor can they always +represent things exactly. +In particular, +floating-point numbers cannot +always represent values exactly. Here is an example: + +@example +$ @kbd{awk '@{ printf("%010d\n", $1 * 100) @}'} +515.79 +@print{} 0000051579 +515.80 +@print{} 0000051579 +515.81 +@print{} 0000051580 +515.82 +@print{} 0000051582 +@kbd{@value{CTL}-d} +@end example + +@noindent +This shows that some values can be represented exactly, +whereas others are only approximated. This is not a ``bug'' +in @command{awk}, but simply an artifact of how computers +represent numbers. + +@quotation NOTE +It cannot be emphasized enough that the behavior just +described is fundamental to modern computers. You will +see this kind of thing happen in @emph{any} programming +language using hardware floating-point numbers. It is @emph{not} +a bug in @command{gawk}, nor is it something that can be ``just +fixed.'' +@end quotation + +@cindex negative zero +@cindex positive zero +@cindex zero@comma{} negative vs.@: positive +Another peculiarity of floating-point numbers on modern systems +is that they often have more than one representation for the number zero! +In particular, it is possible to represent ``minus zero'' as well as +regular, or ``positive'' zero. + +This example shows that negative and positive zero are distinct values +when stored internally, but that they are in fact equal to each other, +as well as to ``regular'' zero: + +@example +$ @kbd{gawk 'BEGIN @{ mz = -0 ; pz = 0} +> @kbd{printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz} +> @kbd{printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0} +> @kbd{@}'} +@print{} -0 = -0, +0 = 0, (-0 == +0) -> 1 +@print{} mz == 0 -> 1, pz == 0 -> 1 +@end example + +It helps to keep this in mind should you process numeric data +that contains negative zero values; the fact that the zero is negative +is noted and can affect comparisons. + +@node POSIX Floating Point Problems +@subsubsection Standards Versus Existing Practice + +Historically, @command{awk} has converted any non-numeric looking string +to the numeric value zero, when required. Furthermore, the original +definition of the language and the original POSIX standards specified that +@command{awk} only understands decimal numbers (base 10), and not octal +(base 8) or hexadecimal numbers (base 16). + +Changes in the language of the +2001 and 2004 POSIX standards can be interpreted to imply that @command{awk} +should support additional features. These features are: + +@itemize @bullet +@item +Interpretation of floating point data values specified in hexadecimal +notation (@samp{0xDEADBEEF}). (Note: data values, @emph{not} +source code constants.) + +@item +Support for the special IEEE 754 floating point values ``Not A Number'' +(NaN), positive Infinity (``inf'') and negative Infinity (``@minus{}inf''). +In particular, the format for these values is as specified by the ISO 1999 +C standard, which ignores case and can allow machine-dependent additional +characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}. +@end itemize + +The first problem is that both of these are clear changes to historical +practice: + +@itemize @bullet +@item +The @command{gawk} maintainer feels that supporting hexadecimal floating +point values, in particular, is ugly, and was never intended by the +original designers to be part of the language. + +@item +Allowing completely alphabetic strings to have valid numeric +values is also a very severe departure from historical practice. +@end itemize + +The second problem is that the @code{gawk} maintainer feels that this +interpretation of the standard, which requires a certain amount of +``language lawyering'' to arrive at in the first place, was not even +intended by the standard developers. In other words, ``we see how you +got where you are, but we don't think that that's where you want to be.'' + +Recognizing the above issues, but attempting to provide compatibility +with the earlier versions of the standard, +the 2008 POSIX standard added explicit wording to allow, but not require, +that @command{awk} support hexadecimal floating point values and +special values for ``Not A Number'' and infinity. + +Although the @command{gawk} maintainer continues to feel that +providing those features is inadvisable, +nevertheless, on systems that support IEEE floating point, it seems +reasonable to provide @emph{some} way to support NaN and Infinity values. +The solution implemented in @command{gawk} is as follows: + +@itemize @bullet +@item +With the @option{--posix} command-line option, @command{gawk} becomes +``hands off.'' String values are passed directly to the system library's +@code{strtod()} function, and if it successfully returns a numeric value, +that is what's used.@footnote{You asked for it, you got it.} +By definition, the results are not portable across +different systems. They are also a little surprising: + +@example +$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'} +@print{} nan +$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'} +@print{} 3735928559 +@end example + +@item +Without @option{--posix}, @command{gawk} interprets the four strings +@samp{+inf}, +@samp{-inf}, +@samp{+nan}, +and +@samp{-nan} +specially, producing the corresponding special numeric values. +The leading sign acts a signal to @command{gawk} (and the user) +that the value is really numeric. Hexadecimal floating point is +not supported (unless you also use @option{--non-decimal-data}, +which is @emph{not} recommended). For example: + +@example +$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'} +@print{} 0 +$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'} +@print{} nan +$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'} +@print{} 0 +@end example + +@command{gawk} does ignore case in the four special values. +Thus @samp{+nan} and @samp{+NaN} are the same. +@end itemize + +@node Integer Programming +@subsection Mixing Integers And Floating-point + +As has been mentioned already, @command{gawk} ordinarily uses hardware double +precision with 64-bit IEEE binary floating-point representation +for numbers on most systems. A large integer like 9,007,199,254,740,997 +has a binary representation that, although finite, is more than 53 bits long; +it must also be rounded to 53 bits. +The biggest integer that can be stored in a C @code{double} is usually the same +as the largest possible value of a @code{double}. If your system @code{double} +is an IEEE 64-bit @code{double}, this largest possible value is an integer and +can be represented precisely. What more should one know about integers? + +If you want to know what is the largest integer, such that it and +all smaller integers can be stored in 64-bit doubles without losing precision, +then the answer is +@iftex +@math{2^{53}}. +@end iftex +@ifnottex +2^53. +@end ifnottex +The next representable number is the even number +@iftex +@math{2^{53} + 2}, +@end iftex +@ifnottex +2^53 + 2, +@end ifnottex +meaning it is unlikely that you will be able to make +@command{gawk} print +@iftex +@math{2^{53} + 1} +@end iftex +@ifnottex +2^53 + 1 +@end ifnottex +in integer format. +The range of integers exactly representable by a 64-bit double +is +@iftex +@math{[-2^{53}, 2^{53}]}. +@end iftex +@ifnottex +[@minus{}2^53, 2^53]. +@end ifnottex +If you ever see an integer outside this range in @command{gawk} +using 64-bit doubles, you have reason to be very suspicious about +the accuracy of the output. Here is a simple program with erroneous output: + +@example +$ @kbd{gawk 'BEGIN @{ i = 2^53 - 1; for (j = 0; j < 4; j++) print i + j @}'} +@print{} 9007199254740991 +@print{} 9007199254740992 +@print{} 9007199254740992 +@print{} 9007199254740994 +@end example + +The lesson is to not assume that any large integer printed by @command{gawk} +represents an exact result from your computation, especially if it wraps +around on your screen. + +@node Floating-point Programming +@section Understanding Floating-point Programming + +Numerical programming is an extensive area; if you need to develop +sophisticated numerical algorithms then @command{gawk} may not be +the ideal tool, and this documentation may not be sufficient. +@c FIXME: JOHN: Do you want to cite some actual books? +It might require digesting a book or two to really internalize how to compute +with ideal accuracy and precision, +and the result often depends on the particular application. + +@quotation NOTE +A floating-point calculation's @dfn{accuracy} is how close it comes +to the real value. This is as opposed to the @dfn{precision}, which +usually refers to the number of bits used to represent the number +(see @uref{http://en.wikipedia.org/wiki/Accuracy_and_precision, +the Wikipedia article} for more information). +@end quotation + +There are two options for doing floating-point calculations: +hardware floating-point (as used by standard @command{awk} and +the default for @command{gawk}), and @dfn{arbitrary-precision} +floating-point, which is software based. +From this point forward, this @value{CHAPTER} +aims to provide enough information to understand both, and then +will focus on @command{gawk}'s facilities for the latter.@footnote{If you +are interested in other tools that perform arbitrary precision arithmetic, +you may want to investigate the POSIX @command{bc} tool. See +@uref{http://pubs.opengroup.org/onlinepubs/009695399/utilities/bc.html, +the POSIX specification for it}, for more information.} + +Binary floating-point representations and arithmetic are inexact. +Simple values like 0.1 cannot be precisely represented using +binary floating-point numbers, and the limited precision of +floating-point numbers means that slight changes in +the order of operations or the precision of intermediate storage +can change the result. To make matters worse, with arbitrary precision +floating-point, you can set the precision before starting a computation, +but then you cannot be sure of the number of significant decimal places +in the final result. + +Sometimes, before you start to write any code, you should think more +about what you really want and what's really happening. Consider the +two numbers in the following example: + +@example +x = 0.875 # 1/2 + 1/4 + 1/8 +y = 0.425 +@end example + +Unlike the number in @code{y}, the number stored in @code{x} +is exactly representable +in binary since it can be written as a finite sum of one or +more fractions whose denominators are all powers of two. +When @command{gawk} reads a floating-point number from +program source, it automatically rounds that number to whatever +precision your machine supports. If you try to print the numeric +content of a variable using an output format string of @code{"%.17g"}, +it may not produce the same number as you assigned to it: + +@example +$ @kbd{gawk 'BEGIN @{ x = 0.875; y = 0.425} +> @kbd{ printf("%0.17g, %0.17g\n", x, y) @}'} +@print{} 0.875, 0.42499999999999999 +@end example + +Often the error is so small you do not even notice it, and if you do, +you can always specify how much precision you would like in your output. +Usually this is a format string like @code{"%.15g"}, which when +used in the previous example, produces an output identical to the input. + +Because the underlying representation can be a little bit off from the exact value, +comparing floating-point values to see if they are equal is generally not a good idea. +Here is an example where it does not work like you expect: + +@example +$ @kbd{gawk 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} +@print{} 0 +@end example + +The loss of accuracy during a single computation with floating-point numbers +usually isn't enough to worry about. However, if you compute a value +which is the result of a sequence of floating point operations, +the error can accumulate and greatly affect the computation itself. +Here is an attempt to compute the value of the constant +@value{PI} using one of its many series representations: + +@example +BEGIN @{ + x = 1.0 / sqrt(3.0) + n = 6 + for (i = 1; i < 30; i++) @{ + n = n * 2.0 + x = (sqrt(x * x + 1) - 1) / x + printf("%.15f\n", n * x) + @} +@} +@end example + +When run, the early errors propagating through later computations +cause the loop to terminate prematurely after an attempt to divide by zero. + +@example +$ @kbd{gawk -f pi.awk} +@print{} 3.215390309173475 +@print{} 3.159659942097510 +@print{} 3.146086215131467 +@print{} 3.142714599645573 +@dots{} +@print{} 3.224515243534819 +@print{} 2.791117213058638 +@print{} 0.000000000000000 +@error{} gawk: pi.awk:6: fatal: division by zero attempted +@end example + +Here is an additional example where the inaccuracies in internal representations +yield an unexpected result: + +@example +$ @kbd{gawk 'BEGIN @{} +> @kbd{for (d = 1.1; d <= 1.5; d += 0.1)} +> @kbd{i++} +> @kbd{print i} +> @kbd{@}'} +@print{} 4 +@end example + +Can computation using arbitrary precision help with the previous examples? +If you are impatient to know, see +@ref{Exact Arithmetic}. + +Instead of arbitrary precision floating-point arithmetic, +often all you need is an adjustment of your logic +or a different order for the operations in your calculation. +The stability and the accuracy of the computation of the constant @value{PI} +in the previous example can be enhanced by using the following +simple algebraic transformation: + +@example +(sqrt(x * x + 1) - 1) / x = x / (sqrt(x * x + 1) + 1) +@end example + +@noindent +After making this, change the program does converge to +@value{PI} in under 30 iterations: + +@example +$ @kbd{gawk -f /tmp/pi2.awk} +@print{} 3.215390309173473 +@print{} 3.159659942097501 +@print{} 3.146086215131436 +@print{} 3.142714599645370 +@print{} 3.141873049979825 +@dots{} +@print{} 3.141592653589797 +@print{} 3.141592653589797 +@end example + +There is no need to be unduly suspicious about the results from +floating-point arithmetic. The lesson to remember is that +floating-point arithmetic is always more complex than arithmetic using +pencil and paper. In order to take advantage of the power +of computer floating-point, you need to know its limitations +and work within them. For most casual use of floating-point arithmetic, +you will often get the expected result in the end if you simply round +the display of your final results to the correct number of significant +decimal digits. + +As general advice, avoid presenting numerical data in a manner that +implies better precision than is actually the case. + +@menu +* Floating-point Representation:: Binary floating-point representation. +* Floating-point Context:: Floating-point context. +* Rounding Mode:: Floating-point rounding mode. +@end menu + +@node Floating-point Representation +@subsection Binary Floating-point Representation +@cindex IEEE-754 format + +Although floating-point representations vary from machine to machine, +the most commonly encountered representation is that defined by the +IEEE 754 Standard. An IEEE-754 format value has three components: + +@itemize @bullet +@item +A sign bit telling whether the number is positive or negative. + +@item +An @dfn{exponent}, @var{e}, giving its order of magnitude. + +@item +A @dfn{significand}, @var{s}, +specifying the actual digits of the number. +@end itemize + +The value of the +number is then +@iftex +@math{s @cdot 2^e}. +@end iftex +@ifnottex +@var{s * 2^e}. +@end ifnottex +The first bit of a non-zero binary significand +is always one, so the significand in an IEEE-754 format only includes the +fractional part, leaving the leading one implicit. +The significand is stored in @dfn{normalized} format, +which means that the first bit is always a one. + +Three of the standard IEEE-754 types are 32-bit single precision, +64-bit double precision and 128-bit quadruple precision. +The standard also specifies extended precision formats +to allow greater precisions and larger exponent ranges. + +@node Floating-point Context +@subsection Floating-point Context +@cindex context, floating-point + +A floating-point @dfn{context} defines the environment for arithmetic operations. +It governs precision, sets rules for rounding, and limits the range for exponents. +The context has the following primary components: + +@table @dfn +@item Precision +Precision of the floating-point format in bits. +@item emax +Maximum exponent allowed for this format. +@item emin +Minimum exponent allowed for this format. +@item Underflow behavior +The format may or may not support gradual underflow. +@item Rounding +The rounding mode of this context. +@end table + +@ref{table-ieee-formats} lists the precision and exponent +field values for the basic IEEE-754 binary formats: + +@float Table,table-ieee-formats +@caption{Basic IEEE Format Context Values} +@multitable @columnfractions .20 .20 .20 .20 .20 +@headitem Name @tab Total bits @tab Precision @tab emin @tab emax +@item Single @tab 32 @tab 24 @tab @minus{}126 @tab +127 +@item Double @tab 64 @tab 53 @tab @minus{}1022 @tab +1023 +@item Quadruple @tab 128 @tab 113 @tab @minus{}16382 @tab +16383 +@end multitable +@end float + +@quotation NOTE +The precision numbers include the implied leading one that gives them +one extra bit of significand. +@end quotation + +A floating-point context can also determine which signals are treated +as exceptions, and can set rules for arithmetic with special values. +Please consult the IEEE-754 standard or other resources for details. + +@command{gawk} ordinarily uses the hardware double precision +representation for numbers. On most systems, this is IEEE-754 +floating-point format, corresponding to 64-bit binary with 53 bits +of precision. + +@quotation NOTE +In case an underflow occurs, the standard allows, but does not require, +the result from an arithmetic operation to be a number smaller than +the smallest nonzero normalized number. Such numbers do +not have as many significant digits as normal numbers, and are called +@dfn{denormals} or @dfn{subnormals}. The alternative, simply returning a zero, +is called @dfn{flush to zero}. The basic IEEE-754 binary formats +support subnormal numbers. +@end quotation + +@node Rounding Mode +@subsection Floating-point Rounding Mode +@cindex rounding mode, floating-point + +The @dfn{rounding mode} specifies the behavior for the results of numerical +operations when discarding extra precision. Each rounding mode indicates +how the least significant returned digit of a rounded result is to +be calculated. +@ref{table-rounding-modes} lists the IEEE-754 defined +rounding modes: + +@float Table,table-rounding-modes +@caption{IEEE 754 Rounding Modes} +@multitable @columnfractions .45 .55 +@headitem Rounding Mode @tab IEEE Name +@item Round to nearest, ties to even @tab @code{roundTiesToEven} +@item Round toward plus Infinity @tab @code{roundTowardPositive} +@item Round toward negative Infinity @tab @code{roundTowardNegative} +@item Round toward zero @tab @code{roundTowardZero} +@item Round to nearest, ties away from zero @tab @code{roundTiesToAway} +@end multitable +@end float + +The default mode @code{roundTiesToEven} is the most preferred, +but the least intuitive. This method does the obvious thing for most values, +by rounding them up or down to the nearest digit. +For example, rounding 1.132 to two digits yields 1.13, +and rounding 1.157 yields 1.16. + +However, when it comes to rounding a value that is exactly halfway between, +things do not work the way you probably learned in school. +In this case, the number is rounded to the nearest even digit. +So rounding 0.125 to two digits rounds down to 0.12, +but rounding 0.6875 to three digits rounds up to 0.688. +You probably have already encountered this rounding mode when +using the @code{printf} routine to format floating-point numbers. +For example: + +@example +BEGIN @{ + x = -4.5 + for (i = 1; i < 10; i++) @{ + x += 1.0 + printf("%4.1f => %2.0f\n", x, x) + @} +@} +@end example + +@noindent +produces the following output when run:@footnote{It +is possible for the output to be completely different if the +C library in your system does not use the IEEE-754 even-rounding +rule to round halfway cases for @code{printf()}.} + +@example +-3.5 => -4 +-2.5 => -2 +-1.5 => -2 +-0.5 => 0 + 0.5 => 0 + 1.5 => 2 + 2.5 => 2 + 3.5 => 4 + 4.5 => 4 +@end example + +The theory behind the rounding mode @code{roundTiesToEven} is that +it more or less evenly distributes upward and downward rounds +of exact halves, which might cause the round-off error +to cancel itself out. This is the default rounding mode used +in IEEE-754 computing functions and operators. + +The other rounding modes are rarely used. +Round toward positive infinity (@code{roundTowardPositive}) +and round toward negative infinity (@code{roundTowardNegative}) +are often used to implement interval arithmetic, +where you adjust the rounding mode to calculate upper and lower bounds +for the range of output. The @code{roundTowardZero} +mode can be used for converting floating-point numbers to integers. +The rounding mode @code{roundTiesToAway} rounds the result to the +nearest number and selects the number with the larger magnitude +if a tie occurs. + +Some numerical analysts will tell you that your choice of rounding style +has tremendous impact on the final outcome, and advise you to wait until +final output for any rounding. Instead, you can often avoid round-off error problems by +setting the precision initially to some value sufficiently larger than +the final desired precision, so that the accumulation of round-off error +does not influence the outcome. +If you suspect that results from your computation are +sensitive to accumulation of round-off error, +one way to be sure is to look for a significant difference in output +when you change the rounding mode. + +@node Gawk and MPFR +@section @command{gawk} + MPFR = Powerful Arithmetic + +The rest of this @value{CHAPTER} describes how to use the arbitrary precision +(also known as @dfn{multiple precision} or @dfn{infinite precision}) numeric +capabilities in @command{gawk} to produce maximally accurate results +when you need it. + +But first you should check if your version of +@command{gawk} supports arbitrary precision arithmetic. +The easiest way to find out is to look at the output of +the following command: + +@example +$ @kbd{gawk --version} +@print{} GNU Awk 4.1.0 (GNU MPFR 3.1.0, GNU MP 5.0.3) +@print{} Copyright (C) 1989, 1991-2012 Free Software Foundation. +@dots{} +@end example + +@command{gawk} uses the +@uref{http://www.mpfr.org, GNU MPFR} +and +@uref{http://gmplib.org, GNU MP} (GMP) +libraries for arbitrary precision +arithmetic on numbers. So if you do not see the names of these libraries +in the output, then your version of @command{gawk} does not support +arbitrary precision arithmetic. + +Additionally, +there are a few elements available in the @code{PROCINFO} array +to provide information about the MPFR and GMP libraries. +@xref{Auto-set}, for more information. + @ignore -@c Try this +Even if you aren't interested in arbitrary precision arithmetic, you +may still benefit from knowing about how @command{gawk} handles numbers +in general, and the limitations of doing arithmetic with ordinary +@command{gawk} numbers. +@end ignore + + +@node Arbitrary Precision Floats +@section Arbitrary Precision Floating-point Arithmetic with @command{gawk} + +@command{gawk} uses the GNU MPFR library +for arbitrary precision floating-point arithmetic. The MPFR library +provides precise control over precisions and rounding modes, and gives +correctly rounded, reproducible, platform-independent results. With the +command-line option @option{--bignum} or @option{-M}, +all floating-point arithmetic operators and numeric functions can yield +results to any desired precision level supported by MPFR. +Two built-in variables, @code{PREC} and @code{ROUNDMODE}, +provide control over the working precision and the rounding mode +(@pxref{Setting Precision}, and +@pxref{Setting Rounding Mode}). +The precision and the rounding mode are set globally for every operation +to follow. + +The default working precision for arbitrary precision floating-point values is 53, +and the default value for @code{ROUNDMODE} is @code{"N"}, +which selects the IEEE-754 @code{roundTiesToEven} rounding mode +(@pxref{Rounding Mode}).@footnote{The +default precision is 53, since according to the MPFR documentation, +the library should be able to exactly reproduce all computations with +double-precision machine floating-point numbers (@code{double} type +in C), except the default exponent range is much wider and subnormal +numbers are not implemented.} +@command{gawk} uses the default exponent range in MPFR @iftex -@page -@headings off -@majorheading III@ @ @ Appendixes -Part III provides the appendixes, the Glossary, and two licenses that cover +(@math{emax = 2^{30} - 1, emin = -emax}) +@end iftex +@ifnottex +(@var{emax} = 2^30 @minus{} 1, @var{emin} = @minus{}@var{emax}) +@end ifnottex +for all floating-point contexts. +There is no explicit mechanism to adjust the exponent range. +MPFR does not implement subnormal numbers by default, +and this behavior cannot be changed in @command{gawk}. + +@quotation NOTE +When emulating an IEEE-754 format (@pxref{Setting Precision}), +@command{gawk} internally adjusts the exponent range +to the value defined for the format and also performs computations needed for +gradual underflow (subnormal numbers). +@end quotation + +@quotation NOTE +MPFR numbers are variable-size entities, consuming only as much space as +needed to store the significant digits. Since the performance using MPFR +numbers pales in comparison to doing arithmetic using the underlying machine +types, you should consider using only as much precision as needed by +your program. +@end quotation + +@menu +* Setting Precision:: Setting the working precision. +* Setting Rounding Mode:: Setting the rounding mode. +* Floating-point Constants:: Representing floating-point constants. +* Changing Precision:: Changing the precision of a number. +* Exact Arithmetic:: Exact arithmetic with floating-point numbers. +@end menu + +@node Setting Precision +@subsection Setting the Working Precision +@cindex @code{PREC} variable + +@command{gawk} uses a global working precision; it does not keep track of +the precision or accuracy of individual numbers. Performing an arithmetic +operation or calling a built-in function rounds the result to the current +working precision. The default working precision is 53, which can be +modified using the built-in variable @code{PREC}. You can also set the +value to one of the following pre-defined case-insensitive strings +to emulate an IEEE-754 binary format: + +@multitable {@code{"double"}} {12345678901234567890123456789012345} +@headitem @code{PREC} @tab IEEE-754 Binary Format +@item @code{"half"} @tab 16-bit half-precision. +@item @code{"single"} @tab Basic 32-bit single precision. +@item @code{"double"} @tab Basic 64-bit double precision. +@item @code{"quad"} @tab Basic 128-bit quadruple precision. +@item @code{"oct"} @tab 256-bit octuple precision. +@end multitable + +The following example illustrates the effects of changing precision +on arithmetic operations: + +@example +$ @kbd{gawk -M -v PREC=100 'BEGIN @{ x = 1.0e-400; print x + 0; \} +> @kbd{PREC = "double"; print x + 0 @}'} +@print{} 1e-400 +@print{} 0 +@end example + +Binary and decimal precisions are related approximately, according to the +formula: + +@iftex +@math{prec = 3.322 @cdot dps} +@end iftex +@ifnottex +@var{prec} = 3.322 * @var{dps} +@end ifnottex + +@noindent +Here, @var{prec} denotes the binary precision +(measured in bits) and @var{dps} (short for decimal places) +is the decimal digits. We can easily calculate how many decimal +digits the 53-bit significand of an IEEE double is equivalent to: +53 / 3.332 which is equal to about 15.95. +But what does 15.95 digits actually mean? It depends whether you are +concerned about how many digits you can rely on, or how many digits +you need. + +It is important to know how many bits it takes to uniquely identify +a double-precision value (the C type @code{double}). If you want to +convert from @code{double} to decimal and back to @code{double} (e.g., +saving a @code{double} representing an intermediate result to a file, and +later reading it back to restart the computation), then a few more decimal +digits are required. 17 digits is generally enough for a @code{double}. + +It can also be important to know what decimal numbers can be uniquely +represented with a @code{double}. If you want to convert +from decimal to @code{double} and back again, 15 digits is the most that +you can get. Stated differently, you should not present +the numbers from your floating-point computations with more than 15 +significant digits in them. + +Conversely, it takes a precision of 332 bits to hold an approximation +of the constant @value{PI} that is accurate to 100 decimal places. + +You should always add some extra bits in order to avoid the confusing round-off +issues that occur because numbers are stored internally in binary. + +@node Setting Rounding Mode +@subsection Setting the Rounding Mode +@cindex @code{ROUNDMODE} variable + +The @code{ROUNDMODE} variable provides +program level control over the rounding mode. +The correspondence between @code{ROUNDMODE} and the IEEE +rounding modes is shown in @ref{table-gawk-rounding-modes}. + +@float Table,table-gawk-rounding-modes +@caption{@command{gawk} Rounding Modes} +@multitable @columnfractions .45 .30 .25 +@headitem Rounding Mode @tab IEEE Name @tab @code{ROUNDMODE} +@item Round to nearest, ties to even @tab @code{roundTiesToEven} @tab @code{"N"} or @code{"n"} +@item Round toward plus Infinity @tab @code{roundTowardPositive} @tab @code{"U"} or @code{"u"} +@item Round toward negative Infinity @tab @code{roundTowardNegative} @tab @code{"D"} or @code{"d"} +@item Round toward zero @tab @code{roundTowardZero} @tab @code{"Z"} or @code{"z"} +@item Round to nearest, ties away from zero @tab @code{roundTiesToAway} @tab @code{"A"} or @code{"a"} +@end multitable +@end float + +@code{ROUNDMODE} has the default value @code{"N"}, +which selects the IEEE-754 rounding mode @code{roundTiesToEven}. +@ref{table-gawk-rounding-modes}, lists @code{"A"} to select the IEEE-754 mode +@code{roundTiesToAway}. This is only available +if your version of the MPFR library supports it; otherwise setting +@code{ROUNDMODE} to this value has no effect. @xref{Rounding Mode}, +for the meanings of the various rounding modes. + +Here is an example of how to change the default rounding behavior of +@code{printf}'s output: + +@example +$ @kbd{gawk -M -v ROUNDMODE="Z" 'BEGIN @{ printf("%.2f\n", 1.378) @}'} +@print{} 1.37 +@end example + +@node Floating-point Constants +@subsection Representing Floating-point Constants +@cindex constants, floating-point + +Be wary of floating-point constants! When reading a floating-point constant +from program source code, @command{gawk} uses the default precision, +unless overridden +by an assignment to the special variable @code{PREC} on the command +line, to store it internally as a MPFR number. +Changing the precision using @code{PREC} in the program text does +@emph{not} change the precision of a constant. If you need to +represent a floating-point constant at a higher precision than the +default and cannot use a command line assignment to @code{PREC}, +you should either specify the constant as a string, or +as a rational number, whenever possible. The following example +illustrates the differences among various ways to +print a floating-point constant: + +@example +$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 0.1) @}'} +@print{} 0.1000000000000000055511151 +$ @kbd{gawk -M -v PREC=113 'BEGIN @{ printf("%0.25f\n", 0.1) @}'} +@print{} 0.1000000000000000000000000 +$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", "0.1") @}'} +@print{} 0.1000000000000000000000000 +$ @kbd{gawk -M 'BEGIN @{ PREC = 113; printf("%0.25f\n", 1/10) @}'} +@print{} 0.1000000000000000000000000 +@end example + +In the first case, the number is stored with the default precision of 53. + +@node Changing Precision +@subsection Changing the Precision of a Number + +@cindex Laurie, Dirk +@quotation +@i{The point is that in any variable-precision package, +a decision is made on how to treat numbers given as data, +or arising in intermediate results, which are represented in +floating-point format to a precision lower than working precision. +Do we promote them to full membership of the high-precision club, +or do we treat them and all their associates as second-class citizens? +Sometimes the first course is proper, sometimes the second, and it takes +careful analysis to tell which.} + +Dirk Laurie@footnote{Dirk Laurie. +@cite{Variable-precision Arithmetic Considered Perilous --- A Detective Story}. +Electronic Transactions on Numerical Analysis. Volume 28, pp. 168-173, 2008.} +@end quotation + +@command{gawk} does not implicitly modify the precision of any previously +computed results when the working precision is changed with an assignment +to @code{PREC}. The precision of a number is always the one that was +used at the time of its creation, and there is no way for the user +to explicitly change it afterwards. However, since the result of a +floating-point arithmetic operation is always an arbitrary precision +floating-point value---with a precision set by the value of @code{PREC}---one of the +following workarounds effectively accomplishes the desired behavior: + +@example +x = x + 0.0 +@end example + +@noindent +or: + +@example +x += 0.0 +@end example + +@node Exact Arithmetic +@subsection Exact Arithmetic with Floating-point Numbers + +@quotation CAUTION +Never depend on the exactness of floating-point arithmetic, +even for apparently simple expressions! +@end quotation + +Can arbitrary precision arithmetic give exact results? There are +no easy answers. The standard rules of algebra often do not apply +when using floating-point arithmetic. +Among other things, the distributive and associative laws +do not hold completely, and order of operation may be important +for your computation. Rounding error, cumulative precision loss +and underflow are often troublesome. + +When @command{gawk} tests the expressions @samp{0.1 + 12.2} and @samp{12.3} +for equality +using the machine double precision arithmetic, it decides that they +are not equal! +(@xref{Floating-point Programming}.) +You can get the result you want by increasing the precision; +56 in this case will get the job done: + +@example +$ @kbd{gawk -M -v PREC=56 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} +@print{} 1 +@end example + +If adding more bits is good, perhaps adding even more bits of +precision is better? +Here is what happens if we use an even larger value of @code{PREC}: + +@example +$ @kbd{gawk -M -v PREC=201 'BEGIN @{ print (0.1 + 12.2 == 12.3) @}'} +@print{} 0 +@end example + +This is not a bug in @command{gawk} or in the MPFR library. +It is easy to forget that the finite number of bits used to store the value +is often just an approximation after proper rounding. +The test for equality succeeds if and only if @emph{all} bits in the two operands +are exactly the same. Since this is not necessarily true after floating-point +computations with a particular precision and effective rounding rule, +a straight test for equality may not work. + +So, don't assume that floating-point values can be compared for equality. +You should also exercise caution when using other forms of comparisons. +The standard way to compare between floating-point numbers is to determine +how much error (or @dfn{tolerance}) you will allow in a comparison and +check to see if one value is within this error range of the other. + +In applications where 15 or fewer decimal places suffice, +hardware double precision arithmetic can be adequate, and is usually much faster. +But you do need to keep in mind that every floating-point operation +can suffer a new rounding error with catastrophic consequences as illustrated +by our earlier attempt to compute the value of the constant @value{PI} +(@pxref{Floating-point Programming}). +Extra precision can greatly enhance the stability and the accuracy +of your computation in such cases. + +Repeated addition is not necessarily equivalent to multiplication +in floating-point arithmetic. In the example in +@ref{Floating-point Programming}: + +@example +$ @kbd{gawk 'BEGIN @{} +> @kbd{for (d = 1.1; d <= 1.5; d += 0.1)} +> @kbd{i++} +> @kbd{print i} +> @kbd{@}'} +@print{} 4 +@end example + +@noindent +you may or may not succeed in getting the correct result by choosing +an arbitrarily large value for @code{PREC}. Reformulation of +the problem at hand is often the correct approach in such situations. + +@node Arbitrary Precision Integers +@section Arbitrary Precision Integer Arithmetic with @command{gawk} +@cindex integer, arbitrary precision + +If the option @option{--bignum} or @option{-M} is specified, +@command{gawk} performs all +integer arithmetic using GMP arbitrary precision integers. +Any number that looks like an integer in a program source or data file +is stored as an arbitrary precision integer. +The size of the integer is limited only by your computer's memory. +The current floating-point context has no effect on operations involving integers. +For example, the following computes +@iftex +@math{5^{4^{3^{2}}}}, +@end iftex +@ifnottex +5^4^3^2, +@end ifnottex +the result of which is beyond the +limits of ordinary @command{gawk} numbers: + +@example +$ @kbd{gawk -M 'BEGIN @{} +> @kbd{x = 5^4^3^2} +> @kbd{print "# of digits =", length(x)} +> @kbd{print substr(x, 1, 20), "...", substr(x, length(x) - 19, 20)} +> @kbd{@}'} +@print{} # of digits = 183231 +@print{} 62060698786608744707 ... 92256259918212890625 +@end example + +If you were to compute the same value using arbitrary precision +floating-point values instead, the precision needed for correct output +(using the formula +@iftex +@math{prec = 3.322 @cdot dps}), +would be @math{3.322 @cdot 183231}, +@end iftex +@ifnottex +@samp{prec = 3.322 * dps}), +would be 3.322 x 183231, +@end ifnottex +or 608693. + +The result from an arithmetic operation with an integer and a floating-point value +is a floating-point value with a precision equal to the working precision. +The following program calculates the eighth term in +Sylvester's sequence@footnote{Weisstein, Eric W. +@cite{Sylvester's Sequence}. From MathWorld---A Wolfram Web Resource. +@url{http://mathworld.wolfram.com/SylvestersSequence.html}} +using a recurrence: + +@example +$ @kbd{gawk -M 'BEGIN @{} +> @kbd{s = 2.0} +> @kbd{for (i = 1; i <= 7; i++)} +> @kbd{s = s * (s - 1) + 1} +> @kbd{print s} +> @kbd{@}'} +@print{} 113423713055421845118910464 +@end example + +The output differs from the actual number, 113,423,713,055,421,844,361,000,443, +because the default precision of 53 is not enough to represent the +floating-point results exactly. You can either increase the precision +(100 is enough in this case), or replace the floating-point constant +@samp{2.0} with an integer, to perform all computations using integer +arithmetic to get the correct output. + +It will sometimes be necessary for @command{gawk} to implicitly convert an +arbitrary precision integer into an arbitrary precision floating-point value. +This is primarily because the MPFR library does not always provide the +relevant interface to process arbitrary precision integers or mixed-mode +numbers as needed by an operation or function. +In such a case, the precision is set to the minimum value necessary +for exact conversion, and the working precision is not used for this purpose. +If this is not what you need or want, you can employ a subterfuge +like this: + +@example +gawk -M 'BEGIN @{ n = 13; print (n + 0.0) % 2.0 @}' +@end example + +You can avoid this issue altogether by specifying the number as a floating-point value +to begin with: + +@example +gawk -M 'BEGIN @{ n = 13.0; print n % 2.0 @}' +@end example + +Note that for the particular example above, there is likely best +to just use the following: + +@example +gawk -M 'BEGIN @{ n = 13; print n % 2 @}' +@end example + +@node Dynamic Extensions +@chapter Writing Extensions for @command{gawk} + +It is possible to add new built-in functions to @command{gawk} using +dynamically loaded libraries. This facility is available on systems +that support the C @code{dlopen()} and @code{dlsym()} +functions. This @value{CHAPTER} describes how to create extensions +using code written in C or C++. + +If you don't know anything about C programming, you can safely skip this +@value{CHAPTER}, although you may wish to review the documentation on the +extensions that come with @command{gawk} (@pxref{Extension Samples}), +and the @value{SECTION} on the @code{gawkextlib} project (@pxref{gawkextlib}). +The sample extensions are automatically built and installed when +@command{gawk} is. + +@quotation NOTE +When @option{--sandbox} is specified, extensions are disabled +(@pxref{Options}). +@end quotation + +@menu +* Extension Intro:: What is an extension. +* Plugin License:: A note about licensing. +* Extension Design:: Design notes about the extension API. +* Extension API Description:: A full description of the API. +* Extension Example:: Example C code for an extension. +* Extension Samples:: The sample extensions that ship with + @code{gawk}. +* gawkextlib:: The @code{gawkextlib} project. +@end menu + +@node Extension Intro +@section Introduction + +An @dfn{extension} (sometimes called a @dfn{plug-in}) is a piece of +external compiled code that @command{gawk} can load at runtime to +provide additional functionality, over and above the built-in capabilities +described in the rest of this @value{DOCUMENT}. + +Extensions are useful because they allow you (of course) to extend +@command{gawk}'s functionality. For example, they can provide access to +system calls (such as @code{chdir()} to change directory) and to other +C library routines that could be of use. As with most software, +``the sky is the limit;'' if you can imagine something that you might +want to do and can write in C or C++, you can write an extension to do it! + +Extensions are written in C or C++, using the @dfn{Application Programming +Interface} (API) defined for this purpose by the @command{gawk} +developers. The rest of this @value{CHAPTER} explains the design +decisions behind the API, the facilities that it provides and how to use +them, and presents a small sample extension. In addition, it documents +the sample extensions included in the @command{gawk} distribution, +and describes the @code{gawkextlib} project. + +@node Plugin License +@section Extension Licensing + +Every dynamic extension should define the global symbol +@code{plugin_is_GPL_compatible} to assert that it has been licensed under +a GPL-compatible license. If this symbol does not exist, @command{gawk} +emits a fatal error and exits when it tries to load your extension. + +The declared type of the symbol should be @code{int}. It does not need +to be in any allocated section, though. The code merely asserts that +the symbol exists in the global scope. Something like this is enough: + +@example +int plugin_is_GPL_compatible; +@end example + +@node Extension Design +@section Extension API Design + +The first version of extensions for @command{gawk} was developed in +the mid-1990s and released with @command{gawk} 3.1 in the late 1990s. +The basic mechanisms and design remained unchanged for close to 15 years, +until 2012. + +The old extension mechanism used data types and functions from +@command{gawk} itself, with a ``clever hack'' to install extension +functions. + +@command{gawk} included some sample extensions, of which a few were +really useful. However, it was clear from the outset that the extension +mechanism was bolted onto the side and was not really thought out. + +@menu +* Old Extension Problems:: Problems with the old mechanism. +* Extension New Mechanism Goals:: Goals for the new mechanism. +* Extension Other Design Decisions:: Some other design decisions. +* Extension Mechanism Outline:: An outline of how it works. +* Extension Future Growth:: Some room for future growth. +@end menu + +@node Old Extension Problems +@subsection Problems With The Old Mechanism + +The old extension mechanism had several problems: + +@itemize @bullet +@item +It depended heavily upon @command{gawk} internals. Any time the +@code{NODE} structure@footnote{A critical central data structure +inside @command{gawk}.} changed, an extension would have to be +recompiled. Furthermore, to really write extensions required understanding +something about @command{gawk}'s internal functions. There was some +documentation in this @value{DOCUMENT}, but it was quite minimal. + +@item +Being able to call into @command{gawk} from an extension required linker +facilities that are common on Unix-derived systems but that did +not work on Windows systems; users wanting extensions on Windows +had to statically link them into @command{gawk}, even though Windows supports +dynamic loading of shared objects. + +@item +The API would change occasionally as @command{gawk} changed; no compatibility +between versions was ever offered or planned for. +@end itemize + +Despite the drawbacks, the @command{xgawk} project developers forked +@command{gawk} and developed several significant extensions. They also +enhanced @command{gawk}'s facilities relating to file inclusion and +shared object access. + +A new API was desired for a long time, but only in 2012 did the +@command{gawk} maintainer and the @command{xgawk} developers finally +start working on it together. More information about the @command{xgawk} +project is provided in @ref{gawkextlib}. + +@node Extension New Mechanism Goals +@subsection Goals For A New Mechanism + +Some goals for the new API were: + +@itemize @bullet +@item +The API should be independent of @command{gawk} internals. Changes in +@command{gawk} internals should not be visible to the writer of an +extension function. + +@item +The API should provide @emph{binary} compatibility across @command{gawk} +releases as long as the API itself does not change. + +@item +The API should enable extensions written in C to have roughly the +same ``appearance'' to @command{awk}-level code as @command{awk} +functions do. This means that extensions should have: + +@itemize @minus +@item +The ability to access function parameters. + +@item +The ability to turn an undefined parameter into an array (call by reference). + +@item +The ability to create, access and update global variables. + +@item +Easy access to all the elements of an array at once (``array flattening'') +in order to loop over all the element in an easy fashion for C code. + +@item +The ability to create arrays (including @command{gawk}'s true +multi-dimensional arrays). +@end itemize +@end itemize + +Some additional important goals were: + +@itemize @bullet +@item +The API should use only features in ISO C 90, so that extensions +can be written using the widest range of C and C++ compilers. The header +should include the appropriate @samp{#ifdef __cplusplus} and @samp{extern "C"} +magic so that a C++ compiler could be used. (If using C++, the runtime +system has to be smart enough to call any constructors and destructors, +as @command{gawk} is a C program. As of this writing, this has not been +tested.) + +@item +The API mechanism should not require access to @command{gawk}'s +symbols@footnote{The @dfn{symbols} are the variables and functions +defined inside @command{gawk}. Access to these symbols by code +external to @command{gawk} loaded dynamically at runtime is +problematic on Windows.} by the compile-time or dynamic linker, +in order to enable creation of extensions that also work on Windows. +@end itemize + +During development, it became clear that there were other features +that should be available to extensions, which were also subsequently +provided: + +@itemize @bullet +@item +Extensions should have the ability to hook into @command{gawk}'s +I/O redirection mechanism. In particular, the @command{xgawk} +developers provided a so-called ``open hook'' to take over reading +records. During development, this was generalized to allow +extensions to hook into input processing, output processing, and +two-way I/O. + +@item +An extension should be able to provide a ``call back'' function +to perform clean up actions when @command{gawk} exits. + +@item +An extension should be able to provide a version string so that +@command{gawk}'s @option{--version} option can provide information +about extensions as well. +@end itemize + +@node Extension Other Design Decisions +@subsection Other Design Decisions + +As an arbitrary design decision, extensions can read the values of +built-in variables and arrays (such as @code{ARGV} and @code{FS}), but cannot +change them, with the exception of @code{PROCINFO}. + +The reason for this is to prevent an extension function from affecting +the flow of an @command{awk} program outside its control. While a real +@command{awk} function can do what it likes, that is at the discretion +of the programmer. An extension function should provide a service or +make a C API available for use within @command{awk}, and not mess with +@code{FS} or @code{ARGC} and @code{ARGV}. + +In addition, it becomes easy to start down a slippery slope. How +much access to @command{gawk} facilities do extensions need? +Do they need @code{getline}? What about calling @code{gsub()} or +compiling regular expressions? What about calling into @command{awk} +functions? (@emph{That} would be messy.) + +In order to avoid these issues, the @command{gawk} developers chose +to start with the simplest, most basic features that are still truly useful. + +Another decision is that although @command{gawk} provides nice things like +MPFR, and arrays indexed internally by integers, these features are not +being brought out to the API in order to keep things simple and close to +traditional @command{awk} semantics. (In fact, arrays indexed internally +by integers are so transparent that they aren't even documented!) + +Additionally, all functions in the API check that their pointer +input parameters are not @code{NULL}. If they are, they return an error. +(It is a good idea for extension code to verify that +pointers received from @command{gawk} are not @code{NULL}. +Such a thing should not happen, but the @command{gawk} developers +are only human, and they have been known to occasionally make +mistakes.) + +With time, the API will undoubtedly evolve; the @command{gawk} developers +expect this to be driven by user needs. For now, the current API seems +to provide a minimal yet powerful set of features for creating extensions. + +@node Extension Mechanism Outline +@subsection At A High Level How It Works + +The requirement to avoid access to @command{gawk}'s symbols is, at first +glance, a difficult one to meet. + +One design, apparently used by Perl and Ruby and maybe others, would +be to make the mainline @command{gawk} code into a library, with the +@command{gawk} utility a small C @code{main()} function linked against +the library. + +This seemed like the tail wagging the dog, complicating build and +installation and making a simple copy of the @command{gawk} executable +from one system to another (or one place to another on the same +system!) into a chancy operation. + +Pat Rankin suggested the solution that was adopted. Communication between +@command{gawk} and an extension is two-way. First, when an extension +is loaded, it is passed a pointer to a @code{struct} whose fields are +function pointers. +This is shown in @ref{load-extension}. + +@float Figure,load-extension +@caption{Loading The Extension} +@ifinfo +@center @image{api-figure1, , , Loading the extension, txt} +@end ifinfo +@ifnotinfo +@center @image{api-figure1, , , Loading the extension} +@end ifnotinfo +@end float + +The extension can call functions inside @command{gawk} through these +function pointers, at runtime, without needing (link-time) access +to @command{gawk}'s symbols. One of these function pointers is to a +function for ``registering'' new built-in functions. +This is shown in @ref{load-new-function}. + +@float Figure,load-new-function +@caption{Loading The New Function} +@ifinfo +@center @image{api-figure2, , , Loading the new function, txt} +@end ifinfo +@ifnotinfo +@center @image{api-figure2, , , Loading the new function} +@end ifnotinfo +@end float + +In the other direction, the extension registers its new functions +with @command{gawk} by passing function pointers to the functions that +provide the new feature (@code{do_chdir()}, for example). @command{gawk} +associates the function pointer with a name and can then call it, using a +defined calling convention. +This is shown in @ref{call-new-function}. + +@float Figure,call-new-function +@caption{Calling The New Function} +@ifinfo +@center @image{api-figure3, , , Calling the new function, txt} +@end ifinfo +@ifnotinfo +@center @image{api-figure3, , , Calling the new function} +@end ifnotinfo +@end float + +The @code{do_@var{xxx}()} function, in turn, then uses the function +pointers in the API @code{struct} to do its work, such as updating +variables or arrays, printing messages, setting @code{ERRNO}, and so on. + +Convenience macros in the @file{gawkapi.h} header file make calling +through the function pointers look like regular function calls so that +extension code is quite readable and understandable. + +Although all of this sounds somewhat complicated, the result is that +extension code is quite straightforward to write and to read. You can +see this in the sample extensions @file{filefuncs.c} (@pxref{Extension +Example}) and also the @file{testext.c} code for testing the APIs. + +Some other bits and pieces: + +@itemize @bullet +@item +The API provides access to @command{gawk}'s @code{do_@var{xxx}} values, +reflecting command line options, like @code{do_lint}, @code{do_profiling} +and so on (@pxref{Extension API Variables}). +These are informational: an extension cannot affect these +inside @command{gawk}. In addition, attempting to assign to them +produces a compile-time error. + +@item +The API also provides major and minor version numbers, so that an +extension can check if the @command{gawk} it is loaded with supports the +facilities it was compiled with. (Version mismatches ``shouldn't'' +happen, but we all know how @emph{that} goes.) +@xref{Extension Versioning}, for details. +@end itemize + +@node Extension Future Growth +@subsection Room For Future Growth + +The API can later be expanded, in two ways: + +@itemize @bullet +@item +@command{gawk} passes an ``extension id'' into the extension when it +first loads the extension. The extension then passes this id back +to @command{gawk} with each function call. This mechanism allows +@command{gawk} to identify the extension calling into it, should it need +to know. + +@item +Similarly, the extension passes a ``name space'' into @command{gawk} +when it registers each extension function. This allows a future +mechanism for grouping extension functions and possibly avoiding name +conflicts. +@end itemize + +Of course, as of this writing, no decisions have been made with respect +to any of the above. + +@node Extension API Description +@section API Description + +This (rather large) @value{SECTION} describes the API in detail. + +@menu +* Extension API Functions Introduction:: Introduction to the API functions. +* General Data Types:: The data types. +* Requesting Values:: How to get a value. +* Constructor Functions:: Functions for creating values. +* Registration Functions:: Functions to register things with + @command{gawk}. +* Printing Messages:: Functions for printing messages. +* Updating @code{ERRNO}:: Functions for updating @code{ERRNO}. +* Accessing Parameters:: Functions for accessing parameters. +* Symbol Table Access:: Functions for accessing global + variables. +* Array Manipulation:: Functions for working with arrays. +* Extension API Variables:: Variables provided by the API. +* Extension API Boilerplate:: Boilerplate code for using the API. +* Finding Extensions:: How @command{gawk} finds compiled + extensions. +@end menu + +@node Extension API Functions Introduction +@subsection Introduction + +Access to facilities within @command{gawk} are made available +by calling through function pointers passed into your extension. + +API function pointers are provided for the following kinds of operations: + +@itemize @bullet +@item +Registrations functions. You may register: +@itemize @minus +@item +extension functions, +@item +exit callbacks, +@item +a version string, +@item +input parsers, +@item +output wrappers, +@item +and two-way processors. +@end itemize +All of these are discussed in detail, later in this @value{CHAPTER}. + +@item +Printing fatal, warning, and ``lint'' warning messages. + +@item +Updating @code{ERRNO}, or unsetting it. + +@item +Accessing parameters, including converting an undefined parameter into +an array. + +@item +Symbol table access: retrieving a global variable, creating one, +or changing one. This also includes the ability to create a scalar +variable that will be @emph{constant} within @command{awk} code. + +@item +Creating and releasing cached values; this provides an +efficient way to use values for multiple variables and +can be a big performance win. + +@item +Manipulating arrays: +@itemize @minus +@item +Retrieving, adding, deleting, and modifying elements +@item +Getting the count of elements in an array +@item +Creating a new array +@item +Clearing an array +@item +Flattening an array for easy C style looping over all its indices and elements +@end itemize +@end itemize + +Some points about using the API: + +@itemize @bullet +@item +The following types and/or macros and/or functions are referenced +in @file{gawkapi.h}. For correct use, you must therefore include the +corresponding standard header file @emph{before} including @file{gawkapi.h}: + +@multitable {C Entity} {@code{<sys/types.h>}} +@headitem C Entity @tab Header File +@item @code{FILE} @tab @code{<stdio.h>} +@item @code{NULL} @tab @code{<stddef.h>} +@item @code{malloc()} @tab @code{<stdlib.h>} +@item @code{memset()}, @code{memcpy()} @tab @code{<string.h>} +@item @code{size_t} @tab @code{<sys/types.h>} +@item @code{struct stat} @tab @code{<sys/stat.h>} +@end multitable + +Due to portability concerns, especially to systems that are not +fully standards-compliant, it is your responsibility +to include the correct files in the correct way. This requirement +is necessary in order to keep @file{gawkapi.h} clean, instead of becoming +a portability hodge-podge as can be seen in the @command{gawk} source code. + +To pass reasonable integer values for @code{ERRNO}, you will also need to +include @code{<errno.h>}. + +@item +The @file{gawkapi.h} file may be included more than once without ill effect. +Doing so, however, is poor coding practice. + +@item +Although the API only uses ISO C 90 features, there is an exception; the +``constructor'' functions use the @code{inline} keyword. If your compiler +does not support this keyword, you should either place +@samp{-Dinline=''} on your command line, or use the GNU Autotools and include a +@file{config.h} file in your extensions. + +@item +All pointers filled in by @command{gawk} are to memory +managed by @command{gawk} and should be treated by the extension as +read-only. Memory for @emph{all} strings passed into @command{gawk} +from the extension @emph{must} come from @code{malloc()} and is managed +by @command{gawk} from then on. + +@item +The API defines several simple structs that map values as seen +from @command{awk}. A value can be a @code{double}, a string, or an +array (as in multidimensional arrays, or when creating a new array). +Strings maintain both pointer and length since embedded @code{NUL} +characters are allowed. + +By intent, strings are maintained using the current multibyte encoding (as +defined by @env{LC_@var{xxx}} environment variables) and not using wide +characters. This matches how @command{gawk} stores strings internally +and also how characters are likely to be input and output from files. + +@item +When retrieving a value (such as a parameter or that of a global variable +or array element), the extension requests a specific type (number, string, +scalars, value cookie, array, or ``undefined''). When the request is +``undefined,'' the returned value will have the real underlying type. + +However, if the request and actual type don't match, the access function +returns ``false'' and fills in the type of the actual value that is there, +so that the extension can, e.g., print an error message +(``scalar passed where array expected''). + +@c This is documented in the header file and needs some expanding upon. +@c The table there should be presented here +@end itemize + +While you may call the API functions by using the function pointers +directly, the interface is not so pretty. To make extension code look +more like regular code, the @file{gawkapi.h} header file defines several +macros that you should use in your code. This @value{SECTION} presents +the macros as if they were functions. + +@node General Data Types +@subsection General Purpose Data Types + +@quotation +@i{I have a true love/hate relationship with unions.}@* +Arnold Robbins + +@i{That's the thing about unions: the compiler will arrange things so they +can accommodate both love and hate.}@* +Chet Ramey +@end quotation + +The extension API defines a number of simple types and structures for general +purpose use. Additional, more specialized, data structures, are introduced +in subsequent @value{SECTION}s, together with the functions that use them. + +@table @code +@item typedef void *awk_ext_id_t; +A value of this type is received from @command{gawk} when an extension is loaded. +That value must then be passed back to @command{gawk} as the first parameter of +each API function. + +@item #define awk_const @dots{} +This macro expands to @samp{const} when compiling an extension, +and to nothing when compiling @command{gawk} itself. This makes +certain fields in the API data structures unwritable from extension code, +while allowing @command{gawk} to use them as it needs to. + +@item typedef int awk_bool_t; +A simple boolean type. At the moment, the API does not define special +``true'' and ``false'' values, although perhaps it should. + +@item typedef struct @{ +@itemx @ @ @ @ char *str;@ @ @ @ @ @ /* data */ +@itemx @ @ @ @ size_t len;@ @ @ @ @ /* length thereof, in chars */ +@itemx @} awk_string_t; +This represents a mutable string. @command{gawk} +owns the memory pointed to if it supplied +the value. Otherwise, it takes ownership of the memory pointed to. +@strong{Such memory must come from @code{malloc()}!} + +As mentioned earlier, strings are maintained using the current +multibyte encoding. + +@item typedef enum @{ +@itemx @ @ @ @ AWK_UNDEFINED, +@itemx @ @ @ @ AWK_NUMBER, +@itemx @ @ @ @ AWK_STRING, +@itemx @ @ @ @ AWK_ARRAY, +@itemx @ @ @ @ AWK_SCALAR,@ @ @ @ @ @ @ @ @ /* opaque access to a variable */ +@itemx @ @ @ @ AWK_VALUE_COOKIE@ @ @ /* for updating a previously created value */ +@itemx @} awk_valtype_t; +This @code{enum} indicates the type of a value. +It is used in the following @code{struct}. + +@item typedef struct @{ +@itemx @ @ @ @ awk_valtype_t val_type; +@itemx @ @ @ @ union @{ +@itemx @ @ @ @ @ @ @ @ awk_string_t@ @ @ @ @ @ @ s; +@itemx @ @ @ @ @ @ @ @ double@ @ @ @ @ @ @ @ @ @ @ @ @ d; +@itemx @ @ @ @ @ @ @ @ awk_array_t@ @ @ @ @ @ @ @ a; +@itemx @ @ @ @ @ @ @ @ awk_scalar_t@ @ @ @ @ @ @ scl; +@itemx @ @ @ @ @ @ @ @ awk_value_cookie_t@ vc; +@itemx @ @ @ @ @} u; +@itemx @} awk_value_t; +An ``@command{awk} value.'' +The @code{val_type} member indicates what kind of value the +@code{union} holds, and each member is of the appropriate type. + +@item #define str_value@ @ @ @ @ @ u.s +@itemx #define num_value@ @ @ @ @ @ u.d +@itemx #define array_cookie@ @ @ u.a +@itemx #define scalar_cookie@ @ u.scl +@itemx #define value_cookie@ @ @ u.vc +These macros make accessing the fields of the @code{awk_value_t} more +readable. + +@item typedef void *awk_scalar_t; +Scalars can be represented as an opaque type. These values are obtained from +@command{gawk} and then passed back into it. This is discussed in a general fashion below, +and in more detail in @ref{Symbol table by cookie}. + +@item typedef void *awk_value_cookie_t; +A ``value cookie'' is an opaque type representing a cached value. +This is also discussed in a general fashion below, +and in more detail in @ref{Cached values}. + +@end table + +Scalar values in @command{awk} are either numbers or strings. The +@code{awk_value_t} struct represents values. The @code{val_type} member +indicates what is in the @code{union}. + +Representing numbers is easy---the API uses a C @code{double}. Strings +require more work. Since @command{gawk} allows embedded @code{NUL} bytes +in string values, a string must be represented as a pair containing a +data-pointer and length. This is the @code{awk_string_t} type. + +Identifiers (i.e., the names of global variables) can be associated +with either scalar values or with arrays. In addition, @command{gawk} +provides true arrays of arrays, where any given array element can +itself be an array. Discussion of arrays is delayed until +@ref{Array Manipulation}. + +The various macros listed earlier make it easier to use the elements +of the @code{union} as if they were fields in a @code{struct}; this +is a common coding practice in C. Such code is easier to write and to +read, however it remains @emph{your} responsibility to make sure that +the @code{val_type} member correctly reflects the type of the value in +the @code{awk_value_t}. + +Conceptually, the first three members of the @code{union} (number, string, +and array) are all that is needed for working with @command{awk} values. +However, since the API provides routines for accessing and changing +the value of global scalar variables only by using the variable's name, +there is a performance penalty: @command{gawk} must find the variable +each time it is accessed and changed. This turns out to be a real issue, +not just a theoretical one. + +Thus, if you know that your extension will spend considerable time +reading and/or changing the value of one or more scalar variables, you +can obtain a @dfn{scalar cookie}@footnote{See +@uref{http://catb.org/jargon/html/C/cookie.html, the ``cookie'' entry in the Jargon file} for a +definition of @dfn{cookie}, and @uref{http://catb.org/jargon/html/M/magic-cookie.html, +the ``magic cookie'' entry in the Jargon file} for a nice example. See +also the entry for ``Cookie'' in the @ref{Glossary}.} +object for that variable, and then use +the cookie for getting the variable's value or for changing the variable's +value. +This is the @code{awk_scalar_t} type and @code{scalar_cookie} macro. +Given a scalar cookie, @command{gawk} can directly retrieve or +modify the value, as required, without having to first find it. + +The @code{awk_value_cookie_t} type and @code{value_cookie} macro are similar. +If you know that you wish to +use the same numeric or string @emph{value} for one or more variables, +you can create the value once, retaining a @dfn{value cookie} for it, +and then pass in that value cookie whenever you wish to set the value of a +variable. This saves both storage space within the running @command{gawk} +process as well as the time needed to create the value. + +@node Requesting Values +@subsection Requesting Values + +All of the functions that return values from @command{gawk} +work in the same way. You pass in an @code{awk_valtype_t} value +to indicate what kind of value you expect. If the actual value +matches what you requested, the function returns true and fills +in the @code{awk_value_t} result. +Otherwise, the function returns false, and the @code{val_type} +member indicates the type of the actual value. You may then +print an error message, or reissue the request for the actual +value type, as appropriate. This behavior is summarized in +@ref{table-value-types-returned}. + +@ifnotplaintext +@float Table,table-value-types-returned +@caption{Value Types Returned} +@multitable @columnfractions .50 .50 +@headitem @tab Type of Actual Value: +@end multitable +@multitable @columnfractions .166 .166 .198 .15 .15 .166 +@headitem @tab @tab String @tab Number @tab Array @tab Undefined +@item @tab @b{String} @tab String @tab String @tab false @tab false +@item @tab @b{Number} @tab Number if can be converted, else false @tab Number @tab false @tab false +@item @b{Type} @tab @b{Array} @tab false @tab false @tab Array @tab false +@item @b{Requested:} @tab @b{Scalar} @tab Scalar @tab Scalar @tab false @tab false +@item @tab @b{Undefined} @tab String @tab Number @tab Array @tab Undefined +@item @tab @b{Value Cookie} @tab false @tab false @tab false @tab false +@end multitable +@end float +@end ifnotplaintext +@ifplaintext +@float Table,table-value-types-returned +@caption{Value Types Returned} +@example + +-------------------------------------------------+ + | Type of Actual Value: | + +------------+------------+-----------+-----------+ + | String | Number | Array | Undefined | ++-----------+-----------+------------+------------+-----------+-----------+ +| | String | String | String | false | false | +| |-----------+------------+------------+-----------+-----------+ +| | Number | Number if | Number | false | false | +| | | can be | | | | +| | | converted, | | | | +| | | else false | | | | +| |-----------+------------+------------+-----------+-----------+ +| Type | Array | false | false | Array | false | +| Requested |-----------+------------+------------+-----------+-----------+ +| | Scalar | Scalar | Scalar | false | false | +| |-----------+------------+------------+-----------+-----------+ +| | Undefined | String | Number | Array | Undefined | +| |-----------+------------+------------+-----------+-----------+ +| | Value | false | false | false | false | +| | Cookie | | | | | ++-----------+-----------+------------+------------+-----------+-----------+ +@end example +@end float +@end ifplaintext + +@node Constructor Functions +@subsection Constructor Functions and Convenience Macros + +The API provides a number of @dfn{constructor} functions for creating +string and numeric values, as well as a number of convenience macros. +This @value{SUBSECTION} presents them all as function prototypes, in +the way that extension code would use them. + +@table @code +@item static inline awk_value_t * +@itemx make_const_string(const char *string, size_t length, awk_value_t *result) +This function creates a string value in the @code{awk_value_t} variable +pointed to by @code{result}. It expects @code{string} to be a C string constant +(or other string data), and automatically creates a @emph{copy} of the data +for storage in @code{result}. It returns @code{result}. + +@item static inline awk_value_t * +@itemx make_malloced_string(const char *string, size_t length, awk_value_t *result) +This function creates a string value in the @code{awk_value_t} variable +pointed to by @code{result}. It expects @code{string} to be a @samp{char *} +value pointing to data previously obtained from @code{malloc()}. The idea here +is that the data is passed directly to @command{gawk}, which assumes +responsibility for it. It returns @code{result}. + +@item static inline awk_value_t * +@itemx make_null_string(awk_value_t *result) +This specialized function creates a null string (the ``undefined'' value) +in the @code{awk_value_t} variable pointed to by @code{result}. +It returns @code{result}. + +@item static inline awk_value_t * +@itemx make_number(double num, awk_value_t *result) +This function simply creates a numeric value in the @code{awk_value_t} variable +pointed to by @code{result}. +@end table + +Two convenience macros may be used for allocating storage from @code{malloc()} +and @code{realloc()}. If the allocation fails, they cause @command{gawk} to +exit with a fatal error message. They should be used as if they were +procedure calls that do not return a value. + +@table @code +@item emalloc(pointer, type, size, message) +The arguments to this macro are as follows: +@c nested table +@table @code +@item pointer +The pointer variable to point at the allocated storage. + +@item type +The type of the pointer variable, used to create a cast for the call to @code{malloc()}. + +@item size +The total number of bytes to be allocated. + +@item message +A message to be prefixed to the fatal error message. Typically this is the name +of the function using the macro. +@end table + +@noindent +For example, you might allocate a string value like so: + +@example +awk_value_t result; +char *message; +const char greet[] = "Don't Panic!"; + +emalloc(message, char *, sizeof(greet), "myfunc"); +strcpy(message, greet); +make_malloced_string(message, strlen(message), & result); +@end example + +@item erealloc(pointer, type, size, message) +This is like @code{emalloc()}, but it calls @code{realloc()}, +instead of @code{malloc()}. +The arguments are the same as for the @code{emalloc()} macro. +@end table + +@node Registration Functions +@subsection Registration Functions + +This @value{SECTION} describes the API functions for +registering parts of your extension with @command{gawk}. + +@menu +* Extension Functions:: Registering extension functions. +* Exit Callback Functions:: Registering an exit callback. +* Extension Version String:: Registering a version string. +* Input Parsers:: Registering an input parser. +* Output Wrappers:: Registering an output wrapper. +* Two-way processors:: Registering a two-way processor. +@end menu + +@node Extension Functions +@subsubsection Registering An Extension Function + +Extension functions are described by the following record: + +@example +typedef struct @{ +@ @ @ @ const char *name; +@ @ @ @ awk_value_t *(*function)(int num_actual_args, awk_value_t *result); +@ @ @ @ size_t num_expected_args; +@} awk_ext_func_t; +@end example + +The fields are: + +@table @code +@item const char *name; +The name of the new function. +@command{awk} level code calls the function by this name. +This is a regular C string. + +@item awk_value_t *(*function)(int num_actual_args, awk_value_t *result); +This is a pointer to the C function that provides the desired +functionality. +The function must fill in the result with either a number +or a string. @command{awk} takes ownership of any string memory. +As mentioned earlier, string memory @strong{must} come from @code{malloc()}. + +The function must return the value of @code{result}. +This is for the convenience of the calling code inside @command{gawk}. + +@item size_t num_expected_args; +This is the number of arguments the function expects to receive. +Each extension function may decide what to do if the number of +arguments isn't what it expected. Following @command{awk} functions, it +is likely OK to ignore extra arguments. +@end table + +Once you have a record representing your extension function, you register +it with @command{gawk} using this API function: + +@table @code +@item awk_bool_t add_ext_func(const char *namespace, const awk_ext_func_t *func); +This function returns true upon success, false otherwise. +The @code{namespace} parameter is currently not used; you should pass in an +empty string (@code{""}). The @code{func} pointer is the address of a +@code{struct} representing your function, as just described. +@end table + +@node Exit Callback Functions +@subsubsection Registering An Exit Callback Function + +An @dfn{exit callback} function is a function that +@command{gawk} calls before it exits. +Such functions are useful if you have general ``clean up'' tasks +that should be performed in your extension (such as closing data +base connections or other resource deallocations). +You can register such +a function with @command{gawk} using the following function. + +@table @code +@item void awk_atexit(void (*funcp)(void *data, int exit_status), +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ void *arg0); +The parameters are: +@c nested table +@table @code +@item funcp +A pointer to the function to be called before @command{gawk} exits. The @code{data} +parameter will be the original value of @code{arg0}. +The @code{exit_status} parameter is +the exit status value that @command{gawk} will pass to the @code{exit()} system call. + +@item arg0 +A pointer to private data which @command{gawk} saves in order to pass to +the function pointed to by @code{funcp}. +@end table +@end table + +Exit callback functions are called in Last-In-First-Out (LIFO) order---that is, in +the reverse order in which they are registered with @command{gawk}. + +@node Extension Version String +@subsubsection Registering An Extension Version String + +You can register a version string which indicates the name and +version of your extension, with @command{gawk}, as follows: + +@table @code +@item void register_ext_version(const char *version); +Register the string pointed to by @code{version} with @command{gawk}. +@command{gawk} does @emph{not} copy the @code{version} string, so +it should not be changed. +@end table + +@command{gawk} prints all registered extension version strings when it +is invoked with the @option{--version} option. + +@node Input Parsers +@subsubsection Customized Input Parsers + +By default, @command{gawk} reads text files as its input. It uses the value +of @code{RS} to find the end of the record, and then uses @code{FS} +(or @code{FIELDWIDTHS}) to split it into fields (@pxref{Reading Files}). +Additionally, it sets the value of @code{RT} (@pxref{Built-in Variables}). + +If you want, you can provide your own custom input parser. An input +parser's job is to return a record to the @command{gawk} record processing +code, along with indicators for the value and length of the data to be +used for @code{RT}, if any. + +To provide an input parser, you must first provide two functions +(where @var{XXX} is a prefix name for your extension): + +@table @code +@item awk_bool_t @var{XXX}_can_take_file(const awk_input_buf_t *iobuf) +This function examines the information available in @code{iobuf} +(which we discuss shortly). Based on the information there, it +decides if the input parser should be used for this file. +If so, it should return true. Otherwise, it should return false. +It should not change any state (variable values, etc.) within @command{gawk}. + +@item awk_bool_t @var{XXX}_take_control_of(awk_input_buf_t *iobuf) +When @command{gawk} decides to hand control of the file over to the +input parser, it calls this function. This function in turn must fill +in certain fields in the @code{awk_input_buf_t} structure, and ensure +that certain conditions are true. It should then return true. If an +error of some kind occurs, it should not fill in any fields, and should +return false; then @command{gawk} will not use the input parser. +The details are presented shortly. +@end table + +Your extension should package these functions inside an +@code{awk_input_parser_t}, which looks like this: + +@example +typedef struct input_parser @{ + const char *name; /* name of parser */ + awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf); + awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf); + awk_const struct input_parser *awk_const next; /* for use by gawk */ +@} awk_input_parser_t; +@end example + +The fields are: + +@table @code +@item const char *name; +The name of the input parser. This is a regular C string. + +@item awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf); +A pointer to your @code{@var{XXX}_can_take_file()} function. + +@item awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf); +A pointer to your @code{@var{XXX}_take_control_of()} function. + +@item awk_const struct input_parser *awk_const next; +This pointer is used by @command{gawk}. +The extension cannot modify it. +@end table + +The steps are as follows: + +@enumerate +@item +Create a @code{static awk_input_parser_t} variable and initialize it +appropriately. + +@item +When your extension is loaded, register your input parser with +@command{gawk} using the @code{register_input_parser()} API function +(described below). +@end enumerate + +An @code{awk_input_buf_t} looks like this: + +@example +typedef struct awk_input @{ + const char *name; /* filename */ + int fd; /* file descriptor */ +#define INVALID_HANDLE (-1) + void *opaque; /* private data for input parsers */ + int (*get_record)(char **out, struct awk_input *iobuf, + int *errcode, char **rt_start, size_t *rt_len); + void (*close_func)(struct awk_input *iobuf); + struct stat sbuf; /* stat buf */ +@} awk_input_buf_t; +@end example + +The fields can be divided into two categories: those for use (initially, +at least) by @code{@var{XXX}_can_take_file()}, and those for use by +@code{@var{XXX}_take_control_of()}. The first group of fields and their uses +are as follows: + +@table @code +@item const char *name; +The name of the file. + +@item int fd; +A file descriptor for the file. If @command{gawk} was able to +open the file, then @code{fd} will @emph{not} be equal to +@code{INVALID_HANDLE}. Otherwise, it will. + +@item struct stat sbuf; +If file descriptor is valid, then @command{gawk} will have filled +in this structure via a call to the @code{fstat()} system call. +@end table + +The @code{@var{XXX}_can_take_file()} function should examine these +fields and decide if the input parser should be used for the file. +The decision can be made based upon @command{gawk} state (the value +of a variable defined previously by the extension and set by +@command{awk} code), the name of the +file, whether or not the file descriptor is valid, the information +in the @code{struct stat}, or any combination of the above. + +Once @code{@var{XXX}_can_take_file()} has returned true, and +@command{gawk} has decided to use your input parser, it calls +@code{@var{XXX}_take_control_of()}. That function then fills in at +least the @code{get_record} field of the @code{awk_input_buf_t}. It must +also ensure that @code{fd} is not set to @code{INVALID_HANDLE}. All of +the fields that may be filled by @code{@var{XXX}_take_control_of()} +are as follows: + +@table @code +@item void *opaque; +This is used to hold any state information needed by the input parser +for this file. It is ``opaque'' to @command{gawk}. The input parser +is not required to use this pointer. + +@item int@ (*get_record)(char@ **out, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ struct@ awk_input *iobuf, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ int *errcode, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ char **rt_start, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ size_t *rt_len); +This function pointer should point to a function that creates the input +records. Said function is the core of the input parser. Its behavior +is described below. + +@item void (*close_func)(struct awk_input *iobuf); +This function pointer should point to a function that does +the ``tear down.'' It should release any resources allocated by +@code{@var{XXX}_take_control_of()}. It may also close the file. If it +does so, it should set the @code{fd} field to @code{INVALID_HANDLE}. + +If @code{fd} is still not @code{INVALID_HANDLE} after the call to this +function, @command{gawk} calls the regular @code{close()} system call. + +Having a ``tear down'' function is optional. If your input parser does +not need it, do not set this field. Then, @command{gawk} calls the +regular @code{close()} system call on the file descriptor, so it should +be valid. +@end table + +The @code{@var{XXX}_get_record()} function does the work of creating +input records. The parameters are as follows: + +@table @code +@item char **out +This is a pointer to a @code{char *} variable which is set to point +to the record. @command{gawk} makes its own copy of the data, so +the extension must manage this storage. + +@item struct awk_input *iobuf +This is the @code{awk_input_buf_t} for the file. The fields should be +used for reading data (@code{fd}) and for managing private state +(@code{opaque}), if any. + +@item int *errcode +If an error occurs, @code{*errcode} should be set to an appropriate +code from @code{<errno.h>}. + +@item char **rt_start +@itemx size_t *rt_len +If the concept of a ``record terminator'' makes sense, then +@code{*rt_start} should be set to point to the data to be used for +@code{RT}, and @code{*rt_len} should be set to the length of the +data. Otherwise, @code{*rt_len} should be set to zero. +@code{gawk} makes its own copy of this data, so the +extension must manage the storage. +@end table + +The return value is the length of the buffer pointed to by +@code{*out}, or @code{EOF} if end-of-file was reached or an +error occurred. + +It is guaranteed that @code{errcode} is a valid pointer, so there is no +need to test for a @code{NULL} value. @command{gawk} sets @code{*errcode} +to zero, so there is no need to set it unless an error occurs. + +If an error does occur, the function should return @code{EOF} and set +@code{*errcode} to a non-zero value. In that case, if @code{*errcode} +does not equal @minus{}1, @command{gawk} automatically updates +the @code{ERRNO} variable based on the value of @code{*errcode} (e.g., +setting @samp{*errcode = errno} should do the right thing). + +@command{gawk} ships with a sample extension that reads directories, +returning records for each entry in the directory (@pxref{Extension +Sample Readdir}). You may wish to use that code as a guide for writing +your own input parser. + +When writing an input parser, you should think about (and document) +how it is expected to interact with @command{awk} code. You may want +it to always be called, and take effect as appropriate (as the +@code{readdir} extension does). Or you may want it to take effect +based upon the value of an @code{awk} variable, as the XML extension +from the @code{gawkextlib} project does (@pxref{gawkextlib}). +In the latter case, code in a @code{BEGINFILE} section +can look at @code{FILENAME} and @code{ERRNO} to decide whether or +not to activate an input parser (@pxref{BEGINFILE/ENDFILE}). + +You register your input parser with the following function: + +@table @code +@item void register_input_parser(awk_input_parser_t *input_parser); +Register the input parser pointed to by @code{input_parser} with +@command{gawk}. +@end table + +@node Output Wrappers +@subsubsection Customized Output Wrappers + +An @dfn{output wrapper} is the mirror image of an input parser. +It allows an extension to take over the output to a file opened +with the @samp{>} or @samp{>>} operators (@pxref{Redirection}). + +The output wrapper is very similar to the input parser structure: + +@example +typedef struct output_wrapper @{ + const char *name; /* name of the wrapper */ + awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf); + awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf); + awk_const struct output_wrapper *awk_const next; /* for use by gawk */ +@} awk_output_wrapper_t; +@end example + +The members are as follows: + +@table @code +@item const char *name; +This is the name of the output wrapper. + +@item awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf); +This points to a function that examines the information in +the @code{awk_output_buf_t} structure pointed to by @code{outbuf}. +It should return true if the output wrapper wants to take over the +file, and false otherwise. It should not change any state (variable +values, etc.) within @command{gawk}. + +@item awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf); +The function pointed to by this field is called when @command{gawk} +decides to let the output wrapper take control of the file. It should +fill in appropriate members of the @code{awk_output_buf_t} structure, +as described below, and return true if successful, false otherwise. + +@item awk_const struct output_wrapper *awk_const next; +This is for use by @command{gawk}. +@end table + +The @code{awk_output_buf_t} structure looks like this: + +@example +typedef struct @{ + const char *name; /* name of output file */ + const char *mode; /* mode argument to fopen */ + FILE *fp; /* stdio file pointer */ + awk_bool_t redirected; /* true if a wrapper is active */ + void *opaque; /* for use by output wrapper */ + size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count, + FILE *fp, void *opaque); + int (*gawk_fflush)(FILE *fp, void *opaque); + int (*gawk_ferror)(FILE *fp, void *opaque); + int (*gawk_fclose)(FILE *fp, void *opaque); +@} awk_output_buf_t; +@end example + +Here too, your extension will define @code{@var{XXX}_can_take_file()} +and @code{@var{XXX}_take_control_of()} functions that examine and update +data members in the @code{awk_output_buf_t}. +The data members are as follows: + +@table @code +@item const char *name; +The name of the output file. + +@item const char *mode; +The mode string (as would be used in the second argument to @code{fopen()}) +with which the file was opened. + +@item FILE *fp; +The @code{FILE} pointer from @code{<stdio.h>}. @command{gawk} opens the file +before attempting to find an output wrapper. + +@item awk_bool_t redirected; +This field must be set to true by the @code{@var{XXX}_take_control_of()} function. + +@item void *opaque; +This pointer is opaque to @command{gawk}. The extension should use it to store +a pointer to any private data associated with the file. + +@item size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ FILE *fp, void *opaque); +@itemx int (*gawk_fflush)(FILE *fp, void *opaque); +@itemx int (*gawk_ferror)(FILE *fp, void *opaque); +@itemx int (*gawk_fclose)(FILE *fp, void *opaque); +These pointers should be set to point to functions that perform +the equivalent function as the @code{<stdio.h>} functions do, if appropriate. +@command{gawk} uses these function pointers for all output. +@command{gawk} initializes the pointers to point to internal, ``pass through'' +functions that just call the regular @code{<stdio.h>} functions, so an +extension only needs to redefine those functions that are appropriate for +what it does. +@end table + +The @code{@var{XXX}_can_take_file()} function should make a decision based +upon the @code{name} and @code{mode} fields, and any additional state +(such as @command{awk} variable values) that is appropriate. + +When @command{gawk} calls @code{@var{XXX}_take_control_of()}, it should fill +in the other fields, as appropriate, except for @code{fp}, which it should just +use normally. + +You register your output wrapper with the following function: + +@table @code +@item void register_output_wrapper(awk_output_wrapper_t *output_wrapper); +Register the output wrapper pointed to by @code{output_wrapper} with +@command{gawk}. +@end table + +@node Two-way processors +@subsubsection Customized Two-way Processors + +A @dfn{two-way processor} combines an input parser and an output wrapper for +two-way I/O with the @samp{|&} operator (@pxref{Redirection}). It makes identical +use of the @code{awk_input_parser_t} and @code{awk_output_buf_t} structures +as described earlier. + +A two-way processor is represented by the following structure: + +@example +typedef struct two_way_processor @{ + const char *name; /* name of the two-way processor */ + awk_bool_t (*can_take_two_way)(const char *name); + awk_bool_t (*take_control_of)(const char *name, + awk_input_buf_t *inbuf, + awk_output_buf_t *outbuf); + awk_const struct two_way_processor *awk_const next; /* for use by gawk */ +@} awk_two_way_processor_t; +@end example + +The fields are as follows: + +@table @code +@item const char *name; +The name of the two-way processor. + +@item awk_bool_t (*can_take_two_way)(const char *name); +This function returns true if it wants to take over two-way I/O for this filename. +It should not change any state (variable +values, etc.) within @command{gawk}. + +@item awk_bool_t (*take_control_of)(const char *name, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_input_buf_t *inbuf, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_output_buf_t *outbuf); +This function should fill in the @code{awk_input_buf_t} and +@code{awk_outut_buf_t} structures pointed to by @code{inbuf} and +@code{outbuf}, respectively. These structures were described earlier. + +@item awk_const struct two_way_processor *awk_const next; +This is for use by @command{gawk}. +@end table + +As with the input parser and output processor, you provide +``yes I can take this'' and ``take over for this'' functions, +@code{@var{XXX}_can_take_two_way()} and @code{@var{XXX}_take_control_of()}. + +You register your two-way processor with the following function: + +@table @code +@item void register_two_way_processor(awk_two_way_processor_t *two_way_processor); +Register the two-way processor pointed to by @code{two_way_processor} with +@command{gawk}. +@end table + +@node Printing Messages +@subsection Printing Messages + +You can print different kinds of warning messages from your +extension, as described below. Note that for these functions, +you must pass in the extension id received from @command{gawk} +when the extension was loaded.@footnote{Because the API uses only ISO C 90 +features, it cannot make use of the ISO C 99 variadic macro feature to hide +that parameter. More's the pity.} + +@table @code +@item void fatal(awk_ext_id_t id, const char *format, ...); +Print a message and then cause @command{gawk} to exit immediately. + +@item void warning(awk_ext_id_t id, const char *format, ...); +Print a warning message. + +@item void lintwarn(awk_ext_id_t id, const char *format, ...); +Print a ``lint warning.'' Normally this is the same as printing a +warning message, but if @command{gawk} was invoked with @samp{--lint=fatal}, +then lint warnings become fatal error messages. +@end table + +All of these functions are otherwise like the C @code{printf()} +family of functions, where the @code{format} parameter is a string +with literal characters and formatting codes intermixed. + +@node Updating @code{ERRNO} +@subsection Updating @code{ERRNO} + +The following functions allow you to update the @code{ERRNO} +variable: + +@table @code +@item void update_ERRNO_int(int errno_val); +Set @code{ERRNO} to the string equivalent of the error code +in @code{errno_val}. The value should be one of the defined +error codes in @code{<errno.h>}, and @command{gawk} turns it +into a (possibly translated) string using the C @code{strerror()} function. + +@item void update_ERRNO_string(const char *string); +Set @code{ERRNO} directly to the string value of @code{ERRNO}. +@command{gawk} makes a copy of the value of @code{string}. + +@item void unset_ERRNO(); +Unset @code{ERRNO}. +@end table + +@node Accessing Parameters +@subsection Accessing and Updating Parameters + +Two functions give you access to the arguments (parameters) +passed to your extension function. They are: + +@table @code +@item awk_bool_t get_argument(size_t count, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); +Fill in the @code{awk_value_t} structure pointed to by @code{result} +with the @code{count}'th argument. Return true if the actual +type matches @code{wanted}, false otherwise. In the latter +case, @code{result@w{->}val_type} indicates the actual type +(@pxref{table-value-types-returned}). Counts are zero based---the first +argument is numbered zero, the second one, and so on. @code{wanted} +indicates the type of value expected. + +@item awk_bool_t set_argument(size_t count, awk_array_t array); +Convert a parameter that was undefined into an array; this provides +call-by-reference for arrays. Return false if @code{count} is too big, +or if the argument's type is not undefined. @xref{Array Manipulation}, +for more information on creating arrays. +@end table + +@node Symbol Table Access +@subsection Symbol Table Access + +Two sets of routines provide access to global variables, and one set +allows you to create and release cached values. + +@menu +* Symbol table by name:: Accessing variables by name. +* Symbol table by cookie:: Accessing variables by ``cookie''. +* Cached values:: Creating and using cached values. +@end menu + +@node Symbol table by name +@subsubsection Variable Access and Update by Name + +The following routines provide the ability to access and update +global @command{awk}-level variables by name. In compiler terminology, +identifiers of different kinds are termed @dfn{symbols}, thus the ``sym'' +in the routines' names. The data structure which stores information +about symbols is termed a @dfn{symbol table}. + +@table @code +@item awk_bool_t sym_lookup(const char *name, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); +Fill in the @code{awk_value_t} structure pointed to by @code{result} +with the value of the variable named by the string @code{name}, which is +a regular C string. @code{wanted} indicates the type of value expected. +Return true if the actual type matches @code{wanted}, false otherwise +In the latter case, @code{result->val_type} indicates the actual type +(@pxref{table-value-types-returned}). + +@item awk_bool_t sym_update(const char *name, awk_value_t *value); +Update the variable named by the string @code{name}, which is a regular +C string. The variable is added to @command{gawk}'s symbol table +if it is not there. Return true if everything worked, false otherwise. + +Changing types (scalar to array or vice versa) of an existing variable +is @emph{not} allowed, nor may this routine be used to update an array. +This routine cannot be used to update any of the predefined +variables (such as @code{ARGC} or @code{NF}). + +@item awk_bool_t sym_constant(const char *name, awk_value_t *value); +Create a variable named by the string @code{name}, which is +a regular C string, that has the constant value as given by +@code{value}. @command{awk}-level code cannot change the value of this +variable.@footnote{There (currently) is no @code{awk}-level feature that +provides this ability.} The extension may change the value of @code{name}'s +variable with subsequent calls to this routine, and may also convert +a variable created by @code{sym_update()} into a constant. However, +once a variable becomes a constant, it cannot later be reverted into a +mutable variable. +@end table + +@node Symbol table by cookie +@subsubsection Variable Access and Update by Cookie + +A @dfn{scalar cookie} is an opaque handle that provide access +to a global variable or array. It is an optimization that +avoids looking up variables in @command{gawk}'s symbol table every time +access is needed. This was discussed earlier, in @ref{General Data Types}. + +The following functions let you work with scalar cookies. + +@table @code +@item awk_bool_t sym_lookup_scalar(awk_scalar_t cookie, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); +Retrieve the current value of a scalar cookie. +Once you have obtained a scalar_cookie using @code{sym_lookup()}, you can +use this function to get its value more efficiently. +Return false if the value cannot be retrieved. + +@item awk_bool_t sym_update_scalar(awk_scalar_t cookie, awk_value_t *value); +Update the value associated with a scalar cookie. Return false if +the new value is not one of @code{AWK_STRING} or @code{AWK_NUMBER}. +Here too, the built-in variables may not be updated. +@end table + +It is not obvious at first glance how to work with scalar cookies or +what their @i{raison d@^etre} really is. In theory, the @code{sym_lookup()} +and @code{sym_update()} routines are all you really need to work with +variables. For example, you might have code that looked up the value of +a variable, evaluated a condition, and then possibly changed the value +of the variable based on the result of that evaluation, like so: + +@example +/* do_magic --- do something really great */ + +static awk_value_t * +do_magic(int nargs, awk_value_t *result) +@{ + awk_value_t value; + + if ( sym_lookup("MAGIC_VAR", AWK_NUMBER, & value) + && some_condition(value.num_value)) @{ + value.num_value += 42; + sym_update("MAGIC_VAR", & value); + @} + + return make_number(0.0, result); +@} +@end example + +@noindent +This code looks (and is) simple and straightforward. So what's the problem? + +Consider what happens if @command{awk}-level code associated with your +extension calls the @code{magic()} function (implemented in C by @code{do_magic()}), +once per record, while processing hundreds of thousands or millions of records. +The @code{MAGIC_VAR} variable is looked up in the symbol table once or twice per function call! + +The symbol table lookup is really pure overhead; it is considerably more efficient +to get a cookie that represents the variable, and use that to get the variable's +value and update it as needed.@footnote{The difference is measurable and quite real. Trust us.} + +Thus, the way to use cookies is as follows. First, install your extension's variable +in @command{gawk}'s symbol table using @code{sym_update()}, as usual. Then get a +scalar cookie for the variable using @code{sym_lookup()}: + +@example +static awk_scalar_t magic_var_cookie; /* cookie for MAGIC_VAR */ + +static void +my_extension_init() +@{ + awk_value_t value; + + /* install initial value */ + sym_update("MAGIC_VAR", make_number(42.0, & value)); + + /* get cookie */ + sym_lookup("MAGIC_VAR", AWK_SCALAR, & value); + + /* save the cookie */ + magic_var_cookie = value.scalar_cookie; + @dots{} +@} +@end example + +Next, use the routines in this section for retrieving and updating +the value through the cookie. Thus, @code{do_magic()} now becomes +something like this: + +@example +/* do_magic --- do something really great */ + +static awk_value_t * +do_magic(int nargs, awk_value_t *result) +@{ + awk_value_t value; + + if ( sym_lookup_scalar(magic_var_cookie, AWK_NUMBER, & value) + && some_condition(value.num_value)) @{ + value.num_value += 42; + sym_update_scalar(magic_var_cookie, & value); + @} + @dots{} + + return make_number(0.0, result); +@} +@end example + +@quotation NOTE +The previous code omitted error checking for +presentation purposes. Your extension code should be more robust +and carefully check the return values from the API functions. +@end quotation + +@node Cached values +@subsubsection Creating and Using Cached Values + +The routines in this section allow you to create and release +cached values. As with scalar cookies, in theory, cached values +are not necessary. You can create numbers and strings using +the functions in @ref{Constructor Functions}. You can then +assign those values to variables using @code{sym_update()} +or @code{sym_update_scalar()}, as you like. + +However, you can understand the point of cached values if you remember that +@emph{every} string value's storage @emph{must} come from @code{malloc()}. +If you have 20 variables, all of which have the same string value, you +must create 20 identical copies of the string.@footnote{Numeric values +are clearly less problematic, requiring only a C @code{double} to store.} + +It is clearly more efficient, if possible, to create a value once, and +then tell @command{gawk} to reuse the value for multiple variables. That +is what the routines in this section let you do. The functions are as follows: + +@table @code +@item awk_bool_t create_value(awk_value_t *value, awk_value_cookie_t *result); +Create a cached string or numeric value from @code{value} for efficient later +assignment. +Only @code{AWK_NUMBER} and @code{AWK_STRING} values are allowed. Any other type +is rejected. While @code{AWK_UNDEFINED} could be allowed, doing so would +result in inferior performance. + +@item awk_bool_t release_value(awk_value_cookie_t vc); +Release the memory associated with a value cookie obtained +from @code{create_value()}. +@end table + +You use value cookies in a fashion similar to the way you use scalar cookies. +In the extension initialization routine, you create the value cookie: + +@example +static awk_value_cookie_t answer_cookie; /* static value cookie */ + +static void +my_extension_init() +@{ + awk_value_t value; + char *long_string; + size_t long_string_len; + + /* code from earlier */ + @dots{} + /* @dots{} fill in long_string and long_string_len @dots{} */ + make_malloced_string(long_string, long_string_len, & value); + create_value(& value, & answer_cookie); /* create cookie */ + @dots{} +@} +@end example + +Once the value is created, you can use it as the value of any number +of variables: + +@example +static awk_value_t * +do_magic(int nargs, awk_value_t *result) +@{ + awk_value_t new_value; + + @dots{} /* as earlier */ + + value.val_type = AWK_VALUE_COOKIE; + value.value_cookie = answer_cookie; + sym_update("VAR1", & value); + sym_update("VAR2", & value); + @dots{} + sym_update("VAR100", & value); + @dots{} +@} +@end example + +@noindent +Using value cookies in this way saves considerable storage, since all of +@code{VAR1} through @code{VAR100} share the same value. + +You might be wondering, ``Is this sharing problematic? +What happens if @command{awk} code assigns a new value to @code{VAR1}, +are all the others be changed too?'' + +That's a great question. The answer is that no, it's not a problem. +Internally, @command{gawk} uses reference-counted strings. This means +that many variables can share the same string, and @command{gawk} +keeps track of the usage. When a variable's value changes, @command{gawk} +simply decrements the reference count on the old value and updates +the variable to use the new value. + +Finally, as part of your clean up action (@pxref{Exit Callback Functions}) +you should release any cached values that you created, using +@code{release_value()}. + +@node Array Manipulation +@subsection Array Manipulation + +The primary data structure@footnote{Okay, the only data structure.} in @command{awk} +is the associative array (@pxref{Arrays}). +Extensions need to be able to manipulate @command{awk} arrays. +The API provides a number of data structures for working with arrays, +functions for working with individual elements, and functions for +working with arrays as a whole. This includes the ability to +``flatten'' an array so that it is easy for C code to traverse +every element in an array. The array data structures integrate +nicely with the data structures for values to make it easy to +both work with and create true arrays of arrays (@pxref{General Data Types}). + +@menu +* Array Data Types:: Data types for working with arrays. +* Array Functions:: Functions for working with arrays. +* Flattening Arrays:: How to flatten arrays. +* Creating Arrays:: How to create and populate arrays. +@end menu + +@node Array Data Types +@subsubsection Array Data Types + +The data types associated with arrays are listed below. + +@table @code +@item typedef void *awk_array_t; +If you request the value of an array variable, you get back an +@code{awk_array_t} value. This value is opaque@footnote{It is also +a ``cookie,'' but the @command{gawk} developers did not wish to overuse this +term.} to the extension; it uniquely identifies the array but can +only be used by passing it into API functions or receiving it from API +functions. This is very similar to way @samp{FILE *} values are used +with the @code{<stdio.h>} library routines. + +@item typedef struct awk_element @{ +@itemx @ @ @ @ /* convenience linked list pointer, not used by gawk */ +@itemx @ @ @ @ struct awk_element *next; +@itemx @ @ @ @ enum @{ +@itemx @ @ @ @ @ @ @ @ AWK_ELEMENT_DEFAULT = 0,@ @ /* set by gawk */ +@itemx @ @ @ @ @ @ @ @ AWK_ELEMENT_DELETE = 1@ @ @ @ /* set by extension if should be deleted */ +@itemx @ @ @ @ @} flags; +@itemx @ @ @ @ awk_value_t index; +@itemx @ @ @ @ awk_value_t value; +@itemx @} awk_element_t; +The @code{awk_element_t} is a ``flattened'' +array element. @command{awk} produces an array of these +inside the @code{awk_flat_array_t} (see the next item). +Individual elements may be marked for deletion. New elements must be added +individually, one at a time, using the separate API for that purpose. +The fields are as follows: + +@c nested table +@table @code +@item struct awk_element *next; +This pointer is for the convenience of extension writers. It allows +an extension to create a linked list of new elements that can then be +added to an array in a loop that traverses the list. + +@item enum @{ @dots{} @} flags; +A set of flag values that convey information between @command{gawk} +and the extension. Currently there is only one: @code{AWK_ELEMENT_DELETE}. +Setting it causes @command{gawk} to delete the +element from the original array upon release of the flattened array. + +@item index +@itemx value +The index and value of the element, respectively. +@emph{All} memory pointed to by @code{index} and @code{value} belongs to @command{gawk}. +@end table + +@item typedef struct awk_flat_array @{ +@itemx @ @ @ @ awk_const void *awk_const opaque1;@ @ @ @ /* private data for use by gawk */ +@itemx @ @ @ @ awk_const void *awk_const opaque2;@ @ @ @ /* private data for use by gawk */ +@itemx @ @ @ @ awk_const size_t count;@ @ @ @ @ /* how many elements */ +@itemx @ @ @ @ awk_element_t elements[1];@ @ /* will be extended */ +@itemx @} awk_flat_array_t; +This is a flattened array. When an extension gets one of these +from @command{gawk}, the @code{elements} array is of actual +size @code{count}. +The @code{opaque1} and @code{opaque2} pointers are for use by @command{gawk}; +therefore they are marked @code{awk_const} so that the extension cannot +modify them. +@end table + +@node Array Functions +@subsubsection Array Functions + +The following functions relate to individual array elements. + +@table @code +@item awk_bool_t get_element_count(awk_array_t a_cookie, size_t *count); +For the array represented by @code{a_cookie}, return in @code{*count} +the number of elements it contains. A subarray counts as a single element. +Return false if there is an error. + +@item awk_bool_t get_array_element(awk_array_t a_cookie, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_value_t *const index, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_valtype_t wanted, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_value_t *result); +For the array represented by @code{a_cookie}, return in @code{*result} +the value of the element whose index is @code{index}. +@code{wanted} specifies the type of value you wish to retrieve. +Return false if @code{wanted} does not match the actual type or if +@code{index} is not in the array (@pxref{table-value-types-returned}). + +The value for @code{index} can be numeric, in which case @command{gawk} +converts it to a string. Using non-integral values is possible, but +requires that you understand how such values are converted to strings +(@pxref{Conversion}); thus using integral values is safest. + +As with @emph{all} strings passed into @code{gawk} from an extension, +the string value of @code{index} must come from @code{malloc()}, and +@command{gawk} releases the storage. + +@item awk_bool_t set_array_element(awk_array_t a_cookie, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const@ awk_value_t *const index, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const@ awk_value_t *const value); +In the array represented by @code{a_cookie}, create or modify +the element whose index is given by @code{index}. +The @code{ARGV} and @code{ENVIRON} arrays may not be changed. + +@item awk_bool_t set_array_element_by_elem(awk_array_t a_cookie, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_element_t element); +Like @code{set_array_element()}, but take the @code{index} and @code{value} +from @code{element}. This is a convenience macro. + +@item awk_bool_t del_array_element(awk_array_t a_cookie, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ const awk_value_t* const index); +Remove the element with the given index from the array +represented by @code{a_cookie}. +Return true if the element was removed, or false if the element did +not exist in the array. +@end table + +The following functions relate to arrays as a whole: + +@table @code +@item awk_array_t create_array(); +Create a new array to which elements may be added. +@xref{Creating Arrays}, for a discussion of how to +create a new array and add elements to it. + +@item awk_bool_t clear_array(awk_array_t a_cookie); +Clear the array represented by @code{a_cookie}. +Return false if there was some kind of problem, true otherwise. +The array remains an array, but after calling this function, it +has no elements. This is equivalent to using the @code{delete} +statement (@pxref{Delete}). + +@item awk_bool_t flatten_array(awk_array_t a_cookie, awk_flat_array_t **data); +For the array represented by @code{a_cookie}, create an @code{awk_flat_array_t} +structure and fill it in. Set the pointer whose address is passed as @code{data} +to point to this structure. +Return true upon success, or false otherwise. +@xref{Flattening Arrays}, for a discussion of how to +flatten an array and work with it. + +@item awk_bool_t release_flattened_array(awk_array_t a_cookie, +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ awk_flat_array_t *data); +When done with a flattened array, release the storage using this function. +You must pass in both the original array cookie, and the address of +the created @code{awk_flat_array_t} structure. +The function returns true upon success, false otherwise. +@end table + +@node Flattening Arrays +@subsubsection Working With All The Elements of an Array + +To @dfn{flatten} an array is create a structure that +represents the full array in a fashion that makes it easy +for C code to traverse the entire array. Test code +in @file{extension/testext.c} does this, and also serves +as a nice example to show how to use the APIs. + +First, the @command{gawk} script that drives the test extension: + +@example +@@load "testext" +BEGIN @{ + n = split("blacky rusty sophie raincloud lucky", pets) + printf("pets has %d elements\n", length(pets)) + ret = dump_array_and_delete("pets", "3") + printf("dump_array_and_delete(pets) returned %d\n", ret) + if ("3" in pets) + printf("dump_array_and_delete() did NOT remove index \"3\"!\n") + else + printf("dump_array_and_delete() did remove index \"3\"!\n") + print "" +@} +@end example + +@noindent +This code creates an array with @code{split()} (@pxref{String Functions}) +and then calls @code{dump_and_delete()}. That function looks up +the array whose name is passed as the first argument, and +deletes the element at the index passed in the second argument. +It then prints the return value and checks if the element +was indeed deleted. Here is the C code that implements +@code{dump_array_and_delete()}. It has been edited slightly for +presentation. + +The first part declares variables, sets up the default +return value in @code{result}, and checks that the function +was called with the correct number of arguments: + +@example +static awk_value_t * +dump_array_and_delete(int nargs, awk_value_t *result) +@{ + awk_value_t value, value2, value3; + awk_flat_array_t *flat_array; + size_t count; + char *name; + int i; + + assert(result != NULL); + make_number(0.0, result); + + if (nargs != 2) @{ + printf("dump_array_and_delete: nargs not right " + "(%d should be 2)\n", nargs); + goto out; + @} +@end example + +The function then proceeds in steps, as follows. First, retrieve +the name of the array, passed as the first argument. Then +retrieve the array itself. If either operation fails, print +error messages and return: + +@example + /* get argument named array as flat array and print it */ + if (get_argument(0, AWK_STRING, & value)) @{ + name = value.str_value.str; + if (sym_lookup(name, AWK_ARRAY, & value2)) + printf("dump_array_and_delete: sym_lookup of %s passed\n", + name); + else @{ + printf("dump_array_and_delete: sym_lookup of %s failed\n", + name); + goto out; + @} + @} else @{ + printf("dump_array_and_delete: get_argument(0) failed\n"); + goto out; + @} +@end example + +For testing purposes and to make sure that the C code sees +the same number of elements as the @command{awk} code, +the second step is to get the count of elements in the array +and print it: + +@example + if (! get_element_count(value2.array_cookie, & count)) @{ + printf("dump_array_and_delete: get_element_count failed\n"); + goto out; + @} + + printf("dump_array_and_delete: incoming size is %lu\n", + (unsigned long) count); +@end example + +The third step is to actually flatten the array, and then +to double check that the count in the @code{awk_flat_array_t} +is the same as the count just retrieved: + +@example + if (! flatten_array(value2.array_cookie, & flat_array)) @{ + printf("dump_array_and_delete: could not flatten array\n"); + goto out; + @} + + if (flat_array->count != count) @{ + printf("dump_array_and_delete: flat_array->count (%lu)" + " != count (%lu)\n", + (unsigned long) flat_array->count, + (unsigned long) count); + goto out; + @} +@end example + +The fourth step is to retrieve the index of the element +to be deleted, which was passed as the second argument. +Remember that argument counts passed to @code{get_argument()} +are zero-based, thus the second argument is numbered one: + +@example + if (! get_argument(1, AWK_STRING, & value3)) @{ + printf("dump_array_and_delete: get_argument(1) failed\n"); + goto out; + @} +@end example + +The fifth step is where the ``real work'' is done. The function +loops over every element in the array, printing the index and +element values. In addition, upon finding the element with the +index that is supposed to be deleted, the function sets the +@code{AWK_ELEMENT_DELETE} bit in the @code{flags} field +of the element. When the array is released, @command{gawk} +traverses the flattened array, and deletes any element which +have this flag bit set: + +@example + for (i = 0; i < flat_array->count; i++) @{ + printf("\t%s[\"%.*s\"] = %s\n", + name, + (int) flat_array->elements[i].index.str_value.len, + flat_array->elements[i].index.str_value.str, + valrep2str(& flat_array->elements[i].value)); + + if (strcmp(value3.str_value.str, + flat_array->elements[i].index.str_value.str) + == 0) @{ + flat_array->elements[i].flags |= AWK_ELEMENT_DELETE; + printf("dump_array_and_delete: marking element \"%s\" " + "for deletion\n", + flat_array->elements[i].index.str_value.str); + @} + @} +@end example + +The sixth step is to release the flattened array. This tells +@command{gawk} that the extension is no longer using the array, +and that it should delete any elements marked for deletion. +@command{gawk} also frees any storage that was allocated, +so you should not use the pointer (@code{flat_array} in this +code) once you have called @code{release_flattened_array()}: + +@example + if (! release_flattened_array(value2.array_cookie, flat_array)) @{ + printf("dump_array_and_delete: could not release flattened array\n"); + goto out; + @} +@end example + +Finally, since everything was successful, the function sets the +return value to success, and returns: + +@example + make_number(1.0, result); +out: + return result; +@} +@end example + +Here is the output from running this part of the test: + +@example +pets has 5 elements +dump_array_and_delete: sym_lookup of pets passed +dump_array_and_delete: incoming size is 5 + pets["1"] = "blacky" + pets["2"] = "rusty" + pets["3"] = "sophie" +dump_array_and_delete: marking element "3" for deletion + pets["4"] = "raincloud" + pets["5"] = "lucky" +dump_array_and_delete(pets) returned 1 +dump_array_and_delete() did remove index "3"! +@end example + +@node Creating Arrays +@subsubsection How To Create and Populate Arrays + +Besides working with arrays created by @command{awk} code, you can +create arrays and populate them as you see fit, and then @command{awk} +code can access them and manipulate them. + +There are two important points about creating arrays from extension code: + +@enumerate 1 +@item +You must install a new array into @command{gawk}'s symbol +table immediately upon creating it. Once you have done so, +you can then populate the array. + +@ignore +Strictly speaking, this is required only +for arrays that will have subarrays as elements; however it is +a good idea to always do this. This restriction may be relaxed +in a subsequent revision of the API. +@end ignore + +Similarly, if installing a new array as a subarray of an existing array, +you must add the new array to its parent before adding any elements to it. + +Thus, the correct way to build an array is to work ``top down.'' Create +the array, and immediately install it in @command{gawk}'s symbol table +using @code{sym_update()}, or install it as an element in a previously +existing array using @code{set_element()}. We show example code shortly. + +@item +Due to gawk internals, after using @code{sym_update()} to install an array +into @command{gawk}, you have to retrieve the array cookie from the value +passed in to @command{sym_update()} before doing anything else with it, like so: + +@example +awk_value_t index, value; +awk_array_t new_array; + +make_const_string("an index", 8, & index); + +new_array = create_array(); +val.val_type = AWK_ARRAY; +val.array_cookie = new_array; + +/* install array in the symbol table */ +sym_update("array", & index, & val); + +new_array = val.array_cookie; /* YOU MUST DO THIS */ +@end example + +If installing an array as a subarray, you must also retrieve the value +of the array cookie after the call to @code{set_element()}. +@end enumerate + +The following C code is a simple test extension to create an array +with two regular elements and with a subarray. The leading @samp{#include} +directives and boilerplate variable declarations are omitted for brevity. +The first step is to create a new array and then install it +in the symbol table: + +@example +@ignore +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + +#include <stdio.h> +#include <assert.h> +#include <errno.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +#include <sys/types.h> +#include <sys/stat.h> + +#include "gawkapi.h" + +static const gawk_api_t *api; /* for convenience macros to work */ +static awk_ext_id_t *ext_id; +static const char *ext_version = "testarray extension: version 1.0"; + +int plugin_is_GPL_compatible; + +@end ignore +/* create_new_array --- create a named array */ + +static void +create_new_array() +@{ + awk_array_t a_cookie; + awk_array_t subarray; + awk_value_t index, value; + + a_cookie = create_array(); + value.val_type = AWK_ARRAY; + value.array_cookie = a_cookie; + + if (! sym_update("new_array", & value)) + printf("create_new_array: sym_update(\"new_array\") failed!\n"); + a_cookie = value.array_cookie; +@end example + +@noindent +Note how @code{a_cookie} is reset from the @code{array_cookie} field in +the @code{value} structure. + +The second step is to install two regular values into @code{new_array}: + +@example + (void) make_const_string("hello", 5, & index); + (void) make_const_string("world", 5, & value); + if (! set_array_element(a_cookie, & index, & value)) @{ + printf("fill_in_array: set_array_element failed\n"); + return; + @} + + (void) make_const_string("answer", 6, & index); + (void) make_number(42.0, & value); + if (! set_array_element(a_cookie, & index, & value)) @{ + printf("fill_in_array: set_array_element failed\n"); + return; + @} +@end example + +The third step is to create the subarray and install it: + +@example + (void) make_const_string("subarray", 8, & index); + subarray = create_array(); + value.val_type = AWK_ARRAY; + value.array_cookie = subarray; + if (! set_array_element(a_cookie, & index, & value)) @{ + printf("fill_in_array: set_array_element failed\n"); + return; + @} + subarray = value.array_cookie; +@end example + +The final step is to populate the subarray with its own element: + +@example + (void) make_const_string("foo", 3, & index); + (void) make_const_string("bar", 3, & value); + if (! set_array_element(subarray, & index, & value)) @{ + printf("fill_in_array: set_array_element failed\n"); + return; + @} +@} +@ignore +static awk_ext_func_t func_table[] = @{ + @{ NULL, NULL, 0 @} +@}; + +/* init_testarray --- additional initialization function */ + +static awk_bool_t init_testarray(void) +@{ + create_new_array(); + + return 1; +@} + +static awk_bool_t (*init_func)(void) = init_testarray; + +dl_load_func(func_table, testarray, "") +@end ignore +@end example + +Here is sample script that loads the extension +and then dumps the array: + +@example +@@load "subarray" + +function dumparray(name, array, i) +@{ + for (i in array) + if (isarray(array[i])) + dumparray(name "[\"" i "\"]", array[i]) + else + printf("%s[\"%s\"] = %s\n", name, i, array[i]) +@} + +BEGIN @{ + dumparray("new_array", new_array); +@} +@end example + +Here is the result of running the script: + +@example +$ @kbd{AWKLIBPATH=$PWD ./gawk -f subarray.awk} +@print{} new_array["subarray"]["foo"] = bar +@print{} new_array["hello"] = world +@print{} new_array["answer"] = 42 +@end example + +@noindent +(@xref{Finding Extensions}, for more information on the +@env{AWKLIBPATH} environment variable.) + +@node Extension API Variables +@subsection API Variables + +The API provides two sets of variables. The first provides information +about the version of the API (both with which the extension was compiled, +and with which @command{gawk} was compiled). The second provides +information about how @command{gawk} was invoked. + +@menu +* Extension Versioning:: API Version information. +* Extension API Informational Variables:: Variables providing information about + @command{gawk}'s invocation. +@end menu + +@node Extension Versioning +@subsubsection API Version Constants and Variables + +The API provides both a ``major'' and a ``minor'' version number. +The API versions are available at compile time as constants: + +@table @code +@item GAWK_API_MAJOR_VERSION +The major version of the API. + +@item GAWK_API_MINOR_VERSION +The minor version of the API. +@end table + +The minor version increases when new functions are added to the API. Such +new functions are always added to the end of the API @code{struct}. + +The major version increases (and the minor version is reset to zero) if any +of the data types change size or member order, or if any of the existing +functions change signature. + +It could happen that an extension may be compiled against one version +of the API but loaded by a version of @command{gawk} using a different +version. For this reason, the major and minor API versions of the +running @command{gawk} are included in the API @code{struct} as read-only +constant integers: + +@table @code +@item api->major_version +The major version of the running @command{gawk}. + +@item api->minor_version +The minor version of the running @command{gawk}. +@end table + +It is up to the extension to decide if there are API incompatibilities. +Typically a check like this is enough: + +@example +if (api->major_version != GAWK_API_MAJOR_VERSION + || api->minor_version < GAWK_API_MINOR_VERSION) @{ + fprintf(stderr, "foo_extension: version mismatch with gawk!\n"); + fprintf(stderr, "\tmy version (%d, %d), gawk version (%d, %d)\n", + GAWK_API_MAJOR_VERSION, GAWK_API_MINOR_VERSION, + api->major_version, api->minor_version); + exit(1); +@} +@end example + +Such code is included in the boilerplate @code{dl_load_func()} macro +provided in @file{gawkapi.h} (discussed later, in +@ref{Extension API Boilerplate}). + +@node Extension API Informational Variables +@subsubsection Informational Variables + +The API provides access to several variables that describe +whether the corresponding command-line options were enabled when +@command{gawk} was invoked. The variables are: + +@table @code +@item do_lint +This variable is true if @command{gawk} was invoked with @option{--lint} option +(@pxref{Options}). + +@item do_traditional +This variable is true if @command{gawk} was invoked with @option{--traditional} option. + +@item do_profile +This variable is true if @command{gawk} was invoked with @option{--profile} option. + +@item do_sandbox +This variable is true if @command{gawk} was invoked with @option{--sandbox} option. + +@item do_debug +This variable is true if @command{gawk} was invoked with @option{--debug} option. + +@item do_mpfr +This variable is true if @command{gawk} was invoked with @option{--bignum} option. +@end table + +The value of @code{do_lint} can change if @command{awk} code +modifies the @code{LINT} built-in variable (@pxref{Built-in Variables}). +The others should not change during execution. + +@node Extension API Boilerplate +@subsection Boilerplate Code + +As mentioned earlier (@pxref{Extension Mechanism Outline}), the function +definitions as presented are really macros. To use these macros, your +extension must provide a small amount of boilerplate code (variables and +functions) towards the top of your source file, using pre-defined names +as described below. The boilerplate needed is also provided in comments +in the @file{gawkapi.h} header file: + +@example +/* Boiler plate code: */ +int plugin_is_GPL_compatible; + +static gawk_api_t *const api; +static awk_ext_id_t ext_id; +static const char *ext_version = NULL; /* or @dots{} = "some string" */ + +static awk_ext_func_t func_table[] = @{ + @{ "name", do_name, 1 @}, + /* @dots{} */ +@}; + +/* EITHER: */ + +static awk_bool_t (*init_func)(void) = NULL; + +/* OR: */ + +static awk_bool_t +init_my_module(void) +@{ + @dots{} +@} + +static awk_bool_t (*init_func)(void) = init_my_module; + +dl_load_func(func_table, some_name, "name_space_in_quotes") +@end example + +These variables and functions are as follows: + +@table @code +@item int plugin_is_GPL_compatible; +This asserts that the extension is compatible with the GNU GPL +(@pxref{Copying}). If your extension does not have this, @command{gawk} +will not load it (@pxref{Plugin License}). + +@item static gawk_api_t *const api; +This global @code{static} variable should be set to point to +the @code{gawk_api_t} pointer that @command{gawk} passes to your +@code{dl_load()} function. This variable is used by all of the macros. + +@item static awk_ext_id_t ext_id; +This global static variable should be set to the @code{awk_ext_id_t} +value that @command{gawk} passes to your @code{dl_load()} function. +This variable is used by all of the macros. + +@item static const char *ext_version = NULL; /* or @dots{} = "some string" */ +This global @code{static} variable should be set either +to @code{NULL}, or to point to a string giving the name and version of +your extension. + +@item static awk_ext_func_t func_table[] = @{ @dots{} @}; +This is an array of one or more @code{awk_ext_func_t} structures +as described earlier (@pxref{Extension Functions}). +It can then be looped over for multiple calls to +@code{add_ext_func()}. + +@item static awk_bool_t (*init_func)(void) = NULL; +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @r{OR} +@itemx static awk_bool_t init_my_module(void) @{ @dots{} @} +@itemx static awk_bool_t (*init_func)(void) = init_my_module; +If you need to do some initialization work, you should define a +function that does it (creates variables, opens files, etc.) +and then define the @code{init_func} pointer to point to your +function. +The function should return zero (false) upon failure, non-zero +(success) if everything goes well. + +If you don't need to do any initialization, define the pointer and +initialize it to @code{NULL}. + +@item dl_load_func(func_table, some_name, "name_space_in_quotes") +This macro expands to a @code{dl_load()} function that performs +all the necessary initializations. +@end table + +The point of the all the variables and arrays is to let the +@code{dl_load()} function (from the @code{dl_load_func()} +macro) do all the standard work. It does the following: + +@enumerate 1 +@item +Check the API versions. If the extension major version does not match +@command{gawk}'s, or if the extension minor version is greater than +@command{gawk}'s, it prints a fatal error message and exits. + +@item +Load the functions defined in @code{func_table}. +If any of them fails to load, it prints a warning message but +continues on. + +@item +If the @code{init_func} pointer is not @code{NULL}, call the +function it points to. If it returns non-zero, print a +warning message. + +@item +If @code{ext_version} is not @code{NULL}, register +the version string with @command{gawk}. +@end enumerate + +@node Finding Extensions +@subsection How @command{gawk} Finds Extensions + +Compiled extensions have to be installed in a directory where +@command{gawk} can find them. If @command{gawk} is configured and +built in the default fashion, the directory in which to find +extensions is @file{/usr/local/lib/gawk}. You can also specify a search +path with a list of directories to search for compiled extensions. +@xref{AWKLIBPATH Variable}, for more information. + +@node Extension Example +@section Example: Some File Functions + +@quotation +@i{No matter where you go, there you are.} @* +Buckaroo Bonzai +@end quotation + +@c It's enough to show chdir and stat, no need for fts + +Two useful functions that are not in @command{awk} are @code{chdir()} (so +that an @command{awk} program can change its directory) and @code{stat()} +(so that an @command{awk} program can gather information about a file). +This @value{SECTION} implements these functions for @command{gawk} +in an extension. + +@menu +* Internal File Description:: What the new functions will do. +* Internal File Ops:: The code for internal file operations. +* Using Internal File Ops:: How to use an external extension. +@end menu + +@node Internal File Description +@subsection Using @code{chdir()} and @code{stat()} + +This @value{SECTION} shows how to use the new functions at +the @command{awk} level once they've been integrated into the +running @command{gawk} interpreter. Using @code{chdir()} is very +straightforward. It takes one argument, the new directory to change to: + +@example +@@load "filefuncs" +@dots{} +newdir = "/home/arnold/funstuff" +ret = chdir(newdir) +if (ret < 0) @{ + printf("could not change to %s: %s\n", + newdir, ERRNO) > "/dev/stderr" + exit 1 +@} +@dots{} +@end example + +The return value is negative if the @code{chdir()} failed, and +@code{ERRNO} (@pxref{Built-in Variables}) is set to a string indicating +the error. + +Using @code{stat()} is a bit more complicated. The C @code{stat()} +function fills in a structure that has a fair amount of information. +The right way to model this in @command{awk} is to fill in an associative +array with the appropriate information: + +@c broke printf for page breaking +@example +file = "/home/arnold/.profile" +ret = stat(file, fdata) +if (ret < 0) @{ + printf("could not stat %s: %s\n", + file, ERRNO) > "/dev/stderr" + exit 1 +@} +printf("size of %s is %d bytes\n", file, fdata["size"]) +@end example + +The @code{stat()} function always clears the data array, even if +the @code{stat()} fails. It fills in the following elements: + +@table @code +@item "name" +The name of the file that was @code{stat()}'ed. + +@item "dev" +@itemx "ino" +The file's device and inode numbers, respectively. + +@item "mode" +The file's mode, as a numeric value. This includes both the file's +type and its permissions. + +@item "nlink" +The number of hard links (directory entries) the file has. + +@item "uid" +@itemx "gid" +The numeric user and group ID numbers of the file's owner. + +@item "size" +The size in bytes of the file. + +@item "blocks" +The number of disk blocks the file actually occupies. This may not +be a function of the file's size if the file has holes. + +@item "atime" +@itemx "mtime" +@itemx "ctime" +The file's last access, modification, and inode update times, +respectively. These are numeric timestamps, suitable for formatting +with @code{strftime()} +(@pxref{Time Functions}). + +@item "pmode" +The file's ``printable mode.'' This is a string representation of +the file's type and permissions, such as is produced by +@samp{ls -l}---for example, @code{"drwxr-xr-x"}. + +@item "type" +A printable string representation of the file's type. The value +is one of the following: + +@table @code +@item "blockdev" +@itemx "chardev" +The file is a block or character device (``special file''). + +@ignore +@item "door" +The file is a Solaris ``door'' (special file used for +interprocess communications). +@end ignore + +@item "directory" +The file is a directory. + +@item "fifo" +The file is a named-pipe (also known as a FIFO). + +@item "file" +The file is just a regular file. + +@item "socket" +The file is an @code{AF_UNIX} (``Unix domain'') socket in the +filesystem. + +@item "symlink" +The file is a symbolic link. +@end table +@end table + +Several additional elements may be present depending upon the operating +system and the type of the file. You can test for them in your @command{awk} +program by using the @code{in} operator +(@pxref{Reference to Elements}): + +@table @code +@item "blksize" +The preferred block size for I/O to the file. This field is not +present on all POSIX-like systems in the C @code{stat} structure. + +@item "linkval" +If the file is a symbolic link, this element is the name of the +file the link points to (i.e., the value of the link). + +@item "rdev" +@itemx "major" +@itemx "minor" +If the file is a block or character device file, then these values +represent the numeric device number and the major and minor components +of that number, respectively. +@end table + +@node Internal File Ops +@subsection C Code for @code{chdir()} and @code{stat()} + +Here is the C code for these extensions.@footnote{This version is +edited slightly for presentation. See @file{extension/filefuncs.c} +in the @command{gawk} distribution for the complete version.} + +The file includes a number of standard header files, and then includes +the @file{gawkapi.h} header file which provides the API definitions. +Those are followed by the necessary variable declarations +to make use of the API macros and boilerplate code +(@pxref{Extension API Boilerplate}). + +@c break line for page breaking +@example +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + +#include <stdio.h> +#include <assert.h> +#include <errno.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +#include <sys/types.h> +#include <sys/stat.h> + +#include "gawkapi.h" + +#include "gettext.h" +#define _(msgid) gettext(msgid) +#define N_(msgid) msgid + +#include "gawkfts.h" +#include "stack.h" + +static const gawk_api_t *api; /* for convenience macros to work */ +static awk_ext_id_t *ext_id; +static awk_bool_t init_filefuncs(void); +static awk_bool_t (*init_func)(void) = init_filefuncs; +static const char *ext_version = "filefuncs extension: version 1.0"; + +int plugin_is_GPL_compatible; +@end example + +@cindex programming conventions, @command{gawk} internals +By convention, for an @command{awk} function @code{foo()}, the C function +that implements it is called @code{do_foo()}. The function should have +two arguments: the first is an @code{int} usually called @code{nargs}, +that represents the number of actual arguments for the function. +The second is a pointer to an @code{awk_value_t}, usually named +@code{result}. + +@example +/* do_chdir --- provide dynamically loaded chdir() builtin for gawk */ + +static awk_value_t * +do_chdir(int nargs, awk_value_t *result) +@{ + awk_value_t newdir; + int ret = -1; + + assert(result != NULL); + + if (do_lint && nargs != 1) + lintwarn(ext_id, + _("chdir: called with incorrect number of arguments, " + "expecting 1")); +@end example + +The @code{newdir} +variable represents the new directory to change to, retrieved +with @code{get_argument()}. Note that the first argument is +numbered zero. + +If the argument is retrieved successfully, the function calls the +@code{chdir()} system call. If the @code{chdir()} fails, @code{ERRNO} +is updated. + +@example + if (get_argument(0, AWK_STRING, & newdir)) @{ + ret = chdir(newdir.str_value.str); + if (ret < 0) + update_ERRNO_int(errno); + @} +@end example + +Finally, the function returns the return value to the @command{awk} level: + +@example + return make_number(ret, result); +@} +@end example + +The @code{stat()} extension is more involved. First comes a function +that turns a numeric mode into a printable representation +(e.g., 644 becomes @samp{-rw-r--r--}). This is omitted here for brevity: + +@c break line for page breaking +@example +/* format_mode --- turn a stat mode field into something readable */ + +static char * +format_mode(unsigned long fmode) +@{ + @dots{} +@} +@end example + +Next comes a function for reading symbolic links, which is also +omitted here for brevity: + +@example +/* read_symlink --- read a symbolic link into an allocated buffer. + @dots{} */ + +static char * +read_symlink(const char *fname, size_t bufsize, ssize_t *linksize) +@{ + @dots{} +@} +@end example + +Two helper functions simplify entering values in the +array that will contain the result of the @code{stat()}: + +@example +/* array_set --- set an array element */ + +static void +array_set(awk_array_t array, const char *sub, awk_value_t *value) +@{ + awk_value_t index; + + set_array_element(array, + make_const_string(sub, strlen(sub), & index), + value); + +@} + +/* array_set_numeric --- set an array element with a number */ + +static void +array_set_numeric(awk_array_t array, const char *sub, double num) +@{ + awk_value_t tmp; + + array_set(array, sub, make_number(num, & tmp)); +@} +@end example + +The following function does most of the work to fill in +the @code{awk_array_t} result array with values obtained +from a valid @code{struct stat}. It is done in a separate function +to support the @code{stat()} function for @command{gawk} and also +to support the @code{fts()} extension which is included in +the same file but whose code is not shown here +(@pxref{Extension Sample File Functions}). + +The first part of the function is variable declarations, +including a table to map file types to strings: + +@example +/* fill_stat_array --- do the work to fill an array with stat info */ + +static int +fill_stat_array(const char *name, awk_array_t array, struct stat *sbuf) +@{ + char *pmode; /* printable mode */ + const char *type = "unknown"; + awk_value_t tmp; + static struct ftype_map @{ + unsigned int mask; + const char *type; + @} ftype_map[] = @{ + @{ S_IFREG, "file" @}, + @{ S_IFBLK, "blockdev" @}, + @{ S_IFCHR, "chardev" @}, + @{ S_IFDIR, "directory" @}, +#ifdef S_IFSOCK + @{ S_IFSOCK, "socket" @}, +#endif +#ifdef S_IFIFO + @{ S_IFIFO, "fifo" @}, +#endif +#ifdef S_IFLNK + @{ S_IFLNK, "symlink" @}, +#endif +#ifdef S_IFDOOR /* Solaris weirdness */ + @{ S_IFDOOR, "door" @}, +#endif /* S_IFDOOR */ + @}; + int j, k; +@end example + +The destination array is cleared, and then code fills in +various elements based on values in the @code{struct stat}: + +@example + /* empty out the array */ + clear_array(array); + + /* fill in the array */ + array_set(array, "name", make_const_string(name, strlen(name), + & tmp)); + array_set_numeric(array, "dev", sbuf->st_dev); + array_set_numeric(array, "ino", sbuf->st_ino); + array_set_numeric(array, "mode", sbuf->st_mode); + array_set_numeric(array, "nlink", sbuf->st_nlink); + array_set_numeric(array, "uid", sbuf->st_uid); + array_set_numeric(array, "gid", sbuf->st_gid); + array_set_numeric(array, "size", sbuf->st_size); + array_set_numeric(array, "blocks", sbuf->st_blocks); + array_set_numeric(array, "atime", sbuf->st_atime); + array_set_numeric(array, "mtime", sbuf->st_mtime); + array_set_numeric(array, "ctime", sbuf->st_ctime); + + /* for block and character devices, add rdev, + major and minor numbers */ + if (S_ISBLK(sbuf->st_mode) || S_ISCHR(sbuf->st_mode)) @{ + array_set_numeric(array, "rdev", sbuf->st_rdev); + array_set_numeric(array, "major", major(sbuf->st_rdev)); + array_set_numeric(array, "minor", minor(sbuf->st_rdev)); + @} +@end example + +@noindent +The latter part of the function makes selective additions +to the destination array, depending upon the availability of +certain members and/or the type of the file. It then returns zero, +for success: + +@example +#ifdef HAVE_ST_BLKSIZE + array_set_numeric(array, "blksize", sbuf->st_blksize); +#endif /* HAVE_ST_BLKSIZE */ + + pmode = format_mode(sbuf->st_mode); + array_set(array, "pmode", make_const_string(pmode, strlen(pmode), + & tmp)); + + /* for symbolic links, add a linkval field */ + if (S_ISLNK(sbuf->st_mode)) @{ + char *buf; + ssize_t linksize; + + if ((buf = read_symlink(name, sbuf->st_size, + & linksize)) != NULL) + array_set(array, "linkval", + make_malloced_string(buf, linksize, & tmp)); + else + warning(ext_id, _("stat: unable to read symbolic link `%s'"), + name); + @} + + /* add a type field */ + type = "unknown"; /* shouldn't happen */ + for (j = 0, k = sizeof(ftype_map)/sizeof(ftype_map[0]); j < k; j++) @{ + if ((sbuf->st_mode & S_IFMT) == ftype_map[j].mask) @{ + type = ftype_map[j].type; + break; + @} + @} + + array_set(array, "type", make_const_string(type, strlen(type), &tmp)); + + return 0; +@} +@end example + +Finally, here is the @code{do_stat()} function. It starts with +variable declarations and argument checking: + +@ignore +Changed message for page breaking. Used to be: + "stat: called with incorrect number of arguments (%d), should be 2", +@end ignore +@example +/* do_stat --- provide a stat() function for gawk */ + +static awk_value_t * +do_stat(int nargs, awk_value_t *result) +@{ + awk_value_t file_param, array_param; + char *name; + awk_array_t array; + int ret; + struct stat sbuf; + int (*statfunc)(const char *path, struct stat *sbuf) = lstat; /* default */ + + assert(result != NULL); + + if (nargs != 2 && nargs != 3) @{ + if (do_lint) + lintwarn(ext_id, _("stat: called with wrong number of arguments")); + return make_number(-1, result); + @} +@end example + +The third argument to @code{stat()} was not discussed previously. This argument +is optional. If present, it causes @code{stat()} to use the @code{stat()} +system call instead of the @code{lstat()} system call. + +Then comes the actual work. First, the function gets the arguments. +Next, it gets the information for the file. +The code use @code{lstat()} (instead of @code{stat()}) +to get the file information, +in case the file is a symbolic link. +If there's an error, it sets @code{ERRNO} and returns: + +@example + /* file is first arg, array to hold results is second */ + if ( ! get_argument(0, AWK_STRING, & file_param) + || ! get_argument(1, AWK_ARRAY, & array_param)) @{ + warning(ext_id, _("stat: bad parameters")); + return make_number(-1, result); + @} + + if (nargs == 3) @{ + statfunc = stat; + @} + + name = file_param.str_value.str; + array = array_param.array_cookie; + + /* always empty out the array */ + clear_array(array); + + /* stat the file, if error, set ERRNO and return */ + ret = statfunc(name, & sbuf); + if (ret < 0) @{ + update_ERRNO_int(errno); + return make_number(ret, result); + @} +@end example + +The tedious work is done by @code{fill_stat_array()}, shown +earlier. When done, return the result from @code{fill_stat_array()}: + +@example + ret = fill_stat_array(name, array, & sbuf); + + return make_number(ret, result); +@} +@end example + +@cindex programming conventions, @command{gawk} internals +Finally, it's necessary to provide the ``glue'' that loads the +new function(s) into @command{gawk}. + +The @code{filefuncs} extension also provides an @code{fts()} +function, which we omit here. For its sake there is an initialization +function: + +@example +/* init_filefuncs --- initialization routine */ + +static awk_bool_t +init_filefuncs(void) +@{ + @dots{} +@} +@end example + +We are almost done. We need an array of @code{awk_ext_func_t} +structures for loading each function into @command{gawk}: + +@example +static awk_ext_func_t func_table[] = @{ + @{ "chdir", do_chdir, 1 @}, + @{ "stat", do_stat, 2 @}, + @{ "fts", do_fts, 3 @}, +@}; +@end example + +Each extension must have a routine named @code{dl_load()} to load +everything that needs to be loaded. It is simplest to use the +@code{dl_load_func()} macro in @code{gawkapi.h}: + +@example +/* define the dl_load() function using the boilerplate macro */ + +dl_load_func(func_table, filefuncs, "") +@end example + +And that's it! As an exercise, consider adding functions to +implement system calls such as @code{chown()}, @code{chmod()}, +and @code{umask()}. + +@node Using Internal File Ops +@subsection Integrating The Extensions + +@cindex @command{gawk}, interpreter@comma{} adding code to +Now that the code is written, it must be possible to add it at +runtime to the running @command{gawk} interpreter. First, the +code must be compiled. Assuming that the functions are in +a file named @file{filefuncs.c}, and @var{idir} is the location +of the @file{gawkapi.h} header file, +the following steps@footnote{In practice, you would probably want to +use the GNU Autotools---Automake, Autoconf, Libtool, and Gettext---to +configure and build your libraries. Instructions for doing so are beyond +the scope of this @value{DOCUMENT}. @xref{gawkextlib}, for WWW links to +the tools.} create a GNU/Linux shared library: + +@example +$ @kbd{gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -I@var{idir} filefuncs.c} +$ @kbd{ld -o filefuncs.so -shared filefuncs.o -lc} +@end example + +Once the library exists, it is loaded by using the @code{@@load} keyword. + +@example +# file testff.awk +@@load "filefuncs" + +BEGIN @{ + "pwd" | getline curdir # save current directory + close("pwd") + + chdir("/tmp") + system("pwd") # test it + chdir(curdir) # go back + + print "Info for testff.awk" + ret = stat("testff.awk", data) + print "ret =", ret + for (i in data) + printf "data[\"%s\"] = %s\n", i, data[i] + print "testff.awk modified:", + strftime("%m %d %y %H:%M:%S", data["mtime"]) + + print "\nInfo for JUNK" + ret = stat("JUNK", data) + print "ret =", ret + for (i in data) + printf "data[\"%s\"] = %s\n", i, data[i] + print "JUNK modified:", strftime("%m %d %y %H:%M:%S", data["mtime"]) +@} +@end example + +The @env{AWKLIBPATH} environment variable tells +@command{gawk} where to find shared libraries (@pxref{Finding Extensions}). +We set it to the current directory and run the program: + +@example +$ @kbd{AWKLIBPATH=$PWD gawk -f testff.awk} +@print{} /tmp +@print{} Info for testff.awk +@print{} ret = 0 +@print{} data["blksize"] = 4096 +@print{} data["mtime"] = 1350838628 +@print{} data["mode"] = 33204 +@print{} data["type"] = file +@print{} data["dev"] = 2053 +@print{} data["gid"] = 1000 +@print{} data["ino"] = 1719496 +@print{} data["ctime"] = 1350838628 +@print{} data["blocks"] = 8 +@print{} data["nlink"] = 1 +@print{} data["name"] = testff.awk +@print{} data["atime"] = 1350838632 +@print{} data["pmode"] = -rw-rw-r-- +@print{} data["size"] = 662 +@print{} data["uid"] = 1000 +@print{} testff.awk modified: 10 21 12 18:57:08 +@print{} +@print{} Info for JUNK +@print{} ret = -1 +@print{} JUNK modified: 01 01 70 02:00:00 +@end example + +@node Extension Samples +@section The Sample Extensions In The @command{gawk} Distribution + +This @value{SECTION} provides brief overviews of the sample extensions +that come in the @command{gawk} distribution. Some of them are intended +for production use, such the @code{filefuncs} and @code{readdir} extensions. +Others mainly provide example code that shows how to use the extension API. + +@menu +* Extension Sample File Functions:: The file functions sample. +* Extension Sample Fnmatch:: An interface to @code{fnmatch()}. +* Extension Sample Fork:: An interface to @code{fork()} and other + process functions. +* Extension Sample Ord:: Character to value to character + conversions. +* Extension Sample Readdir:: An interface to @code{readdir()}. +* Extension Sample Revout:: Reversing output sample output wrapper. +* Extension Sample Rev2way:: Reversing data sample two-way processor. +* Extension Sample Read write array:: Serializing an array to a file. +* Extension Sample Readfile:: Reading an entire file into a string. +* Extension Sample API Tests:: Tests for the API. +* Extension Sample Time:: An interface to @code{gettimeofday()} + and @code{sleep()}. +@end menu + +@node Extension Sample File Functions +@subsection File Related Functions + +The @code{filefuncs} extension provides three different functions, as follows: +The usage is: + +@table @code +@item @@load "filefuncs" +This is how you load the extension. + +@item result = chdir("/some/directory") +The @code{chdir()} function is a direct hook to the @code{chdir()} +system call to change the current directory. It returns zero +upon success or less than zero upon error. In the latter case it updates +@code{ERRNO}. + +@item result = stat("/some/path", statdata [, follow]) +The @code{stat()} function provides a hook into the +@code{stat()} system call. +It returns zero upon success or less than zero upon error. +In the latter case it updates @code{ERRNO}. + +By default, it uses the @code{lstat()} system call. However, if passed +a third argument, it uses @code{stat()} instead. + +In all cases, it clears the @code{statdata} array. +When the call is successful, @code{stat()} fills the @code{statdata} +array with information retrieved from the filesystem, as follows: + +@c nested table +@multitable @columnfractions .25 .60 +@item @code{statdata["name"]} @tab +The name of the file. + +@item @code{statdata["dev"]} @tab +Corresponds to the @code{st_dev} field in the @code{struct stat}. + +@item @code{statdata["ino"]} @tab +Corresponds to the @code{st_ino} field in the @code{struct stat}. + +@item @code{statdata["mode"]} @tab +Corresponds to the @code{st_mode} field in the @code{struct stat}. + +@item @code{statdata["nlink"]} @tab +Corresponds to the @code{st_nlink} field in the @code{struct stat}. + +@item @code{statdata["uid"]} @tab +Corresponds to the @code{st_uid} field in the @code{struct stat}. + +@item @code{statdata["gid"]} @tab +Corresponds to the @code{st_gid} field in the @code{struct stat}. + +@item @code{statdata["size"]} @tab +Corresponds to the @code{st_size} field in the @code{struct stat}. + +@item @code{statdata["atime"]} @tab +Corresponds to the @code{st_atime} field in the @code{struct stat}. + +@item @code{statdata["mtime"]} @tab +Corresponds to the @code{st_mtime} field in the @code{struct stat}. + +@item @code{statdata["ctime"]} @tab +Corresponds to the @code{st_ctime} field in the @code{struct stat}. + +@item @code{statdata["rdev"]} @tab +Corresponds to the @code{st_rdev} field in the @code{struct stat}. +This element is only present for device files. + +@item @code{statdata["major"]} @tab +Corresponds to the @code{st_major} field in the @code{struct stat}. +This element is only present for device files. + +@item @code{statdata["minor"]} @tab +Corresponds to the @code{st_minor} field in the @code{struct stat}. +This element is only present for device files. + +@item @code{statdata["blksize"]} @tab +Corresponds to the @code{st_blksize} field in the @code{struct stat}. +if this field is present on your system. +(It is present on all modern systems that we know of.) + +@item @code{statdata["pmode"]} @tab +A human-readable version of the mode value, such as printed by +@command{ls}. For example, @code{"-rwxr-xr-x"}. + +@item @code{statdata["linkval"]} @tab +If the named file is a symbolic link, this element will exist +and its value is the value of the symbolic link (where the +symbolic link points to). + +@item @code{statdata["type"]} @tab +The type of the file as a string. One of +@code{"file"}, +@code{"blockdev"}, +@code{"chardev"}, +@code{"directory"}, +@code{"socket"}, +@code{"fifo"}, +@code{"symlink"}, +@code{"door"}, +or +@code{"unknown"}. +Not all systems support all file types. +@end multitable + +@item flags = or(FTS_PHYSICAL, ...) +@itemx result = fts(pathlist, flags, filedata) +Walk the file trees provided in @code{pathlist} and fill in the +@code{filedata} array as described below. @code{flags} is the bitwise +OR of several predefined constant values, also as described below. +Return zero if there were no errors, otherwise return @minus{}1. +@end table + +The @code{fts()} function provides a hook to the C library @code{fts()} +routines for traversing file hierarchies. Instead of returning data +about one file at a time in a stream, it fills in a multi-dimensional +array with data about each file and directory encountered in the requested +hierarchies. + +The arguments are as follows: + +@table @code +@item pathlist +An array of filenames. The element values are used; the index values are ignored. + +@item flags +This should be the bitwise OR of one or more of the following +predefined constant flag values. At least one of +@code{FTS_LOGICAL} or @code{FTS_PHYSICAL} must be provided; otherwise +@code{fts()} returns an error value and sets @code{ERRNO}. +The flags are: + +@c nested table +@table @code +@item FTS_LOGICAL +Do a ``logical'' file traversal, where the information returned for +a symbolic link refers to the linked-to file, and not to the symbolic +link itself. This flag is mutually exclusive with @code{FTS_PHYSICAL}. + +@item FTS_PHYSICAL +Do a ``physical'' file traversal, where the information returned for a +symbolic link refers to the symbolic link itself. This flag is mutually +exclusive with @code{FTS_LOGICAL}. + +@item FTS_NOCHDIR +As a performance optimization, the C library @code{fts()} routines +change directory as they traverse a file hierarchy. This flag disables +that optimization. + +@item FTS_COMFOLLOW +Immediately follow a symbolic link named in @code{pathlist}, +whether or not @code{FTS_LOGICAL} is set. + +@item FTS_SEEDOT +By default, the @code{fts()} routines do not return entries for @file{.} +and @file{..}. This option causes entries for @file{..} to also +be included. (The extension always includes an entry for @file{.}, +see below.) + +@item FTS_XDEV +During a traversal, do not cross onto a different mounted filesystem. +@end table + +@item filedata +The @code{filedata} array is first cleared. Then, @code{fts()} creates +an element in @code{filedata} for every element in @code{pathlist}. +The index is the name of the directory or file given in @code{pathlist}. +The element for this index is itself an array. There are two cases. + +@c nested table +@table @emph +@item The path is a file. +In this case, the array contains two or three elements: + +@c doubly nested table +@table @code +@item "path" +The full path to this file, starting from the ``root'' that was given +in the @code{pathlist} array. + +@item "stat" +This element is itself an array, containing the same information as provided +by the @code{stat()} function described earlier for its +@code{statdata} argument. The element may not be present if +the @code{stat()} system call for the file failed. + +@item "error" +If some kind of error was encountered, the array will also +contain an element named @code{"error"}, which is a string describing the error. +@end table + +@item The path is a directory. +In this case, the array contains one element for each entry in the +directory. If an entry is a file, that element is as for files, just +described. If the entry is a directory, that element is (recursively), +an array describing the subdirectory. If @code{FTS_SEEDOT} was provided +in the flags, then there will also be an element named @code{".."}. This +element will be an array containing the data as provided by @code{stat()}. + +In addition, there will be an element whose index is @code{"."}. +This element is an array containing the same two or three elements as +for a file: @code{"path"}, @code{"stat"}, and @code{"error"}. +@end table +@end table + +The @code{fts()} function returns zero if there were no errors. +Otherwise it returns @minus{}1. + +@quotation NOTE +The @code{fts()} extension does not exactly mimic the +interface of the C library @code{fts()} routines, choosing instead to +provide an interface that is based on associative arrays, which should +be more comfortable to use from an @command{awk} program. This includes the +lack of a comparison function, since @command{gawk} already provides +powerful array sorting facilities. While an @code{fts_read()}-like +interface could have been provided, this felt less natural than simply +creating a multi-dimensional array to represent the file hierarchy and +its information. +@end quotation + +See @file{test/fts.awk} in the @command{gawk} distribution for an example. + +@node Extension Sample Fnmatch +@subsection Interface To @code{fnmatch()} + +This extension provides an interface to the C library +@code{fnmatch()} function. The usage is: + +@example +@@load "fnmatch" + +result = fnmatch(pattern, string, flags) +@end example + +The @code{fnmatch} extension adds a single function named +@code{fnmatch()}, one constant (@code{FNM_NOMATCH}), and an array of +flag values named @code{FNM}. + +The arguments to @code{fnmatch()} are: + +@table @code +@item pattern +The filename wildcard to match. + +@item string +The filename string, + +@item flag +Either zero, or the bitwise OR of one or more of the +flags in the @code{FNM} array. +@end table + +The return value is zero on success, @code{FNM_NOMATCH} +if the string did not match the pattern, or +a different non-zero value if an error occurred. + +The flags are follows: + +@multitable @columnfractions .25 .75 +@item @code{FNM["CASEFOLD"]} @tab +Corresponds to the @code{FNM_CASEFOLD} flag as defined in @code{fnmatch()}. + +@item @code{FNM["FILE_NAME"]} @tab +Corresponds to the @code{FNM_FILE_NAME} flag as defined in @code{fnmatch()}. + +@item @code{FNM["LEADING_DIR"]} @tab +Corresponds to the @code{FNM_LEADING_DIR} flag as defined in @code{fnmatch()}. + +@item @code{FNM["NOESCAPE"]} @tab +Corresponds to the @code{FNM_NOESCAPE} flag as defined in @code{fnmatch()}. + +@item @code{FNM["PATHNAME"]} @tab +Corresponds to the @code{FNM_PATHNAME} flag as defined in @code{fnmatch()}. + +@item @code{FNM["PERIOD"]} @tab +Corresponds to the @code{FNM_PERIOD} flag as defined in @code{fnmatch()}. +@end multitable + +Here is an example: + +@example +@@load "fnmatch" +@dots{} +flags = or(FNM["PERIOD"], FNM["NOESCAPE"]) +if (fnmatch("*.a", "foo.c", flags) == FNM_NOMATCH) + print "no match" +@end example + +@node Extension Sample Fork +@subsection Interface To @code{fork()}, @code{wait()} and @code{waitpid()} + +The @code{fork} extension adds three functions, as follows. + +@table @code +@item @@load "fork" +This is how you load the extension. + +@item pid = fork() +This function creates a new process. The return value is the zero in the +child and the process-id number of the child in the parent, or @minus{}1 +upon error. In the latter case, @code{ERRNO} indicates the problem. +In the child, @code{PROCINFO["pid"]} and @code{PROCINFO["ppid"]} are +updated to reflect the correct values. + +@item ret = waitpid(pid) +This function takes a numeric argument, which is the process-id to +wait for. The return value is that of the +@code{waitpid()} system call. + +@item ret = wait() +This function waits for the first child to die. +The return value is that of the +@code{wait()} system call. +@end table + +There is no corresponding @code{exec()} function. + +Here is an example: + +@example +@@load "fork" +@dots{} +if ((pid = fork()) == 0) + print "hello from the child" +else + print "hello from the parent" +@end example + +@node Extension Sample Ord +@subsection Character and Numeric values: @code{ord()} and @code{chr()} + +The @code{ordchr} extension adds two functions, named +@code{ord()} and @code{chr()}, as follows. + +@table @code +@item number = ord(string) +Return the numeric value of the first character in @code{string}. + +@item char = chr(number) +Return the string whose first character is that represented by @code{number}. +@end table + +These functions are inspired by the Pascal language functions +of the same name. Here is an example: + +@example +@@load "ordchr" +@dots{} +printf("The numeric value of 'A' is %d\n", ord("A")) +printf("The string value of 65 is %s\n", chr(65)) +@end example + +@node Extension Sample Readdir +@subsection Reading Directories + +The @code{readdir} extension adds an input parser for directories. +The usage is as follows: + +@example +@@load "readdir" +@end example + +When this extension is in use, instead of skipping directories named +on the command line (or with @code{getline}), +they are read, with each entry returned as a record. + +The record consists of three fields. The first two are the inode number and the +filename, separated by a forward slash character. +On systems where the directory entry contains the file type, the record +has a third field which is a single letter indicating the type of the +file: + +@multitable @columnfractions .1 .9 +@headitem Letter @tab File Type +@item @code{b} @tab Block device +@item @code{c} @tab Character device +@item @code{d} @tab Directory +@item @code{f} @tab Regular file +@item @code{l} @tab Symbolic link +@item @code{p} @tab Named pipe (FIFO) +@item @code{s} @tab Socket +@item @code{u} @tab Anything else (unknown) +@end multitable + +On systems without the file type information, the third field is always +@samp{u}. + +@quotation NOTE +On GNU/Linux systems, there are filesystems that don't support the +@code{d_type} entry (see the @i{readdir}(3) manual page), and so the file +type is always @samp{u}. You can use the @code{filefuncs} extension to call +@code{stat()} in order to get correct type information. +@end quotation + +Here is an example: + +@example +@@load "readdir" +@dots{} +BEGIN @{ FS = "/" @} +@{ print "file name is", $2 @} +@end example + +@node Extension Sample Revout +@subsection Reversing Output + +The @code{revoutput} extension adds a simple output wrapper that reverses +the characters in each output line. It's main purpose is to show how to +write an output wrapper, although it may be mildly amusing for the unwary. +Here is an example: + +@example +@@load "revoutput" + +BEGIN @{ + REVOUT = 1 + print "hello, world" > "/dev/stdout" +@} +@end example + +The output from this program is: +@samp{dlrow ,olleh}. + +@node Extension Sample Rev2way +@subsection Two-Way I/O Example + +The @code{revtwoway} extension adds a simple two-way processor that +reverses the characters in each line sent to it for reading back by +the @command{awk} program. It's main purpose is to show how to write +a two-way processor, although it may also be mildly amusing. +The following example shows how to use it: + +@example +@@load "revtwoway" + +BEGIN @{ + cmd = "/magic/mirror" + print "hello, world" |& cmd + cmd |& getline result + print result + close(cmd) +@} +@end example + +@node Extension Sample Read write array +@subsection Dumping and Restoring An Array + +The @code{rwarray} extension adds two functions, +named @code{writea()} and @code{reada()}, as follows: + +@table @code +@item ret = writea(file, array) +This function takes a string argument, which is the name of the file +to which dump the array, and the array itself as the second argument. +@code{writea()} understands multidimensional arrays. It returns one on +success, or zero upon failure. + +@item ret = reada(file, array) +@code{reada()} is the inverse of @code{writea()}; +it reads the file named as its first argument, filling in +the array named as the second argument. It clears the array first. +Here too, the return value is one on success and zero upon failure. +@end table + +The array created by @code{reada()} is identical to that written by +@code{writea()} in the sense that the contents are the same. However, +due to implementation issues, the array traversal order of the recreated +array is likely to be different from that of the original array. As array +traversal order in @command{awk} is by default undefined, this is not +(technically) a problem. If you need to guarantee a particular traversal +order, use the array sorting features in @command{gawk} to do so +(@pxref{Array Sorting}). + +The file contains binary data. All integral values are written in network +byte order. However, double precision floating-point values are written +as native binary data. Thus, arrays containing only string data can +theoretically be dumped on systems with one byte order and restored on +systems with a different one, but this has not been tried. + +Here is an example: + +@example +@@load "rwarray" +@dots{} +ret = writea("arraydump.bin", array) +@dots{} +ret = reada("arraydump.bin", array) +@end example + +@node Extension Sample Readfile +@subsection Reading An Entire File + +The @code{readfile} extension adds a single function +named @code{readfile()}: + +@table @code +@item result = readfile("/some/path") +The argument is the name of the file to read. The return value is a +string containing the entire contents of the requested file. Upon error, +the function returns the empty string and sets @code{ERRNO}. +@end table + +Here is an example: + +@example +@@load "readfile" +@dots{} +contents = readfile("/path/to/file"); +if (contents == "" && ERRNO != "") @{ + print("problem reading file", ERRNO) > "/dev/stderr" + ... +@} +@end example + +@node Extension Sample API Tests +@subsection API Tests + +The @code{testext} extension exercises parts of the extension API that +are not tested by the other samples. The @file{extension/testext.c} +file contains both the C code for the extension and @command{awk} +test code inside C comments that run the tests. The testing framework +extracts the @command{awk} code and runs the tests. See the source file +for more information. + +@node Extension Sample Time +@subsection Extension Time Functions + +@cindex time +@cindex sleep + +These functions can be used by either invoking @command{gawk} +with a command-line argument of @samp{-l time} or by +inserting @samp{@@load "time"} in your script. + +@table @code + +@cindex @code{gettimeofday} time extension function +@item the_time = gettimeofday() +Return the time in seconds that has elapsed since 1970-01-01 UTC as a +floating point value. If the time is unavailable on this platform, return +@minus{}1 and set @code{ERRNO}. The returned time should have sub-second +precision, but the actual precision may vary based on the platform. +If the standard C @code{gettimeofday()} system call is available on this +platform, then it simply returns the value. Otherwise, if on Windows, +it tries to use @code{GetSystemTimeAsFileTime()}. + +@cindex @code{sleep} time extension function +@item result = sleep(@var{seconds}) +Attempt to sleep for @var{seconds} seconds. If @var{seconds} is negative, +or the attempt to sleep fails, return @minus{}1 and set @code{ERRNO}. +Otherwise, return zero after sleeping for the indicated amount of time. +Note that @var{seconds} may be a floating-point (non-integral) value. +Implementation details: depending on platform availability, this function +tries to use @code{nanosleep()} or @code{select()} to implement the delay. +@end table + +@node gawkextlib +@section The @code{gawkextlib} Project + +The @uref{http://sourceforge.net/projects/gawkextlib/, @code{gawkextlib}} +project provides a number of @command{gawk} extensions, including one for +processing XML files. This is the evolution of the original @command{xgawk} +(XML @command{gawk}) project. + +As of this writing, there are four extensions: + +@itemize @bullet +@item +XML parser extension, using the @uref{http://expat.sourceforge.net, Expat} +XML parsing library. + +@item +PostgreSQL extension. + +@item +GD graphics library extension. + +@item +MPFR library extension. +This provides access to a number of MPFR functions which @command{gawk}'s +native MPFR support does not. +@end itemize + +The @code{time} extension described earlier (@pxref{Extension Sample +Time}) was originally from this project but has been moved in to the +main @command{gawk} distribution. + +You can check out the code for the @code{gawkextlib} project +using the @uref{http://git-scm.com, GIT} distributed source +code control system. The command is as follows: + +@example +git clone git://git.code.sf.net/p/gawkextlib/code gawkextlib-code +@end example + +You will need to have the @uref{http://expat.sourceforge.net, Expat} +XML parser library installed in order to build and use the XML extension. + +In addition, you must have the GNU Autotools installed +(@uref{http://www.gnu.org/software/autoconf, Autoconf}, +@uref{http://www.gnu.org/software/automake, Automake}, +@uref{http://www.gnu.org/software/libtool, Libtool}, +and +@uref{http://www.gnu.org/software/gettext, Gettext}). + +The simple recipe for building and testing @code{gawkextlib} is as follows. +First, build and install @command{gawk}: + +@example +cd .../path/to/gawk/code +./configure --prefix=/tmp/newgawk @ii{Install in /tmp/newgawk for now} +make && make check @ii{Build and check that all is OK} +make install @ii{Install gawk} +@end example + +Next, build @code{gawkextlib} and test it: + +@example +cd .../path/to/gawkextlib-code +./update-autotools @ii{Generate configure, etc.} + @ii{You may have to run this command twice} +./configure --with-gawk=/tmp/newgawk @ii{Configure, point at ``installed'' gawk} +make && make check @ii{Build and check that all is OK} +@end example + +If you write an extension that you wish to share with other +@command{gawk} users, please consider doing so through the +@code{gawkextlib} project. + +@iftex +@part Part IV:@* Appendices +@end iftex + +@ignore +@ifdocbook + +@part Part IV:@* Appendices + +Part IV provides the appendices, the Glossary, and two licenses that cover the @command{gawk} source code and this @value{DOCUMENT}, respectively. -It contains the following appendixes: +It contains the following appendices: @itemize @bullet @item @@ -27476,11 +32009,7 @@ It contains the following appendixes: @item @ref{GNU Free Documentation License}. @end itemize - -@page -@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| -@oddheading @| @| @strong{@thischapter}@ @ @ @thispage -@end iftex +@end ifdocbook @end ignore @node Language History @@ -27498,8 +32027,6 @@ This @value{CHAPTER} briefly describes the evolution of the @command{awk} language, with cross-references to other parts of the @value{DOCUMENT} where you can find more information. -@c FIXME: Try to determine whether it was 3.1 or 3.2 that had new awk. - @menu * V7/SVR3.1:: The major changes between V7 and System V Release 3.1. @@ -27914,6 +32441,7 @@ and @code{xor()} functions for bit manipulation (@pxref{Bitwise Functions}). +@c In 4.1, and(), or() and xor() grew the ability to take > 2 arguments @item The @code{asort()} and @code{asorti()} functions for sorting arrays @@ -27925,11 +32453,6 @@ functions for internationalization (@pxref{Programmer i18n}). @item -The @code{extension()} built-in function and the ability to add -new functions dynamically -(@pxref{Dynamic Extensions}). - -@item The @code{fflush()} function from Brian Kernighan's version of @command{awk} (@pxref{I/O Functions}). @@ -27956,29 +32479,70 @@ the @option{-f} command-line option (@pxref{Options}). @item -The ability to use GNU-style long-named options that start with @option{--} +The @env{AWKLIBPATH} environment variable for specifying a path search for +the @option{-l} command-line option +(@pxref{Options}). + +@item +The +@option{-b}, +@option{-c}, +@option{-C}, +@option{-d}, +@option{-D}, +@option{-e}, +@option{-E}, +@option{-g}, +@option{-h}, +@option{-i}, +@option{-l}, +@option{-L}, +@option{-M}, +@option{-n}, +@option{-N}, +@option{-o}, +@option{-O}, +@option{-p}, +@option{-P}, +@option{-r}, +@option{-S}, +@option{-t}, +and +@option{-V} +short options. Also, the +ability to use GNU-style long-named options that start with @option{--} and the +@option{--assign}, +@option{--bignum}, @option{--characters-as-bytes}, -@option{--compat}, +@option{--copyright}, +@option{--debug}, @option{--dump-variables}, -@option{--exec}, +@option{--execle}, +@option{--field-separator}, +@option{--file}, @option{--gen-pot}, +@option{--help}, +@option{--include}, @option{--lint}, @option{--lint-old}, +@option{--load}, @option{--non-decimal-data}, +@option{--optimize}, @option{--posix}, +@option{--pretty-print}, @option{--profile}, @option{--re-interval}, @option{--sandbox}, @option{--source}, @option{--traditional}, +@option{--use-lc-numeric}, and -@option{--use-lc-numeric} -options +@option{--version} +long options (@pxref{Options}). @end itemize - @c new ports @item @@ -28076,7 +32640,7 @@ Almost all introductory Unix literature explained range expressions as working in this fashion, and in particular, would teach that the ``correct'' way to match lowercase letters was with @samp{[a-z]}, and that @samp{[A-Z]} was the ``correct'' way to match uppercase letters. -And indeed, this was true. +And indeed, this was true.@footnote{And Life was good.} The 1993 POSIX standard introduced the idea of locales (@pxref{Locales}). Since many locales include other letters besides the plain twenty-six @@ -28094,12 +32658,12 @@ But outside those locales, the ordering was defined to be based on In many locales, @samp{A} and @samp{a} are both less than @samp{B}. In other words, these locales sort characters in dictionary order, and @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; -instead it might be equivalent to @samp{[aBbCcdXxYyz]}, for example. +instead it might be equivalent to @samp{[ABCXYabcdxyz]}, for example. This point needs to be emphasized: Much literature teaches that you should use @samp{[a-z]} to match a lowercase character. But on systems with non-ASCII locales, this also matched all of the uppercase characters -except @samp{Z}! This was a continuous cause of confusion, even well +except @samp{A} or @samp{Z}! This was a continuous cause of confusion, even well into the twenty-first century. To demonstrate these issues, the following example uses the @code{sub()} @@ -28135,13 +32699,16 @@ the @command{gawk} maintainer grew weary of trying to explain that @command{gawk} was being nicely standards-compliant, and that the issue was in the user's locale. During the development of version 4.0, he modified @command{gawk} to always treat ranges in the original, -pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}). +pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).@footnote{And +thus was born the Campain for Rational Range Interpretation (or RRI). A number +of GNU tools, such as @command{grep} and @command{sed}, have either +implemented this change, or will soon. Thanks to Karl Berry for coining the phrase +``Rational Range Interpretation.''} Fortunately, shortly before the final release of @command{gawk} 4.0, the maintainer learned that the 2008 standard had changed the definition of ranges, such that outside the @code{"C"} and @code{"POSIX"} -locales, the meaning of range expressions was -@emph{undefined}.@footnote{See +locales, the meaning of range expressions was @emph{undefined}.@footnote{See @uref{http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05, the standard} and @uref{http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05, its rationale}.} @@ -28151,7 +32718,6 @@ to implementors to implement ranges in whatever way they choose. The @command{gawk} maintainer chose to apply the pre-POSIX meaning in all cases: the default regexp matching; with @option{--traditional}, and with @option{--posix}; in all cases, @command{gawk} remains POSIX compliant. - @node Contributors @appendixsec Major Contributors to @command{gawk} @cindex @command{gawk}, list of contributors to @@ -28284,6 +32850,7 @@ the various PC platforms. Christos Zoulas provided the @code{extension()} built-in function for dynamically adding new modules. +(This was removed at @command{gawk} 4.1.) @item @cindex Kahrs, J@"urgen @@ -29712,9 +34279,8 @@ maintainers of @command{gawk}. Everything in it applies specifically to * Compatibility Mode:: How to disable certain @command{gawk} extensions. * Additions:: Making Additions To @command{gawk}. -* Dynamic Extensions:: Adding new built-in functions to - @command{gawk}. * Future Extensions:: New features that may be implemented one day. +* Implementation Limitations:: Some limitations of the implementation. @end menu @node Compatibility Mode @@ -29759,6 +34325,8 @@ as well as any considerations you should bear in mind. @command{gawk}. * New Ports:: Porting @command{gawk} to a new operating system. +* Derived Files:: Why derived files are kept in the + @command{git} repository. @end menu @node Accessing The Source @@ -29790,7 +34358,7 @@ git clone http://git.savannah.gnu.org/r/gawk.git @end example Once you have made changes, you can use @samp{git diff} to produce a -patch, and send that to the @command{gawk} maintainer; see @ref{Bugs} +patch, and send that to the @command{gawk} maintainer; see @ref{Bugs}, for how to do that. Finally, if you cannot install Git (e.g., if it hasn't been ported @@ -29801,6 +34369,10 @@ to check out a copy using CVS, as follows: cvs -d:pserver:anonymous@@pserver.git.sv.gnu.org:/gawk.git co -d gawk master @end example +Note that this gateway is flakey; you may have better luck using +a more modern version control system like Bazaar, that has a Git +plug-in for working with Git repositories. + @node Adding Code @appendixsubsec Adding New Features @@ -29902,7 +34474,8 @@ of @code{switch} statements, instead of just the plain pointer or character value. @item -Use the @code{TRUE}, @code{FALSE} and @code{NULL} symbolic constants +Use @code{true}, @code{false} for @code{bool} values, +the @code{NULL} symbolic constant for pointer values, and the character constant @code{'\0'} where appropriate, instead of @code{1} and @code{0}. @@ -29949,8 +34522,9 @@ You will also have to sign paperwork for your documentation changes. Submit changes as unified diffs. Use @samp{diff -u -r -N} to compare the original @command{gawk} source tree with your version. -I recommend using the GNU version of @command{diff}. -Send the output produced by either run of @command{diff} to me when you +I recommend using the GNU version of @command{diff}, or best of all, +@samp{git diff} or @samp{git format-patch}. +Send the output produced by @command{diff} to me when you submit your changes. (@xref{Bugs}, for the electronic mail information.) @@ -30076,840 +34650,188 @@ operating systems' code that is already there. In the code that you supply and maintain, feel free to use a coding style and brace layout that suits your taste. -@node Dynamic Extensions -@appendixsec Adding New Built-in Functions to @command{gawk} -@cindex Robinson, Will -@cindex robot, the -@cindex Lost In Space -@quotation -@i{Danger Will Robinson! Danger!!@* -Warning! Warning!}@* -The Robot -@end quotation +@node Derived Files +@appendixsubsec Why Generated Files Are Kept In @command{git} -@c STARTOFRANGE gladfgaw -@cindex @command{gawk}, functions, adding -@c STARTOFRANGE adfugaw -@cindex adding, functions to @command{gawk} -@c STARTOFRANGE fubadgaw -@cindex functions, built-in, adding to @command{gawk} -It is possible to add new built-in -functions to @command{gawk} using dynamically loaded libraries. This -facility is available on systems (such as GNU/Linux) that support -the C @code{dlopen()} and @code{dlsym()} functions. -This @value{SECTION} describes how to write and use dynamically -loaded extensions for @command{gawk}. -Experience with programming in -C or C++ is necessary when reading this @value{SECTION}. +@c From emails written March 22, 2012, to the gawk developers list. -@quotation CAUTION -The facilities described in this @value{SECTION} -are very much subject to change in a future @command{gawk} release. -Be aware that you may have to re-do everything, -at some future time. - -If you have written your own dynamic extensions, -be sure to recompile them for each new @command{gawk} release. -There is no guarantee of binary compatibility between different -releases, nor will there ever be such a guarantee. -@end quotation +If you look at the @command{gawk} source in the @command{git} +repository, you will notice that it includes files that are automatically +generated by GNU infrastructure tools, such as @file{Makefile.in} from +@command{automake} and even @file{configure} from @command{autoconf}. -@quotation NOTE -When @option{--sandbox} is specified, extensions are disabled -(@pxref{Options}. -@end quotation +This is different from many Free Software projects that do not store +the derived files, because that keeps the repository less cluttered, +and it is easier to see the substantive changes when comparing versions +and trying to understand what changed between commits. -@menu -* Internals:: A brief look at some @command{gawk} internals. -* Plugin License:: A note about licensing. -* Loading Extensions:: How to load dynamic extensions. -* Sample Library:: A example of new functions. -@end menu +However, there are two reasons why the @command{gawk} maintainer +likes to have everything in the repository. -@node Internals -@appendixsubsec A Minimal Introduction to @command{gawk} Internals -@c STARTOFRANGE gawint -@cindex @command{gawk}, internals - -The truth is that @command{gawk} was not designed for simple extensibility. -The facilities for adding functions using shared libraries work, but -are something of a ``bag on the side.'' Thus, this tour is -brief and simplistic; would-be @command{gawk} hackers are encouraged to -spend some time reading the source code before trying to write -extensions based on the material presented here. Of particular note -are the files @file{awk.h}, @file{builtin.c}, and @file{eval.c}. -Reading @file{awkgram.y} in order to see how the parse tree is built -would also be of use. - -@cindex @code{awk.h} file (internal) -With the disclaimers out of the way, the following types, structure -members, functions, and macros are declared in @file{awk.h} and are of -use when writing extensions. The next @value{SECTION} -shows how they are used: +First, because it is then easy to reproduce any given version completely, +without relying upon the availability of (older, likely obsolete, and +maybe even impossible to find) other tools. -@table @code -@cindex floating-point, numbers, @code{AWKNUM} internal type -@cindex numbers, floating-point, @code{AWKNUM} internal type -@cindex @code{AWKNUM} internal type -@cindex internal type, @code{AWKNUM} -@item AWKNUM -An @code{AWKNUM} is the internal type of @command{awk} -floating-point numbers. Typically, it is a C @code{double}. - -@cindex @code{NODE} internal type -@cindex internal type, @code{NODE} -@cindex strings, @code{NODE} internal type -@cindex numbers, @code{NODE} internal type -@item NODE -Just about everything is done using objects of type @code{NODE}. -These contain both strings and numbers, as well as variables and arrays. - -@cindex @code{force_number()} internal function -@cindex internal function, @code{force_number()} -@cindex numeric, values -@item AWKNUM force_number(NODE *n) -This macro forces a value to be numeric. It returns the actual -numeric value contained in the node. -It may end up calling an internal @command{gawk} function. - -@cindex @code{force_string()} internal function -@cindex internal function, @code{force_string()} -@item void force_string(NODE *n) -This macro guarantees that a @code{NODE}'s string value is current. -It may end up calling an internal @command{gawk} function. -It also guarantees that the string is zero-terminated. - -@cindex @code{force_wstring()} internal function -@cindex internal function, @code{force_wstring()} -@item void force_wstring(NODE *n) -Similarly, this -macro guarantees that a @code{NODE}'s wide-string value is current. -It may end up calling an internal @command{gawk} function. -It also guarantees that the wide string is zero-terminated. - -@cindex parameters@comma{} number of -@cindex @code{nargs} internal variable -@cindex internal variable, @code{nargs} -@item nargs -Inside an extension function, this is the actual number of -parameters passed to the current function. - -@cindex @code{stptr} internal variable -@cindex internal variable, @code{stptr} -@cindex @code{stlen} internal variable -@cindex internal variable, @code{stlen} -@item n->stptr -@itemx n->stlen -The data and length of a @code{NODE}'s string value, respectively. -The string is @emph{not} guaranteed to be zero-terminated. -If you need to pass the string value to a C library function, save -the value in @code{n->stptr[n->stlen]}, assign @code{'\0'} to it, -call the routine, and then restore the value. - -@cindex @code{wstptr} internal variable -@cindex internal variable, @code{wstptr} -@cindex @code{wstlen} internal variable -@cindex internal variable, @code{wstlen} -@item n->wstptr -@itemx n->wstlen -The data and length of a @code{NODE}'s wide-string value, respectively. -Use @code{force_wstring()} to make sure these values are current. - -@cindex @code{type} internal variable -@cindex internal variable, @code{type} -@item n->type -The type of the @code{NODE}. This is a C @code{enum}. Values should -be one of @code{Node_var}, @code{Node_var_new}, or @code{Node_var_array} -for function parameters. - -@cindex @code{vname} internal variable -@cindex internal variable, @code{vname} -@item n->vname -The ``variable name'' of a node. This is not of much use inside -externally written extensions. - -@cindex arrays, associative, clearing -@cindex @code{assoc_clear()} internal function -@cindex internal function, @code{assoc_clear()} -@item void assoc_clear(NODE *n) -Clears the associative array pointed to by @code{n}. -Make sure that @samp{n->type == Node_var_array} first. - -@cindex arrays, elements, installing -@cindex @code{assoc_lookup()} internal function -@cindex internal function, @code{assoc_lookup()} -@item NODE **assoc_lookup(NODE *symbol, NODE *subs) -Finds, and installs if necessary, array elements. -@code{symbol} is the array, @code{subs} is the subscript. -This is usually a value created with @code{make_string()} (see below). - -@cindex strings -@cindex @code{make_string()} internal function -@cindex internal function, @code{make_string()} -@item NODE *make_string(char *s, size_t len) -Take a C string and turn it into a pointer to a @code{NODE} that -can be stored appropriately. This is permanent storage; understanding -of @command{gawk} memory management is helpful. - -@cindex numbers -@cindex @code{make_number()} internal function -@cindex internal function, @code{make_number()} -@item NODE *make_number(AWKNUM val) -Take an @code{AWKNUM} and turn it into a pointer to a @code{NODE} that -can be stored appropriately. This is permanent storage; understanding -of @command{gawk} memory management is helpful. - - -@cindex nodes@comma{} duplicating -@cindex @code{dupnode()} internal function -@cindex internal function, @code{dupnode()} -@item NODE *dupnode(NODE *n) -Duplicate a node. In most cases, this increments an internal -reference count instead of actually duplicating the entire @code{NODE}; -understanding of @command{gawk} memory management is helpful. - -@cindex memory, releasing -@cindex @code{unref()} internal function -@cindex internal function, @code{unref()} -@item void unref(NODE *n) -This macro releases the memory associated with a @code{NODE} -allocated with @code{make_string()} or @code{make_number()}. -Understanding of @command{gawk} memory management is helpful. - -@cindex @code{make_builtin()} internal function -@cindex internal function, @code{make_builtin()} -@item void make_builtin(const char *name, NODE *(*func)(NODE *), int count) -Register a C function pointed to by @code{func} as new built-in -function @code{name}. @code{name} is a regular C string. @code{count} -is the maximum number of arguments that the function takes. -The function should be written in the following manner: - -@example -/* do_xxx --- do xxx function for gawk */ - -NODE * -do_xxx(int nargs) -@{ - @dots{} -@} -@end example +As an extreme example, if you ever even think about trying to compile, +oh, say, the V7 @command{awk}, you will discover that not only do you +have to bootstrap the V7 @command{yacc} to do so, but you also need the +V7 @command{lex}. And the latter is pretty much impossible to bring up +on a modern GNU/Linux system.@footnote{We tried. It was painful.} -@cindex arguments, retrieving -@cindex @code{get_argument()} internal function -@cindex internal function, @code{get_argument()} -@item NODE *get_argument(int i) -This function is called from within a C extension function to get -the @code{i}-th argument from the function call. -The first argument is argument zero. - -@cindex @code{get_actual_argument()} internal function -@cindex internal function, @code{get_actual_argument()} -@item NODE *get_actual_argument(int i, -@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ int@ optional,@ int@ wantarray); -This function retrieves a particular argument @code{i}. @code{wantarray} is @code{TRUE} -if the argument should be an array, @code{FALSE} otherwise. If @code{optional} is -@code{TRUE}, the argument need not have been supplied. If it wasn't, the return -value is @code{NULL}. It is a fatal error if @code{optional} is @code{TRUE} but -the argument was not provided. - -@cindex @code{get_scalar_argument()} internal macro -@cindex internal macro, @code{get_scalar_argument()} -@item get_scalar_argument(i, opt) -This is a convenience macro that calls @code{get_actual_argument()}. - -@cindex @code{get_array_argument()} internal macro -@cindex internal macro, @code{get_array_argument()} -@item get_array_argument(i, opt) -This is a convenience macro that calls @code{get_actual_argument()}. - -@cindex functions, return values@comma{} setting +(Or, let's say @command{gawk} 1.2 required @command{bison} whatever-it-was +in 1989 and that there was no @file{awkgram.c} file in the repository. Is +there a guarantee that we could find that @command{bison} version? Or that +@emph{it} would build?) -@cindex @code{ERRNO} variable -@cindex @code{update_ERRNO()} internal function -@cindex internal function, @code{update_ERRNO()} -@item void update_ERRNO(void) -This function is called from within a C extension function to set -the value of @command{gawk}'s @code{ERRNO} variable, based on the current -value of the C @code{errno} global variable. -It is provided as a convenience. - -@cindex @code{ERRNO} variable -@cindex @code{update_ERRNO_saved()} internal function -@cindex internal function, @code{update_ERRNO_saved()} -@item void update_ERRNO_saved(int errno_saved) -This function is called from within a C extension function to set -the value of @command{gawk}'s @code{ERRNO} variable, based on the error -value provided as the argument. -It is provided as a convenience. - -@cindex @code{ENVIRON} array -@cindex @code{PROCINFO} array -@cindex @code{register_deferred_variable()} internal function -@cindex internal function, @code{register_deferred_variable()} -@item void register_deferred_variable(const char *name, NODE *(*load_func)(void)) -This function is called to register a function to be called when a -reference to an undefined variable with the given name is encountered. -The callback function will never be called if the variable exists already, -so, unless the calling code is running at program startup, it should first -check whether a variable of the given name already exists. -The argument function must return a pointer to a @code{NODE} containing the -newly created variable. This function is used to implement the builtin -@code{ENVIRON} and @code{PROCINFO} arrays, so you can refer to them -for examples. - -@cindex @code{IOBUF} internal structure -@cindex internal structure, @code{IOBUF} -@cindex @code{iop_alloc()} internal function -@cindex internal function, @code{iop_alloc()} -@cindex @code{get_record()} input method -@cindex @code{close_func}() input method -@cindex @code{INVALID_HANDLE} internal constant -@cindex internal constant, @code{INVALID_HANDLE} -@cindex XML (eXtensible Markup Language) -@cindex eXtensible Markup Language (XML) -@cindex @code{register_open_hook()} internal function -@cindex internal function, @code{register_open_hook()} -@item void register_open_hook(void *(*open_func)(IOBUF *)) -This function is called to register a function to be called whenever -a new data file is opened, leading to the creation of an @code{IOBUF} -structure in @code{iop_alloc()}. After creating the new @code{IOBUF}, -@code{iop_alloc()} will call (in reverse order of registration, so the last -function registered is called first) each open hook until one returns -non-@code{NULL}. If any hook returns a non-@code{NULL} value, that value is assigned -to the @code{IOBUF}'s @code{opaque} field (which will presumably point -to a structure containing additional state associated with the input -processing), and no further open hooks are called. - -The function called will most likely want to set the @code{IOBUF}'s -@code{get_record} method to indicate that future input records should -be retrieved by calling that method instead of using the standard -@command{gawk} input processing. - -And the function will also probably want to set the @code{IOBUF}'s -@code{close_func} method to be called when the file is closed to clean -up any state associated with the input. - -Finally, hook functions should be prepared to receive an @code{IOBUF} -structure where the @code{fd} field is set to @code{INVALID_HANDLE}, -meaning that @command{gawk} was not able to open the file itself. In -this case, the hook function must be able to successfully open the file -and place a valid file descriptor there. - -Currently, for example, the hook function facility is used to implement -the XML parser shared library extension. For more info, please look in -@file{awk.h} and in @file{io.c}. -@end table - -An argument that is supposed to be an array needs to be handled with -some extra code, in case the array being passed in is actually -from a function parameter. - -The following boilerplate code shows how to do this: - -@example -NODE *the_arg; - -/* assume need 3rd arg, 0-based */ -the_arg = get_array_argument(2, FALSE); -@end example - -Again, you should spend time studying the @command{gawk} internals; -don't just blindly copy this code. -@c ENDOFRANGE gawint - -@node Plugin License -@appendixsubsec Extension Licensing +If the repository has all the generated files, then it's easy to just check +them out and build. (Or @emph{easier}, depending upon how far back we go. +@code{:-)}) -Every dynamic extension should define the global symbol -@code{plugin_is_GPL_compatible} to assert that it has been licensed under -a GPL-compatible license. If this symbol does not exist, @command{gawk} -will emit a fatal error and exit. +And that brings us to the second (and stronger) reason why all the files +really need to be in @command{git}. It boils down to who do you cater +to---the @command{gawk} developer(s), or the user who just wants to check +out a version and try it out? -The declared type of the symbol should be @code{int}. It does not need -to be in any allocated section, though. The code merely asserts that -the symbol exists in the global scope. Something like this is enough: +The @command{gawk} maintainer +wants it to be possible for any interested @command{awk} user in the +world to just clone the repository, check out the branch of interest and +build it. Without their having to have the correct version(s) of the +autotools.@footnote{There is one GNU program that is (in our opinion) +severely difficult to bootstrap from the @command{git} repository. For +example, on the author's old (but still working) PowerPC macintosh with +Mac OS X 10.5, it was necessary to bootstrap a ton of software, starting +with @command{git} itself, in order to try to work with the latest code. +It's not pleasant, and especially on older systems, it's a big waste +of time. -@example -int plugin_is_GPL_compatible; -@end example - -@node Loading Extensions -@appendixsubsec Loading a Dynamic Extension -@cindex loading extension -@cindex @command{gawk}, functions, loading -There are two ways to load a dynamically linked library. The first is to use the -builtin @code{extension()}: +Starting with the latest tarball was no picnic either. The maintainers +had dropped @file{.gz} and @file{.bz2} files and only distribute +@file{.tar.xz} files. It was necessary to bootstrap @command{xz} first!} +That is the point of the @file{bootstrap.sh} file. It touches the +various other files in the right order such that @example -extension(libname, init_func) +# The canonical incantation for building GNU software: +./bootstrap.sh && ./configure && make @end example -where @file{libname} is the library to load, and @samp{init_func} is the -name of the initialization or bootstrap routine to run once loaded. - -The second method for dynamic loading of a library is to use the -command line option @option{-l}: - -@example -$ @kbd{gawk -l libname -f myprog} -@end example - -This will work only if the initialization routine is named @code{dlload()}. - -If you use @code{extension()}, the library will be loaded -at run time. This means that the functions are available only to the rest of -your script. If you use the command line option @option{-l} instead, -the library will be loaded before @command{gawk} starts compiling the -actual program. The net effect is that you can use those functions -anywhere in the program. - -@command{gawk} has a list of directories where it searches for libraries. -By default, the list includes directories that depend upon how gawk was built -and installed (@pxref{AWKPATH Variable}). If you want @command{gawk} -to look for libraries in your private directory, you have to tell it. -The way to do it is to set the @env{AWKPATH} environment variable -(@pxref{AWKPATH Variable}). -@command{gawk} supplies the default suffix @samp{.so} if it is not -present in the name of the library. -If the name of your library is @file{mylib.so}, you can simply type - -@example -$ @kbd{gawk -l mylib -f myprog} -@end example - -and @command{gawk} will do everything necessary to load in your library, -and then call your @code{dlload()} routine. - -You can always specify the library using an absolute pathname, in which -case @command{gawk} will not use @env{AWKPATH} to search for it. - -@node Sample Library -@appendixsubsec Example: Directory and File Operation Built-ins -@c STARTOFRANGE chdirg -@cindex @code{chdir()} function@comma{} implementing in @command{gawk} -@c STARTOFRANGE statg -@cindex @code{stat()} function@comma{} implementing in @command{gawk} -@c STARTOFRANGE filre -@cindex files, information about@comma{} retrieving -@c STARTOFRANGE dirch -@cindex directories, changing - -Two useful functions that are not in @command{awk} are @code{chdir()} -(so that an @command{awk} program can change its directory) and -@code{stat()} (so that an @command{awk} program can gather information about -a file). -This @value{SECTION} implements these functions for @command{gawk} in an -external extension library. - -@menu -* Internal File Description:: What the new functions will do. -* Internal File Ops:: The code for internal file operations. -* Using Internal File Ops:: How to use an external extension. -@end menu - -@node Internal File Description -@appendixsubsubsec Using @code{chdir()} and @code{stat()} - -This @value{SECTION} shows how to use the new functions at the @command{awk} -level once they've been integrated into the running @command{gawk} -interpreter. -Using @code{chdir()} is very straightforward. It takes one argument, -the new directory to change to: - -@example -@dots{} -newdir = "/home/arnold/funstuff" -ret = chdir(newdir) -if (ret < 0) @{ - printf("could not change to %s: %s\n", - newdir, ERRNO) > "/dev/stderr" - exit 1 -@} -@dots{} -@end example - -The return value is negative if the @code{chdir} failed, -and @code{ERRNO} -(@pxref{Built-in Variables}) -is set to a string indicating the error. +@noindent +will @emph{just work}. -Using @code{stat()} is a bit more complicated. -The C @code{stat()} function fills in a structure that has a fair -amount of information. -The right way to model this in @command{awk} is to fill in an associative -array with the appropriate information: +This is extremely important for the @code{master} and +@code{gawk-@var{X}.@var{Y}-stable} branches. -@c broke printf for page breaking -@example -file = "/home/arnold/.profile" -fdata[1] = "x" # force `fdata' to be an array -ret = stat(file, fdata) -if (ret < 0) @{ - printf("could not stat %s: %s\n", - file, ERRNO) > "/dev/stderr" - exit 1 -@} -printf("size of %s is %d bytes\n", file, fdata["size"]) -@end example +Further, the @command{gawk} maintainer would argue that it's also +important for the @command{gawk} developers. When he tried to check out +the @code{xgawk} branch@footnote{A branch created by one of the other +developers that did not include the generated files.} to build it, he +couldn't. (No @file{ltmain.sh} file, and he had no idea how to create it, +and that was not the only problem.) -The @code{stat()} function always clears the data array, even if -the @code{stat()} fails. It fills in the following elements: +He felt @emph{extremely} frustrated. With respect to that branch, +the maintainer is no different than Jane User who wants to try to build +@code{gawk-4.0-stable} or @code{master} from the repository. -@table @code -@item "name" -The name of the file that was @code{stat()}'ed. +Thus, the maintainer thinks that it's not just important, but critical, +that for any given branch, the above incantation @emph{just works}. -@item "dev" -@itemx "ino" -The file's device and inode numbers, respectively. +@c So - that's my reasoning and philosophy. -@item "mode" -The file's mode, as a numeric value. This includes both the file's -type and its permissions. +What are some of the consequences and/or actions to take? -@item "nlink" -The number of hard links (directory entries) the file has. - -@item "uid" -@itemx "gid" -The numeric user and group ID numbers of the file's owner. - -@item "size" -The size in bytes of the file. - -@item "blocks" -The number of disk blocks the file actually occupies. This may not -be a function of the file's size if the file has holes. - -@item "atime" -@itemx "mtime" -@itemx "ctime" -The file's last access, modification, and inode update times, -respectively. These are numeric timestamps, suitable for formatting -with @code{strftime()} -(@pxref{Built-in}). +@enumerate 1 +@item +We don't mind that there are differing files in the different branches +as a result of different versions of the autotools. -@item "pmode" -The file's ``printable mode.'' This is a string representation of -the file's type and permissions, such as what is produced by -@samp{ls -l}---for example, @code{"drwxr-xr-x"}. +@enumerate A +@item +It's the maintainer's job to merge them and he will deal with it. -@item "type" -A printable string representation of the file's type. The value -is one of the following: +@item +He is really good at @samp{git diff x y > /tmp/diff1 ; gvim /tmp/diff1} to +remove the diffs that aren't of interest in order to review code. @code{:-)} +@end enumerate -@table @code -@item "blockdev" -@itemx "chardev" -The file is a block or character device (``special file''). +@item +It would certainly help if everyone used the same versions of the GNU tools +as he does, which in general are the latest released versions of +@command{automake}, +@command{autoconf}, +@command{bison}, +and +@command{gettext}. @ignore -@item "door" -The file is a Solaris ``door'' (special file used for -interprocess communications). +If it would help if I sent out an "I just upgraded to version x.y +of tool Z" kind of message to this list, I can do that. Up until +now it hasn't been a real issue since I'm the only one who's been +dorking with the configuration machinery. @end ignore -@item "directory" -The file is a directory. - -@item "fifo" -The file is a named-pipe (also known as a FIFO). - -@item "file" -The file is just a regular file. - -@item "socket" -The file is an @code{AF_UNIX} (``Unix domain'') socket in the -filesystem. - -@item "symlink" -The file is a symbolic link. -@end table -@end table - -Several additional elements may be present depending upon the operating -system and the type of the file. You can test for them in your @command{awk} -program by using the @code{in} operator -(@pxref{Reference to Elements}): - -@table @code -@item "blksize" -The preferred block size for I/O to the file. This field is not -present on all POSIX-like systems in the C @code{stat} structure. - -@item "linkval" -If the file is a symbolic link, this element is the name of the -file the link points to (i.e., the value of the link). - -@item "rdev" -@itemx "major" -@itemx "minor" -If the file is a block or character device file, then these values -represent the numeric device number and the major and minor components -of that number, respectively. -@end table - -@node Internal File Ops -@appendixsubsubsec C Code for @code{chdir()} and @code{stat()} - -Here is the C code for these extensions. They were written for -GNU/Linux. The code needs some more work for complete portability -to other POSIX-compliant systems:@footnote{This version is edited -slightly for presentation. See -@file{extension/filefuncs.c} in the @command{gawk} distribution -for the complete version.} - -@c break line for page breaking -@example -#include "awk.h" - -#include <sys/sysmacros.h> - -int plugin_is_GPL_compatible; - -/* do_chdir --- provide dynamically loaded chdir() builtin for gawk */ - -static NODE * -do_chdir(int nargs) -@{ - NODE *newdir; - int ret = -1; - - if (do_lint && nargs != 1) - lintwarn("chdir: called with incorrect number of arguments"); - - newdir = get_scalar_argument(0, FALSE); -@end example - -The file includes the @code{"awk.h"} header file for definitions -for the @command{gawk} internals. It includes @code{<sys/sysmacros.h>} -for access to the @code{major()} and @code{minor}() macros. - -@cindex programming conventions, @command{gawk} internals -By convention, for an @command{awk} function @code{foo}, the function that -implements it is called @samp{do_foo}. The function should take -a @samp{int} argument, usually called @code{nargs}, that -represents the number of defined arguments for the function. The @code{newdir} -variable represents the new directory to change to, retrieved -with @code{get_scalar_argument()}. Note that the first argument is -numbered zero. - -This code actually accomplishes the @code{chdir()}. It first forces -the argument to be a string and passes the string value to the -@code{chdir()} system call. If the @code{chdir()} fails, @code{ERRNO} -is updated. - -@example - (void) force_string(newdir); - ret = chdir(newdir->stptr); - if (ret < 0) - update_ERRNO(); -@end example - -Finally, the function returns the return value to the @command{awk} level: - -@example - return make_number((AWKNUM) ret); -@} -@end example - -The @code{stat()} built-in is more involved. First comes a function -that turns a numeric mode into a printable representation -(e.g., 644 becomes @samp{-rw-r--r--}). This is omitted here for brevity: +@enumerate A +@item +Installing from source is quite easy. It's how the maintainer worked for years +under Fedora. +He had @file{/usr/local/bin} at the front of hs @env{PATH} and just did: -@c break line for page breaking @example -/* format_mode --- turn a stat mode field into something readable */ - -static char * -format_mode(unsigned long fmode) -@{ - @dots{} -@} +wget http://ftp.gnu.org/gnu/@var{package}/@var{package}-@var{x}.@var{y}.@var{z}.tar.gz +tar -xpzvf @var{package}-@var{x}.@var{y}.@var{z}.tar.gz +cd @var{package}-@var{x}.@var{y}.@var{z} +./configure && make && make check +make install # as root @end example -Next comes the @code{do_stat()} function. It starts with -variable declarations and argument checking: +@item +These days the maintainer uses Ubuntu 10.11 which is medium current, but +he is already doing the above for @command{autoconf} and @command{bison}. @ignore -Changed message for page breaking. Used to be: - "stat: called with incorrect number of arguments (%d), should be 2", +(C. Rant: Recent Linux versions with GNOME 3 really suck. What + are all those people thinking? Fedora 15 was such a bust it drove + me to Ubuntu, but Ubuntu 11.04 and 11.10 are totally unusable from + a UI perspective. Bleah.) @end ignore -@example -/* do_stat --- provide a stat() function for gawk */ - -static NODE * -do_stat(int nargs) -@{ - NODE *file, *array, *tmp; - struct stat sbuf; - int ret; - NODE **aptr; - char *pmode; /* printable mode */ - char *type = "unknown"; - - if (do_lint && nargs > 2) - lintwarn("stat: called with too many arguments"); -@end example - -Then comes the actual work. First, the function gets the arguments. -Then, it always clears the array. -The code use @code{lstat()} (instead of @code{stat()}) -to get the file information, -in case the file is a symbolic link. -If there's an error, it sets @code{ERRNO} and returns: - -@c comment made multiline for page breaking -@example - /* file is first arg, array to hold results is second */ - file = get_scalar_argument(0, FALSE); - array = get_array_argument(1, FALSE); - - /* empty out the array */ - assoc_clear(array); - - /* lstat the file, if error, set ERRNO and return */ - (void) force_string(file); - ret = lstat(file->stptr, & sbuf); - if (ret < 0) @{ - update_ERRNO(); - return make_number((AWKNUM) ret); - @} -@end example - -Now comes the tedious part: filling in the array. Only a few of the -calls are shown here, since they all follow the same pattern: - -@example - /* fill in the array */ - aptr = assoc_lookup(array, tmp = make_string("name", 4)); - *aptr = dupnode(file); - unref(tmp); - - aptr = assoc_lookup(array, tmp = make_string("mode", 4)); - *aptr = make_number((AWKNUM) sbuf.st_mode); - unref(tmp); - - aptr = assoc_lookup(array, tmp = make_string("pmode", 5)); - pmode = format_mode(sbuf.st_mode); - *aptr = make_string(pmode, strlen(pmode)); - unref(tmp); -@end example - -When done, return the @code{lstat()} return value: - -@example - - return make_number((AWKNUM) ret); -@} -@end example - -@cindex programming conventions, @command{gawk} internals -Finally, it's necessary to provide the ``glue'' that loads the -new function(s) into @command{gawk}. By convention, each library has -a routine named @code{dlload()} that does the job: - -@example -/* dlload --- load new builtins in this library */ - -NODE * -dlload(NODE *tree, void *dl) -@{ - make_builtin("chdir", do_chdir, 1); - make_builtin("stat", do_stat, 2); - return make_number((AWKNUM) 0); -@} -@end example +@end enumerate -And that's it! As an exercise, consider adding functions to -implement system calls such as @code{chown()}, @code{chmod()}, -and @code{umask()}. +@ignore +@item +If someone still feels really strongly about all this, then perhaps they +can have two branches, one for their development with just the clean +changes, and one that is buildable (xgawk and xgawk-buildable, maybe). +Or, as I suggested in another mail, make commits in pairs, the first with +the "real" changes and the second with "everything else needed for + building". +@end ignore +@end enumerate -@node Using Internal File Ops -@appendixsubsubsec Integrating the Extensions +Most of the above was originally written by the maintainer to other +@command{gawk} developers. It raised the objection from one of +the developers ``@dots{} that anybody pulling down the source from +@command{git} is not an end user.'' -@cindex @command{gawk}, interpreter@comma{} adding code to -Now that the code is written, it must be possible to add it at -runtime to the running @command{gawk} interpreter. First, the -code must be compiled. Assuming that the functions are in -a file named @file{filefuncs.c}, and @var{idir} is the location -of the @command{gawk} include files, -the following steps create -a GNU/Linux shared library: +However, this is not true. There are ``power @command{awk} users'' +who can build @command{gawk} (using the magic incantation shown previously) +but who can't program in C. Thus, the major branches should be +kept buildable all the time. -@example -$ @kbd{gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -I@var{idir} filefuncs.c} -$ @kbd{ld -o filefuncs.so -shared filefuncs.o} -@end example +It was then suggested that there be a @command{cron} job to create +nightly tarballs of ``the source.'' Here, the problem is that there +are source trees, corresponding to the various branches! So, +nightly tar balls aren't the answer, especially as the repository can go +for weeks without significant change being introduced. -@cindex @code{extension()} function (@command{gawk}) -Once the library exists, it is loaded by calling the @code{extension()} -built-in function. -This function takes two arguments: the name of the -library to load and the name of a function to call when the library -is first loaded. This function adds the new functions to @command{gawk}. -It returns the value returned by the initialization function -within the shared library: +Fortunately, the @command{git} server can meet this need. For any given +branch named @var{branchname}, use: @example -# file testff.awk -BEGIN @{ - extension("./filefuncs.so", "dlload") - - chdir(".") # no-op - - data[1] = 1 # force `data' to be an array - print "Info for testff.awk" - ret = stat("testff.awk", data) - print "ret =", ret - for (i in data) - printf "data[\"%s\"] = %s\n", i, data[i] - print "testff.awk modified:", - strftime("%m %d %y %H:%M:%S", data["mtime"]) - - print "\nInfo for JUNK" - ret = stat("JUNK", data) - print "ret =", ret - for (i in data) - printf "data[\"%s\"] = %s\n", i, data[i] - print "JUNK modified:", strftime("%m %d %y %H:%M:%S", data["mtime"]) -@} +wget http://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-@var{branchname}.tar.gz @end example -Here are the results of running the program: +@noindent +to retrieve a snapshot of the given branch. -@example -$ @kbd{gawk -f testff.awk} -@print{} Info for testff.awk -@print{} ret = 0 -@print{} data["size"] = 607 -@print{} data["ino"] = 14945891 -@print{} data["name"] = testff.awk -@print{} data["pmode"] = -rw-rw-r-- -@print{} data["nlink"] = 1 -@print{} data["atime"] = 1293993369 -@print{} data["mtime"] = 1288520752 -@print{} data["mode"] = 33204 -@print{} data["blksize"] = 4096 -@print{} data["dev"] = 2054 -@print{} data["type"] = file -@print{} data["gid"] = 500 -@print{} data["uid"] = 500 -@print{} data["blocks"] = 8 -@print{} data["ctime"] = 1290113572 -@print{} testff.awk modified: 10 31 10 12:25:52 -@print{} -@print{} Info for JUNK -@print{} ret = -1 -@print{} JUNK modified: 01 01 70 02:00:00 -@end example -@c ENDOFRANGE filre -@c ENDOFRANGE dirch -@c ENDOFRANGE statg -@c ENDOFRANGE chdirg -@c ENDOFRANGE gladfgaw -@c ENDOFRANGE adfugaw -@c ENDOFRANGE fubadgaw @node Future Extensions @appendixsec Probable Future Extensions @@ -30958,66 +34880,37 @@ Arnold Robbins Larry Wall @end quotation -This @value{SECTION} briefly lists extensions and possible improvements -that indicate the directions we are -currently considering for @command{gawk}. The file @file{FUTURES} in the -@command{gawk} distribution lists these extensions as well. - -Following is a list of probable future changes visible at the -@command{awk} language level: - -@c these are ordered by likelihood -@table @asis -@item Loadable module interface -It is not clear that the @command{awk}-level interface to the -modules facility is as good as it should be. The interface needs to be -redesigned, particularly taking namespace issues into account, as -well as possibly including issues such as library search path order -and versioning. - -@item @code{RECLEN} variable for fixed-length records -Along with @code{FIELDWIDTHS}, this would speed up the processing of -fixed-length records. -@code{PROCINFO["RS"]} would be @code{"RS"} or @code{"RECLEN"}, -depending upon which kind of record processing is in effect. - -@item Databases -It may be possible to map a GDBM/NDBM/SDBM file into an @command{awk} array. - -@item More @code{lint} warnings -There are more things that could be checked for portability. -@end table - -Following is a list of probable improvements that will make @command{gawk}'s -source code easier to work with: - -@table @asis -@item Loadable module mechanics -The current extension mechanism works -(@pxref{Dynamic Extensions}), -but is rather primitive. It requires a fair amount of manual work -to create and integrate a loadable module. -Nor is the current mechanism as portable as might be desired. -The GNU @command{libtool} package provides a number of features that -would make using loadable modules much easier. -@command{gawk} should be changed to use @command{libtool}. - -@item Loadable module internals -The API to its internals that @command{gawk} ``exports'' should be revised. -Too many things are needlessly exposed. A new API should be designed -and implemented to make module writing easier. - -@item Better array subscript management -@command{gawk}'s management of array subscript storage could use revamping, -so that using the same value to index multiple arrays only -stores one copy of the index value. -@end table - -Finally, -the programs in the test suite could use documenting in this @value{DOCUMENT}. - +The @file{TODO} file in the @command{gawk} Git repository lists possible +future enhancements. Some of these relate to the source code, and others +to possible new features. Please see that file for the list. @xref{Additions}, -if you are interested in tackling any of these projects. +if you are interested in tackling any of the projects listed there. + +@node Implementation Limitations +@appendixsec Some Limitations of the Implementation + +This following table describes limits of @command{gawk} on a Unix-like +system (although it is variable even then). Other systems may have +different limits. + +@c @multitable {Number of file redirections} {min(number of processes per user, number of open files)} +@multitable @columnfractions .40 .60 +@headitem Item @tab Limit +@item Characters in a character class @tab 2^(number of bits per byte) +@item Length of input record @tab @code{MAX_INT } +@item Length of output record @tab Unlimited +@item Length of source line @tab Unlimited +@item Number of fields in a record @tab @code{MAX_LONG} +@item Number of file redirections @tab Unlimited +@item Number of input records in one file @tab @code{MAX_LONG} +@item Number of input records total @tab @code{MAX_LONG} +@item Number of pipe redirections @tab min(number of processes per user, number of open files) +@item Numeric values @tab Double-precision floating point (if not using MPFR) +@item Size of a field @tab @code{MAX_INT } +@item Size of a literal string @tab @code{MAX_INT } +@item Size of a printf string @tab @code{MAX_INT } +@end multitable + @c ENDOFRANGE impis @c ENDOFRANGE gawii @@ -31038,7 +34931,6 @@ other introductory texts that you should refer to instead.) @menu * Basic High Level:: The high level view. * Basic Data Typing:: A very quick intro to data types. -* Floating Point Issues:: Stuff to know about floating-point numbers. @end menu @node Basic High Level @@ -31046,19 +34938,17 @@ other introductory texts that you should refer to instead.) @cindex processing data At the most basic level, the job of a program is to process -some input data and produce results. +some input data and produce results. See @ref{figure-general-flow}. -@iftex -@image{general-program} -@end iftex -@ifnottex -@example - _______ -+------+ / \ +---------+ -| Data | -----> < Program > -----> | Results | -+------+ \_______/ +---------+ -@end example -@end ifnottex +@float Figure,figure-general-flow +@caption{General Program Flow} +@ifinfo +@center @image{general-program, , , General program flow, txt} +@end ifinfo +@ifnotinfo +@center @image{general-program, , , General program flow} +@end ifnotinfo +@end float @cindex compiled programs @cindex interpreted programs @@ -31074,26 +34964,18 @@ instructions in your program to process the data. @cindex programming, basic steps When you write a program, it usually consists -of the following, very basic set of steps: +of the following, very basic set of steps, as shown +in @ref{figure-process-flow}: -@iftex -@image{process-flow} -@end iftex -@ifnottex -@example - ______ -+----------------+ / More \ No +----------+ -| Initialization | -------> < Data > -------> | Clean Up | -+----------------+ ^ \ ? / +----------+ - | +--+-+ - | | Yes - | | - | V - | +---------+ - +-----+ Process | - +---------+ -@end example -@end ifnottex +@float Figure,figure-process-flow +@caption{Basic Program Steps} +@ifinfo +@center @image{process-flow, , , Basic Program Stages, txt} +@end ifinfo +@ifnotinfo +@center @image{process-flow, , , Basic Program Stages} +@end ifnotinfo +@end float @table @asis @item Initialization @@ -31189,47 +35071,10 @@ Individual variables, as well as numeric and string variables, are referred to as @dfn{scalar} values. Groups of values, such as arrays, are not scalars. -@cindex integers -@cindex floating-point, numbers -@cindex numbers, floating-point -Within computers, there are two kinds of numeric values: @dfn{integers} -and @dfn{floating-point}. -In school, integer values were referred to as ``whole'' numbers---that is, -numbers without any fractional part, such as 1, 42, or @minus{}17. -The advantage to integer numbers is that they represent values exactly. -The disadvantage is that their range is limited. On most systems, -this range is @minus{}2,147,483,648 to 2,147,483,647. -However, many systems now support a range from -@minus{}9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. - -@cindex unsigned integers -@cindex integers, unsigned -Integer values come in two flavors: @dfn{signed} and @dfn{unsigned}. -Signed values may be negative or positive, with the range of values just -described. -Unsigned values are always positive. On most systems, -the range is from 0 to 4,294,967,295. -However, many systems now support a range from -0 to 18,446,744,073,709,551,615. - -@cindex double precision floating-point -@cindex single precision floating-point -Floating-point numbers represent what are called ``real'' numbers; i.e., -those that do have a fractional part, such as 3.1415927. -The advantage to floating-point numbers is that they -can represent a much larger range of values. -The disadvantage is that there are numbers that they cannot represent -exactly. -@command{awk} uses @dfn{double precision} floating-point numbers, which -can hold more digits than @dfn{single precision} -floating-point numbers. -Floating-point issues are discussed more fully in -@ref{Floating Point Issues}. - -At the very lowest level, computers store values as groups of binary digits, -or @dfn{bits}. Modern computers group bits into groups of eight, called @dfn{bytes}. -Advanced applications sometimes have to manipulate bits directly, -and @command{gawk} provides functions for doing so. +@ref{General Arithmetic}, provided a basic introduction to numeric +types (integer and floating-point) and how they are used in a computer. +Please review that information, including a number of caveats that +were presented. @cindex null strings While you are probably used to the idea of a number without a value (i.e., zero), @@ -31253,6 +35098,11 @@ plus 0 times 1, or decimal 10. Octal and hexadecimal are discussed more in @ref{Nondecimal-numbers}. +At the very lowest level, computers store values as groups of binary digits, +or @dfn{bits}. Modern computers group bits into groups of eight, called @dfn{bytes}. +Advanced applications sometimes have to manipulate bits directly, +and @command{gawk} provides functions for doing so. + Programs are written in programming languages. Hundreds, if not thousands, of programming languages exist. One of the most popular is the C programming language. @@ -31272,239 +35122,6 @@ standard for C. This standard became an ISO standard in 1990. In 1999, a revised ISO C standard was approved and released. Where it makes sense, POSIX @command{awk} is compatible with 1999 ISO C. -@node Floating Point Issues -@appendixsec Floating-Point Number Caveats - -As mentioned earlier, floating-point numbers represent what are called -``real'' numbers, i.e., those that have a fractional part. @command{awk} -uses double precision floating-point numbers to represent all -numeric values. This @value{SECTION} describes some of the issues -involved in using floating-point numbers. - -There is a very nice -@uref{http://www.validlab.com/goldberg/paper.pdf, paper on floating-point arithmetic} -by David Goldberg, -``What Every Computer Scientist Should Know About Floating-point Arithmetic,'' -@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 5-48. -This is worth reading if you are interested in the details, -but it does require a background in computer science. - -@menu -* String Conversion Precision:: The String Value Can Lie. -* Unexpected Results:: Floating Point Numbers Are Not Abstract - Numbers. -* POSIX Floating Point Problems:: Standards Versus Existing Practice. -@end menu - -@node String Conversion Precision -@appendixsubsec The String Value Can Lie - -Internally, @command{awk} keeps both the numeric value -(double precision floating-point) and the string value for a variable. -Separately, @command{awk} keeps -track of what type the variable has -(@pxref{Typing and Comparison}), -which plays a role in how variables are used in comparisons. - -It is important to note that the string value for a number may not -reflect the full value (all the digits) that the numeric value -actually contains. -The following program (@file{values.awk}) illustrates this: - -@example -@{ - sum = $1 + $2 - # see it for what it is - printf("sum = %.12g\n", sum) - # use CONVFMT - a = "<" sum ">" - print "a =", a - # use OFMT - print "sum =", sum -@} -@end example - -@noindent -This program shows the full value of the sum of @code{$1} and @code{$2} -using @code{printf}, and then prints the string values obtained -from both automatic conversion (via @code{CONVFMT}) and -from printing (via @code{OFMT}). - -Here is what happens when the program is run: - -@example -$ @kbd{echo 3.654321 1.2345678 | awk -f values.awk} -@print{} sum = 4.8888888 -@print{} a = <4.88889> -@print{} sum = 4.88889 -@end example - -This makes it clear that the full numeric value is different from -what the default string representations show. - -@code{CONVFMT}'s default value is @code{"%.6g"}, which yields a value with -at least six significant digits. For some applications, you might want to -change it to specify more precision. -On most modern machines, most of the time, -17 digits is enough to capture a floating-point number's -value exactly.@footnote{Pathological cases can require up to -752 digits (!), but we doubt that you need to worry about this.} - -@node Unexpected Results -@appendixsubsec Floating Point Numbers Are Not Abstract Numbers - -@cindex floating-point, numbers -Unlike numbers in the abstract sense (such as what you studied in high school -or college math), numbers stored in computers are limited in certain ways. -They cannot represent an infinite number of digits, nor can they always -represent things exactly. -In particular, -floating-point numbers cannot -always represent values exactly. Here is an example: - -@example -$ @kbd{awk '@{ printf("%010d\n", $1 * 100) @}'} -515.79 -@print{} 0000051579 -515.80 -@print{} 0000051579 -515.81 -@print{} 0000051580 -515.82 -@print{} 0000051582 -@kbd{@value{CTL}-d} -@end example - -@noindent -This shows that some values can be represented exactly, -whereas others are only approximated. This is not a ``bug'' -in @command{awk}, but simply an artifact of how computers -represent numbers. - -@cindex negative zero -@cindex positive zero -@cindex zero@comma{} negative vs.@: positive -Another peculiarity of floating-point numbers on modern systems -is that they often have more than one representation for the number zero! -In particular, it is possible to represent ``minus zero'' as well as -regular, or ``positive'' zero. - -This example shows that negative and positive zero are distinct values -when stored internally, but that they are in fact equal to each other, -as well as to ``regular'' zero: - -@example -$ @kbd{gawk 'BEGIN @{ mz = -0 ; pz = 0} -> @kbd{printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz} -> @kbd{printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0} -> @kbd{@}'} -@print{} -0 = -0, +0 = 0, (-0 == +0) -> 1 -@print{} mz == 0 -> 1, pz == 0 -> 1 -@end example - -It helps to keep this in mind should you process numeric data -that contains negative zero values; the fact that the zero is negative -is noted and can affect comparisons. - -@node POSIX Floating Point Problems -@appendixsubsec Standards Versus Existing Practice - -Historically, @command{awk} has converted any non-numeric looking string -to the numeric value zero, when required. Furthermore, the original -definition of the language and the original POSIX standards specified that -@command{awk} only understands decimal numbers (base 10), and not octal -(base 8) or hexadecimal numbers (base 16). - -Changes in the language of the -2001 and 2004 POSIX standard can be interpreted to imply that @command{awk} -should support additional features. These features are: - -@itemize @bullet -@item -Interpretation of floating point data values specified in hexadecimal -notation (@samp{0xDEADBEEF}). (Note: data values, @emph{not} -source code constants.) - -@item -Support for the special IEEE 754 floating point values ``Not A Number'' -(NaN), positive Infinity (``inf'') and negative Infinity (``@minus{}inf''). -In particular, the format for these values is as specified by the ISO 1999 -C standard, which ignores case and can allow machine-dependent additional -characters after the @samp{nan} and allow either @samp{inf} or @samp{infinity}. -@end itemize - -The first problem is that both of these are clear changes to historical -practice: - -@itemize @bullet -@item -The @command{gawk} maintainer feels that supporting hexadecimal floating -point values, in particular, is ugly, and was never intended by the -original designers to be part of the language. - -@item -Allowing completely alphabetic strings to have valid numeric -values is also a very severe departure from historical practice. -@end itemize - -The second problem is that the @code{gawk} maintainer feels that this -interpretation of the standard, which requires a certain amount of -``language lawyering'' to arrive at in the first place, was not even -intended by the standard developers. In other words, ``we see how you -got where you are, but we don't think that that's where you want to be.'' - -The 2008 POSIX standard added explicit wording to allow, but not require, -that @command{awk} support hexadecimal floating point values and -special values for ``Not A Number'' and infinity. - -Although the @command{gawk} maintainer continues to feel that -providing those features is inadvisable, -nevertheless, on systems that support IEEE floating point, it seems -reasonable to provide @emph{some} way to support NaN and Infinity values. -The solution implemented in @command{gawk} is as follows: - -@itemize @bullet -@item -With the @option{--posix} command-line option, @command{gawk} becomes -``hands off.'' String values are passed directly to the system library's -@code{strtod()} function, and if it successfully returns a numeric value, -that is what's used.@footnote{You asked for it, you got it.} -By definition, the results are not portable across -different systems. They are also a little surprising: - -@example -$ @kbd{echo nanny | gawk --posix '@{ print $1 + 0 @}'} -@print{} nan -$ @kbd{echo 0xDeadBeef | gawk --posix '@{ print $1 + 0 @}'} -@print{} 3735928559 -@end example - -@item -Without @option{--posix}, @command{gawk} interprets the four strings -@samp{+inf}, -@samp{-inf}, -@samp{+nan}, -and -@samp{-nan} -specially, producing the corresponding special numeric values. -The leading sign acts a signal to @command{gawk} (and the user) -that the value is really numeric. Hexadecimal floating point is -not supported (unless you also use @option{--non-decimal-data}, -which is @emph{not} recommended). For example: - -@example -$ @kbd{echo nanny | gawk '@{ print $1 + 0 @}'} -@print{} 0 -$ @kbd{echo +nan | gawk '@{ print $1 + 0 @}'} -@print{} nan -$ @kbd{echo 0xDeadBeef | gawk '@{ print $1 + 0 @}'} -@print{} 0 -@end example - -@command{gawk} does ignore case in the four special values. -Thus @samp{+nan} and @samp{+NaN} are the same. -@end itemize - @c ENDOFRANGE procon @node Glossary @@ -31713,6 +35330,50 @@ It was written in @command{awk} by Brian Kernighan and Jon Bentley, and is available from @uref{http://netlib.sandia.gov/netlib/typesetting/chem.gz}. +@cindex cookie +@item Cookie +A peculiar goodie, token, saying or remembrance +produced by or presented to a program. (With thanks to Doug McIlroy.) +@ignore +From: Doug McIlroy <doug@cs.dartmouth.edu> +Date: Sat, 13 Oct 2012 19:55:25 -0400 +To: arnold@skeeve.com +Subject: Re: origin of the term "cookie"? + +I believe the term "cookie", for a more or less inscrutable +saying or crumb of information, was injected into Unix +jargon by Bob Morris, who used the word quite frequently. +It had no fixed meaning as it now does in browsers. + +The word had been around long before it was recognized in +the 8th edition glossary (earlier editions had no glossary): + +cookie a peculiar goodie, token, saying or remembrance +returned by or presented to a program. [I would say that +"returned by" would better read "produced by", and assume +responsibility for the inexactitude.] + +Doug McIlroy + +From: Doug McIlroy <doug@cs.dartmouth.edu> +Date: Sun, 14 Oct 2012 10:08:43 -0400 +To: arnold@skeeve.com +Subject: Re: origin of the term "cookie"? + +> Can I forward your email to Eric Raymond, for possible addition to the +> Jargon File? + +Sure. I might add that I don't know how "cookie" entered Morris's +vocabulary. Certainly "values of beta give rise to dom!" (see google) +was an early, if not the earliest Unix cookie. The fact that it was +found lying around on a model 37 teletype (which had Greek beta in +its type box) suggests that maybe it was seen to be like milk and +cookies laid out for Santa Claus. Morris was wont to make such +connections. + +Doug +@end ignore + @item Coprocess A subordinate program with which two-way communications is possible. @@ -31955,12 +35616,15 @@ in @command{awk} programs. @cindex ISO @item ISO -The International Standards Organization. +The International Organization for Standardization. This organization produces international standards for many things, including programming languages, such as C and C++. In the computer arena, important standards like those for C, C++, and POSIX become both American national and ISO international standards simultaneously. This @value{DOCUMENT} refers to Standard C as ``ISO C'' throughout. +See @uref{http://www.iso.org/iso/home/about.htm, the ISO website} for more +information about the name of the organization and its language-independent +three-letter acronym. @cindex Java programming language @cindex Programming languages, Java @@ -33494,9 +37158,6 @@ Unresolved Issues: of how to use them. It would be useful to perhaps have a "programming style" section of the manual that would include this and other tips. -2. The default AWKPATH search path should be configurable via `configure' - The default and how this changes needs to be documented. - Consistency issues: /.../ regexps are in @code, not @samp ".." strings are in @code, not @samp @@ -33591,14 +37252,7 @@ ORA uses filename, thus the macro. Suggestions: ------------ -Enhance FIELDWIDTHS with some way to indicate "the rest of the record". -E.g., a length of 0 or -1 or something. May be "n"? - -Make FIELDWIDTHS be an array? - % Next edition: -% 1. Talk about common extensions, those in nawk, gawk, mawk -% 2. Use @code{foo} for variables and @code{foo()} for functions -% 3. Standardize the error messages from the functions and programs -% in Chapters 12 and 13. -% 4. Nuke the BBS stuff and use something that won't be obsolete +% 1. Standardize the error messages from the functions and programs +% in the two sample code chapters. +% 2. Nuke the BBS stuff and use something that won't be obsolete |