From 34b9e9e666c79e4c42a59d0b7b7584a0620295f0 Mon Sep 17 00:00:00 2001 From: "Arnold D. Robbins" Date: Tue, 16 Apr 2013 10:50:46 +0300 Subject: Largely done with doc cleanup. --- doc/gawk.info | 2352 ++++++++++++++++++++++++++++----------------------------- 1 file changed, 1176 insertions(+), 1176 deletions(-) (limited to 'doc/gawk.info') diff --git a/doc/gawk.info b/doc/gawk.info index 9271620b..2d1ee6c3 100644 --- a/doc/gawk.info +++ b/doc/gawk.info @@ -90,10 +90,10 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) * Library Functions:: A Library of `awk' Functions. * Sample Programs:: Many `awk' programs with complete explanations. -* Internationalization:: Getting `gawk' to speak your - language. * Advanced Features:: Stuff for advanced users, specific to `gawk'. +* Internationalization:: Getting `gawk' to speak your + language. * Debugger:: The `gawk' debugger. * Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with `gawk'. @@ -997,14 +997,14 @@ problems. Part III focuses on features specific to `gawk'. It contains the following chapters: - *note Internationalization::, describes special features in `gawk' -for translating program messages into different languages at runtime. - *note Advanced Features::, describes a number of `gawk'-specific advanced features. Of particular note are the abilities to have two-way communications with another process, perform TCP/IP networking, and profile your `awk' programs. + *note Internationalization::, describes special features in `gawk' +for translating program messages into different languages at runtime. + *note Debugger::, describes the `awk' debugger. *note Arbitrary Precision Arithmetic::, describes advanced @@ -15242,7 +15242,7 @@ user-defined function that expects to receive and index and a value, and then processes the element.  -File: gawk.info, Node: Sample Programs, Next: Internationalization, Prev: Library Functions, Up: Top +File: gawk.info, Node: Sample Programs, Next: Advanced Features, Prev: Library Functions, Up: Top 11 Practical `awk' Programs *************************** @@ -17861,1449 +17861,1449 @@ supplies the following copyright terms: We leave it to you to determine what the program does.  -File: gawk.info, Node: Internationalization, Next: Advanced Features, Prev: Sample Programs, Up: Top +File: gawk.info, Node: Advanced Features, Next: Internationalization, Prev: Sample Programs, Up: Top -12 Internationalization with `gawk' -*********************************** +12 Advanced Features of `gawk' +****************************** -Once upon a time, computer makers wrote software that worked only in -English. Eventually, hardware and software vendors noticed that if -their systems worked in the native languages of non-English-speaking -countries, they were able to sell more systems. As a result, -internationalization and localization of programs and software systems -became a common practice. + Write documentation as if whoever reads it is a violent psychopath + who knows where you live. + Steve English, as quoted by Peter Langston - For many years, the ability to provide internationalization was -largely restricted to programs written in C and C++. This major node -describes the underlying library `gawk' uses for internationalization, -as well as how `gawk' makes internationalization features available at -the `awk' program level. Having internationalization available at the -`awk' level gives software developers additional flexibility--they are -no longer forced to write in C or C++ when internationalization is a -requirement. + This major node discusses advanced features in `gawk'. It's a bit +of a "grab bag" of items that are otherwise unrelated to each other. +First, a command-line option allows `gawk' to recognize nondecimal +numbers in input data, not just in `awk' programs. Then, `gawk''s +special features for sorting arrays are presented. Next, two-way I/O, +discussed briefly in earlier parts of this Info file, is described in +full detail, along with the basics of TCP/IP networking. Finally, +`gawk' can "profile" an `awk' program, making it possible to tune it +for performance. -* Menu: + A number of advanced features require separate major nodes of their +own: -* I18N and L10N:: Internationalization and Localization. -* Explaining gettext:: How GNU `gettext' works. -* Programmer i18n:: Features for the programmer. -* Translator i18n:: Features for the translator. -* I18N Example:: A simple i18n example. -* Gawk I18N:: `gawk' is also internationalized. + * *note Internationalization::, discusses how to internationalize + your `awk' programs, so that they can speak multiple national + languages. - -File: gawk.info, Node: I18N and L10N, Next: Explaining gettext, Up: Internationalization + * *note Debugger::, describes `gawk''s built-in command-line + debugger for debugging `awk' programs. -12.1 Internationalization and Localization -========================================== + * *note Arbitrary Precision Arithmetic::, describes how you can use + `gawk' to perform arbitrary-precision arithmetic. -"Internationalization" means writing (or modifying) a program once, in -such a way that it can use multiple languages without requiring further -source-code changes. "Localization" means providing the data necessary -for an internationalized program to work in a particular language. -Most typically, these terms refer to features such as the language used -for printing error messages, the language used to read responses, and -information related to how numerical and monetary values are printed -and read. + * *note Dynamic Extensions::, discusses the ability to dynamically + add new built-in functions to `gawk'. - -File: gawk.info, Node: Explaining gettext, Next: Programmer i18n, Prev: I18N and L10N, Up: Internationalization +* Menu: -12.2 GNU `gettext' -================== +* Nondecimal Data:: Allowing nondecimal input data. +* Array Sorting:: Facilities for controlling array traversal and + sorting arrays. +* Two-way I/O:: Two-way communications with another process. +* TCP/IP Networking:: Using `gawk' for network programming. +* Profiling:: Profiling your `awk' programs. -The facilities in GNU `gettext' focus on messages; strings printed by a -program, either directly or via formatting with `printf' or -`sprintf()'.(1) + +File: gawk.info, Node: Nondecimal Data, Next: Array Sorting, Up: Advanced Features - When using GNU `gettext', each application has its own "text -domain". This is a unique name, such as `kpilot' or `gawk', that -identifies the application. A complete application may have multiple -components--programs written in C or C++, as well as scripts written in -`sh' or `awk'. All of the components use the same text domain. +12.1 Allowing Nondecimal Input Data +=================================== - To make the discussion concrete, assume we're writing an application -named `guide'. Internationalization consists of the following steps, -in this order: +If you run `gawk' with the `--non-decimal-data' option, you can have +nondecimal constants in your input data: - 1. The programmer goes through the source for all of `guide''s - components and marks each string that is a candidate for - translation. For example, `"`-F': option required"' is a good - candidate for translation. A table with strings of option names - is not (e.g., `gawk''s `--profile' option should remain the same, - no matter what the local language). + $ echo 0123 123 0x123 | + > gawk --non-decimal-data '{ printf "%d, %d, %d\n", + > $1, $2, $3 }' + -| 83, 123, 291 - 2. The programmer indicates the application's text domain (`"guide"') - to the `gettext' library, by calling the `textdomain()' function. + For this feature to work, write your program so that `gawk' treats +your data as numeric: - 3. Messages from the application are extracted from the source code - and collected into a portable object template file (`guide.pot'), - which lists the strings and their translations. The translations - are initially empty. The original (usually English) messages - serve as the key for lookup of the translations. + $ echo 0123 123 0x123 | gawk '{ print $1, $2, $3 }' + -| 0123 123 0x123 - 4. For each language with a translator, `guide.pot' is copied to a - portable object file (`.po') and translations are created and - shipped with the application. For example, there might be a - `fr.po' for a French translation. +The `print' statement treats its expressions as strings. Although the +fields can act as numbers when necessary, they are still strings, so +`print' does not try to treat them numerically. You may need to add +zero to a field to force it to be treated as a number. For example: - 5. Each language's `.po' file is converted into a binary message - object (`.mo') file. A message object file contains the original - messages and their translations in a binary format that allows - fast lookup of translations at runtime. + $ echo 0123 123 0x123 | gawk --non-decimal-data ' + > { print $1, $2, $3 + > print $1 + 0, $2 + 0, $3 + 0 }' + -| 0123 123 0x123 + -| 83 123 291 - 6. When `guide' is built and installed, the binary translation files - are installed in a standard place. + Because it is common to have decimal data with leading zeros, and +because using this facility could lead to surprising results, the +default is to leave it disabled. If you want it, you must explicitly +request it. - 7. For testing and development, it is possible to tell `gettext' to - use `.mo' files in a different directory than the standard one by - using the `bindtextdomain()' function. + CAUTION: _Use of this option is not recommended._ It can break old + programs very badly. Instead, use the `strtonum()' function to + convert your data (*note Nondecimal-numbers::). This makes your + programs easier to write and easier to read, and leads to less + surprising results. - 8. At runtime, `guide' looks up each string via a call to - `gettext()'. The returned string is the translated string if - available, or the original string if not. + +File: gawk.info, Node: Array Sorting, Next: Two-way I/O, Prev: Nondecimal Data, Up: Advanced Features - 9. If necessary, it is possible to access messages from a different - text domain than the one belonging to the application, without - having to switch the application's default text domain back and - forth. +12.2 Controlling Array Traversal and Array Sorting +================================================== - In C (or C++), the string marking and dynamic translation lookup are -accomplished by wrapping each string in a call to `gettext()': +`gawk' lets you control the order in which a `for (i in array)' loop +traverses an array. - printf("%s", gettext("Don't Panic!\n")); + In addition, two built-in functions, `asort()' and `asorti()', let +you sort arrays based on the array values and indices, respectively. +These two functions also provide control over the sorting criteria used +to order the elements during sorting. - The tools that extract messages from source code pull out all -strings enclosed in calls to `gettext()'. +* Menu: - The GNU `gettext' developers, recognizing that typing `gettext(...)' -over and over again is both painful and ugly to look at, use the macro -`_' (an underscore) to make things easier: +* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. +* Array Sorting Functions:: How to use `asort()' and `asorti()'. - /* In the standard header file: */ - #define _(str) gettext(str) + +File: gawk.info, Node: Controlling Array Traversal, Next: Array Sorting Functions, Up: Array Sorting - /* In the program text: */ - printf("%s", _("Don't Panic!\n")); +12.2.1 Controlling Array Traversal +---------------------------------- -This reduces the typing overhead to just three extra characters per -string and is considerably easier to read as well. +By default, the order in which a `for (i in array)' loop scans an array +is not defined; it is generally based upon the internal implementation +of arrays inside `awk'. - There are locale "categories" for different types of locale-related -information. The defined locale categories that `gettext' knows about -are: + Often, though, it is desirable to be able to loop over the elements +in a particular order that you, the programmer, choose. `gawk' lets +you do this. -`LC_MESSAGES' - Text messages. This is the default category for `gettext' - operations, but it is possible to supply a different one - explicitly, if necessary. (It is almost never necessary to supply - a different category.) + *note Controlling Scanning::, describes how you can assign special, +pre-defined values to `PROCINFO["sorted_in"]' in order to control the +order in which `gawk' will traverse an array during a `for' loop. -`LC_COLLATE' - Text-collation information; i.e., how different characters and/or - groups of characters sort in a given language. + In addition, the value of `PROCINFO["sorted_in"]' can be a function +name. This lets you traverse an array based on any custom criterion. +The array elements are ordered according to the return value of this +function. The comparison function should be defined with at least four +arguments: -`LC_CTYPE' - Character-type information (alphabetic, digit, upper- or - lowercase, and so on). This information is accessed via the POSIX - character classes in regular expressions, such as `/[[:alnum:]]/' - (*note Regexp Operators::). + function comp_func(i1, v1, i2, v2) + { + COMPARE ELEMENTS 1 AND 2 IN SOME FASHION + RETURN < 0; 0; OR > 0 + } -`LC_MONETARY' - Monetary information, such as the currency symbol, and whether the - symbol goes before or after a number. + Here, I1 and I2 are the indices, and V1 and V2 are the corresponding +values of the two elements being compared. Either V1 or V2, or both, +can be arrays if the array being traversed contains subarrays as values. +(*Note Arrays of Arrays::, for more information about subarrays.) The +three possible return values are interpreted as follows: -`LC_NUMERIC' - Numeric information, such as which characters to use for the - decimal point and the thousands separator.(2) +`comp_func(i1, v1, i2, v2) < 0' + Index I1 comes before index I2 during loop traversal. -`LC_RESPONSE' - Response information, such as how "yes" and "no" appear in the - local language, and possibly other information as well. +`comp_func(i1, v1, i2, v2) == 0' + Indices I1 and I2 come together but the relative order with + respect to each other is undefined. -`LC_TIME' - Time- and date-related information, such as 12- or 24-hour clock, - month printed before or after the day in a date, local month - abbreviations, and so on. +`comp_func(i1, v1, i2, v2) > 0' + Index I1 comes after index I2 during loop traversal. -`LC_ALL' - All of the above. (Not too useful in the context of `gettext'.) + Our first comparison function can be used to scan an array in +numerical order of the indices: - ---------- Footnotes ---------- + function cmp_num_idx(i1, v1, i2, v2) + { + # numerical index comparison, ascending order + return (i1 - i2) + } - (1) For some operating systems, the `gawk' port doesn't support GNU -`gettext'. Therefore, these features are not available if you are -using one of those operating systems. Sorry. + Our second function traverses an array based on the string order of +the element values rather than by indices: - (2) Americans use a comma every three decimal places and a period -for the decimal point, while many Europeans do exactly the opposite: -1,234.56 versus 1.234,56. + function cmp_str_val(i1, v1, i2, v2) + { + # string value comparison, ascending order + v1 = v1 "" + v2 = v2 "" + if (v1 < v2) + return -1 + return (v1 != v2) + } - -File: gawk.info, Node: Programmer i18n, Next: Translator i18n, Prev: Explaining gettext, Up: Internationalization + The third comparison function makes all numbers, and numeric strings +without any leading or trailing spaces, come out first during loop +traversal: -12.3 Internationalizing `awk' Programs -====================================== + function cmp_num_str_val(i1, v1, i2, v2, n1, n2) + { + # numbers before string value comparison, ascending order + n1 = v1 + 0 + n2 = v2 + 0 + if (n1 == v1) + return (n2 == v2) ? (n1 - n2) : -1 + else if (n2 == v2) + return 1 + return (v1 < v2) ? -1 : (v1 != v2) + } -`gawk' provides the following variables and functions for -internationalization: + Here is a main program to demonstrate how `gawk' behaves using each +of the previous functions: -`TEXTDOMAIN' - This variable indicates the application's text domain. For - compatibility with GNU `gettext', the default value is - `"messages"'. + BEGIN { + data["one"] = 10 + data["two"] = 20 + data[10] = "one" + data[100] = 100 + data[20] = "two" -`_"your message here"' - String constants marked with a leading underscore are candidates - for translation at runtime. String constants without a leading - underscore are not translated. + f[1] = "cmp_num_idx" + f[2] = "cmp_str_val" + f[3] = "cmp_num_str_val" + for (i = 1; i <= 3; i++) { + printf("Sort function: %s\n", f[i]) + PROCINFO["sorted_in"] = f[i] + for (j in data) + printf("\tdata[%s] = %s\n", j, data[j]) + print "" + } + } -`dcgettext(STRING [, DOMAIN [, CATEGORY]])' - Return the translation of STRING in text domain DOMAIN for locale - category CATEGORY. The default value for DOMAIN is the current - value of `TEXTDOMAIN'. The default value for CATEGORY is - `"LC_MESSAGES"'. + Here are the results when the program is run: - If you supply a value for CATEGORY, it must be a string equal to - one of the known locale categories described in *note Explaining - gettext::. You must also supply a text domain. Use `TEXTDOMAIN' - if you want to use the current domain. + $ gawk -f compdemo.awk + -| Sort function: cmp_num_idx Sort by numeric index + -| data[two] = 20 + -| data[one] = 10 Both strings are numerically zero + -| data[10] = one + -| data[20] = two + -| data[100] = 100 + -| + -| Sort function: cmp_str_val Sort by element values as strings + -| data[one] = 10 + -| data[100] = 100 String 100 is less than string 20 + -| data[two] = 20 + -| data[10] = one + -| data[20] = two + -| + -| Sort function: cmp_num_str_val Sort all numeric values before all strings + -| data[one] = 10 + -| data[two] = 20 + -| data[100] = 100 + -| data[10] = one + -| data[20] = two - CAUTION: The order of arguments to the `awk' version of the - `dcgettext()' function is purposely different from the order - for the C version. The `awk' version's order was chosen to - be simple and to allow for reasonable `awk'-style default - arguments. + Consider sorting the entries of a GNU/Linux system password file +according to login name. The following program sorts records by a +specific field position and can be used for this purpose: -`dcngettext(STRING1, STRING2, NUMBER [, DOMAIN [, CATEGORY]])' - Return the plural form used for NUMBER of the translation of - STRING1 and STRING2 in text domain DOMAIN for locale category - CATEGORY. STRING1 is the English singular variant of a message, - and STRING2 the English plural variant of the same message. The - default value for DOMAIN is the current value of `TEXTDOMAIN'. - The default value for CATEGORY is `"LC_MESSAGES"'. + # sort.awk --- simple program to sort by field position + # field position is specified by the global variable POS - The same remarks about argument order as for the `dcgettext()' - function apply. + function cmp_field(i1, v1, i2, v2) + { + # comparison by value, as string, and ascending order + return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS]) + } -`bindtextdomain(DIRECTORY [, DOMAIN])' - Change the directory in which `gettext' looks for `.mo' files, in - case they will not or cannot be placed in the standard locations - (e.g., during testing). Return the directory in which DOMAIN is - "bound." + { + for (i = 1; i <= NF; i++) + a[NR][i] = $i + } - The default DOMAIN is the value of `TEXTDOMAIN'. If DIRECTORY is - the null string (`""'), then `bindtextdomain()' returns the - current binding for the given DOMAIN. + END { + PROCINFO["sorted_in"] = "cmp_field" + if (POS < 1 || POS > NF) + POS = 1 + for (i in a) { + for (j = 1; j <= NF; j++) + printf("%s%c", a[i][j], j < NF ? ":" : "") + print "" + } + } - To use these facilities in your `awk' program, follow the steps -outlined in *note Explaining gettext::, like so: + The first field in each entry of the password file is the user's +login name, and the fields are separated by colons. Each record +defines a subarray, with each field as an element in the subarray. +Running the program produces the following output: - 1. Set the variable `TEXTDOMAIN' to the text domain of your program. - This is best done in a `BEGIN' rule (*note BEGIN/END::), or it can - also be done via the `-v' command-line option (*note Options::): + $ gawk -v POS=1 -F: -f sort.awk /etc/passwd + -| adm:x:3:4:adm:/var/adm:/sbin/nologin + -| apache:x:48:48:Apache:/var/www:/sbin/nologin + -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin + ... - BEGIN { - TEXTDOMAIN = "guide" - ... - } + The comparison should normally always return the same value when +given a specific pair of array elements as its arguments. If +inconsistent results are returned then the order is undefined. This +behavior can be exploited to introduce random order into otherwise +seemingly ordered data: - 2. Mark all translatable strings with a leading underscore (`_') - character. It _must_ be adjacent to the opening quote of the - string. For example: + function cmp_randomize(i1, v1, i2, v2) + { + # random order + return (2 - 4 * rand()) + } - print _"hello, world" - x = _"you goofed" - printf(_"Number of users is %d\n", nusers) + As mentioned above, the order of the indices is arbitrary if two +elements compare equal. This is usually not a problem, but letting the +tied elements come out in arbitrary order can be an issue, especially +when comparing item values. The partial ordering of the equal elements +may change during the next loop traversal, if other elements are added +or removed from the array. One way to resolve ties when comparing +elements with otherwise equal values is to include the indices in the +comparison rules. Note that doing this may make the loop traversal +less efficient, so consider it only if necessary. The following +comparison functions force a deterministic order, and are based on the +fact that the indices of two elements are never equal: - 3. If you are creating strings dynamically, you can still translate - them, using the `dcgettext()' built-in function: + function cmp_numeric(i1, v1, i2, v2) + { + # numerical value (and index) comparison, descending order + return (v1 != v2) ? (v2 - v1) : (i2 - i1) + } - message = nusers " users logged in" - message = dcgettext(message, "adminprog") - print message + function cmp_string(i1, v1, i2, v2) + { + # string value (and index) comparison, descending order + v1 = v1 i1 + v2 = v2 i2 + return (v1 > v2) ? -1 : (v1 != v2) + } - Here, the call to `dcgettext()' supplies a different text domain - (`"adminprog"') in which to find the message, but it uses the - default `"LC_MESSAGES"' category. + A custom comparison function can often simplify ordered loop +traversal, and the sky is really the limit when it comes to designing +such a function. - 4. During development, you might want to put the `.mo' file in a - private directory for testing. This is done with the - `bindtextdomain()' built-in function: + When string comparisons are made during a sort, either for element +values where one or both aren't numbers, or for element indices handled +as strings, the value of `IGNORECASE' (*note Built-in Variables::) +controls whether the comparisons treat corresponding uppercase and +lowercase letters as equivalent or distinct. - BEGIN { - TEXTDOMAIN = "guide" # our text domain - if (Testing) { - # where to find our files - bindtextdomain("testdir") - # joe is in charge of adminprog - bindtextdomain("../joe/testdir", "adminprog") - } - ... - } + Another point to keep in mind is that in the case of subarrays the +element values can themselves be arrays; a production comparison +function should use the `isarray()' function (*note Type Functions::), +to check for this, and choose a defined sorting order for subarrays. + All sorting based on `PROCINFO["sorted_in"]' is disabled in POSIX +mode, since the `PROCINFO' array is not special in that case. - *Note I18N Example::, for an example program showing the steps to -create and use translations from `awk'. + As a side note, sorting the array indices before traversing the +array has been reported to add 15% to 20% overhead to the execution +time of `awk' programs. For this reason, sorted array traversal is not +the default.  -File: gawk.info, Node: Translator i18n, Next: I18N Example, Prev: Programmer i18n, Up: Internationalization - -12.4 Translating `awk' Programs -=============================== +File: gawk.info, Node: Array Sorting Functions, Prev: Controlling Array Traversal, Up: Array Sorting -Once a program's translatable strings have been marked, they must be -extracted to create the initial `.po' file. As part of translation, it -is often helpful to rearrange the order in which arguments to `printf' -are output. +12.2.2 Sorting Array Values and Indices with `gawk' +--------------------------------------------------- - `gawk''s `--gen-pot' command-line option extracts the messages and -is discussed next. After that, `printf''s ability to rearrange the -order for `printf' arguments at runtime is covered. +In most `awk' implementations, sorting an array requires writing a +`sort()' function. While this can be educational for exploring +different sorting algorithms, usually that's not the point of the +program. `gawk' provides the built-in `asort()' and `asorti()' +functions (*note String Functions::) for sorting arrays. For example: -* Menu: + POPULATE THE ARRAY data + n = asort(data) + for (i = 1; i <= n; i++) + DO SOMETHING WITH data[i] -* String Extraction:: Extracting marked strings. -* Printf Ordering:: Rearranging `printf' arguments. -* I18N Portability:: `awk'-level portability issues. + After the call to `asort()', the array `data' is indexed from 1 to +some number N, the total number of elements in `data'. (This count is +`asort()''s return value.) `data[1]' <= `data[2]' <= `data[3]', and so +on. The comparison is based on the type of the elements (*note Typing +and Comparison::). All numeric values come before all string values, +which in turn come before all subarrays. - -File: gawk.info, Node: String Extraction, Next: Printf Ordering, Up: Translator i18n + An important side effect of calling `asort()' is that _the array's +original indices are irrevocably lost_. As this isn't always +desirable, `asort()' accepts a second argument: -12.4.1 Extracting Marked Strings --------------------------------- + POPULATE THE ARRAY source + n = asort(source, dest) + for (i = 1; i <= n; i++) + DO SOMETHING WITH dest[i] -Once your `awk' program is working, and all the strings have been -marked and you've set (and perhaps bound) the text domain, it is time -to produce translations. First, use the `--gen-pot' command-line -option to create the initial `.pot' file: + In this case, `gawk' copies the `source' array into the `dest' array +and then sorts `dest', destroying its indices. However, the `source' +array is not affected. - $ gawk --gen-pot -f guide.awk > guide.pot + `asort()' accepts a third string argument to control comparison of +array elements. As with `PROCINFO["sorted_in"]', this argument may be +one of the predefined names that `gawk' provides (*note Controlling +Scanning::), or the name of a user-defined function (*note Controlling +Array Traversal::). - When run with `--gen-pot', `gawk' does not execute your program. -Instead, it parses it as usual and prints all marked strings to -standard output in the format of a GNU `gettext' Portable Object file. -Also included in the output are any constant strings that appear as the -first argument to `dcgettext()' or as the first and second argument to -`dcngettext()'.(1) *Note I18N Example::, for the full list of steps to -go through to create and test translations for `guide'. + NOTE: In all cases, the sorted element values consist of the + original array's element values. The ability to control + comparison merely affects the way in which they are sorted. - ---------- Footnotes ---------- + Often, what's needed is to sort on the values of the _indices_ +instead of the values of the elements. To do that, use the `asorti()' +function. The interface is identical to that of `asort()', except that +the index values are used for sorting, and become the values of the +result array: - (1) The `xgettext' utility that comes with GNU `gettext' can handle -`.awk' files. + { source[$0] = some_func($0) } - -File: gawk.info, Node: Printf Ordering, Next: I18N Portability, Prev: String Extraction, Up: Translator i18n + END { + n = asorti(source, dest) + for (i = 1; i <= n; i++) { + Work with sorted indices directly: + DO SOMETHING WITH dest[i] + ... + Access original array via sorted indices: + DO SOMETHING WITH source[dest[i]] + } + } -12.4.2 Rearranging `printf' Arguments -------------------------------------- + Similar to `asort()', in all cases, the sorted element values +consist of the original array's indices. The ability to control +comparison merely affects the way in which they are sorted. -Format strings for `printf' and `sprintf()' (*note Printf::) present a -special problem for translation. Consider the following:(1) + Sorting the array by replacing the indices provides maximal +flexibility. To traverse the elements in decreasing order, use a loop +that goes from N down to 1, either over the elements or over the +indices.(1) - printf(_"String `%s' has %d characters\n", - string, length(string))) + Copying array indices and elements isn't expensive in terms of +memory. Internally, `gawk' maintains "reference counts" to data. For +example, when `asort()' copies the first array to the second one, there +is only one copy of the original array elements' data, even though both +arrays use the values. - A possible German translation for this might be: + Because `IGNORECASE' affects string comparisons, the value of +`IGNORECASE' also affects sorting for both `asort()' and `asorti()'. +Note also that the locale's sorting order does _not_ come into play; +comparisons are based on character values only.(2) Caveat Emptor. - "%d Zeichen lang ist die Zeichenkette `%s'\n" + ---------- Footnotes ---------- - The problem should be obvious: the order of the format -specifications is different from the original! Even though `gettext()' -can return the translated string at runtime, it cannot change the -argument order in the call to `printf'. + (1) You may also use one of the predefined sorting names that sorts +in decreasing order. - To solve this problem, `printf' format specifiers may have an -additional optional element, which we call a "positional specifier". -For example: + (2) This is true because locale-based comparison occurs only when in +POSIX compatibility mode, and since `asort()' and `asorti()' are `gawk' +extensions, they are not available in that case. - "%2$d Zeichen lang ist die Zeichenkette `%1$s'\n" + +File: gawk.info, Node: Two-way I/O, Next: TCP/IP Networking, Prev: Array Sorting, Up: Advanced Features - Here, the positional specifier consists of an integer count, which -indicates which argument to use, and a `$'. Counts are one-based, and -the format string itself is _not_ included. Thus, in the following -example, `string' is the first argument and `length(string)' is the -second: +12.3 Two-Way Communications with Another Process +================================================ - $ gawk 'BEGIN { - > string = "Dont Panic" - > printf _"%2$d characters live in \"%1$s\"\n", - > string, length(string) - > }' - -| 10 characters live in "Dont Panic" + From: brennan@whidbey.com (Mike Brennan) + Newsgroups: comp.lang.awk + Subject: Re: Learn the SECRET to Attract Women Easily + Date: 4 Aug 1997 17:34:46 GMT + Message-ID: <5s53rm$eca@news.whidbey.com> - If present, positional specifiers come first in the format -specification, before the flags, the field width, and/or the precision. + On 3 Aug 1997 13:17:43 GMT, Want More Dates??? + wrote: + >Learn the SECRET to Attract Women Easily + > + >The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women - Positional specifiers can be used with the dynamic field width and -precision capability: + The scent of awk programmers is a lot more attractive to women than + the scent of perl programmers. + -- + Mike Brennan - $ gawk 'BEGIN { - > printf("%*.*s\n", 10, 20, "hello") - > printf("%3$*2$.*1$s\n", 20, 10, "hello") - > }' - -| hello - -| hello + It is often useful to be able to send data to a separate program for +processing and then read the result. This can always be done with +temporary files: - NOTE: When using `*' with a positional specifier, the `*' comes - first, then the integer position, and then the `$'. This is - somewhat counterintuitive. + # Write the data for processing + tempfile = ("mydata." PROCINFO["pid"]) + while (NOT DONE WITH DATA) + print DATA | ("subprogram > " tempfile) + close("subprogram > " tempfile) - `gawk' does not allow you to mix regular format specifiers and those -with positional specifiers in the same string: + # Read the results, remove tempfile when done + while ((getline newdata < tempfile) > 0) + PROCESS newdata APPROPRIATELY + close(tempfile) + system("rm " tempfile) - $ gawk 'BEGIN { printf _"%d %3$s\n", 1, 2, "hi" }' - error--> gawk: cmd. line:1: fatal: must use `count$' on all formats or none +This works, but not elegantly. Among other things, it requires that +the program be run in a directory that cannot be shared among users; +for example, `/tmp' will not do, as another user might happen to be +using a temporary file with the same name. - NOTE: There are some pathological cases that `gawk' may fail to - diagnose. In such cases, the output may not be what you expect. - It's still a bad idea to try mixing them, even if `gawk' doesn't - detect it. + However, with `gawk', it is possible to open a _two-way_ pipe to +another process. The second process is termed a "coprocess", since it +runs in parallel with `gawk'. The two-way connection is created using +the `|&' operator (borrowed from the Korn shell, `ksh'):(1) - Although positional specifiers can be used directly in `awk' -programs, their primary purpose is to help in producing correct -translations of format strings into languages different from the one in -which the program is first written. + do { + print DATA |& "subprogram" + "subprogram" |& getline results + } while (DATA LEFT TO PROCESS) + close("subprogram") - ---------- Footnotes ---------- + The first time an I/O operation is executed using the `|&' operator, +`gawk' creates a two-way pipeline to a child process that runs the +other program. Output created with `print' or `printf' is written to +the program's standard input, and output from the program's standard +output can be read by the `gawk' program using `getline'. As is the +case with processes started by `|', the subprogram can be any program, +or pipeline of programs, that can be started by the shell. - (1) This example is borrowed from the GNU `gettext' manual. + There are some cautionary items to be aware of: - -File: gawk.info, Node: I18N Portability, Prev: Printf Ordering, Up: Translator i18n + * As the code inside `gawk' currently stands, the coprocess's + standard error goes to the same place that the parent `gawk''s + standard error goes. It is not possible to read the child's + standard error separately. -12.4.3 `awk' Portability Issues -------------------------------- + * I/O buffering may be a problem. `gawk' automatically flushes all + output down the pipe to the coprocess. However, if the coprocess + does not flush its output, `gawk' may hang when doing a `getline' + in order to read the coprocess's results. This could lead to a + situation known as "deadlock", where each process is waiting for + the other one to do something. -`gawk''s internationalization features were purposely chosen to have as -little impact as possible on the portability of `awk' programs that use -them to other versions of `awk'. Consider this program: + It is possible to close just one end of the two-way pipe to a +coprocess, by supplying a second argument to the `close()' function of +either `"to"' or `"from"' (*note Close Files And Pipes::). These +strings tell `gawk' to close the end of the pipe that sends data to the +coprocess or the end that reads from it, respectively. - BEGIN { - TEXTDOMAIN = "guide" - if (Test_Guide) # set with -v - bindtextdomain("/test/guide/messages") - print _"don't panic!" - } + This is particularly necessary in order to use the system `sort' +utility as part of a coprocess; `sort' must read _all_ of its input +data before it can produce any output. The `sort' program does not +receive an end-of-file indication until `gawk' closes the write end of +the pipe. -As written, it won't work on other versions of `awk'. However, it is -actually almost portable, requiring very little change: + When you have finished writing data to the `sort' utility, you can +close the `"to"' end of the pipe, and then start reading sorted data +via `getline'. For example: - * Assignments to `TEXTDOMAIN' won't have any effect, since - `TEXTDOMAIN' is not special in other `awk' implementations. + BEGIN { + command = "LC_ALL=C sort" + n = split("abcdefghijklmnopqrstuvwxyz", a, "") - * Non-GNU versions of `awk' treat marked strings as the - concatenation of a variable named `_' with the string following - it.(1) Typically, the variable `_' has the null string (`""') as - its value, leaving the original string constant as the result. + for (i = n; i > 0; i--) + print a[i] |& command + close(command, "to") - * By defining "dummy" functions to replace `dcgettext()', - `dcngettext()' and `bindtextdomain()', the `awk' program can be - made to run, but all the messages are output in the original - language. For example: + while ((command |& getline line) > 0) + print "got", line + close(command) + } - function bindtextdomain(dir, domain) - { - return dir - } + This program writes the letters of the alphabet in reverse order, one +per line, down the two-way pipe to `sort'. It then closes the write +end of the pipe, so that `sort' receives an end-of-file indication. +This causes `sort' to sort the data and write the sorted data back to +the `gawk' program. Once all of the data has been read, `gawk' +terminates the coprocess and exits. - function dcgettext(string, domain, category) - { - return string - } + As a side note, the assignment `LC_ALL=C' in the `sort' command +ensures traditional Unix (ASCII) sorting from `sort'. - function dcngettext(string1, string2, number, domain, category) - { - return (number == 1 ? string1 : string2) - } + You may also use pseudo-ttys (ptys) for two-way communication +instead of pipes, if your system supports them. This is done on a +per-command basis, by setting a special element in the `PROCINFO' array +(*note Auto-set::), like so: - * The use of positional specifications in `printf' or `sprintf()' is - _not_ portable. To support `gettext()' at the C level, many - systems' C versions of `sprintf()' do support positional - specifiers. But it works only if enough arguments are supplied in - the function call. Many versions of `awk' pass `printf' formats - and arguments unchanged to the underlying C library version of - `sprintf()', but only one format and argument at a time. What - happens if a positional specification is used is anybody's guess. - However, since the positional specifications are primarily for use - in _translated_ format strings, and since non-GNU `awk's never - retrieve the translated string, this should not be a problem in - practice. + command = "sort -nr" # command, save in convenience variable + PROCINFO[command, "pty"] = 1 # update PROCINFO + print ... |& command # start two-way pipe + ... + +Using ptys avoids the buffer deadlock issues described earlier, at some +loss in performance. If your system does not have ptys, or if all the +system's ptys are in use, `gawk' automatically falls back to using +regular pipes. ---------- Footnotes ---------- - (1) This is good fodder for an "Obfuscated `awk'" contest. + (1) This is very different from the same operator in the C shell.  -File: gawk.info, Node: I18N Example, Next: Gawk I18N, Prev: Translator i18n, Up: Internationalization +File: gawk.info, Node: TCP/IP Networking, Next: Profiling, Prev: Two-way I/O, Up: Advanced Features -12.5 A Simple Internationalization Example -========================================== +12.4 Using `gawk' for Network Programming +========================================= -Now let's look at a step-by-step example of how to internationalize and -localize a simple `awk' program, using `guide.awk' as our original -source: + `EMISTERED': + A host is a host from coast to coast, + and no-one can talk to host that's close, + unless the host that isn't close + is busy hung or dead. - BEGIN { - TEXTDOMAIN = "guide" - bindtextdomain(".") # for testing - print _"Don't Panic" - print _"The Answer Is", 42 - print "Pardon me, Zaphod who?" - } + In addition to being able to open a two-way pipeline to a coprocess +on the same system (*note Two-way I/O::), it is possible to make a +two-way connection to another process on another system across an IP +network connection. -Run `gawk --gen-pot' to create the `.pot' file: + You can think of this as just a _very long_ two-way pipeline to a +coprocess. The way `gawk' decides that you want to use TCP/IP +networking is by recognizing special file names that begin with one of +`/inet/', `/inet4/' or `/inet6'. - $ gawk --gen-pot -f guide.awk > guide.pot + The full syntax of the special file name is +`/NET-TYPE/PROTOCOL/LOCAL-PORT/REMOTE-HOST/REMOTE-PORT'. The +components are: -This produces: +NET-TYPE + Specifies the kind of Internet connection to make. Use `/inet4/' + to force IPv4, and `/inet6/' to force IPv6. Plain `/inet/' (which + used to be the only option) uses the system default, most likely + IPv4. - #: guide.awk:4 - msgid "Don't Panic" - msgstr "" +PROTOCOL + The protocol to use over IP. This must be either `tcp', or `udp', + for a TCP or UDP IP connection, respectively. The use of TCP is + recommended for most applications. - #: guide.awk:5 - msgid "The Answer Is" - msgstr "" +LOCAL-PORT + The local TCP or UDP port number to use. Use a port number of `0' + when you want the system to pick a port. This is what you should do + when writing a TCP or UDP client. You may also use a well-known + service name, such as `smtp' or `http', in which case `gawk' + attempts to determine the predefined port number using the C + `getaddrinfo()' function. - This original portable object template file is saved and reused for -each language into which the application is translated. The `msgid' is -the original string and the `msgstr' is the translation. +REMOTE-HOST + The IP address or fully-qualified domain name of the Internet host + to which you want to connect. - NOTE: Strings not marked with a leading underscore do not appear - in the `guide.pot' file. +REMOTE-PORT + The TCP or UDP port number to use on the given REMOTE-HOST. + Again, use `0' if you don't care, or else a well-known service + name. - Next, the messages must be translated. Here is a translation to a -hypothetical dialect of English, called "Mellow":(1) + NOTE: Failure in opening a two-way socket will result in a + non-fatal error being returned to the calling code. The value of + `ERRNO' indicates the error (*note Auto-set::). - $ cp guide.pot guide-mellow.po - ADD TRANSLATIONS TO guide-mellow.po ... + Consider the following very simple example: -Following are the translations: + BEGIN { + Service = "/inet/tcp/0/localhost/daytime" + Service |& getline + print $0 + close(Service) + } - #: guide.awk:4 - msgid "Don't Panic" - msgstr "Hey man, relax!" + This program reads the current date and time from the local system's +TCP `daytime' server. It then prints the results and closes the +connection. - #: guide.awk:5 - msgid "The Answer Is" - msgstr "Like, the scoop is" + Because this topic is extensive, the use of `gawk' for TCP/IP +programming is documented separately. See *note (General +Introduction)Top:: gawkinet, TCP/IP Internetworking with `gawk', for a +much more complete introduction and discussion, as well as extensive +examples. - The next step is to make the directory to hold the binary message -object file and then to create the `guide.mo' file. The directory -layout shown here is standard for GNU `gettext' on GNU/Linux systems. -Other versions of `gettext' may use a different layout: + +File: gawk.info, Node: Profiling, Prev: TCP/IP Networking, Up: Advanced Features - $ mkdir en_US en_US/LC_MESSAGES +12.5 Profiling Your `awk' Programs +================================== - The `msgfmt' utility does the conversion from human-readable `.po' -file to machine-readable `.mo' file. By default, `msgfmt' creates a -file named `messages'. This file must be renamed and placed in the -proper directory so that `gawk' can find it: +You may produce execution traces of your `awk' programs. This is done +by passing the option `--profile' to `gawk'. When `gawk' has finished +running, it creates a profile of your program in a file named +`awkprof.out'. Because it is profiling, it also executes up to 45% +slower than `gawk' normally does. - $ msgfmt guide-mellow.po - $ mv messages en_US/LC_MESSAGES/guide.mo + As shown in the following example, the `--profile' option can be +used to change the name of the file where `gawk' will write the profile: - Finally, we run the program to test it: + gawk --profile=myprog.prof -f myprog.awk data1 data2 - $ gawk -f guide.awk - -| Hey man, relax! - -| Like, the scoop is 42 - -| Pardon me, Zaphod who? +In the above example, `gawk' places the profile in `myprog.prof' +instead of in `awkprof.out'. - If the three replacement functions for `dcgettext()', `dcngettext()' -and `bindtextdomain()' (*note I18N Portability::) are in a file named -`libintl.awk', then we can run `guide.awk' unchanged as follows: + Here is a sample session showing a simple `awk' program, its input +data, and the results from running `gawk' with the `--profile' option. +First, the `awk' program: - $ gawk --posix -f guide.awk -f libintl.awk - -| Don't Panic - -| The Answer Is 42 - -| Pardon me, Zaphod who? + BEGIN { print "First BEGIN rule" } - ---------- Footnotes ---------- + END { print "First END rule" } - (1) Perhaps it would be better if it were called "Hippy." Ah, well. + /foo/ { + print "matched /foo/, gosh" + for (i = 1; i <= 3; i++) + sing() + } - -File: gawk.info, Node: Gawk I18N, Prev: I18N Example, Up: Internationalization + { + if (/foo/) + print "if is true" + else + print "else is true" + } -12.6 `gawk' Can Speak Your Language -=================================== + BEGIN { print "Second BEGIN rule" } -`gawk' itself has been internationalized using the GNU `gettext' -package. (GNU `gettext' is described in complete detail in *note (GNU -`gettext' utilities)Top:: gettext, GNU gettext tools.) As of this -writing, the latest version of GNU `gettext' is version 0.18.2.1 -(ftp://ftp.gnu.org/gnu/gettext/gettext-0.18.2.1.tar.gz). + END { print "Second END rule" } - If a translation of `gawk''s messages exists, then `gawk' produces -usage messages, warnings, and fatal errors in the local language. + function sing( dummy) + { + print "I gotta be me!" + } - -File: gawk.info, Node: Advanced Features, Next: Debugger, Prev: Internationalization, Up: Top + Following is the input data: -13 Advanced Features of `gawk' -****************************** + foo + bar + baz + foo + junk - Write documentation as if whoever reads it is a violent psychopath - who knows where you live. - Steve English, as quoted by Peter Langston + Here is the `awkprof.out' that results from running the `gawk' +profiler on this program and data (this example also illustrates that +`awk' programmers sometimes have to work late): - This major node discusses advanced features in `gawk'. It's a bit -of a "grab bag" of items that are otherwise unrelated to each other. -First, a command-line option allows `gawk' to recognize nondecimal -numbers in input data, not just in `awk' programs. Then, `gawk''s -special features for sorting arrays are presented. Next, two-way I/O, -discussed briefly in earlier parts of this Info file, is described in -full detail, along with the basics of TCP/IP networking. Finally, -`gawk' can "profile" an `awk' program, making it possible to tune it -for performance. + # gawk profile, created Sun Aug 13 00:00:15 2000 - A number of advanced features require separate major nodes of their -own: + # BEGIN block(s) - * *note Internationalization::, discusses how to internationalize - your `awk' programs, so that they can speak multiple national - languages. + BEGIN { + 1 print "First BEGIN rule" + 1 print "Second BEGIN rule" + } - * *note Debugger::, describes `gawk''s built-in command-line - debugger for debugging `awk' programs. + # Rule(s) - * *note Arbitrary Precision Arithmetic::, describes how you can use - `gawk' to perform arbitrary-precision arithmetic. + 5 /foo/ { # 2 + 2 print "matched /foo/, gosh" + 6 for (i = 1; i <= 3; i++) { + 6 sing() + } + } - * *note Dynamic Extensions::, discusses the ability to dynamically - add new built-in functions to `gawk'. + 5 { + 5 if (/foo/) { # 2 + 2 print "if is true" + 3 } else { + 3 print "else is true" + } + } -* Menu: - -* Nondecimal Data:: Allowing nondecimal input data. -* Array Sorting:: Facilities for controlling array traversal and - sorting arrays. -* Two-way I/O:: Two-way communications with another process. -* TCP/IP Networking:: Using `gawk' for network programming. -* Profiling:: Profiling your `awk' programs. - - -File: gawk.info, Node: Nondecimal Data, Next: Array Sorting, Up: Advanced Features - -13.1 Allowing Nondecimal Input Data -=================================== + # END block(s) -If you run `gawk' with the `--non-decimal-data' option, you can have -nondecimal constants in your input data: + END { + 1 print "First END rule" + 1 print "Second END rule" + } - $ echo 0123 123 0x123 | - > gawk --non-decimal-data '{ printf "%d, %d, %d\n", - > $1, $2, $3 }' - -| 83, 123, 291 + # Functions, listed alphabetically - For this feature to work, write your program so that `gawk' treats -your data as numeric: + 6 function sing(dummy) + { + 6 print "I gotta be me!" + } - $ echo 0123 123 0x123 | gawk '{ print $1, $2, $3 }' - -| 0123 123 0x123 + This example illustrates many of the basic features of profiling +output. They are as follows: -The `print' statement treats its expressions as strings. Although the -fields can act as numbers when necessary, they are still strings, so -`print' does not try to treat them numerically. You may need to add -zero to a field to force it to be treated as a number. For example: + * The program is printed in the order `BEGIN' rule, `BEGINFILE' rule, + pattern/action rules, `ENDFILE' rule, `END' rule and functions, + listed alphabetically. Multiple `BEGIN' and `END' rules are + merged together, as are multiple `BEGINFILE' and `ENDFILE' rules. - $ echo 0123 123 0x123 | gawk --non-decimal-data ' - > { print $1, $2, $3 - > print $1 + 0, $2 + 0, $3 + 0 }' - -| 0123 123 0x123 - -| 83 123 291 + * Pattern-action rules have two counts. The first count, to the + left of the rule, shows how many times the rule's pattern was + _tested_. The second count, to the right of the rule's opening + left brace in a comment, shows how many times the rule's action + was _executed_. The difference between the two indicates how many + times the rule's pattern evaluated to false. - Because it is common to have decimal data with leading zeros, and -because using this facility could lead to surprising results, the -default is to leave it disabled. If you want it, you must explicitly -request it. + * Similarly, the count for an `if'-`else' statement shows how many + times the condition was tested. To the right of the opening left + brace for the `if''s body is a count showing how many times the + condition was true. The count for the `else' indicates how many + times the test failed. - CAUTION: _Use of this option is not recommended._ It can break old - programs very badly. Instead, use the `strtonum()' function to - convert your data (*note Nondecimal-numbers::). This makes your - programs easier to write and easier to read, and leads to less - surprising results. + * The count for a loop header (such as `for' or `while') shows how + many times the loop test was executed. (Because of this, you + can't just look at the count on the first statement in a rule to + determine how many times the rule was executed. If the first + statement is a loop, the count is misleading.) - -File: gawk.info, Node: Array Sorting, Next: Two-way I/O, Prev: Nondecimal Data, Up: Advanced Features + * For user-defined functions, the count next to the `function' + keyword indicates how many times the function was called. The + counts next to the statements in the body show how many times + those statements were executed. -13.2 Controlling Array Traversal and Array Sorting -================================================== + * The layout uses "K&R" style with TABs. Braces are used + everywhere, even when the body of an `if', `else', or loop is only + a single statement. -`gawk' lets you control the order in which a `for (i in array)' loop -traverses an array. + * Parentheses are used only where needed, as indicated by the + structure of the program and the precedence rules. For example, + `(3 + 5) * 4' means add three plus five, then multiply the total + by four. However, `3 + 5 * 4' has no parentheses, and means `3 + + (5 * 4)'. - In addition, two built-in functions, `asort()' and `asorti()', let -you sort arrays based on the array values and indices, respectively. -These two functions also provide control over the sorting criteria used -to order the elements during sorting. + * Parentheses are used around the arguments to `print' and `printf' + only when the `print' or `printf' statement is followed by a + redirection. Similarly, if the target of a redirection isn't a + scalar, it gets parenthesized. -* Menu: + * `gawk' supplies leading comments in front of the `BEGIN' and `END' + rules, the pattern/action rules, and the functions. -* Controlling Array Traversal:: How to use PROCINFO["sorted_in"]. -* Array Sorting Functions:: How to use `asort()' and `asorti()'. - -File: gawk.info, Node: Controlling Array Traversal, Next: Array Sorting Functions, Up: Array Sorting + The profiled version of your program may not look exactly like what +you typed when you wrote it. This is because `gawk' creates the +profiled version by "pretty printing" its internal representation of +the program. The advantage to this is that `gawk' can produce a +standard representation. The disadvantage is that all source-code +comments are lost, as are the distinctions among multiple `BEGIN', +`END', `BEGINFILE', and `ENDFILE' rules. Also, things such as: -13.2.1 Controlling Array Traversal ----------------------------------- + /foo/ -By default, the order in which a `for (i in array)' loop scans an array -is not defined; it is generally based upon the internal implementation -of arrays inside `awk'. +come out as: - Often, though, it is desirable to be able to loop over the elements -in a particular order that you, the programmer, choose. `gawk' lets -you do this. + /foo/ { + print $0 + } - *note Controlling Scanning::, describes how you can assign special, -pre-defined values to `PROCINFO["sorted_in"]' in order to control the -order in which `gawk' will traverse an array during a `for' loop. +which is correct, but possibly surprising. - In addition, the value of `PROCINFO["sorted_in"]' can be a function -name. This lets you traverse an array based on any custom criterion. -The array elements are ordered according to the return value of this -function. The comparison function should be defined with at least four -arguments: + Besides creating profiles when a program has completed, `gawk' can +produce a profile while it is running. This is useful if your `awk' +program goes into an infinite loop and you want to see what has been +executed. To use this feature, run `gawk' with the `--profile' option +in the background: - function comp_func(i1, v1, i2, v2) - { - COMPARE ELEMENTS 1 AND 2 IN SOME FASHION - RETURN < 0; 0; OR > 0 - } + $ gawk --profile -f myprog & + [1] 13992 - Here, I1 and I2 are the indices, and V1 and V2 are the corresponding -values of the two elements being compared. Either V1 or V2, or both, -can be arrays if the array being traversed contains subarrays as values. -(*Note Arrays of Arrays::, for more information about subarrays.) The -three possible return values are interpreted as follows: +The shell prints a job number and process ID number; in this case, +13992. Use the `kill' command to send the `USR1' signal to `gawk': -`comp_func(i1, v1, i2, v2) < 0' - Index I1 comes before index I2 during loop traversal. + $ kill -USR1 13992 -`comp_func(i1, v1, i2, v2) == 0' - Indices I1 and I2 come together but the relative order with - respect to each other is undefined. +As usual, the profiled version of the program is written to +`awkprof.out', or to a different file if one specified with the +`--profile' option. -`comp_func(i1, v1, i2, v2) > 0' - Index I1 comes after index I2 during loop traversal. + Along with the regular profile, as shown earlier, the profile +includes a trace of any active functions: - Our first comparison function can be used to scan an array in -numerical order of the indices: + # Function Call Stack: - function cmp_num_idx(i1, v1, i2, v2) - { - # numerical index comparison, ascending order - return (i1 - i2) - } + # 3. baz + # 2. bar + # 1. foo + # -- main -- - Our second function traverses an array based on the string order of -the element values rather than by indices: + You may send `gawk' the `USR1' signal as many times as you like. +Each time, the profile and function call trace are appended to the +output profile file. - function cmp_str_val(i1, v1, i2, v2) - { - # string value comparison, ascending order - v1 = v1 "" - v2 = v2 "" - if (v1 < v2) - return -1 - return (v1 != v2) - } + If you use the `HUP' signal instead of the `USR1' signal, `gawk' +produces the profile and the function call trace and then exits. - The third comparison function makes all numbers, and numeric strings -without any leading or trailing spaces, come out first during loop -traversal: + When `gawk' runs on MS-Windows systems, it uses the `INT' and `QUIT' +signals for producing the profile and, in the case of the `INT' signal, +`gawk' exits. This is because these systems don't support the `kill' +command, so the only signals you can deliver to a program are those +generated by the keyboard. The `INT' signal is generated by the +`Ctrl-' or `Ctrl-' key, while the `QUIT' signal is generated +by the `Ctrl-<\>' key. - function cmp_num_str_val(i1, v1, i2, v2, n1, n2) - { - # numbers before string value comparison, ascending order - n1 = v1 + 0 - n2 = v2 + 0 - if (n1 == v1) - return (n2 == v2) ? (n1 - n2) : -1 - else if (n2 == v2) - return 1 - return (v1 < v2) ? -1 : (v1 != v2) - } + Finally, `gawk' also accepts another option, `--pretty-print'. When +called this way, `gawk' "pretty prints" the program into `awkprof.out', +without any execution counts. - Here is a main program to demonstrate how `gawk' behaves using each -of the previous functions: + +File: gawk.info, Node: Internationalization, Next: Debugger, Prev: Advanced Features, Up: Top - BEGIN { - data["one"] = 10 - data["two"] = 20 - data[10] = "one" - data[100] = 100 - data[20] = "two" +13 Internationalization with `gawk' +*********************************** - f[1] = "cmp_num_idx" - f[2] = "cmp_str_val" - f[3] = "cmp_num_str_val" - for (i = 1; i <= 3; i++) { - printf("Sort function: %s\n", f[i]) - PROCINFO["sorted_in"] = f[i] - for (j in data) - printf("\tdata[%s] = %s\n", j, data[j]) - print "" - } - } +Once upon a time, computer makers wrote software that worked only in +English. Eventually, hardware and software vendors noticed that if +their systems worked in the native languages of non-English-speaking +countries, they were able to sell more systems. As a result, +internationalization and localization of programs and software systems +became a common practice. - Here are the results when the program is run: + For many years, the ability to provide internationalization was +largely restricted to programs written in C and C++. This major node +describes the underlying library `gawk' uses for internationalization, +as well as how `gawk' makes internationalization features available at +the `awk' program level. Having internationalization available at the +`awk' level gives software developers additional flexibility--they are +no longer forced to write in C or C++ when internationalization is a +requirement. - $ gawk -f compdemo.awk - -| Sort function: cmp_num_idx Sort by numeric index - -| data[two] = 20 - -| data[one] = 10 Both strings are numerically zero - -| data[10] = one - -| data[20] = two - -| data[100] = 100 - -| - -| Sort function: cmp_str_val Sort by element values as strings - -| data[one] = 10 - -| data[100] = 100 String 100 is less than string 20 - -| data[two] = 20 - -| data[10] = one - -| data[20] = two - -| - -| Sort function: cmp_num_str_val Sort all numeric values before all strings - -| data[one] = 10 - -| data[two] = 20 - -| data[100] = 100 - -| data[10] = one - -| data[20] = two +* Menu: - Consider sorting the entries of a GNU/Linux system password file -according to login name. The following program sorts records by a -specific field position and can be used for this purpose: +* I18N and L10N:: Internationalization and Localization. +* Explaining gettext:: How GNU `gettext' works. +* Programmer i18n:: Features for the programmer. +* Translator i18n:: Features for the translator. +* I18N Example:: A simple i18n example. +* Gawk I18N:: `gawk' is also internationalized. - # sort.awk --- simple program to sort by field position - # field position is specified by the global variable POS + +File: gawk.info, Node: I18N and L10N, Next: Explaining gettext, Up: Internationalization - function cmp_field(i1, v1, i2, v2) - { - # comparison by value, as string, and ascending order - return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS]) - } +13.1 Internationalization and Localization +========================================== - { - for (i = 1; i <= NF; i++) - a[NR][i] = $i - } +"Internationalization" means writing (or modifying) a program once, in +such a way that it can use multiple languages without requiring further +source-code changes. "Localization" means providing the data necessary +for an internationalized program to work in a particular language. +Most typically, these terms refer to features such as the language used +for printing error messages, the language used to read responses, and +information related to how numerical and monetary values are printed +and read. - END { - PROCINFO["sorted_in"] = "cmp_field" - if (POS < 1 || POS > NF) - POS = 1 - for (i in a) { - for (j = 1; j <= NF; j++) - printf("%s%c", a[i][j], j < NF ? ":" : "") - print "" - } - } + +File: gawk.info, Node: Explaining gettext, Next: Programmer i18n, Prev: I18N and L10N, Up: Internationalization - The first field in each entry of the password file is the user's -login name, and the fields are separated by colons. Each record -defines a subarray, with each field as an element in the subarray. -Running the program produces the following output: +13.2 GNU `gettext' +================== - $ gawk -v POS=1 -F: -f sort.awk /etc/passwd - -| adm:x:3:4:adm:/var/adm:/sbin/nologin - -| apache:x:48:48:Apache:/var/www:/sbin/nologin - -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin - ... +The facilities in GNU `gettext' focus on messages; strings printed by a +program, either directly or via formatting with `printf' or +`sprintf()'.(1) - The comparison should normally always return the same value when -given a specific pair of array elements as its arguments. If -inconsistent results are returned then the order is undefined. This -behavior can be exploited to introduce random order into otherwise -seemingly ordered data: + When using GNU `gettext', each application has its own "text +domain". This is a unique name, such as `kpilot' or `gawk', that +identifies the application. A complete application may have multiple +components--programs written in C or C++, as well as scripts written in +`sh' or `awk'. All of the components use the same text domain. - function cmp_randomize(i1, v1, i2, v2) - { - # random order - return (2 - 4 * rand()) - } + To make the discussion concrete, assume we're writing an application +named `guide'. Internationalization consists of the following steps, +in this order: - As mentioned above, the order of the indices is arbitrary if two -elements compare equal. This is usually not a problem, but letting the -tied elements come out in arbitrary order can be an issue, especially -when comparing item values. The partial ordering of the equal elements -may change during the next loop traversal, if other elements are added -or removed from the array. One way to resolve ties when comparing -elements with otherwise equal values is to include the indices in the -comparison rules. Note that doing this may make the loop traversal -less efficient, so consider it only if necessary. The following -comparison functions force a deterministic order, and are based on the -fact that the indices of two elements are never equal: + 1. The programmer goes through the source for all of `guide''s + components and marks each string that is a candidate for + translation. For example, `"`-F': option required"' is a good + candidate for translation. A table with strings of option names + is not (e.g., `gawk''s `--profile' option should remain the same, + no matter what the local language). - function cmp_numeric(i1, v1, i2, v2) - { - # numerical value (and index) comparison, descending order - return (v1 != v2) ? (v2 - v1) : (i2 - i1) - } + 2. The programmer indicates the application's text domain (`"guide"') + to the `gettext' library, by calling the `textdomain()' function. - function cmp_string(i1, v1, i2, v2) - { - # string value (and index) comparison, descending order - v1 = v1 i1 - v2 = v2 i2 - return (v1 > v2) ? -1 : (v1 != v2) - } + 3. Messages from the application are extracted from the source code + and collected into a portable object template file (`guide.pot'), + which lists the strings and their translations. The translations + are initially empty. The original (usually English) messages + serve as the key for lookup of the translations. - A custom comparison function can often simplify ordered loop -traversal, and the sky is really the limit when it comes to designing -such a function. + 4. For each language with a translator, `guide.pot' is copied to a + portable object file (`.po') and translations are created and + shipped with the application. For example, there might be a + `fr.po' for a French translation. - When string comparisons are made during a sort, either for element -values where one or both aren't numbers, or for element indices handled -as strings, the value of `IGNORECASE' (*note Built-in Variables::) -controls whether the comparisons treat corresponding uppercase and -lowercase letters as equivalent or distinct. + 5. Each language's `.po' file is converted into a binary message + object (`.mo') file. A message object file contains the original + messages and their translations in a binary format that allows + fast lookup of translations at runtime. - Another point to keep in mind is that in the case of subarrays the -element values can themselves be arrays; a production comparison -function should use the `isarray()' function (*note Type Functions::), -to check for this, and choose a defined sorting order for subarrays. + 6. When `guide' is built and installed, the binary translation files + are installed in a standard place. - All sorting based on `PROCINFO["sorted_in"]' is disabled in POSIX -mode, since the `PROCINFO' array is not special in that case. + 7. For testing and development, it is possible to tell `gettext' to + use `.mo' files in a different directory than the standard one by + using the `bindtextdomain()' function. - As a side note, sorting the array indices before traversing the -array has been reported to add 15% to 20% overhead to the execution -time of `awk' programs. For this reason, sorted array traversal is not -the default. + 8. At runtime, `guide' looks up each string via a call to + `gettext()'. The returned string is the translated string if + available, or the original string if not. - -File: gawk.info, Node: Array Sorting Functions, Prev: Controlling Array Traversal, Up: Array Sorting + 9. If necessary, it is possible to access messages from a different + text domain than the one belonging to the application, without + having to switch the application's default text domain back and + forth. -13.2.2 Sorting Array Values and Indices with `gawk' ---------------------------------------------------- + In C (or C++), the string marking and dynamic translation lookup are +accomplished by wrapping each string in a call to `gettext()': -In most `awk' implementations, sorting an array requires writing a -`sort()' function. While this can be educational for exploring -different sorting algorithms, usually that's not the point of the -program. `gawk' provides the built-in `asort()' and `asorti()' -functions (*note String Functions::) for sorting arrays. For example: + printf("%s", gettext("Don't Panic!\n")); - POPULATE THE ARRAY data - n = asort(data) - for (i = 1; i <= n; i++) - DO SOMETHING WITH data[i] + The tools that extract messages from source code pull out all +strings enclosed in calls to `gettext()'. - After the call to `asort()', the array `data' is indexed from 1 to -some number N, the total number of elements in `data'. (This count is -`asort()''s return value.) `data[1]' <= `data[2]' <= `data[3]', and so -on. The comparison is based on the type of the elements (*note Typing -and Comparison::). All numeric values come before all string values, -which in turn come before all subarrays. + The GNU `gettext' developers, recognizing that typing `gettext(...)' +over and over again is both painful and ugly to look at, use the macro +`_' (an underscore) to make things easier: - An important side effect of calling `asort()' is that _the array's -original indices are irrevocably lost_. As this isn't always -desirable, `asort()' accepts a second argument: + /* In the standard header file: */ + #define _(str) gettext(str) - POPULATE THE ARRAY source - n = asort(source, dest) - for (i = 1; i <= n; i++) - DO SOMETHING WITH dest[i] + /* In the program text: */ + printf("%s", _("Don't Panic!\n")); - In this case, `gawk' copies the `source' array into the `dest' array -and then sorts `dest', destroying its indices. However, the `source' -array is not affected. +This reduces the typing overhead to just three extra characters per +string and is considerably easier to read as well. - `asort()' accepts a third string argument to control comparison of -array elements. As with `PROCINFO["sorted_in"]', this argument may be -one of the predefined names that `gawk' provides (*note Controlling -Scanning::), or the name of a user-defined function (*note Controlling -Array Traversal::). + There are locale "categories" for different types of locale-related +information. The defined locale categories that `gettext' knows about +are: - NOTE: In all cases, the sorted element values consist of the - original array's element values. The ability to control - comparison merely affects the way in which they are sorted. +`LC_MESSAGES' + Text messages. This is the default category for `gettext' + operations, but it is possible to supply a different one + explicitly, if necessary. (It is almost never necessary to supply + a different category.) - Often, what's needed is to sort on the values of the _indices_ -instead of the values of the elements. To do that, use the `asorti()' -function. The interface is identical to that of `asort()', except that -the index values are used for sorting, and become the values of the -result array: +`LC_COLLATE' + Text-collation information; i.e., how different characters and/or + groups of characters sort in a given language. - { source[$0] = some_func($0) } +`LC_CTYPE' + Character-type information (alphabetic, digit, upper- or + lowercase, and so on). This information is accessed via the POSIX + character classes in regular expressions, such as `/[[:alnum:]]/' + (*note Regexp Operators::). - END { - n = asorti(source, dest) - for (i = 1; i <= n; i++) { - Work with sorted indices directly: - DO SOMETHING WITH dest[i] - ... - Access original array via sorted indices: - DO SOMETHING WITH source[dest[i]] - } - } +`LC_MONETARY' + Monetary information, such as the currency symbol, and whether the + symbol goes before or after a number. - Similar to `asort()', in all cases, the sorted element values -consist of the original array's indices. The ability to control -comparison merely affects the way in which they are sorted. +`LC_NUMERIC' + Numeric information, such as which characters to use for the + decimal point and the thousands separator.(2) - Sorting the array by replacing the indices provides maximal -flexibility. To traverse the elements in decreasing order, use a loop -that goes from N down to 1, either over the elements or over the -indices.(1) +`LC_RESPONSE' + Response information, such as how "yes" and "no" appear in the + local language, and possibly other information as well. - Copying array indices and elements isn't expensive in terms of -memory. Internally, `gawk' maintains "reference counts" to data. For -example, when `asort()' copies the first array to the second one, there -is only one copy of the original array elements' data, even though both -arrays use the values. +`LC_TIME' + Time- and date-related information, such as 12- or 24-hour clock, + month printed before or after the day in a date, local month + abbreviations, and so on. - Because `IGNORECASE' affects string comparisons, the value of -`IGNORECASE' also affects sorting for both `asort()' and `asorti()'. -Note also that the locale's sorting order does _not_ come into play; -comparisons are based on character values only.(2) Caveat Emptor. +`LC_ALL' + All of the above. (Not too useful in the context of `gettext'.) ---------- Footnotes ---------- - (1) You may also use one of the predefined sorting names that sorts -in decreasing order. + (1) For some operating systems, the `gawk' port doesn't support GNU +`gettext'. Therefore, these features are not available if you are +using one of those operating systems. Sorry. - (2) This is true because locale-based comparison occurs only when in -POSIX compatibility mode, and since `asort()' and `asorti()' are `gawk' -extensions, they are not available in that case. + (2) Americans use a comma every three decimal places and a period +for the decimal point, while many Europeans do exactly the opposite: +1,234.56 versus 1.234,56.  -File: gawk.info, Node: Two-way I/O, Next: TCP/IP Networking, Prev: Array Sorting, Up: Advanced Features +File: gawk.info, Node: Programmer i18n, Next: Translator i18n, Prev: Explaining gettext, Up: Internationalization -13.3 Two-Way Communications with Another Process -================================================ +13.3 Internationalizing `awk' Programs +====================================== - From: brennan@whidbey.com (Mike Brennan) - Newsgroups: comp.lang.awk - Subject: Re: Learn the SECRET to Attract Women Easily - Date: 4 Aug 1997 17:34:46 GMT - Message-ID: <5s53rm$eca@news.whidbey.com> +`gawk' provides the following variables and functions for +internationalization: - On 3 Aug 1997 13:17:43 GMT, Want More Dates??? - wrote: - >Learn the SECRET to Attract Women Easily - > - >The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women +`TEXTDOMAIN' + This variable indicates the application's text domain. For + compatibility with GNU `gettext', the default value is + `"messages"'. - The scent of awk programmers is a lot more attractive to women than - the scent of perl programmers. - -- - Mike Brennan +`_"your message here"' + String constants marked with a leading underscore are candidates + for translation at runtime. String constants without a leading + underscore are not translated. - It is often useful to be able to send data to a separate program for -processing and then read the result. This can always be done with -temporary files: +`dcgettext(STRING [, DOMAIN [, CATEGORY]])' + Return the translation of STRING in text domain DOMAIN for locale + category CATEGORY. The default value for DOMAIN is the current + value of `TEXTDOMAIN'. The default value for CATEGORY is + `"LC_MESSAGES"'. - # Write the data for processing - tempfile = ("mydata." PROCINFO["pid"]) - while (NOT DONE WITH DATA) - print DATA | ("subprogram > " tempfile) - close("subprogram > " tempfile) + If you supply a value for CATEGORY, it must be a string equal to + one of the known locale categories described in *note Explaining + gettext::. You must also supply a text domain. Use `TEXTDOMAIN' + if you want to use the current domain. - # Read the results, remove tempfile when done - while ((getline newdata < tempfile) > 0) - PROCESS newdata APPROPRIATELY - close(tempfile) - system("rm " tempfile) + CAUTION: The order of arguments to the `awk' version of the + `dcgettext()' function is purposely different from the order + for the C version. The `awk' version's order was chosen to + be simple and to allow for reasonable `awk'-style default + arguments. -This works, but not elegantly. Among other things, it requires that -the program be run in a directory that cannot be shared among users; -for example, `/tmp' will not do, as another user might happen to be -using a temporary file with the same name. +`dcngettext(STRING1, STRING2, NUMBER [, DOMAIN [, CATEGORY]])' + Return the plural form used for NUMBER of the translation of + STRING1 and STRING2 in text domain DOMAIN for locale category + CATEGORY. STRING1 is the English singular variant of a message, + and STRING2 the English plural variant of the same message. The + default value for DOMAIN is the current value of `TEXTDOMAIN'. + The default value for CATEGORY is `"LC_MESSAGES"'. - However, with `gawk', it is possible to open a _two-way_ pipe to -another process. The second process is termed a "coprocess", since it -runs in parallel with `gawk'. The two-way connection is created using -the `|&' operator (borrowed from the Korn shell, `ksh'):(1) + The same remarks about argument order as for the `dcgettext()' + function apply. - do { - print DATA |& "subprogram" - "subprogram" |& getline results - } while (DATA LEFT TO PROCESS) - close("subprogram") +`bindtextdomain(DIRECTORY [, DOMAIN])' + Change the directory in which `gettext' looks for `.mo' files, in + case they will not or cannot be placed in the standard locations + (e.g., during testing). Return the directory in which DOMAIN is + "bound." - The first time an I/O operation is executed using the `|&' operator, -`gawk' creates a two-way pipeline to a child process that runs the -other program. Output created with `print' or `printf' is written to -the program's standard input, and output from the program's standard -output can be read by the `gawk' program using `getline'. As is the -case with processes started by `|', the subprogram can be any program, -or pipeline of programs, that can be started by the shell. + The default DOMAIN is the value of `TEXTDOMAIN'. If DIRECTORY is + the null string (`""'), then `bindtextdomain()' returns the + current binding for the given DOMAIN. - There are some cautionary items to be aware of: + To use these facilities in your `awk' program, follow the steps +outlined in *note Explaining gettext::, like so: - * As the code inside `gawk' currently stands, the coprocess's - standard error goes to the same place that the parent `gawk''s - standard error goes. It is not possible to read the child's - standard error separately. + 1. Set the variable `TEXTDOMAIN' to the text domain of your program. + This is best done in a `BEGIN' rule (*note BEGIN/END::), or it can + also be done via the `-v' command-line option (*note Options::): + + BEGIN { + TEXTDOMAIN = "guide" + ... + } - * I/O buffering may be a problem. `gawk' automatically flushes all - output down the pipe to the coprocess. However, if the coprocess - does not flush its output, `gawk' may hang when doing a `getline' - in order to read the coprocess's results. This could lead to a - situation known as "deadlock", where each process is waiting for - the other one to do something. + 2. Mark all translatable strings with a leading underscore (`_') + character. It _must_ be adjacent to the opening quote of the + string. For example: - It is possible to close just one end of the two-way pipe to a -coprocess, by supplying a second argument to the `close()' function of -either `"to"' or `"from"' (*note Close Files And Pipes::). These -strings tell `gawk' to close the end of the pipe that sends data to the -coprocess or the end that reads from it, respectively. + print _"hello, world" + x = _"you goofed" + printf(_"Number of users is %d\n", nusers) - This is particularly necessary in order to use the system `sort' -utility as part of a coprocess; `sort' must read _all_ of its input -data before it can produce any output. The `sort' program does not -receive an end-of-file indication until `gawk' closes the write end of -the pipe. + 3. If you are creating strings dynamically, you can still translate + them, using the `dcgettext()' built-in function: - When you have finished writing data to the `sort' utility, you can -close the `"to"' end of the pipe, and then start reading sorted data -via `getline'. For example: + message = nusers " users logged in" + message = dcgettext(message, "adminprog") + print message - BEGIN { - command = "LC_ALL=C sort" - n = split("abcdefghijklmnopqrstuvwxyz", a, "") + Here, the call to `dcgettext()' supplies a different text domain + (`"adminprog"') in which to find the message, but it uses the + default `"LC_MESSAGES"' category. - for (i = n; i > 0; i--) - print a[i] |& command - close(command, "to") + 4. During development, you might want to put the `.mo' file in a + private directory for testing. This is done with the + `bindtextdomain()' built-in function: - while ((command |& getline line) > 0) - print "got", line - close(command) - } + BEGIN { + TEXTDOMAIN = "guide" # our text domain + if (Testing) { + # where to find our files + bindtextdomain("testdir") + # joe is in charge of adminprog + bindtextdomain("../joe/testdir", "adminprog") + } + ... + } - This program writes the letters of the alphabet in reverse order, one -per line, down the two-way pipe to `sort'. It then closes the write -end of the pipe, so that `sort' receives an end-of-file indication. -This causes `sort' to sort the data and write the sorted data back to -the `gawk' program. Once all of the data has been read, `gawk' -terminates the coprocess and exits. - As a side note, the assignment `LC_ALL=C' in the `sort' command -ensures traditional Unix (ASCII) sorting from `sort'. + *Note I18N Example::, for an example program showing the steps to +create and use translations from `awk'. - You may also use pseudo-ttys (ptys) for two-way communication -instead of pipes, if your system supports them. This is done on a -per-command basis, by setting a special element in the `PROCINFO' array -(*note Auto-set::), like so: + +File: gawk.info, Node: Translator i18n, Next: I18N Example, Prev: Programmer i18n, Up: Internationalization - command = "sort -nr" # command, save in convenience variable - PROCINFO[command, "pty"] = 1 # update PROCINFO - print ... |& command # start two-way pipe - ... +13.4 Translating `awk' Programs +=============================== -Using ptys avoids the buffer deadlock issues described earlier, at some -loss in performance. If your system does not have ptys, or if all the -system's ptys are in use, `gawk' automatically falls back to using -regular pipes. +Once a program's translatable strings have been marked, they must be +extracted to create the initial `.po' file. As part of translation, it +is often helpful to rearrange the order in which arguments to `printf' +are output. - ---------- Footnotes ---------- + `gawk''s `--gen-pot' command-line option extracts the messages and +is discussed next. After that, `printf''s ability to rearrange the +order for `printf' arguments at runtime is covered. - (1) This is very different from the same operator in the C shell. +* Menu: + +* String Extraction:: Extracting marked strings. +* Printf Ordering:: Rearranging `printf' arguments. +* I18N Portability:: `awk'-level portability issues.  -File: gawk.info, Node: TCP/IP Networking, Next: Profiling, Prev: Two-way I/O, Up: Advanced Features +File: gawk.info, Node: String Extraction, Next: Printf Ordering, Up: Translator i18n -13.4 Using `gawk' for Network Programming -========================================= +13.4.1 Extracting Marked Strings +-------------------------------- - `EMISTERED': - A host is a host from coast to coast, - and no-one can talk to host that's close, - unless the host that isn't close - is busy hung or dead. +Once your `awk' program is working, and all the strings have been +marked and you've set (and perhaps bound) the text domain, it is time +to produce translations. First, use the `--gen-pot' command-line +option to create the initial `.pot' file: - In addition to being able to open a two-way pipeline to a coprocess -on the same system (*note Two-way I/O::), it is possible to make a -two-way connection to another process on another system across an IP -network connection. + $ gawk --gen-pot -f guide.awk > guide.pot - You can think of this as just a _very long_ two-way pipeline to a -coprocess. The way `gawk' decides that you want to use TCP/IP -networking is by recognizing special file names that begin with one of -`/inet/', `/inet4/' or `/inet6'. + When run with `--gen-pot', `gawk' does not execute your program. +Instead, it parses it as usual and prints all marked strings to +standard output in the format of a GNU `gettext' Portable Object file. +Also included in the output are any constant strings that appear as the +first argument to `dcgettext()' or as the first and second argument to +`dcngettext()'.(1) *Note I18N Example::, for the full list of steps to +go through to create and test translations for `guide'. - The full syntax of the special file name is -`/NET-TYPE/PROTOCOL/LOCAL-PORT/REMOTE-HOST/REMOTE-PORT'. The -components are: + ---------- Footnotes ---------- -NET-TYPE - Specifies the kind of Internet connection to make. Use `/inet4/' - to force IPv4, and `/inet6/' to force IPv6. Plain `/inet/' (which - used to be the only option) uses the system default, most likely - IPv4. + (1) The `xgettext' utility that comes with GNU `gettext' can handle +`.awk' files. -PROTOCOL - The protocol to use over IP. This must be either `tcp', or `udp', - for a TCP or UDP IP connection, respectively. The use of TCP is - recommended for most applications. + +File: gawk.info, Node: Printf Ordering, Next: I18N Portability, Prev: String Extraction, Up: Translator i18n -LOCAL-PORT - The local TCP or UDP port number to use. Use a port number of `0' - when you want the system to pick a port. This is what you should do - when writing a TCP or UDP client. You may also use a well-known - service name, such as `smtp' or `http', in which case `gawk' - attempts to determine the predefined port number using the C - `getaddrinfo()' function. +13.4.2 Rearranging `printf' Arguments +------------------------------------- -REMOTE-HOST - The IP address or fully-qualified domain name of the Internet host - to which you want to connect. +Format strings for `printf' and `sprintf()' (*note Printf::) present a +special problem for translation. Consider the following:(1) -REMOTE-PORT - The TCP or UDP port number to use on the given REMOTE-HOST. - Again, use `0' if you don't care, or else a well-known service - name. + printf(_"String `%s' has %d characters\n", + string, length(string))) - NOTE: Failure in opening a two-way socket will result in a - non-fatal error being returned to the calling code. The value of - `ERRNO' indicates the error (*note Auto-set::). + A possible German translation for this might be: - Consider the following very simple example: + "%d Zeichen lang ist die Zeichenkette `%s'\n" - BEGIN { - Service = "/inet/tcp/0/localhost/daytime" - Service |& getline - print $0 - close(Service) - } + The problem should be obvious: the order of the format +specifications is different from the original! Even though `gettext()' +can return the translated string at runtime, it cannot change the +argument order in the call to `printf'. - This program reads the current date and time from the local system's -TCP `daytime' server. It then prints the results and closes the -connection. + To solve this problem, `printf' format specifiers may have an +additional optional element, which we call a "positional specifier". +For example: - Because this topic is extensive, the use of `gawk' for TCP/IP -programming is documented separately. See *note (General -Introduction)Top:: gawkinet, TCP/IP Internetworking with `gawk', for a -much more complete introduction and discussion, as well as extensive -examples. + "%2$d Zeichen lang ist die Zeichenkette `%1$s'\n" - -File: gawk.info, Node: Profiling, Prev: TCP/IP Networking, Up: Advanced Features + Here, the positional specifier consists of an integer count, which +indicates which argument to use, and a `$'. Counts are one-based, and +the format string itself is _not_ included. Thus, in the following +example, `string' is the first argument and `length(string)' is the +second: -13.5 Profiling Your `awk' Programs -================================== + $ gawk 'BEGIN { + > string = "Dont Panic" + > printf _"%2$d characters live in \"%1$s\"\n", + > string, length(string) + > }' + -| 10 characters live in "Dont Panic" -You may produce execution traces of your `awk' programs. This is done -by passing the option `--profile' to `gawk'. When `gawk' has finished -running, it creates a profile of your program in a file named -`awkprof.out'. Because it is profiling, it also executes up to 45% -slower than `gawk' normally does. + If present, positional specifiers come first in the format +specification, before the flags, the field width, and/or the precision. - As shown in the following example, the `--profile' option can be -used to change the name of the file where `gawk' will write the profile: + Positional specifiers can be used with the dynamic field width and +precision capability: - gawk --profile=myprog.prof -f myprog.awk data1 data2 + $ gawk 'BEGIN { + > printf("%*.*s\n", 10, 20, "hello") + > printf("%3$*2$.*1$s\n", 20, 10, "hello") + > }' + -| hello + -| hello -In the above example, `gawk' places the profile in `myprog.prof' -instead of in `awkprof.out'. + NOTE: When using `*' with a positional specifier, the `*' comes + first, then the integer position, and then the `$'. This is + somewhat counterintuitive. - Here is a sample session showing a simple `awk' program, its input -data, and the results from running `gawk' with the `--profile' option. -First, the `awk' program: + `gawk' does not allow you to mix regular format specifiers and those +with positional specifiers in the same string: - BEGIN { print "First BEGIN rule" } + $ gawk 'BEGIN { printf _"%d %3$s\n", 1, 2, "hi" }' + error--> gawk: cmd. line:1: fatal: must use `count$' on all formats or none - END { print "First END rule" } + NOTE: There are some pathological cases that `gawk' may fail to + diagnose. In such cases, the output may not be what you expect. + It's still a bad idea to try mixing them, even if `gawk' doesn't + detect it. - /foo/ { - print "matched /foo/, gosh" - for (i = 1; i <= 3; i++) - sing() - } + Although positional specifiers can be used directly in `awk' +programs, their primary purpose is to help in producing correct +translations of format strings into languages different from the one in +which the program is first written. - { - if (/foo/) - print "if is true" - else - print "else is true" - } + ---------- Footnotes ---------- + + (1) This example is borrowed from the GNU `gettext' manual. - BEGIN { print "Second BEGIN rule" } + +File: gawk.info, Node: I18N Portability, Prev: Printf Ordering, Up: Translator i18n - END { print "Second END rule" } +13.4.3 `awk' Portability Issues +------------------------------- - function sing( dummy) - { - print "I gotta be me!" - } +`gawk''s internationalization features were purposely chosen to have as +little impact as possible on the portability of `awk' programs that use +them to other versions of `awk'. Consider this program: - Following is the input data: + BEGIN { + TEXTDOMAIN = "guide" + if (Test_Guide) # set with -v + bindtextdomain("/test/guide/messages") + print _"don't panic!" + } - foo - bar - baz - foo - junk +As written, it won't work on other versions of `awk'. However, it is +actually almost portable, requiring very little change: - Here is the `awkprof.out' that results from running the `gawk' -profiler on this program and data (this example also illustrates that -`awk' programmers sometimes have to work late): + * Assignments to `TEXTDOMAIN' won't have any effect, since + `TEXTDOMAIN' is not special in other `awk' implementations. - # gawk profile, created Sun Aug 13 00:00:15 2000 + * Non-GNU versions of `awk' treat marked strings as the + concatenation of a variable named `_' with the string following + it.(1) Typically, the variable `_' has the null string (`""') as + its value, leaving the original string constant as the result. - # BEGIN block(s) + * By defining "dummy" functions to replace `dcgettext()', + `dcngettext()' and `bindtextdomain()', the `awk' program can be + made to run, but all the messages are output in the original + language. For example: - BEGIN { - 1 print "First BEGIN rule" - 1 print "Second BEGIN rule" - } + function bindtextdomain(dir, domain) + { + return dir + } - # Rule(s) + function dcgettext(string, domain, category) + { + return string + } - 5 /foo/ { # 2 - 2 print "matched /foo/, gosh" - 6 for (i = 1; i <= 3; i++) { - 6 sing() - } - } + function dcngettext(string1, string2, number, domain, category) + { + return (number == 1 ? string1 : string2) + } - 5 { - 5 if (/foo/) { # 2 - 2 print "if is true" - 3 } else { - 3 print "else is true" - } - } + * The use of positional specifications in `printf' or `sprintf()' is + _not_ portable. To support `gettext()' at the C level, many + systems' C versions of `sprintf()' do support positional + specifiers. But it works only if enough arguments are supplied in + the function call. Many versions of `awk' pass `printf' formats + and arguments unchanged to the underlying C library version of + `sprintf()', but only one format and argument at a time. What + happens if a positional specification is used is anybody's guess. + However, since the positional specifications are primarily for use + in _translated_ format strings, and since non-GNU `awk's never + retrieve the translated string, this should not be a problem in + practice. - # END block(s) + ---------- Footnotes ---------- - END { - 1 print "First END rule" - 1 print "Second END rule" - } + (1) This is good fodder for an "Obfuscated `awk'" contest. - # Functions, listed alphabetically + +File: gawk.info, Node: I18N Example, Next: Gawk I18N, Prev: Translator i18n, Up: Internationalization - 6 function sing(dummy) - { - 6 print "I gotta be me!" - } +13.5 A Simple Internationalization Example +========================================== - This example illustrates many of the basic features of profiling -output. They are as follows: +Now let's look at a step-by-step example of how to internationalize and +localize a simple `awk' program, using `guide.awk' as our original +source: - * The program is printed in the order `BEGIN' rule, `BEGINFILE' rule, - pattern/action rules, `ENDFILE' rule, `END' rule and functions, - listed alphabetically. Multiple `BEGIN' and `END' rules are - merged together, as are multiple `BEGINFILE' and `ENDFILE' rules. + BEGIN { + TEXTDOMAIN = "guide" + bindtextdomain(".") # for testing + print _"Don't Panic" + print _"The Answer Is", 42 + print "Pardon me, Zaphod who?" + } - * Pattern-action rules have two counts. The first count, to the - left of the rule, shows how many times the rule's pattern was - _tested_. The second count, to the right of the rule's opening - left brace in a comment, shows how many times the rule's action - was _executed_. The difference between the two indicates how many - times the rule's pattern evaluated to false. +Run `gawk --gen-pot' to create the `.pot' file: - * Similarly, the count for an `if'-`else' statement shows how many - times the condition was tested. To the right of the opening left - brace for the `if''s body is a count showing how many times the - condition was true. The count for the `else' indicates how many - times the test failed. + $ gawk --gen-pot -f guide.awk > guide.pot - * The count for a loop header (such as `for' or `while') shows how - many times the loop test was executed. (Because of this, you - can't just look at the count on the first statement in a rule to - determine how many times the rule was executed. If the first - statement is a loop, the count is misleading.) +This produces: - * For user-defined functions, the count next to the `function' - keyword indicates how many times the function was called. The - counts next to the statements in the body show how many times - those statements were executed. + #: guide.awk:4 + msgid "Don't Panic" + msgstr "" - * The layout uses "K&R" style with TABs. Braces are used - everywhere, even when the body of an `if', `else', or loop is only - a single statement. + #: guide.awk:5 + msgid "The Answer Is" + msgstr "" - * Parentheses are used only where needed, as indicated by the - structure of the program and the precedence rules. For example, - `(3 + 5) * 4' means add three plus five, then multiply the total - by four. However, `3 + 5 * 4' has no parentheses, and means `3 + - (5 * 4)'. + This original portable object template file is saved and reused for +each language into which the application is translated. The `msgid' is +the original string and the `msgstr' is the translation. - * Parentheses are used around the arguments to `print' and `printf' - only when the `print' or `printf' statement is followed by a - redirection. Similarly, if the target of a redirection isn't a - scalar, it gets parenthesized. + NOTE: Strings not marked with a leading underscore do not appear + in the `guide.pot' file. - * `gawk' supplies leading comments in front of the `BEGIN' and `END' - rules, the pattern/action rules, and the functions. + Next, the messages must be translated. Here is a translation to a +hypothetical dialect of English, called "Mellow":(1) + $ cp guide.pot guide-mellow.po + ADD TRANSLATIONS TO guide-mellow.po ... - The profiled version of your program may not look exactly like what -you typed when you wrote it. This is because `gawk' creates the -profiled version by "pretty printing" its internal representation of -the program. The advantage to this is that `gawk' can produce a -standard representation. The disadvantage is that all source-code -comments are lost, as are the distinctions among multiple `BEGIN', -`END', `BEGINFILE', and `ENDFILE' rules. Also, things such as: +Following are the translations: - /foo/ + #: guide.awk:4 + msgid "Don't Panic" + msgstr "Hey man, relax!" -come out as: + #: guide.awk:5 + msgid "The Answer Is" + msgstr "Like, the scoop is" - /foo/ { - print $0 - } + The next step is to make the directory to hold the binary message +object file and then to create the `guide.mo' file. The directory +layout shown here is standard for GNU `gettext' on GNU/Linux systems. +Other versions of `gettext' may use a different layout: -which is correct, but possibly surprising. + $ mkdir en_US en_US/LC_MESSAGES - Besides creating profiles when a program has completed, `gawk' can -produce a profile while it is running. This is useful if your `awk' -program goes into an infinite loop and you want to see what has been -executed. To use this feature, run `gawk' with the `--profile' option -in the background: + The `msgfmt' utility does the conversion from human-readable `.po' +file to machine-readable `.mo' file. By default, `msgfmt' creates a +file named `messages'. This file must be renamed and placed in the +proper directory so that `gawk' can find it: - $ gawk --profile -f myprog & - [1] 13992 + $ msgfmt guide-mellow.po + $ mv messages en_US/LC_MESSAGES/guide.mo -The shell prints a job number and process ID number; in this case, -13992. Use the `kill' command to send the `USR1' signal to `gawk': + Finally, we run the program to test it: - $ kill -USR1 13992 + $ gawk -f guide.awk + -| Hey man, relax! + -| Like, the scoop is 42 + -| Pardon me, Zaphod who? -As usual, the profiled version of the program is written to -`awkprof.out', or to a different file if one specified with the -`--profile' option. + If the three replacement functions for `dcgettext()', `dcngettext()' +and `bindtextdomain()' (*note I18N Portability::) are in a file named +`libintl.awk', then we can run `guide.awk' unchanged as follows: - Along with the regular profile, as shown earlier, the profile -includes a trace of any active functions: + $ gawk --posix -f guide.awk -f libintl.awk + -| Don't Panic + -| The Answer Is 42 + -| Pardon me, Zaphod who? - # Function Call Stack: + ---------- Footnotes ---------- - # 3. baz - # 2. bar - # 1. foo - # -- main -- + (1) Perhaps it would be better if it were called "Hippy." Ah, well. - You may send `gawk' the `USR1' signal as many times as you like. -Each time, the profile and function call trace are appended to the -output profile file. + +File: gawk.info, Node: Gawk I18N, Prev: I18N Example, Up: Internationalization - If you use the `HUP' signal instead of the `USR1' signal, `gawk' -produces the profile and the function call trace and then exits. +13.6 `gawk' Can Speak Your Language +=================================== - When `gawk' runs on MS-Windows systems, it uses the `INT' and `QUIT' -signals for producing the profile and, in the case of the `INT' signal, -`gawk' exits. This is because these systems don't support the `kill' -command, so the only signals you can deliver to a program are those -generated by the keyboard. The `INT' signal is generated by the -`Ctrl-' or `Ctrl-' key, while the `QUIT' signal is generated -by the `Ctrl-<\>' key. +`gawk' itself has been internationalized using the GNU `gettext' +package. (GNU `gettext' is described in complete detail in *note (GNU +`gettext' utilities)Top:: gettext, GNU gettext tools.) As of this +writing, the latest version of GNU `gettext' is version 0.18.2.1 +(ftp://ftp.gnu.org/gnu/gettext/gettext-0.18.2.1.tar.gz). - Finally, `gawk' also accepts another option, `--pretty-print'. When -called this way, `gawk' "pretty prints" the program into `awkprof.out', -without any execution counts. + If a translation of `gawk''s messages exists, then `gawk' produces +usage messages, warnings, and fatal errors in the local language.  -File: gawk.info, Node: Debugger, Next: Arbitrary Precision Arithmetic, Prev: Advanced Features, Up: Top +File: gawk.info, Node: Debugger, Next: Arbitrary Precision Arithmetic, Prev: Internationalization, Up: Top 14 Debugging `awk' Programs *************************** @@ -32254,65 +32254,65 @@ Ref: Passwd Functions-Footnote-1619614 Node: Group Functions619702 Node: Walking Arrays627786 Node: Sample Programs629923 -Node: Running Examples630600 -Node: Clones631328 -Node: Cut Program632552 -Node: Egrep Program642397 -Ref: Egrep Program-Footnote-1650170 -Node: Id Program650280 -Node: Split Program653896 -Ref: Split Program-Footnote-1657415 -Node: Tee Program657543 -Node: Uniq Program660346 -Node: Wc Program667775 -Ref: Wc Program-Footnote-1672041 -Ref: Wc Program-Footnote-2672241 -Node: Miscellaneous Programs672333 -Node: Dupword Program673521 -Node: Alarm Program675552 -Node: Translate Program680301 -Ref: Translate Program-Footnote-1684688 -Ref: Translate Program-Footnote-2684916 -Node: Labels Program685050 -Ref: Labels Program-Footnote-1688421 -Node: Word Sorting688505 -Node: History Sorting692389 -Node: Extract Program694228 -Ref: Extract Program-Footnote-1701729 -Node: Simple Sed701857 -Node: Igawk Program704919 -Ref: Igawk Program-Footnote-1720076 -Ref: Igawk Program-Footnote-2720277 -Node: Anagram Program720415 -Node: Signature Program723483 -Node: Internationalization724583 -Node: I18N and L10N726015 -Node: Explaining gettext726701 -Ref: Explaining gettext-Footnote-1731767 -Ref: Explaining gettext-Footnote-2731951 -Node: Programmer i18n732116 -Node: Translator i18n736316 -Node: String Extraction737109 -Ref: String Extraction-Footnote-1738070 -Node: Printf Ordering738156 -Ref: Printf Ordering-Footnote-1740940 -Node: I18N Portability741004 -Ref: I18N Portability-Footnote-1743453 -Node: I18N Example743516 -Ref: I18N Example-Footnote-1746151 -Node: Gawk I18N746223 -Node: Advanced Features746844 -Node: Nondecimal Data748719 -Node: Array Sorting750302 -Node: Controlling Array Traversal750999 -Node: Array Sorting Functions759237 -Ref: Array Sorting Functions-Footnote-1762911 -Ref: Array Sorting Functions-Footnote-2763004 -Node: Two-way I/O763198 -Ref: Two-way I/O-Footnote-1768630 -Node: TCP/IP Networking768700 -Node: Profiling771544 -Node: Debugger778999 +Node: Running Examples630597 +Node: Clones631325 +Node: Cut Program632549 +Node: Egrep Program642394 +Ref: Egrep Program-Footnote-1650167 +Node: Id Program650277 +Node: Split Program653893 +Ref: Split Program-Footnote-1657412 +Node: Tee Program657540 +Node: Uniq Program660343 +Node: Wc Program667772 +Ref: Wc Program-Footnote-1672038 +Ref: Wc Program-Footnote-2672238 +Node: Miscellaneous Programs672330 +Node: Dupword Program673518 +Node: Alarm Program675549 +Node: Translate Program680298 +Ref: Translate Program-Footnote-1684685 +Ref: Translate Program-Footnote-2684913 +Node: Labels Program685047 +Ref: Labels Program-Footnote-1688418 +Node: Word Sorting688502 +Node: History Sorting692386 +Node: Extract Program694225 +Ref: Extract Program-Footnote-1701726 +Node: Simple Sed701854 +Node: Igawk Program704916 +Ref: Igawk Program-Footnote-1720073 +Ref: Igawk Program-Footnote-2720274 +Node: Anagram Program720412 +Node: Signature Program723480 +Node: Advanced Features724580 +Node: Nondecimal Data726462 +Node: Array Sorting728045 +Node: Controlling Array Traversal728742 +Node: Array Sorting Functions736980 +Ref: Array Sorting Functions-Footnote-1740654 +Ref: Array Sorting Functions-Footnote-2740747 +Node: Two-way I/O740941 +Ref: Two-way I/O-Footnote-1746373 +Node: TCP/IP Networking746443 +Node: Profiling749287 +Node: Internationalization756742 +Node: I18N and L10N758167 +Node: Explaining gettext758853 +Ref: Explaining gettext-Footnote-1763919 +Ref: Explaining gettext-Footnote-2764103 +Node: Programmer i18n764268 +Node: Translator i18n768468 +Node: String Extraction769261 +Ref: String Extraction-Footnote-1770222 +Node: Printf Ordering770308 +Ref: Printf Ordering-Footnote-1773092 +Node: I18N Portability773156 +Ref: I18N Portability-Footnote-1775605 +Node: I18N Example775668 +Ref: I18N Example-Footnote-1778303 +Node: Gawk I18N778375 +Node: Debugger778996 Node: Debugging779967 Node: Debugging Concepts780400 Node: Debugging Terms782256 -- cgit v1.2.3