summaryrefslogtreecommitdiffstats
path: root/HACKING
blob: 5ce8ffc09ee69b2f68425bdbd752dbd22cc3761e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
                              Txr Internals Guide
                         Kaz Kylheku <kaz@kylheku.com>

CONTENTS:

0.  Overview

1.  Coding Practice
1.2  Program File Structure
1.3  Style
1.3  Error Handling
1.4  I/O
1.5  Regression

2.  Dynamic Types
2.1  Two Kinds of Values
2.1  Pointer Bitfield
2.2  Heap Objects
2.3  The COBJ type
2.4  Strings
2.4.1  Encapsulated C Strings
2.4.2  Representation Hacks for 2 Byte wchar_t

3.  Garbage Collection
3.1  Root Pointers
3.2  GC-safe Code
3.2.1  Rule One: Full Initialization
3.2.2  Rule Two: Make it Reachable
3.3  Weak Reference Support
3.4  Generational GC
3.4.2  Representation of Generations
3.4.3  Basic Algorithm
3.4.4  Handling Backpointers

4.  Debugging
4.2.  Debugging the Yacc-generated Parser
4.3.  Debugging GC Issues
4.4.  Valgrind: Your Friend


0. Overview

This is an internals guide to someone who wants to understand, and possibly
change or extend the txr program. The purpose is to give explanations,
provide rationale and make coding recommendations.


1. Coding Practice

1.1  Language

Txr is written in a language that consists of the common dialect between C90
and C++98.  The code can be built with either the GNU C compiler or the GNU C++
compiler.  Use is made of some Unix functions from before Unix95, which are
requested by means of -D_XOPEN_SOURCE (POSIX.1, POSIX.2, X/Open Portability
Guide 4).  Also, the <wchar.h> header is used, which was introduced by a 1995
addendum to the C language, so it may be said that the actual C dialect is C95.

In coding new features or fixing bugs, care must be taken to preserve this.
Code must continue to compile as C and C++, and not increase the portability
requirements.

C++ compilation can be arranged using ./configure --ccname=g++ (for instance).

Note that txr takes some non-portable liberties with the language, such as
encoding bit fields into pointers, and treating automatic storage as a flat
stack which can be treated as an array that can be walked by a garbage
collector looking for references to objects. There are assumptions about the
alignment of objects too.

1.2  Program File Structure

The txr code has a simple flat structure: a collection of .c files
(and also a .l flex file and a .y yacc file) and headers.
The txr project follows the include header style that every C source file
includes all needed headers, in the proper order. Headers do not include other
headers. 

The generation of the dependency makefile dep.mk depends on this; the
depend.txr script does not scan headers for inclusion of other headers.  If
this stylistic decision is ever changed, the dependency generation will have to
be updated.


1.3  Style

Tab characters are avoided in txr source files. The indentation is two
characters.  Formatting is similar to K&R, though the yacc grammar files use a
Lispy formatting.  Expression or statement elements which are syntactically
parallel, but on separate lines, must be horizontally aligned with each other:

  if (function(argument1,
               argument))

rather than:

  if (function(argument1,
      argument))

The opening brace of a function goes on a separate line.  if/else braces
``cuddle'' into the previous line, except when the condition spans multiple
lines:

  if (multi +
      line +
      condition)
  {
    /* brace doesn't cuddle */
  } else {

  }

switch cases indent with the switch:

  switch (x)
  {
  case ...
    ...
    break;
  }

switches handle all enumeration members; default cases have a break
even if they are last in the block. The following style is permitted

  if (...) {
    ...
  } else switch (...) {
  case ...

  }

Forward and backward goto are permitted, unless it is /glaringly/
obvious that the code can be written better without it.

Certain C programming conventions are avoided. For generic pointers to anything
(needed in some low-level code) use the type mem_t *, not void *, and use casts
on conversions to and from this pointer.

The void * pointer, which came into C by way of C++, is brain-damaged.  It
allows C programs to subvert the type system without any cast operators or
diagnostics. In C++ it's a little better because conversions from void *
require a cast.  In this project, we want all hazardous pointer conversions to
be marked in the code by casts, whose presence is demanded by compiler
diagnostics.

1.3 Error Handling

Multiple return points from functions are encouraged. Txr has a garbage
collector, so there is usually no need to branch to a common cleanup just for
the sake of freeing memory.

Txr also has exceptions; code that must free some resource other than
garbage collected memory if a failure occurs, must be exception safe.

Exceptions should be used for both internal errors and environmental
situations. The internal_error macro is preferred to calling abort.


1.4 I/O

Use of the C streams and printf must be avoided. Txr has its own
streams and its own formatter function called format. Printing to
a dynamic string is supported. There are three global streams:
std_output, std_input and std_error. These streams don't do everything
that standard I/O streams can do, such as binary I/O, but
their capabilities can be extended.


1.5 Regression

All changes must be verified not to break the test cases. This is done by
running ``make tests''. Running ``make tests'' is not possible if the code is
being cross-compiled; in that case run ``make install-tests'' after ``make
install''. This will add the test cases and a shell script to run them to the
installation. The cases can then be installed and run on the cross target.


2. Dynamic Types

The txr code is organized around a dynamic typing paradigm implemented in C.
Values are represented by the C type val, which is a typedef name for
a pointer to obj_t, i.e obj_t *.


2.1 Two Kinds of Values

A value of type val falls into two kinds: heaped and immediate.

An heaped val points to an obj_t object, which is a union of a number
of structure types, discriminated by a type field.

An immediate val actually contains the value inside the pointer, and does not
point to anything.


2.1 Pointer Bitfield

Immediate and heaped values are distinguished by a two-bit field in the least
significant bits of the pointer. If the two bits are 00 (i.e. the pointer is
four-byte-aligned) then the value is a pointer to a heaped object, unless
it is the null pointer. The null pointer is understood to be the object
nil.  The is_ptr(v) macro evaluates true for a value v which is not nil,
and which points to a heap object (at least according to its bit field;
is_ptr does not validate the pointer).

The codes 01 10 and 11 indicate immediate values: values of type NUM, CHR and
LIT, respectively. That is to say, if the tag bits are 01, then then remaining
upper bits of the pointer constitute a signed integer. The range of this
integer is NUM_MIN to NUM_MAX, defined in lib.h.  The code 10 is for
characters: the remaining bits of the pointer encode a wchar_t value. The bits
11 indicate that the object is a pointer to an encapsulated C string (of
wide characters), which is most often a literal. See the subsection 
Encapsulated C Strings below.  Only C strings whose first character is suitably
aligned can be represented as LIT objects. The address of the first character
of the string is formed by masking out the 11 code, leaving a pointer which is
four-byte aligned.


2.2 Heap Objects

Heap types are an union of various structures: union obj. The obj_t name is a
typedef.  All of the structures are no larger than four pointer-sized words,
including the type tag, and it should be kept that way.  Heaps are managed as
arrays of this union obj.  If any one of the union members is made larger than
four words, then the heap size will increase.

Though the type tag is defined by a enumeration, for memory management 
purposes, the type field is overloaded with additional bitmasked values.
The FREE flag in a type field marks an object on the free list.
The REACHABLE flag is the marking bit used during garbage collection.


2.3 The COBJ type

The COBJ type is a mechanism whereby a ``native'' C type can be integrated into
the dynamic type system. Under the COBJ model, the heap allocated object of
type COBJ serves as a handle which points to a separately allocated C object,
which can be an arbitrary structure. The relationship between the dynamic world
and this object is managed through a registered table of operations. The module
managing that object must provide functions for dealing with garbage
collection, printing, equality and hashing.  The garbage collector hooks allow
the object's module to be notified when the associated COBJ handle becomes
unreachable. The associated C object may contain references to dynamic objects
(i.e. members of type val).  In that case, it must provide the mark function,
which, when invoked, must traverse the object's members of this type and
report to the garbage collector that they are reachable by invoking
mark_obj on them.


2.4 Strings

All string manipulation should be done using the dynamic object system.
The object system provides three kinds of strings: encapsulated
C strings, regular strings and lazy strings (type tags LIT, STR and LSTR,
respectively). Most code working with strings doesn't have to care about
the difference between these. However, taking advantage of the performance
capabilities of lazy strings requires some special coding (which is
backward compatible with regular strings). For instance, if you want to
know whether the length of a lazy string S is greater than 42, you don't want
to do this: gt(length_str(S), num(42)). This will force an instantiation
of the lazy string. There are functions for testing whether a string's length
is greater, lesser, greater or equal and lesser or equal, to some number.


2.4.1 Encapsulated C Strings

The design of the dynamic type system recognizes that programs contain literals
and static strings, and that sometimes transient strings are are used which
have temporary lifetimes. Therefore, a special provision is made in the val
type to be able to represent C strings directly, without having to create
dynamically allocated copies in heap storage. These C strings represented as
values of type val are referred to in this document as encapsulated C strings.
A C string whose address is aligned on a four-byte boundary, or more strictly,
is converted to an encapsulated C string by masking the bits 11 into the least
significant two bit positions of its pointer, and then manipulated as a value
of type val (pointer to obj_t).

Encapsulated C strings can be transparently used wherever the other kinds of
strings can be used, so the benefit is immense, for the small cost of a bit
operation.

Most often, this feature is used for literals, and the lit macro is provided
for this situation. The macro call lit("abc") produces a value of type val
which represents the wide string L"abc".

However, C strings other than literals can be encapsulated as values also. The
most obvious candidates are static strings which are arrays, rather than
literals, and stack-allocated strings, which C programs often use as
efficient temporary buffers for character manipulation. Two functions are
provided for converting these kinds of strings to encapsulated strings: the
functions static_str and auto_str. They do the same thing: simply take the
wchar_t * pointer and convert it to a obj_t * pointer with the bits 11 in the
tag field (thus requiring that the C string pointer be aligned such
that these bits are originally 00).  Two different functions which do the same
thing are provided, because it is generally much safer to convert a static
string to a val (due to its indefinite lifetime) than an automatic string
(which becomes indeterminate when the enclosing block terminates).  Care should
be taken to only ever use auto_str to wrap a stack-allocated string as a val,
so that such usage can be found in the program by searching for occurrences of
``auto_str''. Secondly, care should be taken to ensure that values produced by
auto_str do not try to escape beyond the lifetime of the enclosing block.  If
they are passed to functions those functions must not retain the value in any
persistent place.  For instance if an object is constructed which contains an
automatic string, that object must not be used beyond the lifetime of that
string.  Note that it is okay if garbage objects contain auto_str values, which
refer to strings that no longer exist, because the garbage collector will
recognize these pointers by their type tag and not use them.

2.4.2 Representation Hacks for 2 Byte wchar_t

On some systems (notably Cygwin), the wide character type wchar_t is only
two bytes wide, and the alignment of string literals and arrays is two
byte. This creates a problem: we need a two-bit type tag in the pointer,
but pointers have only one spare bit due to their strict alignment.

It turns out that this is not a problem provided that we can ensure that no two
distinct string objects share the same four byte word, and if we're willing to
incur a small performance penalty to find the beginning of the string when we
need it.

On these systems, what we do is add a null character at the beginning of the
string, and an extra one at the end: So the literal L"abc" is actually
represented by L"\0" L"abc" L"\0".  We then take the pointer to the 'a'
character as the string, which falls into one of two cases: it is either
four-byte aligned (case 1), or it is two-byte aligned (case 2). Either way, it
falls into some four byte cell, either at its base or at its third byte. When
we add the tag bits 11 (TAG_LIT), we make this pointer point to the fourth byte
(byte 3) of the four byte cell.  To recover the pointer, we remove the tag
(replace it with bits 00), which leaves us pointing to the base of the
four-byte cell. The string either starts there (case 1) or two bytes higher
(case 2). The case is distinguished by looking at the pointed-at wchar_t. If it
is the null character, then the pointer is incremented to the next character.

The padding at the end of the string ensures that  this trick works for the
null string, where the test for the null character always succeeds.

The lit macro, which existed before this hack, takes care of doing this so most
code doesn't know the difference.

The new wli macro helps manage this representation when access is needed to C
string literals which are not used directly, but first assigned to variables,
and also provides type safety by using a different pointer type for strings
which have been treated with the padding.

  const wchli_t *abc = wli("abc"); /* special type */

  val abc_obj = static_str(abc); /* good: requires const wchlit_t * pointer */

  val xyz_obj = static_str(L"xyz"); /* error */

  val def_obj = static_str(lit("abc")); /* error */

The wini and wref macros manage this representation when character arrays are
used. The wini macro abstracts away the initializer, so the programmer doesn't
have to be aware of the extra null bytes:

  wchar_t abc[] = wini("abc"); /* potentially six wchar_t units! */

The wref macro hides the displacement of the first character:

  wchar_t *ptr_a = wref(abc); /* pointer to "a" */

  wref(abc)[1] = L'B'; /* overwrite 'b' with 'B' */

On a platform where this hack isn't needed, these w* macros are no-ops.


3. Garbage Collection

Txr has a fairly simple mark-and-sweep garbage collector. The collector marks
objects by performing a depth-first-search over the graph formed by
inter-object references, starting at certain root values.  Objects which are
not marked are identified during the sweep phase, which is a linear scan
through the object heaps, and placed on the free list.

During the marking phase, the bit value 0x100 (denoted by the symbolic constant
REACHABLE) is used to mark reachable objects. This flag is reset during
the sweep phase, but the flag 0x200 (the value of the symbolic constant FREE)
is added to the type field of objects on the free list. This FREE flag has
the effect of ``poisoning'' free objects: if an object is prematurely
reclaimed (indicating a bug in the garbage collection system), uses
of that object will see a bad type tag, so that there is a good chance
the program will throw an exception due to a failed type check.

3.1  Root Pointers

The marking phase of the garbage collector looks in two places for root
pointers: by scanning the entire call stack, and by looking at a registered
list of global variables.

Scanning the stack means that the garbage collector is conservative: it could
encounter values which look like valid object references, but are actually only
accidentally so due to having the right bit pattern. When this happens,
objects that should be considered garbage will remain live.
This is called "spurious retention", and can be a bad problem, but it's
better than the opposite problem of premature deallocation.

Global root pointers are registered individually using the prot1 function,
or many at once using the protect function. Care must be taken to properly
null-terminate the variable argument list to protect. It does not use the
nao convention, but rather (val *) 0.

The garbage collector takes care to also scan the machine registers.  This is
currently done using a broadly portable approach, namely recording the machine
state into the stack with the setjmp macro.


3.2  GC-safe Code

Since garbage collection is being used in code processed by a compiler which
knows nothing about garbage collection, it is important to obey certain rules
so that the code is gc-safe. Code which is not gc-safe is susceptible
to two potential serious problems: the premature garbage collection of an
object, and accesses, in the garbage collector, to uninitialized parts of an
object.

The rules for gc-safe code are not difficult in txr, due to the immense
simplification that the garbage collector scans the stack and registers.  If a
value is in an automatic local variable, or if the code is working with the
value as the result of an expression, function return, or passing it as a
function parameter, that value is visible to the gc and protected.  Thus, the
rules only have to be followed in lower-level code which is close to the
allocator. Normal application code does not have to follow any special rules.

The garbage collector is called implicitly by code which calls make_obj to
pull a raw object from the garbage collector's free list. Code which does
not allocate code will not be interrupted by the garbage collector.
That's another helpful simplification, but it comes at the cost of not
supporting multithreading. However, code that calls make_obj must be
written with the assumption that make_obj may garbage collect on any call.

Now, here come the rules.

3.2.1 Rule One: Full Initialization

  A function which calls make_obj must not be hanging on to any 
  references to a partially initialized object. 

Any partially initialized object may be visited by the garbage collector during
the call to make_obj. A partially initialized object may have a type code which
still indicates that it is free. If the garbage collector encounters an object
on the stack which is free, it will simply skip that object. This means that
the sweep phase may then return that object to the free list. If a free object
is encountered during transitive marking, the garbage collector will abort.

In other words, if the program allocates an object from the free list, but then
accidentally invokes the garbage collector prior to completing the
initialization of that object, the object may be reclaimed back to the
free list and the program is then working with a freed object; or
the program may even abort.

If the program initializes only the type field of the object from make_obj,
but not the other fields that may contain a value of type val, and then invokes
the garbage collector, the garbage collector will treat that object as visible,
and then try to mark the val-typed fields of that object, thereby using
uninitialized memory.

The full initialization rule is therefore that after make_obj is called, the
object must be fully initialized before doing any other operation that may
allocate gc memory. Fully initialized means that the type field is initialized,
as well as any other field that is visited during garbage collection.

3.2.2 Rule Two: Make it Reachable

  A function which constructs an object must place it in live, reachable
  storage before attempting to construct another object.

The garbage collector does not scan all of memory for root pointers, only the
stack and registered globals. So for instance, if the only reference to an
object is inside a dynamically allocated structure, and that structure is not
visible to the allocator, then if gc is invoked, that object will be reclaimed.
So the following pattern is incorrect.

  {
    some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t);
    t->value = cons(foo, bar);
    return cobj((mem_t *) t, some_type_symbol, &some_type_ops);
  }

There are three allocations in the code. The allocation of the structure
assigned to pointer t, the allocation of the cons cell stored in t->value, and
the allocation of the COBJ.  The issue is that the object t is not known to the
allocator.  It is a ``native'' C type, which the garbage collector will not
traverse.  The garbage allocator can see the pointer t, because it scans the
stack and registers, but that pointer is not recognized by the garbage
collector since it doesn't point into one of its heaps, and so the collector
will not find and mark the t->value member.

Of course, the operations structure ``some_type_ops'' presumably contains a
mark function which knows how to traverse this object and find values inside
it. But that does not come into play until this object is registered as a
COBJ, which does not happen until the last line in the above block
where the cobj function is called. After the cobj call, the t pointer
is hooked into the COBJ object, and visible to the garbage collector.

So the object allocated by cons(foo, bar) is put into a structure which is
yet invisible to the allocator, and that reference is the only live reference
which the program has to that cons cell. Consequently, the subsequent call to
the allocator, hidden inside the cobj function, may trigger gc, and cause this
cons cell to be reclaimed into the free list!

The following adjustment does not fix the problem:

  {
    val c = cons(foo, bar);
    some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t);
    t->value = c; /* still wrong */
    return cobj((mem_t *) t, some_type_symbol, &some_type_ops);
  }

Even though the cons cell is now also held in a local variable, as well as in
the structure, it is still not necessarily visible to the garbage collector.
The problem is that after the ``t->value = c'' assignment, the variable c is no
longer live. Variable liveness is a concept from dataflow analysis, which
is a process implemented in optimizing compilers. A variable is live at some
point in the code if the value stored in it has a next use: another code can be
reached from that point which uses the value.  The variable c has no next use
after the t->value = c assignment. There is only one execution path from that
point in the code, and that path leads to the termination of the block, which
destroys c. Essentially, the t->value structure member is the sink for the
data flow which carries the cons cell: The data flow emanates from the call
cons(foo, bar), and terminates in t->value.

There are several right ways to fix this:

  {
    val co;
    some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t);
    t->value = nil;
    co = cobj((mem_t *) t, some_type_symbol, &some_type_ops);
    t->value = cons(foo, bar);
    return co;
  }

The above properly initializes the structure, and then associate it with the
COBJ. This makes the structure visible to the garbage collector (through the co
variable, which is live at the point where the cobj function is called, due to
having a next use in the return statement!) Now we can safely stash a newly
allocated cons cell into that structure, allowing that structure to hold the
one and only reference to that object.

Another approach, which avoids two-step initialization of the structure:

  {
    val c = cons(foo, bar);
    some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t);
    co = cobj((mem_t *) t, some_type_symbol, &some_type_ops);
    t->value = c;
    return co;
  }

In this situation, the variable c maintains a live, gc-visible reference to the
cons across the cobj allocation. The variable c is live at the point of the
cobj call because it has a next use: its value is used in the subsequent
assignment to t->value.  We don't initialize the structure because even if
the cobj function triggers gc, the gc cannot yet see that structure and
so there is no danger. After cobj returns, the first thing we do is
initialize the structure (obeying the first rule of gc-safe code). 
Just after cobj returns, the structure is uninitialized and visible to the
garbage collector, but there is nothing that will trigger gc prior to
the initialization.

Note that this premature collection problem also affects functions which simply
take an existing object and put it into a structure, where it is not obvious
that an object may have been allocated which is not visible to gc,

  /* Looks harmless: allocate structure, stick the argument object
     into it and make a COBJ! */

  val make_foo(val member)
  {
    foo *f = (foo *) chk_malloc(sizeof *foo);
    f->mem = member; /* Oops, member is no longer live. */
    return cobj((mem_t *) f, ...);
  }

The problem is that the caller which invokes foo might not maintain any live
reference to the argument object either, and so the f->mem = member might
be the one and only sink for the data flow carrying that object; i.e.
the one and only reference to that object in the entire program.
One way that can happen is that the object is just a temporary that is
allocated in the function call expression itself:

  make_foo(string("abc")); /* oops! */

The make_foo function can be corrected like this:

  val make_foo(val member)
  {
    cobj co;
    foo *f = (foo *) chk_malloc(sizeof *foo);
    co = cobj((mem_t *) f, ...);
    f->mem = member;
    return co;
  }


3.3 Weak Reference Support

COBJ objects can support weak pointers, but there is no fully encapsulated
interface for this; to be more specific, adding a new module of objects that
have weak references, it is necessary to to add a function call code into the
garbage collection function.  

Modules with weak references should closely follow the design pattern used by
the hash module.  Hash tables are implemented using COBJ, and provide weak key
and value support thanks to cooperation with the gc module.

Weak references work as follows. During gc marking, a given COBJ module
must maintain a list of all objects of its kind which are marked
(or at least just that subset of them which contains weak references).
It must refrain from marking the weak references contained in these
objects, but rather leave them unmarked.

After the initial marking phase, gc will call a global function in each module
that manages objects with weak references. (Currently there is only one such
function: hash_process_weak; a similar function must be written
for a new module and added).

This function must process and clear the weak list gathered during the
initial marking. Each weak reference in each object on this weak
marked list must be inspected to see whether it refers to an object which is
still reachable. Weak references which point to values which have not been
reached (do not have the REACHABLE bit) must be lapsed according to the
object's rules for lapsing weak references. For instance, a hash table with
weak keys will delete a key/value pair if the key reference lapses.  A weak
pointer container object might convert a lapsed weak reference to the value
nil.

Weak objects can defer marking certain other non-weak objects.  For instance
the hash module, during marking, does not mark the vector object that serves as
the hash chain table (at least not for weak hashes), and neither does it mark
the conses which make up the hash chains emanating from that vector. This
marking is completed in hash_process_weak. After the lapsed entries are removed
(their conses are spliced out of the chains), then the vector is marked, which
transitively causes the chain conses to be marked. The conses that were removed
due to the lapsing of weak keys thus stay unmarked and are reclaimed during
the sweep phase of the gc, which soon follows.


3.4  Generational GC

3.4.1 Preprocessor Delimitation

Currently, the generational GC code is delimited by #ifdef CONFIG_GEN_GC.
So to understand what the differences are between the regular GC and
generational, one just has to read those sections dependent on that
preprocessor symbol.

3.4.2 Representation of Generations

Generational garbage collectors are typically copying collectors. In a copying
system, objects can be segregated into generations by their physical location.
If an object is in a certain area, then it's in a certain generation. Moving
it to a different area reassigns it to a different generation.

In TXR, the garbage collection is non-copying. For generational GC support, we
simply carve some bits from the type field of an object to indicate the
generation.

There are only three generation values: -1, 0 and 1. Both -1 and 0 indicate a
"baby" object. The value 1 indicates a mature object.  A freshly allocated
object is put into generation 0. The value -1 is used to prevent
a young object from being placed into the checkobj array twice.
There would be no harm in doing so, so this is an optimization.  Avoiding
duplicate entries in the table prevents it from filling up due to repetitions,
which is desirable because when checkobj fills up, a gc must be triggered.
The checkobj array has to do with backreferences from old to young
objects, and is described a few paragraphs below.

3.4.3 Basic Algorithm

When an object is newly allocated, it is not only assigned generation 0, but is
appended into the freshobj array. This array allows the garbage collector to
identify all of the baby objects (because unlike a copying collector, it cannot
just traverse a "nursery" area). The array is cleared on every garbage
collection, and after each garbage collection, there are only mature objects,
no babies since all live objects are promoted to generation 1. So freshobj
identifies all baby objects since the last garbage collection, which is the
same as all baby objects in existence, period. Whenever the freshobj array
fills up, a generational collection cycle must be triggered, otherwise
there is no place to record the next baby object.

Generational collection walks all of the root places like the stack and
registered globals. However, generational garbage collection does not traverse
generation 1 objects. It traverses only objects whose generation is less
than 1. When a generation 1 object is visited, the recursion simply returns.
All generation 1 objects are considered reachable, without the necessity of
visiting all of them and marking them. This of course may be wrong: there
may be generation 1 objects which have become garbage.  Generational garbage
collection will not find generation 1 garbage, only a full garbage collection
pass will.

Under generational GC, a full sweep is also not performed. Since a full mark
was not done, it would be pointless. A full sweep would just waste time
visiting all of the heaps and necessarily skipping all the unmarked
generation 1 objects, almost defeating the point of generational GC.

The full sweep is replaced by a generational sweep which traverses only the
baby objects, which, recall, are all in the freshobj array.  Those baby objects
which were not marked during the marking phase are recycled.  So generational
GC saves time by avoiding doing full marking (terminating whenever it meets a
generation 1 object) and avoiding a full sweep (processing only the freshobj
array).

3.4.4 Handling Backpointers

Under generational GC, there is the problem that objects in the generation can
be destructively changed (mutated) so that they point to baby objects. This is
a problem, because generational GC avoids traversing the generation 1 objects.
If the only reference to a baby object is a mutated pointer in a mature object,
and the GC doesn't realize this, it will reclaim that object, leaving the
mature object with an invalid, dangling pointer.  

This problem is solved by identifying all such destructive operations
in the code base, and ensuring that they use the set macro defined in "lib.h"
rather than straight C assignment.

If TXR is not compiled for generational GC support, then the set macro
expands to a C assignment, otherwise it expands to a call to the function
gc_set. gc_set checks whether the assignment place looks like it might be in
the heap.  If the assignment place is not in the heap then it must be in the
stack, or else a static variable: places which are traversed for root
references.  For such places, gc_set can proceed to do a straight assignment
and return.  Secondly, gc_set checks whether the object being assigned is a
generation 0 heap object. Non-heap objects such as string literals, fixnum
integers and characters do not have a generation and are ignored: for these,
the assignment is done.

If gc_set detects that the address of generation 0 object is being written into
what looks might be a heap location, it changes the generation of the object
to -1 and stores in in the next available element in checkobj array. 
The change to -1 prevents it from repeating this action for the same object
twice since duplicates only waste space in the checkobj array. Not only
are the duplicates wastefully visited more than once, but when checkobj is full,
a generational GC cycle is triggered.

During a generational gc, the checkobj array is treated as an additional root
area, ensuring that baby objects that might be the target of a backpointer from
generation 1 are marked and retained.


4. Debugging

4.1. Using gdb

Debugging txr is mostly easy thanks to the dynamic types. The function d()
is provided which makes it easy to print an object.

Most of the Lisp-like functions in txr can be invoked from the debugger.
You can construct objects, inspect values with complex expressions etc.

If the problem you're debugging can be reproduced in an unoptimized build,
then use that. It's much better because values are not optimized out.
Simply run

 ./configure opt_flags=

then "make clean" and "make".

If the program catches an exception and terminates cleanly, then
place a breakpoint on the function "uw_throw" to catch this in the debugger.

Sample debug session:

      $ gdb ./txr
      GNU gdb (GDB) Fedora (6.8.50.20090302-23.fc11)
      Copyright (C) 2009 Free Software Foundation, Inc.
      License GPLv3+: GNU GPL version 3 or later
      <http://gnu.org/licenses/gpl.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.  Type "show
      copying" and "show warranty" for details.
      This GDB was configured as "i586-redhat-linux-gnu".
      For bug reporting instructions, please see:
      <http://www.gnu.org/software/gdb/bugs/>...
      (gdb) b match_line
      Breakpoint 1 at 0x80503a2: file match.c, line 295.
      (gdb) r -c '@a' -
      Starting program: /home/kaz/txr/txr -c '@a' -
      hello

      Breakpoint 1, match_line (bindings=0x0, specline=0xb7fd163c, 
          dataline=0xb7fd15bc, pos=0x1, spec_lineno=0x5, data_lineno=0x5, 
          file=0xb7fd15fc) at match.c:295
      295         if (specline == nil)
      (gdb) p d(specline)
      ((sys:var a))
      $1 = void
      (gdb) p d(car(specline))
      (sys:var a)
      $2 = void
      (gdb) p d(dataline)
      "hello"
      $3 = void
      (gdb) n
      298         elem = first(specline);
      (gdb) n
      300         switch (elem ? type(elem) : 0) {
      (gdb) p d(elem)
      (sys:var a)
      $4 = void
      (gdb) n
      303             val directive = first(elem);
      (gdb) n
      305             if (directive == var_s) {
      (gdb) n
      306               val sym = second(elem);
      (gdb) n
      307               val pat = third(elem);
      (gdb) p d(sym)
      a
      $5 = void
      (gdb) n
      308               val modifier = fourth(elem);
      (gdb) n
      309               val pair = assoc(bindings, sym); /* var exists alr...
      */
      (gdb) p d(bindings)
      nil
      $6 = void
      (gdb) n
      311               if (gt(length(modifier), one)) {
      (gdb) p d(length(modifier))
      0
      $7 = void
      (gdb) p d(one)
      No symbol "one" in current context.
      (gdb) n
      316               modifier = car(modifier);
      (gdb) n
      318               if (pair) {
      (gdb) n
      349               } else if (consp(modifier)) { /* regex variable */
      (gdb) n
      363               } else if (nump(modifier)) { /* fixed field */
      (gdb) n
      378               } else if (modifier) {
      (gdb) n
      381               } else if (pat == nil) { /* no modifier, no elem 
      (gdb) n
      382                 bindings = acons_new(bindings, sym, sub_str(data...
      (gdb) n
      383                 pos = length_str(dataline);
      (gdb) p d(bindings)
      ((a . "hello"))
      $8 = void
      (gdb) n
      628           break;
      (gdb) p d(pos)
      5
      $9 = void
      (gdb) n
      646         specline = cdr(specline);
      (gdb) n
      647       }
      (gdb) n

      Breakpoint 1, match_line (bindings=0xb7fd154c, specline=0x0, 
          dataline=0xb7fd15bc, pos=0x15, spec_lineno=0x5, data_lineno=0x5, 
          file=0xb7fd15fc) at match.c:295
      295         if (specline == nil)
      (gdb) n
      649       return cons(bindings, pos);
      (gdb) n
      650     }
      (gdb) n
      match_files (spec=0xb7fd161c, files=0xb7fd15dc, bindings=0x0, 
          first_file_parsed=0xb7feaebc, data_linenum=0x0) at match.c:1995
      1995          if (nump(success) && c_num(success) < c_num(length_st ...
      (gdb) quit


4.2. Debugging the Yacc-generated Parser

To debug the parser, which should be rare, you have to edit the makefiles
(config.make is a good place) to pass the -t option to yacc to build an
instrumented parser. To force a regeneration of the parser, remove y.tab.c and
run make.  To see the debug trace, you must also set the yydebug variable.
Instead of modifying the program, another way is to just set a breakpoint on
main in gdb and do a "set yydebug=1".

The file y.output is useful; it summarizes the LALR(1) state machine generated
by the parser.


4.3. Debugging GC Issues

Use the --gc-debug option of txr to run it in a mode in which it eagerly
reclaims garbage after nearly every operation. This slows it down, but makes it
more likely to catch invalid uses of garbage. It works even better with
Valgrind integration.

There are other GC issues that are hard to catch, like spurious retention.
This is when the code generated by the C compiler hangs on to an object
which, in the source code semantics, should be garbage. It can happen,
for example, when a variable has gone out of scope, but the stack location
where that variable was last stored has not been overwritten. Register-save
areas in the stack frame can similarly contain stale data, because when a
register value is restored from the save area, the copy remains there.

Spurious retention can also happen if a bit pattern is generated which looks
like a reference to an object, by chance. We share this problem with
garbage collectors like Boehm. Luckily, unlike Boehm, we do not have this
problem over dynamic objects, because we do not scan dynamic memory. All
dynamic objects are registered with the garbage collector and are precisely
traced. What isn't precisely traced is the call stack and machine context.


4.4. Valgrind: Your Friend

To get the most out running txr under valgrind, build it with valgrind support.
Of course, you have to have the valgrind development stuff installed (so
the valgrind.h header file is visible), not only the valgrind executables. 
Do a 

  ./configure --valgrind

then rebuild. If this is enabled, txr uses the Valgrind API to inform valgrind
about the state of allocated or unallocated areas on the garbage-collected
heap, if it is additionally run with the --vg-debug option. Valgrind will be
able to trap uses of objects which are marked as garbage. Using --gc-debug
together with --vg-debug while running txr under valgrind is a pretty good way
to catch gc-related errors. However, Valgrind will not precisely
identify individual heap objects. If a freed object is misused, Valgrind will
only be able to say something like that the pointer is 536 bytes into a large
block allocated in the more function called from make_obj (i.e. a heap).
Valgrind will not give you the call trace which led to that particular
object being allocated, only the call stack which triggered the containing heap
being allocated: an irrelevant piece of information that can confuse you!