utf8: bugfix: trailing char fragment ignored.

After "years of trouble-free operation" a bug in the UTF-8 decoder was found, which violates its property that any sequence of bytes will decode to some kind of string, which will encode to the original bytes. When the UTF-8 data prematurely ends in the middle of a valid character, the decoder just drops that data as if it didn't exist. So for instance the two-byte sequence E6 BC should decode to "\xDCE6\xDCBC", since it is a fragment of a three-byte UTF-8 sequence. It actually decodes to the empty string. * utf8.c (utf8_bfom_buffer): When the buffer is exhausted, if we are not in the utf8_init state, it means we were in the middle of a UTF-8 sequence. Walk the bytes from the backtrack point to the end of the buffer and store them into the string as U+DCxx codes. * tests/012/buf.tl: Tests added for this via buf-str, str-buf.
author: Kaz Kylheku <kaz@kylheku.com> 2022-05-20 22:11:06 -0700
committer: Kaz Kylheku <kaz@kylheku.com> 2022-05-20 22:11:06 -0700
commit: 378318ef010dfb15045dfadf242231793d1434de (patch)
tree: bc8d0d4e8d4f2674b4065f18ddc48af1f31e23d6
parent: 0cff857c70c0f770259066d29a720f4404770558 (diff)
download: txr-378318ef010dfb15045dfadf242231793d1434de.tar.gz
txr-378318ef010dfb15045dfadf242231793d1434de.tar.bz2
txr-378318ef010dfb15045dfadf242231793d1434de.zip
2 files changed, 16 insertions, 0 deletions
diff --git a/tests/012/buf.tl b/tests/012/buf.tl
index 1c8040d6..8f494264 100644
--- a/tests/012/buf.tl
+++ b/tests/012/buf.tl
@@ -2,3 +2,9 @@
 
 (vtest (uint-buf (make-buf 8 255 16)) (pred (expt 2 64)))
 (test (int-buf (make-buf 8 255 16)) -1)
+
+(mtest
+  (str-buf #b'E6BC') "\xDCE6\xDCBC"
+  (buf-str "\xDCE6\xDCBC") #b'E6BC'
+  (str-buf #b'E6') "\xDCE6"
+  (buf-str "\xDCE6") #b'E6')
diff --git a/utf8.c b/utf8.c
index e1e696fc..9ec6aed9 100644
--- a/utf8.c
+++ b/utf8.c
@@ -138,8 +138,18 @@ size_t utf8_from_buf(wchar_t *wdst, const unsigned char *src, size_t nbytes)
     }
   }
 
+  if (state != utf8_init) {
+    while (backtrack != src) {
+      if (wdst)
+        *wdst++ = 0xDC00 | *backtrack;
+      nchar++;
+      backtrack++;
+    }
+  }
+
   if (wdst)
     *wdst++ = 0;
+
   return nchar;
 }
author	Kaz Kylheku <kaz@kylheku.com>	2022-05-20 22:11:06 -0700
committer	Kaz Kylheku <kaz@kylheku.com>	2022-05-20 22:11:06 -0700
commit	378318ef010dfb15045dfadf242231793d1434de (patch)
tree	bc8d0d4e8d4f2674b4065f18ddc48af1f31e23d6
parent	0cff857c70c0f770259066d29a720f4404770558 (diff)
download	txr-378318ef010dfb15045dfadf242231793d1434de.tar.gz txr-378318ef010dfb15045dfadf242231793d1434de.tar.bz2 txr-378318ef010dfb15045dfadf242231793d1434de.zip