Re: Extract data from file

 new new list compose Reply to this message Top page
Attachments:
+ (text/plain)

Delete this message
Author: Kaz Kylheku
Date:  
To: Roger Mason
CC: txr-users
Subject: Re: Extract data from file
On 2023-04-25 04:58, Roger Mason wrote:
> Hello,
>
> I have files like this (Si.in):
>
>  'Si'                                       : spsymb
>  'silicon'                                  : spname
>   -14.0000                                  : spzn
>    51196.73454                              : spmass
>   0.534522E-06    2.2000   47.8169   400    : rminsp, rmt, rmaxsp, nrmt
>    7                                        : nstsp
>    1   0   1   2.00000    T                 : nsp, lsp, ksp, occsp, spcore
>    2   0   1   2.00000    T
>    2   1   1   2.00000    T
>    2   1   2   4.00000    T
>    3   0   1   2.00000    F
>    3   1   1   1.00000    F
>    3   1   2   1.00000    F
>    1                                        : apword
>     0.1500   0  F                           : apwe0, apwdm, apwve
>    0                                        : nlx
>    2                                        : nlorb
>    0   2                                    : lorbl, lorbord
>     0.1500   0  F                           : lorbe0, lorbdm, lorbve
>     0.1500   1  F
>    1   2                                    : lorbl, lorbord
>     0.1500   0  F                           : lorbe0, lorbdm, lorbve
>     0.1500   1  F


I have the impression that the indentation of the data indicates
a nesting level, so that there is a hierarchy.

A general approach is possible to parse the whole along these lines.

We define a simple data structure to represent a frame.

- A frame consists of headings, rows and children.

- The headings is a list of strings like ("lorbl" "lorbord").

- The rows are a vector of lists of items, which we can tokenize
into strings and floating-point (or possibly more finely: we
could have T and F be t and nil Lisp objects or whatever).

- Children are other frames, listed below a certain frame, if
they are indented by one from that frame.

According to this, I wrote a prototype program:

(defstruct frame ()
headings
rows
children)

(defun tokenize-data (str)
  (let ((toks (tok #/'.*'|[^ ]+/ str)))
    (collect-each ((tok toks))
      (match-case tok
        (@(@f (tofloat)) f)
        (@(and @(starts-with "'") @(ends-with "'")) [tok 1..-1])
        (@else tok)))))


(defun table-data-read (: (stream *stdin*))
  (let ((stack (vector 32))
        (prev-level 0))
    (build
      (whilet ((line (get-line stream)))
        (let ((level (match-regex line #/ */)))
          (if (< level 32)
            (match-case line
              (`@data : @headings`
                (let ((fr (new frame
                               headings (spl ", " headings)
                               rows (vec (tokenize-data data))
                               children (vec))))
                  (set [stack level] fr)
                  (if (eql 1 level)
                    (add fr)
                    (iflet ((parent [stack (pred level)]))
                      (vec-push parent.children fr)))))
              (`@data`
                (iflet ((current [stack level]))
                  (vec-push current.rows (tokenize-data data)))))))))))


(prinl (table-data-read))

Note that this contains a hack: that the root level is 1 rather
than 0. This is because the sample data's root node is indented by
one. See the expression (eql 1 level).

The program produces the following data (which I reformatted
manually).

Is this barking up the right tree?

(#S(frame headings ("spsymb")
          rows #(("Si"))
          children #())
 #S(frame headings ("spname")
          rows #(("silicon"))
          children #(#S(frame headings ("spzn")
                              rows #((-14.0))
                              children #(#S(frame headings ("spmass")
                                                  rows #((51196.73454))
                                                  children #())))
                     #S(frame headings ("rminsp" "rmt" "rmaxsp" "nrmt")
                              rows #((5.34522e-7 2.2 47.8169 400.0))
                              children #(#S(frame headings ("nstsp")
                                                  rows #((7.0))
                                                  children #())
                                         #S(frame headings ("nsp" "lsp" "ksp" "occsp" "spcore")
                                                  rows #((1.0 0.0 1.0 2.0 "T")
                                                         (2.0 0.0 1.0 2.0 "T")
                                                         (2.0 1.0 1.0 2.0 "T")
                                                         (2.0 1.0 2.0 4.0 "T")
                                                         (3.0 0.0 1.0 2.0 "F")
                                                         (3.0 1.0 1.0 1.0 "F")
                                                         (3.0 1.0 2.0 1.0 "F"))
                                                  children #())
                                         #S(frame headings ("apword")
                                                  rows #((1.0))
                                                  children
                                                  #(#S(frame headings ("apwe0" "apwdm" "apwve")
                                                             rows #((0.15 0.0 "F"))
                                                             children #())))
                                         #S(frame headings ("nlx")
                                                  rows #((0.0))
                                                  children #())
                                         #S(frame headings ("nlorb")
                                                  rows #((2.0))
                                                  children #())
                                         #S(frame headings ("lorbl" "lorbord")
                                                  rows #((0.0 2.0))
                                                  children #(#S(frame headings ("lorbe0" "lorbdm" "lorbve")
                                                                      rows #((0.15 0.0 "F")
                                                                             (0.15 1.0 "F"))
                                                                      children #())))
                                         #S(frame headings ("lorbl" "lorbord")
                                                  rows #((1.0 2.0))
                                                  children #(#S(frame headings ("lorbe0" "lorbdm" "lorbve")
                                                                      rows #((0.15 0.0 "F")
                                                                             (0.15 1.0 "F"))
                                                                      children #()))))))))