Re: Extract data from file

Attachments:
(text/plain)

Author: Kaz Kylheku
Date:
To: Roger Mason
CC: txr-users
Subject: Re: Extract data from file

On 2023-04-25 04:58, Roger Mason wrote:
As for that:
> (Si_states.txt):
> nsp, lsp, ksp, occsp, spcore
> 1 0 1 2.00000 T
> 2 0 1 2.00000 T
> 2 1 1 2.00000 T
> 2 1 2 4.00000 T
> 3 0 1 2.00000 F
> 3 1 1 1.00000 F
> 3 1 2 1.00000 F

There is more than one way to skin the cat. The main
inconvenience in the data format is that the first
row of data precedes the data header, and then the
other rows follow.

Let's say we don't care about shoring up all the
data into lists: we can just treat the first
line outside of a collect and then collect the
following lines with a zero gap, until the next
colon-containing data record.

'@elem' : spsymb
@(skip)
@nsp0 @lsp0 @ksp0 @occsp0 @spcore0 : nsp, lsp, ksp, occsp, spcore
@(collect :gap 0)
@nsp @lsp @ksp @occsp @spcore
@(until)
@nil : @nil
@(end)
@(output)
nsp, lsp, ksp, occsp, spcore
@nsp0 @lsp0 @ksp0 @occsp0 @spcore0
@ (repeat)
@nsp @lsp @ksp @occsp @spcore
@ (end)
@(end)

I matched the spsymb line, so this could all be wrapped into some
larger repeat or collect or whatever to extract for multiple
elements. Maybe then the output would include the element symbol.

Now one way to put the first batch of data into the
list is with @(merge), which we have to use for each
variable:

'@elem' : spsymb
@(skip)
@nsp0 @lsp0 @ksp0 @occsp0 @spcore0 : nsp, lsp, ksp, occsp, spcore
@(collect :gap 0)
@nsp @lsp @ksp @occsp @spcore
@(until)
@nil : @nil
@(end)
@(merge nsp nsp0 nsp)
@(merge lsp lsp0 lsp)
@(merge ksp ksp0 ksp)
@(merge occsp occsp0 occsp)
@(merge spcore spcore0 spcore)
@(output)
nsp, lsp, ksp, occsp, spcore
@ (repeat)
@nsp @lsp @ksp @occsp @spcore
@ (end)
@(end)

Now suppose we don't like how are special-casing the
first row of the data and would like to collect all
of it, without any merging afterward. That can get
complicated:

 '@elem' : spsymb
@(skip)
@(all)
 @nil : nsp, lsp, ksp, occsp, spcore
@(and)
@  (collect :gap 0)
@    (some)
 @nsp @lsp @ksp @occsp @spcore : @nil
@    (or)
 @nsp @lsp @ksp @occsp @spcore
@    (end)
@  (until)
 @nil : @keys
@    (require (not (contains "nsp" keys)))
@  (end)
@(end)
@(output)
nsp, lsp, ksp, occsp, spcore
@  (repeat)
 @nsp @lsp @ksp @occsp @spcore
@  (end)
@(end)

We now have a parallel match using @(all).
When the header line matches, then in parallel,
we kick off a @(collect) at that location in the
data. The body of our collect has a @(some)
to handle the two cases. The case with the :
field names is handled first, we fall back on the
case when there are no field names.

The termination becomes complicated. We can't
just collect until we get a @nil : @nil
like we did before. The reasons is that this
pattern matches the starting line of our
data! We have to add a constraint:

@  (until)
 @nil : @keys
@    (require (not (contains "nsp" keys)))
@  (end)

I.e. our keys, the ones with "nsp" in them,
are not the terminator. Some colon line without
nsp has to terminate the batch.

If you make assumptions about what records
are present and in what order, you
can simplify things. If can rely on apword
following the nsp stuff, the termination test
could be:

@ (until)
@nil : apword
@ (end)

Likewise if we take advantage of that whole
nsp section following an nstsp line, that
also simplifies things.

The data exhibits a regular structure so that other techniques
are applicable. Like this TXR Lisp program:

(awk
  (#/: nsp/                     (match `@nil : @keys` rec
                                  (put-line keys)))
  ((rng- #/: nsp/ #/: apword/)  (prn "" [f 0..5]) nil))

Cheers ..

This message is part of the following thread: