Extracting data from files

I have a bunch of files that contain some data I would like to extract. Somewhere within each file is the string “HEADER,” followed by 15 bytes of junk data that can be anything, followed by 4 bytes that represent the length of the string I want to extract (represented in binary with LSB first, not in ASCII text), followed by the string. I want to create a single file by appending [path & name of file containing extracted string] [tab] [extracted string] [newline] for each input file.

I was thinking the script would just work on a single file, and I could use “find…-exec” to pass each file to the script.

Does anyone have any ideas on how to do this with a script calling standard utilities?

How big are the files? If not infeasibly big, say less than 10% of your RAM, you could read it into memory in one go and use regex matching to grab the string and what follows. This is easily done in Perl, but other languages are capable of doing this.

Python will do the job as wel. The String object has some convenient option to handle this. Look at 6. Built-in Types — Python v2.6.4 documentation (at 6.6.1)

And of course you can use regular expressions to extract what you want 8.2. re — Regular expression operations — Python v2.6.4 documentation

The files range from 12 to 200 kilobytes. I don’t know anything about Perl or Python. I could probably write something in C, but I haven’t compiled any C for Linux yet. I was hoping to get it done with a bash script.

In the past, I’ve used xxd to translate binary files to hexadecimal represented in ASCII to get around limitations with line-based programs, then inserted tokens like “XX” before key strings with tr, then used csplit to break the files up at every occurrence of “XX.” The main part I can’t figure out is how to grab a variable-length string based on the value of bytes in the file. Maybe I could use bc to translate the hexadecimal to a decimal value and use that to specify how many bytes to grab.

Though the length is represented by four bytes, none of the strings are more than 255 bytes, so I could ignore three of the bytes and only read the least significant byte.

On 2010-01-19, kylefaucett <kylefaucett@no-mx.forums.opensuse.org> wrote:
>
> I have a bunch of files that contain some data I would like to extract.
> Somewhere within each file is the string “HEADER,” followed by 15 bytes
> of junk data that can be anything, followed by 4 bytes that represent
> the length of the string I want to extract (represented in binary with
> LSB first, not in ASCII text), followed by the string. I want to create
> a single file by appending [path & name of file containing extracted
> string] [tab] [extracted string] [newline] for each input file.
>
> I was thinking the script would just work on a single file, and I could
> use “find…-exec” to pass each file to the script.
>
> Does anyone have any ideas on how to do this with a script calling
> standard utilities?

Each file only contains this HEADER string only once ?
Can’t set up a copy of such a file, somewhere?


Any time things appear to be going better, you have overlooked
something.

The header will occur only once per file.

A sample file is available here (the password is empty):

ftp://OpenSuseForum:@66.224.206.75/jobentry3.pdj

I have a parent directory containing directories named “10000,” “10001,” “10002,” etc. Inside each of these is a file named “jobentry3.pdj,” and other irrelevant files. The header is at offset 10B-113 (a null terminated string), and can be anywhere in the file without alignment to any boundary. 114-121 is 14 bytes (I said 15 originally) of junk data. It’s all zeroes here, but usually is not. 122-125 is the length of the string, and the last three bytes should always be zero because the string should never be more than 255 bytes. 126-139 is the variable-length string I want (here it’s “ThisIsTheOrderNumber”).

Offsets are zero based and inclusive.

Assuming this file is in the directory “12345,” I would like to create a script that processes this file (either with the full path passed to it or executed within the directory) and appends a log file with the following line:

12345[tab]ThisIsTheOrderNumber[newline]

It would be trivial in Perl, about 10 lines. Nice chance to learn.

You would match the HEADER and string with a regular expression, using subexpressions to match the string length and 255 bytes and save to separate variables. Then you would use unpack to decode the length and decide how much of the 255 bytes to write out.