Paperless filling

Hello there,
I was looking for a solution to store my utility bills, bank statements etc on my hard drive in a virtual file cabinet or data base. Anybody have any ideas? I was looking at gscan2pdf but I would like to just be able to scan things and maybe apply a filter for certain words such as company names and have it automatize the process.

Just scan and convert to PDF and then add some tags for document name, date, etc and file under the appropriate directory. Nearly all the time you don’t need an OCRed version, just something you can read on the screen.

PS: I don’t know about your country, but here many utilities already send out or let you download electronic invoices and statements.

And be sure to keep backups, including offsite backups.

I was looking for a solution to store my utility bills, bank statements etc on my hard drive in a virtual file cabinet or data base. Anybody have any ideas?

I do not have a ready made solution (still working on it myself) but the plan is to use a combo of:

  1. A good scanner producing tiff
  2. ImageMagick (reduce images to black and white)
  3. tesseract (open source OCR)
  4. MySQL

The scanned images all go into a file system tree. Then a bash script will be run nightly to detect what is new, make a b/w copy of new images, run tesseract to produce raw text and load the text into a mysql datatase with a fulltext index on the text and a pointer to the location of the image file.

The (desired) result: you can do a fulltext search within all documents and display the scanned image for viewing.

Note: the paperless office is an illusion, just like the paperless toilet :sarcastic:

Could you expand more on the file system tree would it just be a collection of folders and sub folders created by me like this?
/home/File cabinet
Gas
10/10/10.pdf
Electric
10/10/10.pdf

Well you shouldn’t use 10/10/10.pdf as the / is a directory separator. I use filenames like

2010-10-10.pdf

On 2010-11-06 23:06, ken yap wrote:
>
> Well you shouldn’t use 10/10/10.pdf as the / is a directory separator. I
> use filenames like

He might be indicating directories, precisely.

> 2010-10-10.pdf

By the way. If the original invoice comes as a file, keep that file intact.
Pdfs can have legal value if they are cryptographically signed.

Otherwise, if you are scanning paper, a better format than PDF is DJVU.

But remember than neither file will have legal value, only the original
paper has it.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

I assumed he wasn’t because I think that level of granularity is ridiculous. I mean how many gas bills do you get in a month to warrant a directory for each month and a file for each bill? Maybe if you are keeping records for a city, but then you would use a database for that, not a filesystem.

Actually around here you don’t even get paper any more. A downloaded PDF is all you get, and AFAICT they are not signed. They are acceptable for taxation reporting because there is a trail back to the issuing entity. Paper is not free from fraud either. Whose to say you didn’t make up some legit looking paper stock and print a fake invoice on it? So either physical or electronic you get caught if they do an audit and there will hell to pay. Presumably, like random checking of public transport tickets, in theory the penalties are set to be roughly X times the cheat, where 1/X is the chance of getting caught.

On 2010-11-07 13:06, ken yap wrote:

> I assumed he wasn’t because I think that level of granularity is
> ridiculous. I mean how many gas bills do you get in a month to warrant a
> directory for each month and a file for each bill?

It could be client invoices, if he works for some kind of business. Not
only utility invoices.

> Maybe if you are
> keeping records for a city, but then you would use a database for that,
> not a filesystem.

Actually, not. I would be using a filesystem with files, and a database of
records (text and numbers data), including where is the PDF in the
directory structure.

Much safer, and perhaps faster, if you have to produce the signed pdf to
send to somebody instantly. Either you keep the file, or you generate a new
file from the data.

A huge dabase structure can be easily damaged and becomes a nightmare with
such sizes.

By the way, that was one of the ideas behind reiserfs development: use that
filesystem as database storage, small or big records as files.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

Well perhaps you are right about storing PDFs in files in enterprises, and the debate about filesystems vs databases could be a thread in its own right. I still think the OP was talking about his home bills. Also he put the filename on a line by itself. If he had been following his own convention, he should have put a separate line per directory. So I went for the more likely explanation. But you’d probably make a good disaster recovery person since you seem to think up of all sorts of unlikely scenarios. :wink:

On 2010-11-08 02:36, ken yap wrote:

> But you’d probably make a good
> disaster recovery person since you seem to think up of all sorts of
> unlikely scenarios. :wink:

Ha, I think my friends call me pessimistic. It must be part of my training,
thinking for the worst case scenario :slight_smile:


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)