Best document format for reading with smallest file size.

Hi I am Rupesh from India and I have the habit of reading ebooks on android tab. I am having upto 25 GB of PDF books.

I have downloaded the PDF books legally and with the permission of site owner if you don’t trust me I am ready to provide it’s address.

Previously I have downloaded a djvu file of size 7 mb from some web site and at that time I can’t find any reader for opening and reading it so I have converted it to PDF and surprisingly the PDF file generated was 300 mb.

Upon analyzing the above anyone can say that there is another format for document reading with lowest file size.

Upon compressing to another format I think that total size of files may be reduced to 5 to 6 GB.

Actually the PDF files I have consists of scanned images from a text book. I think that djvu is the best format for storing scanned images at lowest file size.

If you know any other format for document reading with smallest file size please suggest it and also the software which converts PDF to it. If you think djvu is the best please suggest a converter which converts in batch from PDF to djvu.

Google really is your friend… :wink:

https://encrypted.google.com/search?hl=en&as_q=best+epub+format
https://encrypted.google.com/search?hl=en&q=epub+format+conversion

Lot’s of interesting material there, from which you could make your own decisions, based upon exactly what you want, and what you’ve already tried.

File size is always a trade off against something… DYOR :slight_smile:

Do you know about djvu which stores pages in the form of images like mpeg format saves pictures in the form of images.

Previously I have de compressed one mpeg file of size 5 mb to jpeg files and the resulted file count was 6000 and with size one GB.

I think that djvu compresses in the same way.

After getting suggestions about djvu and after deciding whether it is suitable or not I will try epub.

Regarding “document formats”
You first need to define some basic requirements, primarily whether the document should be editable or not, and another might be whether a special application viewer might be required for simply viewing.

If editing is not an issue, then documents which display individual text characters can be considered, and then quality will depend on the quality of your font library and how well the viewer app renders the document. Because you have no consistent control how well the document will render, this is often not a preferred choice for document creators.

If the document is not to be edited and as a document creator you want to have reasonable control over quality no matter on what device the documented is viewed, then an image-based format is generally preferred, with all its ramifications.
If you know graphics, you’d know that there are two main ways to define a picture, either by each and every individual pixel(ie raster, or bitmap) or by vector(shapes defined by algorithms).

Rasterized images can maintain high quality at a very wide range of display sizes by simply removing pixels as needed and although not desirable through extrapolation. One major drawback of rasterized images is that if you want to maintain extraordinarily high quality for very large images, the number of mapped bits is very large so the file is very large.

Vector based images are also very popular for their ability to display images with very small files, compared to rasterization. By defining common and recurring shapes with an algorithm, you can also often get shapes with smoother edges by use of dithering which although might be possible by rasterization would require even more data. If you consider a web page as a kind of document, all web pages display their content through vector based graphics.

A PDF is actually an image format, so based on your personal experience you should be able to guess which type of image format a PDF is… (of course, it’s rasterization). The file size is enormous only because you want such fine detail <at an enormous viewing size>. If the file will be seen only on mobile devices, you should re-compile(convert) the PDF to the viewing size, eg 320 pixels on a 3x5 inch screen. The result would be a file that looks fine on that size screen but lousy on a big monitor, at a reasonable file size.

You can also do a search on vector-based document formats and experiment with those, but as I described above you’ll need a special viewer for that format so make sure the viewer is available first.

There’s plenty more than the generalizations I just described, you can either read up or just do a lot of experimenting which I suspect you’d need to do anyway.

TSU

Take a look at Calibre - it can do the conversions, and you can try
several out and see what meets your needs.


Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

If you want to keep PDFs as small as possible use only the standard fonts: Times (v3) (in regular, italic, bold, and bold italic), Courier (in regular, oblique, bold and bold oblique), Helvetica (v3) (in regular, oblique, bold and bold oblique), Symbol and Zapf Dingbats. Any other fonts need extra space because they are not supported natively.

like tsu said a document format is dependent on it’s use
djvu is used for scanned documents ie images and it consists of 3 layers a 256 level color background compressed with a wavelet algorithm a bitonal black and white text compressed with jbig and an optional hidden bziped ocr text layer
converting a highly compressed djvu to a pdf that does not have the 256 color limitation results in a huge file
most linux document readers can open djvu files (calibre on kde and evince on gnome) on windows you can use windjview or SumatraPDF (both are opensourced) to read djvu files, once upon a time the company behind djvu provided a mozilla plugin but after Firefox 52 no plugins except for flash are allowed
regarding “best document formats” this is a vague statement for scanned documents djvu can create the smallest file for computer generated documents I’d say it depends where the file is going to be read if it’s for PC/Tablet a pdf file is the best choice as pdf does support advanced compression technologies it’s just most people don’t know how to use them if it’s going to be read on a phone or a small screen tablet epub is the better choice as it’s a reflowable format it can be read on small screens (have you tried reading an A4 or letter formatted pdf on a 5’ screen)
if the document is meant to be edited not just read then stick with odt as it is an iso standard and all document editors (inc ms office) can open and edit it.

I am specifying my need’s as below so you may give appropriate suggestions

  1. I want to read converted files on an android tab with 7 inch screen and quad core processor.
  2. I don’t want to edit the converted files I mean I don’t want to add any text.
  3. If the source PDF files consists of scanned images converted files may contain same as source.
  4. As I don’t want to share them with anyone it is not at all a problem if those files are opened properly in my tab and doesn’t work in others system.
  5. My tab consists of limited storage and so if the converted files size is greater than my tab storage I need to buy a latest tablet which consists of greater storage than the current one.

Someone suggested that epub reduces size from 35 to 50 percent of the original PDF files is it true then which is better among epub,djvu.

One small question source PDF files consists of text in language other than English ( Indian language) does epub or djvu support those languages.

my experience is that converted files never look as good as the original and are smaller only if the embedded images are over compressed and a lot of quality/information is lost
the statement that epub’s are 30 percent smaller then pdf’s is laughable comparing epub to pdf is like comparing apples to oranges pdf’s are a fixed page layout format that support a wide variety of compression algorithms for both text and images inc jpeg2000 for color and jbig for monotone images pdfs have stram based zip (deflate) compression epub are basically html pages with the accompanying jpeg or png files in a zip with a few extra text files that contain metadata and bookmark information being based on html epub is a reflowable format meaning if you increase the text size the formating will change text will reflow to the next line and because of this epub is good for small screen devices where reading an A4 (letter) formatted pdf would be a hustle
For your question a 7’ screen is sufficient for reading both A4 and letter pdf as well as epub and djvu documents do not convert anything just copy your files to your sd card and read them

there are a ton of good open sourced freeware and payware document readers for android for example FBReader is GPLed opensoured reader that supports tons of formats inc pdf epub djvu fb2 cbz … (some of the formats need extra plugins which are also free)
https://play.google.com/store/apps/details?id=org.geometerplus.zlibrary.ui.android&hl=en
then there’s the minimalistic mupdf also GPL’ed and it can read pdf epub xps cbz (no cbr)
https://play.google.com/store/apps/details?id=com.artifex.mupdfdemo&hl=en
my favorite is Moon+ Reader it supports a lot of formats (no djvu) but it’s not opensource there are 2 versions a free add sponsored and a pay version
https://play.google.com/store/apps/details?id=com.flyersoft.moonreader&hl=en
I use the free version but as I have the f-droid repo I also have adaway installed so my droid is ad free
https://f-droid.org/
https://f-droid.org/packages/org.adaway/

Thanks for the details reply