DjVu (pronounced deja-vu) is a computer file format designed primarily to store scanned images, especially those containing text and line drawings. It features advanced technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal images. This allows for high quality, readable images to be stored in a minimum of space, so that they can be made available on the web.
Filename extension |
.djvu, .djv |
---|---|
Internet media type |
image/vnd.djvu |
Type code | DJVU |
Developed by | ATT Research |
Type of format | Image file formats |
Progressive loading makes the format ideal for images served over the internet. DjVu has been promoted as an alternative to PDF, actually outperforming PDF on most scanned documents. The DjVu developers report that color magazine pages compress to 40-70KB, black and white technical papers compress to 15-40KB, and ancient manuscripts compress to around 100KB; all of these are significantly better thant the typical 500KB required for a satisfactory JPEG image. This has led to its widespread use in distributing math books on file sharing networks. Like PDF, DjVu can contain an OCRed text layer, making it easy to perform cut and paste operations.
The DjVu technology was originally developed by Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard at AT&T Laboratories in 1996. DjVu is a free file format. The file format specification is published as well as source code for the reference library. The ownership rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T and LizardTech. The original authors maintain a GPLed implementation named "DjVuLibre".
DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100dpi); the mask image is a high-resolution bilevel image (e.g., 300dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. The mask image is compressed using a method called JB2. The JB2 encoding method identifies nearly-identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appearrs on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.
In 2002 the DjVu file format was chosen by the Internet archive as the format in which its Million Book Project provides scanned public domain books online (along with TIFF and PDF).
DjVu format will be used by the One Laptop per Child project in order to easily supply existing paper books in an eBook format. The advantage of DjVu is that it is highly compressed and it does not require any font support. [1]
Limitations of DjVu over PDF file format
Before deciding on converting all your PDF files to DjVu, here are a few things that one should know:
- The DjVu media format does not specify a way to certify the authenticity of a document.
- The DjVu media format cannot store formatted text. It just stores all the data in plain text and the text components are tagged to specific areas.
- Once a PDF has been converted to DjVu, it becomes as good as an image and a text file bound together.
- You cannot do the Digital Rights Management using DjVu file format.
- DjVu cannot be an alternative for a PDF document. Its just a highly optimized image file with a text file attached to it.
When to select what format (DjVu or PDF)
Please note that DjVu is just very good image file. If you are looking for a media to store scanned images, then DjVu certainly stands superior to PDF file format. But however, if you have OCR a scanned image and Proof read it, and if that font is very much available on your system, then do stick to PDF file. Say you OCR a mathematical expression and edit it in MSWord, then stick to PDF else the expression will become just a line of plain text and cannot be used. Note that PDF file supports font embedding. But if you scanned a document and the fonts used in that document are not available on your system then you might not be able to OCR it. And say you are more interested in the image rather than the text. For example, say you have scanned some old scriptures. And say you might hand-code the text. In such cases, the DjVu will really come in handy. But please remember, DjVu can never be a replacement for PDF. A PDF is a document format and DjVu is an image format with an embedded plain text file.
External links
- ATT patent 6058214 (1999)
- DjVuZone.org, non-commercial resource about DjVu
- Creating DjVu from almost any format online
- DjVuLibre, open source DjVu viewer, browser plug-in, and tools
- LizardTech, DjVu Browser Plug-in, free, proprietary viewer
- LizardTech, Technical papers on DjVu
- High Quality Document Image Compression with DjVu (434KB), (ps.gz, 1.9MB)
- Bottou98 citations (Journal of Electronic Imaging, vol. 7, no. 3)
- MIME image/vnd.djvu (IANA registration, 2002)
- WinDjView & MacDjView (open source)
- Evince open source Linux viewer for DjVu, PDF, PS, TIFF and DVI.
- DjView for Qtopia (open source, for Zaurus)
- Facsimile Books & other digitally enhanced Works from: THE UNIVERSITY OF GEORGIA LIBRARIES (searchable DjVu format)
- DjVu vs PDF comparison / challenge published by Planet DjVu
- List of DjVu resources
- DjVu ebooks
- DjVu on Wiki*edia