The lease was six pages, mostly white space, and the email bounced anyway. Somewhere between the phone and the outbox, a document a lawyer could read aloud in ten minutes had become a forty-megabyte object too heavy for the internet's oldest delivery system. Most email providers cap attachments somewhere around 20 to 25 megabytes, which means people usually meet this problem not as a technical curiosity but as a small, timed humiliation: the landlord is waiting, and the file will not go.

The usual response is to search for a PDF compressor and squeeze. Sometimes that works. Often it produces a smeared gray ghost of the original — readable the way a photocopy of a photocopy is readable, which is to say resentfully. There is a better way to think about the problem, and it starts with understanding why the file got so big in the first place.

A Scan Is a Photograph That Doesn't Know It's a Document

A text file stores characters. The letter e costs a byte or two, and an entire novel fits in less space than a single snapshot. A scan stores none of that. It stores a picture of the page — every pixel, including the millions that depict blank paper.

At 300 dots per inch, the long-standing standard for legible text and reliable OCR, a letter-size page is 2,550 by 3,300 pixels, about 8.4 million in all. In full color, at three bytes per pixel, that is roughly 25 megabytes before any compression touches it. And most of those bytes are spent remembering things nobody asked to keep: the faint texture of the paper, the shadow of your hand, the warm cast of the kitchen light. The file cannot tell signal from noise, so it dutifully memorizes both.

This is the one idea worth carrying out of this article: file size is a memory problem. A file is small when there is little it is required to remember. Every honest technique for shrinking a scan is a way of deciding, in advance, what the page needs to be remembered about — and letting the rest go.

The Three Dials: Resolution, Color Depth, Compression

Only three settings meaningfully control a scan's size.

Resolution sets how many pixels exist at all. Halving it from 600 to 300 DPI quarters the pixel count. Below 300, text starts costing you readability and OCR accuracy, so for documents this dial has a floor.

Color depth sets how much each pixel remembers. Full color spends 24 bits per pixel. Grayscale spends 8. A pure black-and-white scan — bilevel, every pixel simply ink or not-ink — spends exactly one. That is a twenty-fourfold reduction before compression has even started, which is why the single most effective size decision is usually made at capture, not afterward.

Compression is the third dial, and the most interesting, because compression algorithms are not neutral. Each one makes a bet about what kind of image it is looking at.

JPEG Believes Every Page Is a Sunset

JPEG, the compression inside most photo-style scans, works by breaking the image into small blocks, describing each block as a blend of coarse and fine patterns, and quietly discarding the finest ones — detail the human eye barely registers in a photograph. For a sunset, this is a brilliant bet. Gradients survive; nobody misses the noise.

Text loses that bet completely. A letterform is nothing but fine detail — a hard black edge against white, exactly the kind of information JPEG is designed to throw away first. Push the quality slider down and you get the familiar wreckage: gray halos and speckle around every character, edges that look breathed-on. Worse, OCR engines depend on those clean edges, so a heavily JPEG-crushed scan is not just uglier — it reads worse, to machines and people alike.

The Compression That Was Built for Ink

Documents have their own native compression, inherited from the fax era. CCITT Group 4 handles bilevel images by recording runs of white and black rather than individual pixels, and it is lossless: nothing is guessed, nothing is discarded. Because a typical typed page is overwhelmingly uninterrupted white, it compresses astonishingly well — a full 300 DPI page of text can drop to a few tens of kilobytes. An entire filing cabinet, encoded this way, weighs less than a vacation's worth of phone photos.

Modern scanner software goes one step further with an approach called mixed raster content: it splits the page into layers, keeping the text as a crisp bilevel mask while storing the background — paper tone, stamps, highlighter — as a soft, cheap image underneath. Each layer gets the compression suited to it. This is how a good scan manages to be simultaneously small and sharp, and it is worth knowing whether your scanning tool does it, because it is the difference between choosing among trade-offs and refusing the trade-off entirely.

When Compression Starts Making Things Up

There is a cautionary tale here, and it is real. In 2013, the German computer scientist David Kriesel discovered that certain Xerox WorkCentre machines were silently altering the documents they scanned: a 6 on a floor plan would come out as an 8, cleanly and confidently, with no blur to warn anyone. The culprit was the lossy mode of a format called JBIG2, which shrinks files by building a dictionary of character-shaped patches and reusing one patch for every symbol that looks similar enough. When the threshold for similar enough was set too loosely, the compressor began substituting digits. Xerox eventually shipped fixes, but the episode left a permanent lesson.

The lesson is not that compression is dangerous. It is that the smaller you demand a file be, the more the encoder has to guess — and the most aggressive settings guess about content itself. Lossless schemes like Group 4 never lie. JPEG degrades transparently; its blur announces itself. Be wary of anything advertised as extreme compression on documents where a digit matters.

A Recipe That Works Before You Reach for a Compressor

First, decide what the page is, because that decision picks the right dials for you. Typed text with no meaningful color: black-and-white or grayscale at 300 DPI, and the size problem largely evaporates. Mixed pages — signatures, stamps, highlighting: grayscale or color, ideally with layered compression doing the separating. Photographs and artwork: keep color, and accept the honest cost.

Second, scan right rather than shrink afterward. Re-compressing an already bloated color scan re-encodes its artifacts along with its content — generation loss, the photocopy-of-a-photocopy effect. A fresh capture at the correct settings beats any post-hoc squeezer.

Third, if a portal or email limit forces you to crush a file, keep the original. Compress the copy, send the copy, and let the archive stay whole. And if the document will ever be OCR'd, compress gently — every artifact you add is accuracy you subtract.

When Not to Shrink at All

Some documents should never be optimized for the outbox. Medical records, contracts, anything that might one day be evidence: storage is cheap and getting cheaper, and a pristine 25-megabyte file costs fractions of a cent to keep. The attachment cap is an email problem, not a document problem — when a file genuinely must stay large, share a link instead of degrading the record itself.

The Decisions You Shouldn't Have to Make by Hand

Everything above is knowable, but nobody wants to think about bilevel masks while standing over a lease. This is the quiet work a good scanner app does for you: LumenScan looks at each page as it is captured, treats text like text and images like images, and produces PDFs that are small because they remember only what matters — with OCR running entirely on your device, so the searchable text layer arrives without your documents ever leaving your hands. If your scans keep bouncing off attachment limits, or arriving smeared on the other end, you can try it at lumenscan.lumenlabs.works.