What the Metadata in Your Scanned Documents Reveals — and How to Remove It

The metadata in scanned documents can reveal your location, device, and edit history — even when the page looks harmless. Here's how to find it and strip it.

In December 2012, the software entrepreneur John McAfee was on the run from police in Belize when Vice published a photo of him under the headline "We Are with John McAfee Right Now, Suckers." The journalists were careful not to say where they were. The iPhone that took the photo was not: embedded in the image file were GPS coordinates placing McAfee at a hotel in Guatemala. Within days, Guatemalan police had him in custody.

Nothing in the visible photo gave him away. The betrayal happened in a part of the file almost nobody looks at — and that part exists, in some form, in nearly every document you have ever scanned with your phone.

We tend to think of a scan as a picture of a page: what you see is what's there. But a scan is a file, and files keep records about themselves. Understanding what those records contain — and how to remove them before a document leaves your hands — is the most overlooked piece of document privacy there is.

The file behind the file

Every digital image and every PDF carries a second layer of information that never appears on screen. In photos, it's called EXIF data — a standardized block inside JPEG and HEIC files that records the camera make and model, the exact date and time of capture, exposure settings, and, if location services were on, latitude and longitude accurate to a few meters. None of this is sinister in origin. Your photo library uses it to sort images by date and place, and to know which way is up.

PDFs keep their own ledger: a document information dictionary and an XMP block that store a title, an author, the software that created the file, and both a creation date and a last-modified date. The author field is often auto-filled from your computer account name or the name registered to your software. A PDF produced by a scanning app frequently records which app made it, which version, and sometimes the device model.

The page is the part you proofread. The metadata is the part that talks about you behind your back.

What a photo of a document can say

Suppose you photograph a signed form to "scan" it, then email the image to a stranger — a landlord, a marketplace buyer, an insurance adjuster. If the photo was taken at home with location enabled, the EXIF block contains your home coordinates. You may have blacked out your address on the page itself and still handed it over in the container.

Where you send the file matters as much as what's in it. Large social platforms strip EXIF from images on upload (though they typically read it first). Email does no such thing — an attachment arrives byte-for-byte as it left. Most messaging apps recompress photos sent as photos, which discards the metadata along with some quality; but send the same image "as a file" or "as a document" to preserve quality, and the metadata is usually preserved too. The route that feels most careful is often the one that leaks.

What a PDF remembers

The most famous metadata catch in history involved no photo at all. In 2005, the serial killer known as BTK — who had taunted Wichita police for three decades — asked them, through one of his messages, whether a floppy disk could be traced. Police answered, through a newspaper ad, that it couldn't. He mailed one to a local TV station. The deleted Word document on it carried metadata naming the computer it was last saved on — at Christ Lutheran Church — and the user who saved it: "Dennis." Dennis Rader was the congregation's president. The thirty-year hunt ended in a matter of days.

PDFs invite the same class of mistake. The author field can expose a real name or an employer's machine naming scheme. The creation and modification dates can contradict a claim about when a document was made. And because PDFs support incremental saving — new versions appended to the end of the same file — an edited PDF can retain earlier versions of itself internally, recoverable with ordinary tools.

There's also the text layer. A scanned PDF that has been through OCR contains the page's full machine-readable text riding invisibly beneath the image. That's exactly what makes scans searchable, and it's worth remembering when you share one: a document can be more legible to software than it looks to you. (Visual redaction has the same trap — covering text is not the same as deleting it — but metadata sits a layer below even that.)

Why nobody checks: seeing is believing

The psychologist Daniel Kahneman described a habit of mind he abbreviated WYSIATI — "what you see is all there is." We build confident judgments from the information in front of us and rarely ask what information might exist that we can't see. A page that looks harmless is judged harmless; the audit stops at the pixels.

That instinct isn't laziness. It's a paper-era intuition. With a physical document, inspecting the artifact really is a complete audit — paper has no hidden compartments. Digital files broke that guarantee decades ago, but the intuition never updated, which is why intelligent, careful people redact a page meticulously and then email it wrapped in their own coordinates.

When metadata is on your side

None of this means metadata is the enemy. A capture timestamp can prove a receipt was scanned before a deadline. GPS coordinates can prove photos of accident damage were taken at the scene. Modification dates have settled disputes about who changed a contract, and when. Insurers, courts, and auditors lean on exactly the fields described above.

So the goal isn't reflexive stripping — it's deliberate handling. A sensible rule: keep metadata on your archival copies, where it serves you, and strip it from copies that leave your control, where it serves whoever receives them.

How to check — and how to strip

Seeing the hidden layer takes about a minute. On a Mac, open an image in Preview and choose Tools → Show Inspector; the EXIF and GPS tabs show everything the file carries. On Windows, right-click → Properties → Details does the same, and the "Remove Properties and Personal Information" link below it will strip the file on the spot. For PDFs, File → Properties in any full-featured PDF reader shows the author, producer, and date fields; Adobe Acrobat's "Sanitize Document" removes metadata, hidden layers, and embedded leftovers in one pass. On the command line, exiftool file.jpg prints the full ledger, and exiftool -all= file.jpg erases it.

Before sharing a photo from an iPhone, tap "Options" at the top of the share sheet and switch Location off — the copy you send goes out clean while your original keeps its data. Several Android gallery apps offer a similar toggle when sharing.

And when in doubt, there's the blunt instrument: print the document to a fresh PDF. The visible content survives; the old container, with its history, does not. The new file has metadata of its own — a creation date, a producer string — but none of yours.

The quiet advantage of staying on-device

Every system that touches a file is another chance for its hidden layer to be copied, logged, or read — which is why the first rule of metadata hygiene is simply minimizing how many hands a document passes through. That principle is the reason LumenScan works the way it does: capture, cleanup, and OCR all happen on your phone, so the page — and whatever its container carries — never has to visit a server to become a clean, searchable PDF. You decide what leaves the device, and when, with the full picture of what you're sending. If your documents deserve that kind of discretion, you can try it at lumenscan.lumenlabs.works.