The Secret Life Of JPEGs

Using photographs for OSINT purposes is not new but almost all work of this kind mostly focuses on the visual content of images. Much less well known is the usefulness of image meta- and file data for research purposes. Beyond the visual appearance of an image there is often additional information hidden within an image file that can be just as useful. The most well-known of these data types is EXIF data. EXIF is an information goldmine because it contains information about the camera, settings, and sometimes even location data, but you’ll be very lucky if you come across EXIF data in an image on the web or any kind of social media platform. EXIF data features in every OSINT course or CTF that I’m aware of but as a rule most platforms remove EXIF data from images and finding it in the wild is quite rare. This means that working with primary sources is always better than secondhand images that have been processed – but it doesn’t mean that we should abandon image metadata altogether. In this post we’ll dig into the world of digital forensics and see that there are still little details hidden away inside web images that can prove to be useful.

Tools

There a few tools that are useful for extracting data from JPEG image files. There are three that I’m going to refer to in this article.

Forensically – this is a very simple web-based image forensics tool that runs in your browser. Simply load an image with “Open File” to begin analysing it. The tabs “Meta Data”, “Geo Tags” and “String Extraction” will be useful for accessing the data we need.

Exiftool – this is a simple but very powerful tool for extracting metadata from many different file types, not just images. It runs on Windows, Mac, and Linux. There’s a simple setup guide here and this is a useful video tutorial for Windows. It isn’t as immediately easy to use as Forensically, but it is more effective for viewing recovered metadata.

Bless – Bless is a hex editor that allows you to see the structure of a file in its rawest possible form. It’s not quite down to the ones and zeroes, but almost. HxD is also a good basic hex viewer for Windows, but any hex editor will do for this purpose. Using a hex reader is the most comprehensive way to inspect the structure of a file and it will allow us to inspect the metadata very precisely.

What’s Inside A JPEG?

Every type of file has its own digital signature – it’s how a computer can tell a .doc from a .exe. In Windows the operating system looks at the extension on the end of a file (e.g. .pdf, .xls) to decide what program to use to run the file, although this is actually less important than you’d think as far as the computer is concerned. What matters more is the file signature. Every file starts with a hexadecimal file signature that tells the operating system what kind of a file it is. A .exe file starts with the signature 4D 5A, a .docx file starts with 50 4B 03 04, and a a JPEG image file starts with FF D8, and so on. There’s a full list of common signatures here. Don’t worry if all this is completely alien, it’ll all be clearer shortly!

This means that when a computer is reading data, it sees “FF D8” at the start of a file and knows that it’s a JPEG. The end of the file is marked by a corresponding “FF D9“. For a quick example, let’s look at a picture from today’s news. Here’s how your computer presents the image to you:

Here’s how the computer sees the same image. By opening it with a hex viewer we can seen the file (almost) as the computer would. Notice it starts with the file signature FF D8, which indicates it is a JPEG:

 

and the file ends with FF D9, which indicates the end of the JPEG:

So what has this got to do with EXIF and other file data that we might interested in? Just as there are specific hexadecimal characters that indicate the start and end of a JPEG file, there are other hexadecimal patterns within the file that indicate where specific kinds of information are to be found. These are all located in the file header, which is the first part of the file before the main bulk of data relating to the actual image itself. There’s a full list of all the useful JPEG codes here, but there are only a few that we’re interested in:

FF E1 – the start of any EXIF data within the file.

FF E2 – ICC (International Colour Consortium) profile information. An ICC profile is a set of properties that determine how a particular device displays colours. This can be important for OSINT because even when most websites remove EXIF data, the ICC profile is often left intact (assuming it was present at the outset). With some devices (e.g. Apple products) it is still sometimes possible to determine the manufacturer of a device from this information.

FF ED – Photoshop and IPTC data. This marker denotes the start of metadata generated by Photoshop processing. IPTC data contains other data such as copyright information, the photographer’s details, and a caption etc. This data is often not present in photos even when unmodified, but when it is present it is very helpful.

This is much easier to see with a practical example, so let’s explore this idea with web photo that has still retained its EXIF data.

Example: A Photo With EXIF Data

This photo is taken from a recent CNN article. CNN seem to be unusual among publishers in that they don’t always remove EXIF data from images on their website. There’s a direct link to the image here.

MESA, CALIFORNIA – MARCH 27: Agricultural laborers pick lemons inside the orchards of Samag Services, Inc, where they grow Avocado, Lemons and Oranges. The bottom has fallen out of the Avocado market as restaurants close during this period of the Covid-19 Coronavirus pandemic. Agricultural workers have become essential workers in the race to maintain Americas food supply while simultaneously staying healthy. (Photo by Brent Stirton/Getty Images.)

Let’s start by saving it and uploading it to Forensically. Here’s what the Metadata tab shows:

There’s lots of useful detail in there. The ImageDescription field is also populated so when I uploaded this photo to this blog, it automatically populates the caption field. There are other snippets of information in the “String Extraction” tab too:

The large amount of data within the header of this JPEG makes it a good test subject for examining a file with a hex editor. Here’s what the file looks like in hex format:

Notice how the EXIF data is visible in the right hand column. We know that the EXIF section of a JPEG begins with FF E1, so we can Ctrl+F to find this part of the file. It’s right there at the start straight after the file signature:

Next we could check for the ICC Profile by searching for FF E2, but we won’t find it in this particular image. However the FF ED field (Photoshop and IPTC data) is present, so we know that the file has likely been processed by Photoshop:

The hex view of a file offers the highest level of detail, but it isn’t always the easiest to read. Forensically does a good job of extracting and displaying most of the data, but ExifTool does a neater job in my opinion. Here’s how it presents a selection of some of the metadata:

 

Because the photo has been processed by Photoshop, creation and modification timestamps for the Photoshop activity are embedded in the image. These Photoshop editing timestamps are not the same as the created/modified/accessed timestamps that you’ll find on all files in your computer – these are derived from the device’s own filesystem and are not inherently part of the file header itself.

This is all very useful, but we already know that hardly any web images contain EXIF data, and even when it is present there are already plenty of easy to use tools like Jeffrey’s EXIF Viewer that can easily extract it for us, so what’s the point? It’s true that there is no substitute for an image with a load of EXIF data attached, but as we’ve seen not all data inside a file header is EXIF data, and by knowing how to dig a little deeper we can still find useful scraps of original data that EXIF-removal tools miss. We will also see that the major platforms remove EXIF data in their own characteristic way which may actually make it easier to identify where an image originated from.

All Is Not Lost…

Almost every major platform removes metadata, but they all do it in different ways. Last year the IPTC did a comprehensive test across many popular social media and image hosting platforms to how each one treats EXIF and IPTC data. As you can see each one handles files differently:

 

Sometimes we can see how a platform strips data. For example if we look at this bike that was for sale on Ebay we can look at a hex view of the file and even tell what software they used to do it:

 

Right where we want the EXIF FF E1 field to begin we have a nice “Processed By eBay with ImageMagick” greeting instead! ImageMagick is a common EXIF removal tool (or EXIF falsifying tool if you want, but that’s another matter…). At least we know how to tell when an image has been borrowed from Ebay now!

However not all websites leave such a blatant trace when removing metadata, and the way each site does it is different. In the rest of this post I’ll look at how some popular websites handle image metadata and add their own distinguishing features along the way.

Facebook

To see what a Facebook image looks like under the hood, I’m going to use this picture from Mark Zuckerberg’s public mobile uploads.

The filename for this image is 88004843_10111606095638101_261759268640784384_o.jpg. This particular file naming format is derived from the unique way in which Facebook stores its billions of images. You can read more about their ingenious Haystack storage system here, but the filename of each image on Facebook actually denotes a specific block on a specific hard drive cluster within Facebook’s huge ecosystem rather than anything to do with the account it originated from. The consequence of this is that Facebook image filenames are very distinctive.

Let’s use ExifTool to look inside the file itself:


ExifTool Version Number : 10.80
File Name : 84068253_10111506635776461_6848249074852823040_o.jpg
Directory : .
File Size : 371 kB
File Permissions : rw-rw-r--
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
Profile CMM Type :
Profile Version : 2.0.0
Profile Class : Display Device Profile
Color Space Data : RGB
Profile Connection Space : XYZ
Profile Date Time : 2009:03:27 21:36:31
Profile File Signature : acsp
Primary Platform : Unknown ()
CMM Flags : Not Embedded, Independent
Device Manufacturer :
Device Model :
Device Attributes : Reflective, Glossy, Positive, Color
Rendering Intent : Perceptual
Connection Space Illuminant : 0.9642 1 0.82491
Profile Creator :
Profile ID : 29f83ddeaff255ae7842fae4ca83390d
Profile Description : sRGB IEC61966-2-1 black scaled
Blue Matrix Column : 0.14307 0.06061 0.7141
Blue Tone Reproduction Curve : (Binary data 2060 bytes, use -b option to extract)
Device Model Desc : IEC 61966-2-1 Default RGB Colour Space - sRGB
Green Matrix Column : 0.38515 0.71687 0.09708
Green Tone Reproduction Curve : (Binary data 2060 bytes, use -b option to extract)
Luminance : 0 80 0
Measurement Observer : CIE 1931
Measurement Backing : 0 0 0
Measurement Geometry : Unknown
Measurement Flare : 0%
Measurement Illuminant : D65
Media Black Point : 0.01205 0.0125 0.01031
Red Matrix Column : 0.43607 0.22249 0.01392
Red Tone Reproduction Curve : (Binary data 2060 bytes, use -b option to extract)
Technology : Cathode Ray Tube Display
Viewing Cond Desc : Reference Viewing Condition in IEC 61966-2-1
Media White Point : 0.9642 1 0.82491
Profile Copyright : Copyright International Color Consortium, 2009
Chromatic Adaptation : 1.04791 0.02293 -0.0502 0.0296 0.99046 -0.01707 -0.00925 0.01506 0.75179
JFIF Version : 1.01
Resolution Unit : None
X Resolution : 1
Y Resolution : 1
Current IPTC Digest : 2aa1d117b0d20226dcefbb16249a023f
Original Transmission Reference : GggUWrgwZ9hSQFQXeGJa
Image Width : 1504
Image Height : 1505
Encoding Process : Progressive DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2)
Image Size : 1504x1505
Megapixels : 2.3

The vast majority of this data is to do with the colour settings of the image. Even the interesting looking field “Profile ID” refers to a non-unique colour profile setting rather than something specific like a user profile. However what is unique to this Facebook image is the IPTC Digest hash:

Current IPTC Digest : 2aa1d117b0d20226dcefbb16249a023f

This is a unique hash derived from the IPTC data associated to the image. We cannot access the IPTC data itself, but the hash is still useful because it is a form of uniqueness. This data field was widely misreported last year as another form of “Facebook tracking” or even some kind of steganography. There are a myriad of ways in which Facebook tracks you to be sure, but this is not one of them. The IPTC digest is more akin to a form of copyright marking, but nothing more. The useful thing to know as an researcher is that if I take an image from Facebook and change the filename, the original IPTC digest still remains unchanged inside the metadata. Here’s what happened when I renamed 88004843_10111606095638101_261759268640784384_o.jpg to someimage.jpg and ran it through Exiftool again. It’s the same result:

Current IPTC Digest : 2aa1d117b0d20226dcefbb16249a023f

But if I change even a single pixel in the image and resave it, all the original metadata is lost:


File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
JFIF Version : 1.01
Resolution Unit : None
X Resolution : 1
Y Resolution : 1
Image Width : 1504
Image Height : 1505
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2)
Image Size : 1504x1505
Megapixels : 2.3

Not a very effective way of tracking.

Twitter

Twitter also removes metadata, but it does leave the ICC Profile field (FF E2) and Photoshop metadata field (FF ED) untouched when files are uploaded from Apple devices. Here’s an example from an old Quiztime photo posted by Julia Bayer:

Opening the image in Forensically and choosing “String Extraction” pulls out the information from the FF E2 and FF ED fields:

There’s still some original metadata in there that hasn’t been removed. Exiftool makes it easier to read:


ExifTool Version Number : 10.80
File Name : EP4i4PqUUA86HSg.jpeg
Directory : .
File Size : 284 kB
File Modification Date/Time : 2020:02:18 19:34:40+00:00
File Access Date/Time : 2020:02:18 19:34:40+00:00
File Inode Change Date/Time : 2020:02:18 19:34:40+00:00
File Permissions : rw-rw-r--
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
JFIF Version : 1.01
Resolution Unit : None
X Resolution : 72
Y Resolution : 72
Profile CMM Type : Apple Computer Inc.
Profile Version : 4.0.0
Profile Class : Display Device Profile
Color Space Data : RGB
Profile Connection Space : XYZ
Profile Date Time : 2017:07:07 13:22:32
Profile File Signature : acsp
Primary Platform : Apple Computer Inc.
CMM Flags : Not Embedded, Independent
Device Manufacturer : Apple Computer Inc.
Device Model :
Device Attributes : Reflective, Glossy, Positive, Color
Rendering Intent : Perceptual
Connection Space Illuminant : 0.9642 1 0.82491
Profile Creator : Apple Computer Inc.
Profile ID : ca1a9582257f104d389913d5d1ea1582
Profile Description : Display P3
Profile Copyright : Copyright Apple Inc., 2017
Media White Point : 0.95045 1 1.08905
Red Matrix Column : 0.51512 0.2412 -0.00105
Green Matrix Column : 0.29198 0.69225 0.04189
Blue Matrix Column : 0.1571 0.06657 0.78407
Red Tone Reproduction Curve : (Binary data 32 bytes, use -b option to extract)
Chromatic Adaptation : 1.04788 0.02292 -0.0502 0.02959 0.99048 -0.01706 -0.00923 0.01508 0.75168
Blue Tone Reproduction Curve : (Binary data 32 bytes, use -b option to extract)
Green Tone Reproduction Curve : (Binary data 32 bytes, use -b option to extract)
IPTC Digest : d41d8cd98f00b204e9800998ecf8427e
Image Width : 1604
Image Height : 2048
Encoding Process : Progressive DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2)
Image Size : 1604x2048
Megapixels : 3.3

The image still retains the metadata that shows it was created with an Apple device:

Profile Creator : Apple Computer Inc.
Profile ID : ca1a9582257f104d389913d5d1ea1582
Profile Description : Display P3
Profile Copyright : Copyright Apple Inc., 2017

 

Even though Twitter removes most of the image metadata following the FF E1 (EXIF) field, it leaves other metadata fields intact. This only seems to be the case for Apple devices, but it’s a tiny scrap of information that could be useful to prove or disprove image attribution and is often overlooked.

Reddit also leaves the same metadata intact for photos from Apple devices. Here’s a photo from today’s front page:

We see traces of the same Apple origins in this image too, despite all the other data having been stripped out:


Profile CMM Type : Apple Computer Inc.
Profile Version : 4.0.0
Profile Class : Display Device Profile
Color Space Data : RGB
Profile Connection Space : XYZ
Profile Date Time : 2017:07:07 13:22:32
Profile File Signature : acsp
Primary Platform : Apple Computer Inc.
CMM Flags : Not Embedded, Independent
Device Manufacturer : Apple Computer Inc.
Device Model :
Device Attributes : Reflective, Glossy, Positive, Color
Rendering Intent : Perceptual
Connection Space Illuminant : 0.9642 1 0.82491
Profile Creator : Apple Computer Inc.
Profile ID : ca1a9582257f104d389913d5d1ea1582
Profile Description : Display P3
Profile Copyright : Copyright Apple Inc., 2017

 

AI-Generated Imagery

In the last OSINTCurious webcast we discussed whether services like Facebook might use image metadata and filenames to detect fake profile pictures such as those produced by This Person Does Not Exist. To see what kind of metadata these images generate I created this image and then ran it through Exiftool and Bless.

 

A lot of the standard metadata fields are present (all beginning with ‘FF’) but they’re all blank and don’t actually contain any data. This is certainly unusual since most real profile images would contain at least some JPEG header information and this one is almost entirely blank. This is what you’d expect from an image created by computer software and not a camera. Could this be a reason that Facebook flags profiles created with these images? It’s possible but I’m not completely convinced. Partly because there are plenty of active profiles out there that use these images without any problem, but also because it’s just not entirely possible to know exactly what triggers Facebook to flag profiles as inauthentic.

For information about the types of retrievable Photometadata, I recommend reading through this detailed guide.

 

 

 

 

 

 

 

 

 

 

1 thought on “The Secret Life Of JPEGs”

  1. Pingback: “数字隐私”相关内容的整合信息,内容十分全面,对于个人隐私保护很有帮助,请勿用于非法用途!-简单就是快乐

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.