7 minute read.ExifTool
(Update - 2007.08.28: I inadvertently missed out the term names in the last example of XMP as RDF/N3 with QNames and have now added these in. Also - a biggie - I said that PRISM had no XMP schema defined. This is actually wrong and as I blogged here today, the new PRISM 2.0 spec does indeed have a mapping of PRISM terms to XMP value types. Should actually have read the spec instead of just blogging about it earlier here. :~)
Having previously stooped to an extremely crass hack for pulling out a document information dictionary from PDFs (for which no apologies are sufficient but it does often work) I feel I should make some kind of amends and mention the wonderful ExifTool by Phil Harvey for reading and writing metadata to media files. This is both a Perl library and command-line application (so it’s cross-platform - a Windows .exe and Mac OS .dmg are also provided.) Besides handling EXIF tags in image files this veritable swissknife of metadata inspectors can also read PDFs for the information dictionary and the document XMP packet. And moreover, intriguingly, can dump the raw (document) XMP packet.
I’m still experimenting with it. There’s quite a number of features to explore. But some preliminary finds are listed below.
Taking one of our standard (metadata poor) PDFs we get this dump:
% exiftool nature05428.pdf
ExifTool Version Number : 6.95
File Name : nature05428.pdf
Directory : .
File Size : 367 kB
File Modification Date/Time : 2007:07:26 14:01:23
File Type : PDF
MIME Type : application/pdf
Page Count : 3
Producer : Acrobat Distiller 6.0.1 (Windows)
Mod Date : 2006:12:19 15:03:23+08:00
Creation Date : 2006:12:18 16:57:58+08:00
Creator : 3B2 Total Publishing System 7.51n/W
Creator Tool : 3B2 Total Publishing System 7.51n/W
Modify Date : 2006:12:19 15:03:23+08:00
Create Date : 2006:12:18 16:57:58+08:00
Metadata Date : 2006:12:19 15:03:23+08:00
Document ID : uuid:f598740b-ad11-41c5-a49e-7caffea783f0
Format : application/pdf
Title : untitled
By way of comparison, if we take a demo (metadata rich) PDF with added descriptive DC and PRISM metadata terms, we then get this dump:
% exiftool 445037a.pdf
ExifTool Version Number : 6.95
File Name : 445037a.pdf
Directory : .
File Size : 265 kB
File Modification Date/Time : 2007:07:26 16:18:17
File Type : PDF
MIME Type : application/pdf
Page Count : 1
Creator Tool : InDesign: pictwpstops filter 1.0
Metadata Date : 2006:12:22 12:10:07Z
Document ID : uuid:4cd39128-2c8e-41c0-9cad-eea2a1fdb64f
Identifier : doi:10.1038/445037a
Description : doi:10.1038/445037a
Source : Nature 445, 37 (2007)
Date : 2007:01:04
Format : application/pdf
Publisher : Nature Publishing Group
Language : en
Rights : © 2007 Nature Publishing Group
Publication Name : Nature
Issn : 0028-0836
E Issn : 1476-4679
Publication Date : 2007-01-04
Copyright : © 2007 Nature Publishing Group
Rights Agent : permissions@nature.com
Volume : 445
Number : 7123
Starting Page : 37
Ending Page : 37
Section : News and Views
Modify Date : 2006:12:22 12:10:07Z
Create Date : 2006:12:22 11:46:18Z
Title : 4.1 N&V NS NEW.indd
Trapped : False
Creator : InDesign: pictwpstops filter 1.0
GTS PDFX Version : PDF/X-1:2001
GTS PDFX Conformance : PDF/X-1a:2001
Author : x
Producer : Acrobat Distiller 6.0.1 for Macintosh
Note that the DC and PRISM terms are encoded as my earlier examples and do not take account of a) how DC is defined as an XMP schema (i.e. the XMP value types for the seperate terms), or b) how PRISM might (because it isn’t yet) be defined as an XMP schema. Nor are identifier considerations fully taken into account. Nonetheless this gives more than an idea of what things could look like.
Now, with ExifTool it is also possible to list out the terms by group, e.g.
% exiftool -g1 445037a.pdf
---- ExifTool ----
ExifTool Version Number : 6.95
---- File ----
File Name : 445037a.pdf
Directory : .
File Size : 265 kB
File Modification Date/Time : 2007:07:26 16:18:17
File Type : PDF
MIME Type : application/pdf
---- PDF ----
Page Count : 1
Modify Date : 2006:12:22 12:10:07Z
Create Date : 2006:12:22 11:46:18Z
Title : 4.1 N&V NS NEW.indd
Trapped : False
Creator : InDesign: pictwpstops filter 1.0
GTS PDFX Version : PDF/X-1:2001
GTS PDFX Conformance : PDF/X-1a:2001
Author : x
Producer : Acrobat Distiller 6.0.1 for Macintosh
---- XMP-xmp ----
Creator Tool : InDesign: pictwpstops filter 1.0
Metadata Date : 2006:12:22 12:10:07Z
---- XMP-xmpMM ----
Document ID : uuid:4cd39128-2c8e-41c0-9cad-eea2a1fdb64f
---- XMP-dc ----
Identifier : doi:10.1038/445037a
Description : doi:10.1038/445037a
Source : Nature 445, 37 (2007)
Date : 2007:01:04
Format : application/pdf
Publisher : Nature Publishing Group
Language : en
Rights : © 2007 Nature Publishing Group
---- XMP-prism ----
Publication Name : Nature
Issn : 0028-0836
E Issn : 1476-4679
Publication Date : 2007-01-04
Copyright : © 2007 Nature Publishing Group
Rights Agent : permissions@nature.com
Volume : 445
Number : 7123
Starting Page : 37
Ending Page : 37
Section : News and Views
Going back to the first example we can extract the (document) XMP packet as:
% exiftool -xmp -b nature05428.pdf
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d' bytes='1753'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:iX='http://ns.adobe.com/iX/1.0/'>
<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://ns.adobe.com/pdf/1.3/'
xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
<pdf:Producer>Acrobat Distiller 6.0.1 (Windows)</pdf:Producer>
<pdf:ModDate>2006-12-19T15:03:23+08:00</pdf:ModDate>
<pdf:CreationDate>2006-12-18T16:57:58+08:00</pdf:CreationDate>
<pdf:Title>untitled</pdf:Title>
<pdf:Creator>3B2 Total Publishing System 7.51n/W</pdf:Creator>
</rdf:Description>
<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://ns.adobe.com/xap/1.0/'
xmlns:xap='http://ns.adobe.com/xap/1.0/'>
<xap:CreatorTool>3B2 Total Publishing System 7.51n/W</xap:CreatorTool>
<xap:ModifyDate>2006-12-19T15:03:23+08:00</xap:ModifyDate>
<xap:CreateDate>2006-12-18T16:57:58+08:00</xap:CreateDate>
<xap:Format>application/pdf</xap:Format>
<xap:Title>
<rdf:Alt>
<rdf:li xml:lang='x-default'>untitled</rdf:li>
</rdf:Alt>
</xap:Title>
<xap:MetadataDate>2006-12-19T15:03:23+08:00</xap:MetadataDate>
</rdf:Description>
<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://ns.adobe.com/xap/1.0/mm/'
xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/'>
<xapMM:DocumentID>uuid:f598740b-ad11-41c5-a49e-7caffea783f0</xapMM:DocumentID>
</rdf:Description>
<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://0-purl-org.libus.csd.mu.edu/dc/elements/1.1/'
xmlns:dc='http://0-purl-org.libus.csd.mu.edu/dc/elements/1.1/'>
<dc:format>application/pdf</dc:format>
<dc:title>untitled</dc:title>
</rdf:Description>
</rdf:RDF>
<?xpacket end='r'?>%
Note that this PDF also included XMP packets for illustrations but the tool extracted the main, or document, XMP packet.
And now that it’s easier to extract the metadata one can look to do something more interesting. For example, if one has cwm installed (Tim BL’s Closed World Machine for semweb dabblings - a Python application, so again cross-platform) one can pipe the XMP packet into cwm as RDF/XML, verify it as valid RDF and read out in another format, e.g. RDF/N3. For the above example we can so this as follows.
But let me first define a pipeline to extract the XMP, a couple filters to strip out processing instructions (includes the open and close bracketing <?xpacket> XMP PI’s as well as an undocumented - legacy? - <?adobe> Adobe PI), and then fed into cwm as RDF/XML and read out as RDF/N3. (Note that instead of ExifTool to extract the XMP another tool could have been used, e.g. something based on the sample apps shipped with the Adobe XMP SDK, or something bespoke.)
% alias get_n3
exiftool -xmp -b !$ | grep -v "<?" | grep -v xmpmeta | cwm --rdf --n3
We can then simply request to get the metadata from this PDF in RDF/N3 format:
% get_n3 nature05428.pdf
#Processed by Id: cwm.py,v 1.164 2004/10/28 17:41:59 timbl Exp
# using base file:/Users/tony/Xcode/xmp/dev/
# Notation3 generation by
# notation3.py,v 1.166 2004/10/28 17:41:59 timbl Exp
# Base was: file:/Users/tony/Xcode/xmp/dev/
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009> <http://ns.adobe.com/pdf/1.3/CreationDate> "2006-12-18T16:57:58+08:00";
<http://ns.adobe.com/pdf/1.3/Creator> "3B2 Total Publishing System 7.51n/W";
<http://ns.adobe.com/pdf/1.3/ModDate> "2006-12-19T15:03:23+08:00";
<http://ns.adobe.com/pdf/1.3/Producer> "Acrobat Distiller 6.0.1 (Windows)";
<http://ns.adobe.com/pdf/1.3/Title> "untitled";
<http://ns.adobe.com/xap/1.0/CreateDate> "2006-12-18T16:57:58+08:00";
<http://ns.adobe.com/xap/1.0/CreatorTool> "3B2 Total Publishing System 7.51n/W";
<http://ns.adobe.com/xap/1.0/Format> "application/pdf";
<http://ns.adobe.com/xap/1.0/MetadataDate> "2006-12-19T15:03:23+08:00";
<http://ns.adobe.com/xap/1.0/ModifyDate> "2006-12-19T15:03:23+08:00";
<http://ns.adobe.com/xap/1.0/Title> [
a rdf:Alt;
rdf:_1 "untitled"@x-default ];
<http://ns.adobe.com/xap/1.0/mm/DocumentID> "uuid:f598740b-ad11-41c5-a49e-7caffea783f0";
<http://0-purl-org.libus.csd.mu.edu/dc/elements/1.1/format> "application/pdf";
<http://0-purl-org.libus.csd.mu.edu/dc/elements/1.1/title> "untitled" .
#ENDS
Or writing that out again with QNames for readability (and dropping the UUID as RDF subject as recommended by latest XMP spec) we have:
#Processed by Id: cwm.py,v 1.164 2004/10/28 17:41:59 timbl Exp
# using base file:/Users/tony/Xcode/xmp/dev/
# Notation3 generation by
# notation3.py,v 1.166 2004/10/28 17:41:59 timbl Exp
# Base was: file:/Users/tony/Xcode/xmp/dev/
@prefix dc: <http://0-purl-org.libus.csd.mu.edu/dc/elements/1.1/> .
@prefix pdf: <http://ns.adobe.com/pdf/1.3/> .
@prefix xmp: <http://ns.adobe.com/xap/1.0/> .
@prefix xmpMM: <http://ns.adobe.com/xap/1.0/mm/> .
<> pdf:CreationDate "2006-12-18T16:57:58+08:00";
pdf:Creator "3B2 Total Publishing System 7.51n/W";
pdf:ModDate "2006-12-19T15:03:23+08:00";
pdf:Producer "Acrobat Distiller 6.0.1 (Windows)";
pdf:Title "untitled";
xmp:CreateDate "2006-12-18T16:57:58+08:00";
xmp:CreatorTool "3B2 Total Publishing System 7.51n/W";
xmp:Format "application/pdf";
xmp:MetadataDate "2006-12-19T15:03:23+08:00";
xmp:ModifyDate "2006-12-19T15:03:23+08:00";
xmp:Title [
a rdf:Alt;
rdf:_1 "untitled"@x-default ];
xmpMM:DocumentID "uuid:f598740b-ad11-41c5-a49e-7caffea783f0";
dc:format "application/pdf";
dc:title "untitled" .
#ENDS
Now just imagine that there were something a little more interesting in there. Like a DOI. Like descriptive metadata, perhaps. 🙂