Industry sign banking michigan pdf safe
(upbeat music) - Hello everyone. And welcome to talk on
Safedocs, an update to industry. This talk is all about how PDF, parsing and
cybersecurity overlap. My name is Peter Wyatt and I'm the Principal Scientist
of the PDF Association and the PDF principal
investigator involved in the Safedocs Program on behalf of the
PDF Association. I am also the co-project
leader of ISO 32000, the core PDF specification. Safedocs lies at the
intersection of file format so it's represented by
extent data in the wild. Parses is represented
by both proprietary and open source implementations across all platforms and devices and cyber security. I believe it has relevance to
everyone in the PDF industry and every user of PDF because it seeks to make
PDF approvably more secure, trusted and a robust
technology choice. Formerly, Safedocs is a DARPA funded
fundamental research program that aims to develop novel verified
programming methodologies for building high
assurance parses and new methodologies for
comprehending, simplifying and reducing these formats
to their safe unambiguous and verification
friendly subsets. You might think of this
as safe subsetting. Safedocs addresses the ambiguity
and complexity obstacles akin to the application
of verified programming and formal methods posed by
complex electronic data formats such as PDF. PDF was selected as the
primary Safedocs file format due to its ubiquity, its
usefulness, its complexity, it's evolution and legacy and rightly or wrongly,
it's track record. Other format supporting
protocols and streaming data are also in the mix. The PDF Association
was selected by DARPA to work on Safedocs as the
industry representative to bring PDF expertise
and knowledge to the various researches and to help transition
wins back into industry. We seek to bring an
understanding of the
real world of PDF with more than 25 years and
billions of legacy files, multiple PDF versions
and continual evolution and a huge number of PDF
creators and PDF consumers to the researchers. Safedocs is currently
about 15 months into a three to four year
program of fundamental research. And today I'd like to pretend
just some of the early work arising from Safedocs. Let's begin by
looking at interaction between parses and extent data. As anyone involved with the
development of PDF parses likely knows what the
PDF specs status shall and should be done and what extent PDFs from
the wild actually do, can be very different. In Safedocs terminology,
we refer to this as parser permissiveness or a positive supports
deviations from the spec, extensions or malformations whether intentionally
or unintentionally and often silently and
without use and knowledge. With different parses
get different results. Whether this be
a rendered image, extracted text or
something else, we refer to this as a
parser differential. Poser differentials
are only sometimes due to poser permissiveness. They also arise from
bugs, areas of emission and differences in understanding of what constitutes
correct behavior. Due to end user
expectations of PDF and thus business imperatives, this has effectively created
a defacto standard of PDF based on the collective
behavior of parses which is different to that of the official PDF specification. Corpora Auto Suites
have become a key part in understanding of file format and of testing parser
implementations. Large organizations often have their own private
corpora established over many years from
customer issues, but there is no recognized
large and diverse public corpora for PDF. It is also not unusual to
hear about development teams using internet searching or the common crawl
internet repository to help inform design
decisions or test software. By using very large and
diverse collections of PDF, Safedocs hopes to gain a
much better understanding of both the defacto PDF standard and what makes a corpus useful. Safe Docs researchers at
NASA JPL undertook a study on the effectiveness
of internet searches and internet crawl repositories to establish a large scale
representative corpus of PDF documents. Their results were presented
at the LangSec 2020 Conference in "The Building a
Wide Reach Corpus "for Secure Parser Development" by Tim Alison. The core findings in this report identified three key points. Firstly, common crawl
truncates PDFs at one megabyte. If you require intake files, then they must be rippled
from the original websites using the cased URL
assuming it is still valid. In the December 2019 crawl,
nearly 430,000 PDFs or 22% were truncated. Secondly, determining
file types from the web can be problematic as there are multiple
competing sources of truth such as HTTP content type
headers, the file extension or file identification. In addition, search engines
can further exclude results due to their internal
PDF parsal limitations, results, filtering, et cetera. Lastly, not only
does common crawl not fully capture a website, the different search engines
such as Google and Bing can give very different results. For the domain jpl.nasa.gov,
Bing reports over 64,000 PDFs or Google is less than 51,000 with Common Crawl
having just 7 PDFs in the December
2019 Crow database. In understanding
parsal permissiveness and parsal differentials
and making robust parser, it is apparent that
malformed PDFs are required. But with these limitations
of internet searches, can we do better to improve finding all those
rusty needles in the haystack, that is the internet? At Safedocs, we recognize
that there is data out there that can help. This is this dataset problematic
files attached to issue or bug trackers for
other PDF parses. These attachments
largely represent unusual and unexpected inputs that have caused
issues for those parses along with targeted test
cases, regression test cases and other examples which exhibit a far higher
occurrence rate of malformations than any normal corpora
gathered via normal means such as internet searching. Internet search engines
also do not bother to index into the attachments
of these issue trackers. With the assistance of NASA JPL, Apache Sparkler web
crawlers were extended to be able to crawl and
collate attachment data from the likes of Bugzilla, Jira and GitHub issue repositories. Safedocs has now publicly
released the first version of a new issue
tracker PDF corpus comprising over 16 gigabytes
and 20,000 stressful PDF files for the PDF parses shown here. We have proven that
using this corpus against various other
open source parses has given a much higher ROI, meaning discovery
of latent issues than internet searching alone. Each file in this
corpus is also named so that the original bug report and associated
developing discussions can be quickly identified, hopefully accelerating
your ability to fix those latent defects. Based on the Electronic
Frontier Foundation SSL Observatory Model, Safedocs is establishing
a PDF observatory which aims to focus
on a microscope on an internet-size
corpus of PDFs. The Safedocs PDF Observatory
is designed to be an internet scale
cloud-based system built on top of Amazon
Athena Apache ticker elastic search and combiner that supports instant
queries about the PDFs and tactic elements, keys,
and values if you will, the millions of
unique PDF files. It stalls and indexes the
results from multiple parses and tools and so it does not solely
rely on the behavior of a single technology
and its limitations. The current PDF observatory currently has over 550,000
unique PDFs from Common Crawl, including the roof edge
for those truncated PDFs, the Govdocs1 Corpora, which is very well studied, as well as other various
smaller corpus collections. Based on just this
initial data set and a limited tool set, we can conduct some
interesting queries that might be used by
a PDF development team in deciding what level
of permissiveness to implement or
malformations to support. Let's have a look at
some examples in action. This is a screenshot of the prototype PDF
observatory system. And yes, the current user
interface is a little cryptic. In this example, we are
doing a case sensitive search for the incorrectly
spelled Type key, it should be an uppercase
T, not lower case, with results returned in
a matter of a few seconds. From the current corpus of
approximately 550,000 PDFs, 86 match this criteria. Of course this may be influenced by the quality of the behind
the scenes tools we're using, but given the corpus size
and the hit count results, a few PDFs either way, will really make no
significant difference. I can select different
columns to display and even get a quick trip
drill down into some data. In this case, they
will create a list determined by Apache ticker. And this shows that
about 1/3 of the hits are related to a
creator called Iris. With the original file name, I can then drill down further
into specific examples to examine each file manually. Selecting a few
Govdocs PDFs at random, a common theme quickly emerged. There is very clearly
a recurring issue in link annotation inline
URI action dictionaries. If I go back to Cabana, I can select other fields to
show that many of these files were created between
2005 and 2008 according to the
PDF file metadata. If these PDFs were
critical to business, developers could now
make an informed decision about whether or
not to support this clearly non-compliant
malformation for link annotation inline
URL action dictionaries. This is a much more lax and potentially
dangerous design decision with much wider implications such as supporting
a lowercase type key in every PDF dictionary. Please don't do this. Even better, developers
could reach out to the creators of those PDFs to notify them of
their clear mistake. But what about something
needing deeper investigation? How efficient can
we be remembering that each query and UI change, returns to us in
just a few seconds? Let's look at subtype
incorrectly spelt
with camel case and a capital T. An initial search of the
incorrect subtype key has 8,450 hits from the
550,000 PDFs I am using. The UI shows the proprietary
key named para XML or Powerlink XML is very common in the result
set as I've highlighted here using my browser
find functionality. I should add this is
not the correct way to specify second-class
names in PDF, but putting that to one side. On inspection of a small,
random selection of these PDFs, I can see that the
misspelt subtype key is actually used inside these
proprietary para dictionaries. This situation is not an error, it's just breaking convention. So let's exclude all those PDFs that have both incorrect
spelling of subtype key and also a key that stops para. We now get 8,365 hits or
85 less across that corpora and we also have a wide range
of creator and producers. So there is no clear
at fault technology. Again, we can do
manual investigation of a small selection of PDFs. By doing this, we can
quickly say that occurrences appear to be in either the OCG usage dictionary
created info key value from table 100 in
ISO 32000 part 2 or OCT usage dictionary
page element key also from table 100
in ISO 32000 part 2. And based on examination
of more samples, the first case appears
to have some correlation with the Esri creator metadata
although the second case does not seem to have
this correlation. Regardless, what we have
learned has been very quick across a corpus of over
half a million PDFs. Now, let's move to
something more insidious that could make
visual differences, the kind of thing that
customers report very quickly. Take for example, the
black is one Boolean key in the optional
parameters dictionary for the CCI ITT
facts decode filter. Some developers may be aware that this key is
sometimes misspellt using a lowercase L
instead of an uppercase I. In this case, I've
used a different font to try and highlight
these visually subtle, but critically
important difference. The Cabana UI does
not use this font. So please trust that I
typed this correctly. In this case, we have 47
PDFs with the bad spelling and 4,379 hits with the correct
spelling out of the 550,000. So the incorrect spelling
occurs roughly 10% of the time, but the key is present in
less than 1% or all files in this half million corpus. Again, there's no clearly
identified creator or producer at fault based on the metadata. Here is some more examples of what you can
discover very quickly with the Safedocs
PDF observatory. These are all
extensions of deviation described in legacy Adobe
PDF specification documents or old blog posts with hit counts from the
550,000 PDF documents. In particular, the last
example is what I have called a Cabanawhack much
like a Googlewhack where we expect
precisely one example PDF exhibiting this old and
obviously very rare feature. In the future, the Safedocs PDF observatory
will hopefully scale up to include many more PDFs
from many more sources, including the new
issue tracker corpus and those not discoverable
via normal internet searching. The Safedocs goal is to
achieve 10 to the power 8 or 10 to the 10 documents. The behind the scenes
tooling will also be improved to support a broader range and more in depth
technical queries and the usability and
user interface improved. In the midterm, we hope
that the PDF observatory can also be made public. So far we have been
looking at extent data and a PDF observatory which enables us to easily
and rapidly gain insights across a very large
volumes of files. But these insights are
informed by our ability to comprehend the specifications that define all aspects of PDF. So let us now turn our attention to the official PDF
specification, ISO 32000. The Safedocs approach
also needs to understand how formats are
formally specified. So work was done to examine the latest ISO 32000
part 2 publication. As you would appreciate, the
official PDF specification is a very large document,
almost 1,000 pages, written by a committee and in imprecise English
with all the nuances, ambiguities,
mistakes, et cetera, that that implies. All ISO documents and many of
the other kinds of standards for that matter identify two kinds
of references, normative references
and bibliographic or informative references. Bibliographic references provide general
background information and do not introduce
technical requirements. PDF too makes direct
normative reference to 80 other technical
publications ranging from other ISO,
IEC and ECMO standards, RFCs and W3C recommendations, Adobe technical
notes, et cetera. A normative reference means that some or all
of the requirements IN those reference documents are also required
to understand PDF, if you like inherited by PDF. Each normatively referenced
document can be highly specific to a dated version
and undated reference to the latest
edition of a document or even a family reference to a whole collection
of documents. If we then drill into
every normative reference of each of these documents, we can start to see a complex tree of
technical specifications that fully define
everything related to PDF. This is a screenshot
of an interactive 3D in virtual reality
visualization, the tests being
prepared for Safedocs illustrating the more than
600 normative references on which PDF 2.0 depends. This is based on a
structured database of every normative reference that is directly or indirectly
referenced by PDF 2.0. It clearly highlights
the complexity of PDF and the difficulty
in harmonizing and
aligning requirements across so many
different documents. The color represents
the publisher or stand as development
organization and the line show dependencies with 32000 being at the
center of this picture. It is clearly a very complex
spider web of interdependencies that spans many
organizations and committees. A single normative reference can be referenced
multiple times, which is definitely
not a bad thing as it provides consistency. Of more concern is where different versions
of the same document are normatively referenced which creates dreaded
ambiguity when implementing. You can clearly see here that Unicode 2.0,
3.0, 3.2, 4.0 and 11.0 are all mentioned in
these orange boxes. This indicates a
potential technical issue to be investigated and resolved. And if you just grab and go with a third-party
Unicode support library, how can you be sure you
have correctly supported the correct Unicode version
in all the correct places? But what if there was a machine
readable definition of PDF that was directly derived from
the specification documents? The benefits and application for a machine readable
definition of PDF are wide, ranging from determining
differences with extent data and that's identifying the
defacto file specification, file validation, parser code
generation, API generation, test case creation, establishing a ground truth for machine learning
applications and from a Safedocs perspective, formal reasoning
about the PDF syntax using their improvements. The idea of a machine-readable
definition is not new. For example, back at the PDF Association
Technical Conference in June 2013, there was a presentation called "Validating PDF, DVA and Beyond" which discussed two
technical initiatives and an internal ISO ad hoc
committee that had been formed at that time. The technical initiatives included the Adobe dictionary
validation agent or DVA that is still being used today by the pre-flight syntax
check feature in Adobe Acrobat and a Lvigio Custom
Xtext Grammar. I recommend you watch this
presentation on YouTube as it is relevant today as it
was more than seven years ago. In the interest of time, I won't delve into all
the possible benefits for having a formal
representational
machine-readable definition of PDF. However, in the
intervening years, the benefits and use cases
have not changed much, but not much has progressed
either, until now. Code name the Arlington PDF
DOM Safedocs has created a specification derived
machine-readable definition of the PDF 2.0 object model based on the very latest ISO
32000 part 2 2020 document, the primary data source for all the tables
in the PDF 2.0 spec. It is neutral against all
implementations and applicable to both parses and unparses. We have encoded just
the object model, which is the bulk of
the 32000 document, i.e. all the dictionaries
and all the arrays. We have not encoded the PDF cause syntax lexical conventions
nor PDF content streams, although this may
be future work. We view the
specification-derived PDF DOM as a written very
rarely by very few, read very often by
very many, grammar. So we have made all
designers instruments ensure an extremely
low bar for adoption and usage by industry. We did some experiments
with MSL systems, the Main Specific Language
systems such as Xtest, but these created a
heavy burden for usage and forced specific tooling. However, we may still move to
such a system in the future. Currently, the Safedocs PDF DOM is expressed in
tab-separated files with one file per
distinct PDF object and named appropriately. TSV is also natively
supported by GitHub to enable easy
online visualization. The columns in each object file list all the keys
or array entries, the permitted types
for the value, the PDF version when introduced, the PDF version when
deprecated, if appropriate, whether the key is
required or not, requirements for always
being a direct object or always being an
indirect reference, the default value if one exists, the sets of possible
valuages and linkages to other PDF objects. We have also invented our own
declarative internal grammar to express more complex inter and intra object relationships. By way of verifying
this Arlington DOM, we've done a comparison
with the Adobe DVA Grammar and have reported
issues back to Adobe. We have also developed various
proof of concept applications in Python, C++ and Java, including the ability to
check extent data files against the relationships
we have captured, internal validation
of our own grammar, checking for typos in the data and conversion to XML
and JSON equivalents. As a result of this
methodical work, a number of corrections
and clarifications have also already then made to
the latest PDF 2.0 standard, which is of benefit to everyone even if you never use
the Arlington DOM. Let's begin with a quick look at the files in
the Arlington DOM. This is a screen recording
of Linux Command Prompt. As you can see, there are
505 individual TSV files all appropriately named. Every file has the same
fixed column layout. Using the Linux cut command, which extracts a specific
field from each TSV file, we can get a list of all the
unique defined data types defined in the Arlington DOM. The first column
are the key names and the second are
the column types. But we also know that
many keys array elements permit combinations
of data types. And here is the
comprehensive list of all defined combinations. We use a semi-colon separated
alphabetically sorted list of those basic types. If we want to know
all the conditions for when keys are required, and this is also a simple Linux
command for the 5th column. This is also showing our internal custom
declarative grammar, which I believe will be
understandable by everyone regardless of whether
you develop a PDF writer or unparser in
Safedocs terminology or a PDF Parser. If we repeat the same
command on the 6th column, then we get the requirements for when keys must be
indirect references or direct objects. Not that we use a
spreadsheet convention of uppercase false and true so as not to be confused with the PDF true false
lowercase keywords. Let's drill a bit
deeper on which case must always be direct objects. These keys are related to file
trailer developer extensions and cross-reference strains and you can see here
also the column of data. We can also draw data for keys which must be direct objects, but only under
specific conditions. All our internal declarative
functions are prefix with the FN: prefix. In this case, the
answer is file trailer and cross-reference streams when the PDF is encrypted. And thus we can
also easily find out the full set of declarative
functions we have defined. There's still some
work to do in ensuring that we have captured
all the inter and intra object relationships in the Arlington DOM. In the coming months, we hope
to make the Arlington DOM more widely available by GitHub once we are confident that
the core PDF DOM is correct and we have improved our
documentation obviously. And in no way does this replace the need to have the latest
PDF 2.0 ISO 32000 part 2 spec inside each and every developer. Thank you for listening to
this deep dive into Safedocs. I will now take some questions. Thank you. - Okay. Thank you, Peter. We'll now take questions
and the chat pod is open or the question pod. Please put your questions... Ask your questions
for Peter there. One question I already
see here, Peter, is you've identified some issues with normative
references for PDFs. How can this be fixed? - It's very complicated to fix that as everyone
would appreciate, with so many standards
bodies involved. Our first step has been to
propose an automated system for ISO project leaders, at least to be informed when
standards they reference get changed on them. In ISO, this is
currently a manual step that requires every project
leader of every document to manually check
every reference every time they look at it. And as you would appreciate,
that is a huge burden. So myself and Dock Johnson, as
project leaders of ISO 32000, have proposed throughancy a method to improve this. And we hope that maybe this
will be more widely adopted into other standards bodies to at least make the
visibility of these changes more simple to manage. Obviously, the actual management
of the standards themselves is just a complex
technology problem. And I think that's further
research can be done in Safedocs on that topic. - All right. Thank you -- Go ahead, Peter. - I was just gonna say, I
should add to everyone that my deck is available
in the handout section in the go-to meeting, sorry, go-to webinar panel too. - So when will some of these
tools become available? - I thought somebody
might ask those questions. So I'll break this
down into two parts. So the first part
is the Cabana system or the PDF Observatory System. Now, that's currently hosted by the NASA JPL team
part of Safedocs and we have a discussion
with them next week about how we might
bring this to market, how we might host it and how we might manage
it into the future. It is built on a set of
open source technologies. It runs in Amazon Cloud. So I'm hoping that
we can bring this in a relatively short
amount of time forward. And as I did say in my talk, we are also planning
on expanding at both
the corpora side and the tooling side and hopefully improve
the user interface. And as you probably
appreciate it just from the
screenshots you saw, there will be a need for
some documentation as well, just to assist people new
to Cabana, to help them. But I would hope it'd
be a couple of months or maybe early in the new year we can bring something to
industry at least to try. Now, the Arlington DOM,
that's a separate thing that is actually developed
as open-source code under the Apache 2 License. At the moment, it's only been
validated with a few eyeballs and I think it
would be really good if we can get some PDF
Association members to maybe begin to trial the DOM. I will admit that we're human as much as the people that
write the PDF spec are human. So I fully expect that
we have made mistakes and it's really a matter
of having a few more people look and make sure
that we can fix those. That is something
relatively easily, I think, that we can
make available quickly. I propose that
we'll talk about it in the next PDF
Association board meeting and then look at a way
to bring it to industry. It is currently
hosted in GitHub. And as I mentioned in my talk, that was one of the design
decisions to support TSV files so that even if you
don't want to run code or anything like that, you can still navigate it
quite simply in GitHub. Again, I hope that would
be by the end of the year. - So Peter, you've prepared
a poll for this webinar. Would you like me to launch it? - Oh yes please. Thanks Dutch. - Poll is up. Go ahead. - I was just gonna say we're
very interested in hearing what industry would like
where Safedocs might focus. So obviously expanding
the issue tracker corpus, we've already had some
feedback on that corpus and already... For example, Patrick
mentioned it in his talk earlier this week and I know from other people that we're getting quite
good results with that ROI, in other words, finding those latent defects
in existing technologies. So whether we
should expand that, I'm making the observatory
publicly available as I discussed in
my presentation and obviously a public release of the machine
readable Arlington DOM. I should also add
that the Arlington DOM is likely just like the
observatory to evolve over time as we define more
internal functions in our internal grammar, expressing more relationships. So this is not just
a completed activity. This is just early work that
I believe is useful industry. And we'll be certainly
seeking feedback from any early
adopters out there. - Let's have a look
at the results. - Of the observatory, okay. That's great. So maybe as we went through
that, we might do another... I'll do another
presentation maybe and do a deep dive
on how to use it. It is a little bit complicated
as I said at the moment, but it is just a
prototype system and certainly we'll work
with the experts and NASA JPL to hopefully improve
the experience so that's great. I'm very conscious of time Dutch so thank you everyone
for listening and if you do have
any questions... And I'm also, my email is
at the end of the deck. So please do directly
reach out to me and I'll try and answer any
questions that you might have. (upbeat music)