How to industry sign banking hawaii pdf myself
(inspirational music) - So good morning everybody. During this session, I'm looking forward to sharing our experiences with the sorts of things you can do with vector graphics that are contained
inside of PDF documents. Before I get started, I just wanna pick up on a couple of themes that I've sort of picked up on over the last couple of days. First was David Blatner, who talked about kerning and nuance. Being in the business of graphics, we care a lot about how things look, and our customers tend
to care about that too. They put a lot of work
into developing their PDF, and they want it to look exactly the same when it's going into another format or displaying on another system. And you become, once you
start noticing good design, you become more sensitive to it. Dove talked about the mindless mantra of where PDF goes to die. And while I agree
wholeheartedly with his approach of attaching data to PDF files, sometimes that just simply isn't possible. Sometimes, it's interesting
to pull meaningful information out of PDF. It's not always scraping,
which sounds kinda negative. Sometimes, it's targeted
extraction with a purpose, and sometimes, we don't need the data, we just need a visual. And that's what we're
trying to mine from a PDF. So a lot of the discussions here have been about getting organizations to adopt more of the PDF/A standard. But we also have to
respect the number of PDFs that exist today. It's an overwhelming number. As Kenny Swope said, part of
the digital transformation is looking at that content
that we already have and trying to leverage
more information from it, and that's basically what
a lot of my presentation is going to be about. Okay, my name is Jean Haney. I'm the cofounder and
CEO of Visual Integrity. Jan Homan is my partner. And he and I started the
company together in 1993. It was the same year that Adobe first released Acrobat and PDF. Up until that point, we had already built up a good deal of experience in developing vector-drawing applications, as well as converting vector
graphics into PostScript, converting the vector
graphics from PostScript and EPS files into vector formats like WMF, SVG, CGM and
MIF for tech pub systems. Back in those days, a lot of content was being created in engineering companies on SUN and other workstations. And they mostly output PostScript but the tech pubs people needed to use EPS or MIF or another format
in their technical manuals. So that's sort of where we got our start. And between the two of us, we have and I really
hate to admit to this, almost 70 years of experience in these vector graphics technologies, products, customers, and workflows. So I know that you're a
technically astute audience, so I'm not gonna spend a lot of time talking about the
difference between raster and vector graphics but
I am gonna set the stage for anybody who may not be in this world. So then, we're gonna
talk a little bit about how end users can use PDF
today, the graphics in PDF, how integrators are using them to automate workflow solutions, and for developers to look
at how they can access and work with vector PDF information in their applications at both the file and the object level. Okay, so first vector versus raster. The first thing we ask ourselves is what is visual information? And visual information
is all of the things you see in this chart. Let me just find my place here. You've all heard the adage that a picture is worth a thousand words. And when we think of pictures, we probably think of
both photos and line art. The average user might never stop to consider that these are by nature, very different types of graphics. Photos are made up of pixels, while illustrations are made up of objects derived from data. Pixels are just dots on a screen or page devoid of any information about what they collectively represent. On the other hand, illustrations
are rendered from data. Every line, character,
object, group, and layer contains rich amounts of information that could be extracted
to further understand or drive your business. A little bit about this is sort of inline with what Kenny Swope was saying in his presentation this morning about how you can have a huge, a locker of information
of PDFs in your business but the digital transformation is sort of trying to figure out how
to mine those documents from even more information, when you can see them
digitally versus just on paper. So these slides summarizes
many of the types of visual information which contain data. These are vector graphics, and that's what we're
gonna focus on today. So scanned documents are raster documents. They are raster PDFs. Even though they look like drawings, act like drawings and say
drawing in neon on them, they are still bitmap images. They're like photocopies
or photos or photographs. They're a snap from a moment in time, and there's no intelligence
in them at all. You can't leverage what is lost. So there's not much we
can do with scanned files. Computer-generated documents are primarily vector PDF files. But remember, garbage in, garbage out. When a lot of people create PDFs, they don't give so much
attention to the graphics. Even if it was computer-generated and was a vector, they might
take a screen-snap of it, or they might just, or
a lot of the conversion engines out there will just convert all of the graphics to bitmaps. And that information is lost. So what we're advocating for is to try and keep as much of the vector
in your file as possible. So why do I say primarily? Because in a PDF file, the fonts, the graphs and charts, the call-outs, the block quotes, the logos, drawings, schematics, tables, that's
all vector information, and there's intelligence in there. The raster information
would be the photographs, the screen-snaps, the original art that might be in there. So basically, a way to look at it is that rasters come from
paper grabs and scanners while vectors come from any kind of illustration package,
technical drawing package, and original line art. Okay, so maximize your vector content. This image right here is basically just showing the difference between a bitmap and a vector. And I'm sure that you're
all familiar with that. The great thing about vector graphics are that they're clear and crisp no matter how large or small they are. And this is a really important thing for a responsive design on the web for a display on all
different size devices. Vectors are device independent, resolution independent, simple to edit, searchable, actionable,
intelligent and precise. Okay, so what I'm gonna
do now is switch gears a little bit and talk
about actual examples so that you can see the
way people right now are looking to use the graphics
in their PDF documents. And these are all
examples of real customers or real products that exist today that people are using to do this. We'll start with PowerPoint. And this is an example where if so what we started with was
the World Wildlife Fund's 2017 annual report
downloaded from the web. And let's imagine that we're a local organizing group and we want to try and get contributions for the WWF. So what happens is we go to the document, and we go to the first page, we open it up in PowerPoint,
we grab the logo. And then we're gonna grab a graphic, which talks about how important
individual contributions are to their program. And when we go through that, you see that you can,
it's up here in the left. I knew it was gonna be pretty hard to see. You see the original. This is the page 30 of the annual report. That's the number two slide. And you see that we've
grabbed the top graphic and the bottom graphic, and we've brought them
over into the slide, and we've enlarged them,
we've changed the color of the two areas we're interested. One is how much it relies
on individual contributions to show people how important
their donations are. So basically, you can take an ordinary corporate document with
very high quality graphics, and you can grab those and
use them for a purpose, where you adapt them a little bit. And I have an ex, how it works here. So we'll just look at that. Hopefully, this starts on its own. If not, I'll hit this. That's not good. That was supposed to be a
kind of video demonstration. I don't know how else
to get it to work here. I guess we should've tested that part. Okay, so I'm just gonna
go back to that slide for a second and just
manually walk you through what we did. So basically, we took those graphics, we grabbed the, you see the
logo up in the very corner of the home page, the
front page of the document. And so we grabbed that. And in grabbing it, we then grouped it to make sure that it would scale properly. And then we were able to enlarge it. We could annotate it. We grabbed these parts of the slide, and we were able to do a lot but we were also able to really leverage those graphics that were
professionally designed in a way that was very powerful. I can show anybody who wants to see the video demonstration
on my laptop after this if you want to take a look at it. I don't know why it didn't
work here in this PDF. So then we'll move on the
next product which is Visio. So a lot of people will receive
Visio floor plans, layouts, diagrams, and they'll receive them from maybe a floor plan from
somebody in the company. It was done a year ago. They wanna be able to modify a few things, move few chairs and tables around. What you can do with a PDF is open it as an editable Visio diagram. You can specify which page you wanna open. You can scale the drawing to, you can scale to a
drawing or to a page size. Remember that with PDF,
PDF is a page format. So in things like CAD and Visio, those are more spatially oriented. They're based on scale. And when the PDF is created, all of the dimensions are lost. You no longer know really
what the measurements are. You just have a scale, a page, a drawing that scale to a page. You can emulate PDF cropping. You can rotate, you have editable text with kerning and font substitution. And a lot of these features are ones that have been developed compensate for differences between PDF and the target formats. So for example, you might
have Bezier curves in PDF but then in a format like WMF, it doesn't support Bezier curve. So we have to create polylines. So there's a lot of work that's gone into compensating for the
differences in formats. You can do things like recognized objects, a circle or an ellipse is a collection of Bezier curves and arcs. And we can look at it and say, hey, the way that's setup, and the way that they're connected, it looks like a circle,
acts like a circle, let's make it a circle. And chances are it's a circle. And if it's not, then the end user can look at that in the end and change, modify whatever they need to. As far as AutoCAD goes,
there are a lot of solutions out there already for opening and editing PDF drawings in CAD. Autodesk themselves included
a feature in AutoCAD, which allows you to open a PDF file. You can specify the scaling factor again, same thing with Visio. When you have a PDF on
paper, it's a page size. So when you bring it into CAD, you have to be able to add the scale or add the dimensions. And hopefully, you know that from the legend on the drawing. You can see what one DXF unit equals in terms of feet or inches or meters. You can extract drawings
from multi-page PDF files. You can use the layers in PDF to combine, merge, create
separate files for each layer. More features that have
been sort of compensated for or being able to render paths as polylines or polygons being able to
recognize dashed lines, which when you with the
PDF, just use a dashed line as a bunch of very small dashes. And products, technology like this can go in and say, hey, that
the way that's behaving, we think that's a line segment. So turning it back into a line segment. Adapting line widths, ignoring paths, text, images, keeping
or removing 3D effects, and adjusting for incompatibilities. And also layers, which I talked about. And you can do this with
external programs, with plugins. Many of the CAD programs are starting to add it either themselves
or through OEMing technology. So in the CAD world,
it's become very common to be able to open and edit PDF. This is an example of what
the drawing looks like once it's been imported into AutoCAD. You can see that everything
looks the way it should. One of the gotchas here is the text because in AutoCAD, CAD
programs support TrueType and OpenType text now. But they also have
something called SHX text, and that's a type of
text which is shape text. And it's basically plotted text. So any letter is made up
of a bunch of pen strokes. And at the moment, there aren't really any good ways to recognize that text with any kind of OCR or anything. And there are companies including us who are working on developing that. AutoDesk has a simple
feature for that in AutoCAD but it requires you to know what SHX text was used,
which font was used. It requires you to know a lot in order to, for them to review that text and say here's what we think it says. And it's I did a lot of tests with it, and I, on our test files, which we've accumulated over many years, and it rarely got it right. So it's a really difficult
thing to do well. So the next thing we're gonna talk about are servers and APIs. So these would be workflow solutions. This is an example of
companies we've worked with and the ways that they
have implemented solutions that focus a lot on the
graphics content in PDF files. I'm just gonna scroll
along here for a second to get to my point. Okay, so they've implemented
some type of PDF document process automation or service related to vector graphics in some way. They fall into a few categories. One would be content management and other technical publishing, and then workflow in general. So if we look at the CMS and
publishing examples first, on the left, you have SDL. They are tech pubs company. They create products for example, Merck uses their software for their FDA submissions, which are very long involved documents. And they wanted to be able to make sure that the graphics from
PDF could be inserted as EPS to neutralize font issues. Because there is a lot of issues around fonts as we've all heard. And when you have these
complicated equations for pharmaceutical compounds, it's very easy for that
to get off a little bit. And when that happens, the
whole submission falls apart because that compound is what the whole submission is about. So we find ways to do things like render the equation
fonts as Bezier curves. So it looks exactly right, no matter what happens to it after that. So in that case, it's been reduced to, the data has been taken
out of that equation. You wouldn't be able to open it up again in MathLab or whatever you're using. But you do have an exact representation that can go through your
document system now, and you never have to worry about it losing its meaning. So that, there is that example. So there, you'll ensure
your Visio representations. Another example in publishing is ABB, who converts PDF as well as EPS and AI, would draw kind of formats of PDF, into WMF print streams, and they're linked on
publishing and viewing system to produce parts catalogs. NXP Semiconductors and
its spin-off, Nexperia, they convert PDF into SVG and EPSF for their tech pub system. And Bosch does something similar going from PDF to EPS and SVG. A lot of companies are interested in going from PDF, the
graphics in PDF to SVG because of the web that's the
HTML standard vector graphic, and they're trying to get their systems to be more browser based. The other companies on
this slide are doing volume automation of reports, statements, and direct mail. When converting PDF to
WMF, EMF print streams, they merge graphics and
text from the PDF file. The text's just pulled
from the PDF as ASCII and then placed precisely on the WMF form, which may be a statement or a check. Change Healthcare, who
we all know is WebMD, they convert 10,000-page
plus PDF documents into automated WMF, EMF print streams as a service business for
their healthcare providers. And then Fork, they also
extract selected ASCII text from the documents to populate their data management system. First Hawaiian Bank and Bank of the West have similar operations,
where they process 10K plus page documents, their PDF mainframe reports to extract and present text within a fixed 132-character-per-line constraint. And then they merge their logo with it and regenerate all of that as a PDF. So sometimes it's PDF in, PDF out. But you're doing a lot to
the PDF to add value to it in that process by using the
value of the object data. Examples of workflow would be Reuters, who's converting PDF-based Oracle reports and casting them to the browser as SVG. And Lufthansa also does this by converting flight navigation maps from PDF to SVG for web access. So that sort of wraps up
these command line examples. And then, what we'll do next is move to examples of where developers can use an API to do
a lot of these things. First off is PDF conversion. And with two, just two API calls, a conversion solution can
be implemented in half a day in an application to allow it
to open and edit PDF files. It's becoming a checklist
item to add support for PDF import, edit in
commercial applications. And conversion is not
always straightforward. In order for it to be done well, there has to be
compensation for differences in PDF in the target format as I mentioned already before. Things like PDF-cropping,
line type definitions, layers and fonts. By mapping your format to PDF through use of an API,
you can also add features like snap to an underlay or viewing to your software service. Matching PDF to another format is never a one-to-one exercise. So these compensations
become very important. Examples here would be an example where somebody adds, Grayburg, they add PDF import to their application. Whereas DraftSight, they create a plugin to also be able to get
a little bit of revenue for that functionality
from their customers. And then for laser-cutting machines, you can now, they expect DXF
files from their customers, which are not everybody
can create a DXF file. So by allowing the
customers to give them PDF, they expand their base
for the kinds of documents that they can consume and then cut with their laser-cutting machines. So that's a big area,
where PDF is becoming an import format for
those sorts of machines. As far as PDF creation APIs go, there are a lot of them out there between open source and
commercial solutions. And it's pretty much your choice. But when you have vector
graphic intensive documents, you need to pay attention
to get WYSIWYG accuracy. There are APIs to create PDF from scratch, from data, text, print streams, PostScript, EPS, and PDF, and APIs that allow you
to modify that content. And we've heard some of those
already at this conference. Being able to merge,
combine, stamp and watermark. The example I wanna tell you about here is the National Bank of Norway. They feed their economists PS, PDF, and EPS charts and graphics
from various analytical tools that they used into
their production system to produce PDFs for their
presentations and reports, EMF for MS Office charting, and PNG or SVG for the web. So they're basically creating a graphics-brokering system, where they can send,
convert and then send off graphics to multiple
different systems to be used, kinda like we started out
with the World Wildlife logo. Like it would be great if
that existed in one place, and everybody can use it. But what happens today
is people just sort of take a screen-snap of a logo. They stick it in something. It gets all skewed or
resized, and it looks awful. So it's a way to kind of ensure quality. Okay, so with the Bank of Norway, they take their, they also add their logo and page numbers to the document, after they do all of that work. The we're gonna have to
finish up pretty quickly here. I knew I had a lot to go through. PDFObject Access API. This is a really interesting
area of development. It gives you deep level of control for preprocessing PDF files. You can extract or search
for objects on a page. You could create your
own conversion engine with no intermediate
step or a print driver, which gives you better
performance and better quality. You can find text object strings and then pass them to something. You could perform
operations on object data, like find it and replace it. You could delete object data. You could change attributes
of the object data. And you can implement
features like snapping. Two customers that use this type of API that I know of are the
Open Design Alliance, which manage the DWF format that is outside of AutoCAD. And they created it, they used it to create a PDF underlay for the format, and to add clipping and
snapping to their CAD platform. GstarCAD map their internal format to the Open Design
Alliance's object database so that they could then
use the objects within, they could import PDF, and they could export PDF. So we can go and look a
little bit at the future. I'm just gonna catch
up here with my notes. I'm gonna show you a couple
of short-term projects, which we view as important. And I'll also talk briefly about some game-changing possibilities. What sort of short-term
challenges are there to improve vector extraction? They would be, first would
be plotted text recognition. This is what I talked about with the SHX fonts and AutoCAD. You can have PDF files that
have these plotted text in them. And right now, there's
no way to do OCR on that. And there's a couple of reasons why. One is because it's really
hard to know the difference when fonts get involved like
what an O and a zero are? How you tell the difference between them? How do you know which one it is? And then, also for example, this queue, that's actually a compound object. It's an ellipse and it's a straight line. And it takes two objects to make a queue. So there are a lot of issues in that. So we're working on that. Being able to recognize hatch fills. When you open up a PDF
file in a CAD program, and there's something that
has a hatch fill in it, it doesn't see it as a
circle with a pattern. It sees it as a circle with 20 lines in the middle of it. And this explodes the file
and makes it very large. So being able to see that
hatch fill as a hatch fill, and replace it as a pattern, that's an important
advancement to be able to make. And then, we also talked about fonts and how fonts are difficult. I mean, not only is
there font-mapping issues but even within fonts,
if you use Windows 7, you probably have Arial MT on your system. And if you use Windows 10, you probably have Arial on your system. Those are exactly the same font. But does the PDF know that? No, so when you have a PDF with Arial MT, and you send it to somebody
who has regular Arial, and it won't know that
that's the same Arial. So there are things to
do to make up for it, to understand the font
variant name recognition. Recognize the names that
are in the PDF files. And then, we also have
to look at how graphics fit into the PDF and association, and the ISO standards. Because now, the focus as
we see at this conference, is a lot on accessibility. It's maybe not as much yet on the idea of using the PDFs that
exist to help people transform their digital businesses to find the intelligence that exists in their organization today, and we can do that a lot through this. So the last thing real quick, I know that my time is up, is the long-term goal we have is we're working on something called Vector Search in Action. And in this case, this is part of the
digital transformation. Organizations have a wealth of information locked inside their
PDF-based visual content. And mining it can lead to
better business intelligence, streamlined workflows,
and result ultimately in an increased competitive advantage. The solution for this is something we called Search in Action. And it needs a platform or API that enables the enterprise
to use the intelligence stored in the visual content through facilities for retrieval, viewing, discreet Search in Action, dynamic modifications, analysis, publishing and more to find, correlate, analyze and publish business results. We've created a prototype, which we developed with grants from the Netherlands
government Innovation Funds. We do our development in Holland. This has two parts to it. The first is compound object recognition, which is where you can pinpoint specific objects within files. A compound object would be we have a simple object like a circle but a compound object
would be like a CAD part. And you would be able to
search a parts catalog for that CAD part. And you would be able to find
other CAD parts that fit, that it look like both
at that size, that scale, and at any rotation, a new orientation, you would really be able to start looking into the graphics or files. And in order to operate this system, you need to have an object query language. And so these are the two parts. It allows for a lot of dynamic functions within the graphics and PDF. It allows data-driven graphics, merging of files, stamping watermarks. That's it, that is it. And so if you have questions, I'm around. You can ask me. If you wanna see the way that PDF can be used in PowerPoint, that little video I was
gonna show, let me know. And if you wanna contact me after, there is my information. Thanks a lot. (audience applauding) (inspirational music)