Doc and PDF to HTML
From Open NZ Wiki
The "Doc and PDF to HTML" was led by Matthew Holloway, with contributions from Stuart Yeates. David Earle also proposed a "beyond HTML" topic, which there wasn't enough time for. This topic is outlined below.
Contents |
[edit] PDF and Docs to HTML
[edit] Introduction
The E-government Web Standards around HTML and other associated standards have been successful but there's a lot of internal information in Office files and PDFs that should be published too as part of the open data approach. This is about how to design a publishing cycle that uses Office and PDF documents as source documents.
[edit] Overview
Existing approaches typically fail to apply traditional software principles to publishing processes, particularly around abstraction layers. There's an analogy to be made between modular software with abstraction layers And software that foolishly combines the application logic, GUI, and data together. Publishing processes should be modular with abstraction layers, not a single stage conversion from Office to HTML.
File formats should be considered as abstraction layers themselves. Certain formats are good at expressing structure, others are good at expressing style. Directly converting from Office to HTML will always result in poor conversion, but a conversion through Office → OpenDocument → DocBook → HTML will be cleaner, more robust and more flexible.
A chain of conversions is called a Pipeline (like a Unix Pipeline), or a publishing cycle.
Like in software, file format conversion should be separated into processes that deal distinctly with document structure, presentation, data consolidation, rather trying to do too much at once.
Further, multiple data sources can be converted into a single format in order to allow easier reuse through subsequent conversion. Converge your data sources to DocBook and then reuse the same conversion pipeline. This is comparable to having multiple databases and yet abstracting their differences away through a database abstraction layer.
Like in all software, the appropriate level of abstraction and length of the pipeline will vary based on your needs and the quality of your source documents.
[edit] The Pipeline
A typical pipeline might look like:
- Source Office document, is converted to... ↓
- OpenDocument, which is converted to... ↓
- DocBook, which is converted to... ↓
- Plain HTML, which is converted to... ↓
- Themed HTML.
However the advantage of multiple stages is that you can converge on (for example) DocBook and then move to alternative formats such as PDF or OpenDocument. For example,
- Source Office Document ↓
- OpenDocument ↓
- DocBook ↓
- Plain Print layout (LaTeX or XSL-FO) ↓
- Themed PDF.
or
- Source Office Document ↓
- OpenDocument ↓
- DocBook ↓
- RSS
Other examples of output formats can be found at Pilferpage.
[edit] Dealing with Bad Source documents
When it comes to PDFs and Office Documents it's often the case that structure has been discarded. It might just be a graphic of the document and so you'd need OCR or retyping. If it's genuine text then genuine headings don't exist, only variations in font size and font weight.
If the original author could fix the source document this would be best, but maybe they can't. If this is the case then you've got to work with what you're got.
[edit] Programmatic fixing of documents
Deriving structure where none exists is always problematic but -- though you'll never get restore a document perfectly -- you will likely be able to improve the document significantly. Like google, you'll be approaching a mass of poorly structured documents and trying to derive structure and relationships.
If you need to OCR the document then do that. Next you'll need to convert your PDFs or Office files into a format that's easy to deal with in a programmatic sense. Dealing with raw PDF data isn't conducive to good programming, nor is the binary DOC, so I suggest converting to OpenDocument by using OpenOffice.org or Abiword. You can run OpenOffice in server-mode and stream documents to it for conversion, and it will try and convert your PDF faux-tables to real tables.
You can read about the heuristics that Abiword uses for PDF import and OpenOffice.org uses for PDF import.
Once it's in OpenDocument it's time to clean the file up. Eg,
- If the document has an internal timestamp before 2000 and the font name has the word "maori" or "māori" in it then you could assume that it's a maori font, and that umlauts should be converted to macrons.
- Build up a list of font-sizes and weights, and use this to derive headings.
- Expand footnotes that contain hyperlinks to inline links, and so on.
- (add your ideas here)
The exact nature of these conversions will depend on your source PDFs, but it could look like this...
- PDF ↓
- Unclean OpenDocument ↓
- Clean OpenDocument ↓
- DocBook ↓
- HTML
[edit] Manual fixing of documents
http://www.nzetc.org/ uses a number of commercial suppliers to digitise documents which are then quality checked and have a great deal of metadata added. Typically documents are double-keyboarded by the commercial partner, but error rates vary hugely with the quality of the original scan, the language of the text and the complexity of the language. Structures like tables are hard and error prone. Very clear guidance to the encoders is necessary; this is done in the form of written guidelines, a schema (a specialisation of the TEI/XML schema) and a schematron representing the house style. PDF and gif/jpg images may or may not be used in parallel with the XML, depending on the project.
We also use tesseract-ocr (as rumoured to power google books) to do crude automatic filtering (such as separating pages of printed text from pages of hand-written).
The NZETC does some digitisation in conjunction with other cultural heritage institutions and partners, for example Transactions and Proceedings of the Royal Society of New Zealand 1868-1961, but for commercial purposes.
More information can be obtained from stuart.yeates@vuw.ac.nz or director@nzetc.org
[edit] Quick and not-so-dirty fixes
What you do depends on your goals and requirements. There are some open-source tools that strip out the text from PDF and word documents very quickly and can be run in batch mode.
pdf2text prepares a text files from one or more pdfs very fast. If the pdf is tidy in structure, the text file is reasonably tidy. It is useful in situations where you want to grab a number of pages of continuous text from a pdf and are happy to do a bit of manual tidyup. This tool is not good for tabular data - and multicolumn layouts can produce ugly results, depending how they were created.
antiword was also mentioned for word documents. And also running a string command across the binary file.
[edit] HTML and beyond
[edit] Observation
Most govt agencies are now complying with the egovt standards and are putting up corporate, research, policy and other publication in reasonable, standards compliant html or PDF, but this content is not easily findable by those outside government.
- Google searches it reasonably well. But you need to know what you are looking for and be patient in finding it. Also google works on content and largely ignores context and structure.
- The metadata schemas as used by most government departments reflect the context of the creation of the data, but are not user centric for the person-on-the-street.
[edit] Solution
So what would a user-driven application for discovering, sharing and commenting on this material look like?

