The Future of Documents: the Fonto perspective

At Fonto, our mission is to open up structured content authoring (SCA) for anyone.

By doing so, we help organizations make the strategic shift from traditional documents to structured content. We help them by engaging the people who are the most important stakeholders in any authoring process: the authors.

Fonto allows users to write, edit and review structured content – in a user-friendly and intuitive manner. It is a browser-based word processor and supports collaborative authoring without knowledge of XML-specifics such as schemas.

Fonto is in use in a wide range of industries and sectors. Examples are publishing, techdoc, life sciences, governments and standardization.

Structured content, semantic tagging

File-formats such as PDF and Word centre around the formatting of content, i.e. fonts and page lay-out, optimized for reading by humans. These file-formats are called ‘unstructured’ because the hierarchical or logical structure of a document is not an intrinsic part of the content. Readers are assisted to understand the organization of the content with the help of styles like headers, bold, underline and lists.

Structured content, as the term suggests, has structural content-organization as its core-principle. By tagging the beginning and end of all logical units in a text, a structured document gets a hierarchical ‘backbone’. For instance: a document consists of sections, sections contain subsections and subsections contain paragraphs, lists, and text-blocks. If parts of a document are stored as separate objects that can be reused within various documents, then they are referred to as content-components.

The second characteristic of structured content is that it is semantically tagged. This means that a fragment of text is marked-up for its meaning. For example, a section may be marked as a ‘summary’, or a word may be marked as a ‘product name’. This form of enrichment is precise and more valuable than working in unstructured formats, where a ‘summary’ or a ‘product name’ would be indicated with a style such as italics.

Structured content is a robust source for the automated generation of formatted documents. Because the structure is controlled and semantics are unambiguously tagged, an automatically generated document will not be prone to formatting inconsistencies. If needed, multiple documents can be generated from one and the same content, without the need to maintain any parallel versions of that document.

Structured content authoring

Traditionally, authors write unstructured content (like Word-files), which may then be converted to a structured format. This means that after authors have done their work, specialists and specialist tools are needed to prepare content for its eventual use.

A user-friendly authoring solution for structured content changes this practice. When authors themselves are empowered to create structured content, it becomes the foundation for efficient creation, management and publication of documentation.

As a result, its value increases. From merely a handy way to electronically exchange information, structured content becomes a basis on which innovation can take place. And, as we will discuss, a lot of innovation is expected in the document area.

A disruption in documents

Forrester, an analyst company, published a report called ‘The Future of Documents’ in December 2020. In this report, Forrester describes a disruption of how organizations and individuals think about and work with documents.

There are quite a number of markets and applications in which content authoring has already shifted from unstructured towards structured and component based. While these are still specialized, and sometimes even ‘niche’ applications, they will form a source of inspiration for much wider adoption. The main conclusion of the Forrester research is that in approximately seven years from now, fundamental paradigm shifts will take place in mainstream documentation. And we do see signs of proof of this. Let’s see why.

A niche-need becomes mainstream

It seems safe to say that Publishing and Tech-doc have been leading the way towards semantically tagged, format-free, structured content management. Their specific need – which goes back to the late ‘90s – was to find an efficient way to author for multiple channels. Websites, knowledge bases and embedded systems became prominent next to books, journals, and manuals.

Today, a much broader trend is observed – albeit that it is based on similar patterns. Many – if not most – business documents are not only (intended to be) read by humans, but increasingly processed in an automated way. There is a growing mismatch between the static format of such documents and the way these documents are processed.

Easy to visualize examples of this are the automated processing of invoices or purchase orders, but on a larger scale there are examples too. To pick one: in life sciences, we see the almost ironic practice that pharma companies spend a lot of resources to get correct data into official submissions and reports , and subsequently see that the regulators who receive these documents spend a considerable amount of time and resources to extract and interpret the data from them.

Alleviating symptoms vs a fundamental change

Of course, there are strategies to reduce the effort related to the automatic interpretation of unstructured documents by machines.

One way to automatically extract and interpret information or data from unstructured documents is to use text recognition and analytics tools. A second way is to enrich unstructured documents by adding a meta-data ‘envelope’ to them; which essentially summarizes key-information inside that document.

However, both strategies address the symptoms that are inherently related to traditional document formats, but not their root-cause.

The more fundamental solution is to adopt a document-format that dissolves the barrier between content and data. We move away from formatting data points, phrases and sections in a document for how they should look (fonts, margins, pages), towards tagging them for what they are (‘semantics’). Doing so, content and data blend into one format in which data remains data and documents are ready for electronic consumption.

Of course, human readers are not forgotten: structured and semantically tagged documents can automatically and flawlessly be published towards styled formats using automated publishing systems.

Changing authoring habits

The shift from WYSIWYG document creation to semantic authoring is a process. We see that strategic benefits are recognized on the level of organizations first. Embracing new ways of working by authors is a step that involves a change of habits.

Authors are still very much bound to the concepts of pages, documents and folders. Research suggests that around 80% of knowledge workers indicate that they are highly satisfied with the word-processing tools they use. On the other hand, we have seen wide adoption of SCA in the earlier industries (techdoc, publishing) and we do see a change in the way authoring takes place.

Authoring and Collaboration become increasingly blended. The paradigm of an author who works on a document in isolation is fundamentally changing into a form of continuous joint drafting of content, at some point resulting in a version of documentation that is being made public. Workflows will become more agile.
From documents to content-assets. Authors increasingly accept the role of contributing to a greater ‘body of information’ by providing content-assets than being responsible for an end-to-end publication or document. Content fragments, components, chunks or topics – whatever they are called, will be used in various contexts. Governance and orchestration of content assets towards the publications is an increasingly important but also distinct role.
Writing becomes increasingly assisted. Autocorrect, grammar check and type-ahead suggestions are the beginning. Automated inclusion of data, references and data from data sources are the next step. Semantic tagging will be semi-automated and based on entity recognition.
The line between data and content blurs. As mentioned before, content-becomes-data and data is embedded in content – as data.

Data-driven and purpose-built

At Fonto, we believe Forrester’s prediction that documentation will become data-driven and purpose-built for their audiences. In our view, this means that the future of documents involves componentized, structured and semantically tagged file formats that have been produced in controlled, but at the same time agile workflows.

Document authoring has changed a lot already and will continue to change in the years ahead. By making SCA ‘doable’ for anyone, we are preparing for the digital disruption of documents. Exciting times are ahead, and next to improving our tool to have a role in this, we will keep sharing our insights through blogs like this. Please feel free to respond and reach out to us. Let’s open the conversation!