Reflections on Metadata

Old can quickly become new in this market, where the ability to repurpose content or make it more discoverable has never been more potent. Archival content was not always given high importance. A finished-goods repository was once the equivalent of a dusty file cabinet. With the explosion of digital products, however, the status of that once-archival content has been transformed to become a publisher’s valued asset and future source of revenue. Content needs to be consistently structured, easily accessible, and managed.

Building on that prescription, this paper provides an executive overview of metadata and its importance to the viability of any commercial publishing endeavor. The intended audience is the publishing business executive. The purpose is not to address the fine detail of a specific metadata function but to provide an overview and suggest an overall strategy for dealing with it and avoiding common pitfalls.

Let’s start with a basic description: what is metadata? It is essentially information about the product or the product’s content. In a somewhat reflexive definition, it is data about data. It describes the product and governs transactions associated with that product. For the publisher, it enables the increasingly complex business rules for what, when, where, and how content is used.

In the past, metadata occupied a simpler role. With print-only products, and prior to the emergence of large online retailers, metadata was primarily a tool for tracking a single version of each title, providing information for print sales and basic cataloguing. It was often stored (and sometimes still is) in the publisher’s title-tracking or business system.

With the advent of e-products, online retailers, and the explosive growth of Web business, metadata has grown exponentially and has increased both in complexity and importance. A single title that originally carried a single print record suddenly became available in multiple formats, each with its own associated metadata. Typically, there are 30–40 fields of metadata required for every digital product in active distribution. Even for smaller publishes with inventories of fewer than 2,000 titles, the exposure can run to 50,000–100,000 data points.

In addition, online markets began demanding a variety of descriptive and transactional metadata. This is metadata that then controls and enforces the rights, permissions, and distribution rules set by the publisher. Furthermore, with the advent of search engines, metadata became the key to discoverability. Precision in writing descriptive metadata became critical.

Many of the title-tracking/business systems used by publishers were originally built assuming a print-only architecture. Such systems could simply not accommodate the crush of additional data. In addition, not all title-tracking systems were designed to ensure a consistent format for each metadata element.

The Ins and Outs of Metadata

The role of metadata has expanded in ways not previously imagined. As noted above, a publisher’s most critical quality of discoverability is dependent on how that publisher’s key-word metadata is captured and prioritized by various search engines. The library market is increasingly demanding more about how content is catalogued and how that information is made machine readable (MARC records). Pricing and transactional metadata is variable across multiple channels and territories, and it needs to be kept up to date in as close to real time as possible. Metadata defines content at the chapter and article level, making it possible to sell more granular information with more specificity.

Broadly speaking, metadata comes in two categories: metadata external to the content object, and metadata internal to the content object.

Metadata external to the content object includes the following:

• All descriptive properties, such as the product registry ID number (ISBN, ISSN, EAN), the author, editor, contributors, publisher, extent, trim size, format, abstracts. This includes overall product descriptions and product-assembly instructions for the warehouse.

• Associated properties such as author biographies and product reviews.

• Discoverability elements such as key words and BISAC subject codes.

• Commercial/transactional metadata, such as price, currency, territory rights, and digital rights.

• Cataloguing metadata, including MARC records and KBart.

Metadata internal to the content object includes the following:

• Digital-object identifiers (DOIs) and DOI registries, such as CrossRef and DataCite. (It is the use of DOIs and DOI registries that make it possible for the publisher’s online system to capture individual book chapters for sale.)

• Industry ontologies with embedded definitions and references.

The Emergence of Onix

The Onix (standing for online information exchange,) standard was created in 2000 and is metadata as an XML instance conforming to a discrete set of international standards. Onix computer-to-computer communication has become the most practical means of dealing with metadata.

Onix for Books made possible a consistent means for publishers to provide rich metadata to online retailers, supply-chain partners, data aggregators, and other interested parties in the publishing business. With Onix, publishers were finally able to provide metadata revisions without the cumbersome practice of developing Excel spreadsheets and waiting for each partner to get around to uploading the data.

Onix is an important tool for the publisher’s ability to respond quickly to market demands. Price changes and new product offerings can be made closer to real time than ever before. With Onix comes the ever-pressing demand for a dedicated repository and managed platform for metadata. It is no longer practical to manage metadata in separate silos under the various business units and departments that create it.

Whatever system a publisher adopts, the automatic processing of metadata requires having a consistently structured format for each element; in other words, a standard. There are a variety of standards in use, each having relevance for the task being performed (e.g., Onix) or a particular market being served (e.g., MARC records for librarians).

With all the standards and wide array of uses, publishers can no longer afford to have metadata records in various “pockets,” silos separated by different departments or different facets of the publishing business. Aggregating and maintaining consistent metadata company-wide is difficult when the employees managing metadata are operating in different systems with different business rules. It is time to get beyond disparate systems that don’t talk to one another.

In the ideal situation, the publisher will have one system that houses all metadata under a single roof. Publishers still cobbling together metadata from multiple sources are already handicapped by a lack of flexibility and timely response to market opportunities. This situation will become even more dire as time goes by.

In a single system under a single roof, there needs to be a systematic way to configure metadata to meet the exact demands or particular business needs of the publisher and business partners. For example, there is good reason to separate commercial metadata, where it can be kept current and routinely distributed to aggregators. Access to metadata should be strictly administered.

To create a system that will enable publishers to retrieve metadata and structure it to a particular standard, companies need a responsive system, one that includes master templates for each market segment or interest group. All metadata keyed in or uploaded will need to be validated against appropriate templates. If it is metadata associated with a title to be sold on Amazon, it should be validated against a template configured with the latest Amazon requirements for markets and market influencers. These requirements can be frustratingly precise and eclectic. More than one publisher has complained about Apple’s pesky pricing grid for iBooks.

Consider the financial implications of a publisher hobbled by an inefficient means of distributing price changes. The publisher determines to make a change in prices across an entire suite of products but can only send out a series of spreadsheets to aggregators. The time delay in creating those spreadsheets is one thing, but then how long will it take each aggregator to actually ingest those changes?

The High Price for the Imprecise

The price-change scenario is just one example of the numerous ways the mismanagement of metadata can lead to some very painful results. It might help to take a brief look at some real-life issues that occur in day-to-day business. Consider the following:

• A coffee table book extolling the pleasures of sipping bourbon is not selling. It turns out the subject code, designating the title as having to do with alcoholic beverages, had an unforeseen qualifier: it placed the title in a category of books dedicated to the prevention of alcoholism.

• A percentage of a publisher’s ISBN’s are completely missing product type and ultimately do not show up on predefined searches.

• A major partner has changed spec requirements. The publisher is not prepared to respond rapidly. The slow adjustment translates into lost sales.

• A percentage of customer orders have first-pass failure at time-of-order entry, due to incomplete metadata. Delays result in lost revenue.

• Multiple component orders are assembled incorrectly due to incorrect metadata, resulting in returns and unhappy customers.

• There is a mix-up in publication dates due to the U.S. style of showing month/day versus the European style of showing day/month.

• Incorrect distribution metadata results in restricted books going to the wrong aggregators or unrestricted books being missed.

The point of these examples is clear: when metadata is not properly managed, the publisher is at risk of losing time, money, and customers.

The notion of consolidated metadata storage and efficient exchange can become overwhelming. Consider a simple math exercise. Imagine being a steward to just 1,000 titles, each with 140 metadata fields. That alone means there are 140,000 fields to track. Now, if one simply needs to manage 10% of that metadata universe in a single year, it would mean dealing with 14,000 data points. That works out to handling, on average, 54 metadata adjustments every working day.

There is no harm in a publisher seeking outside help. Often, a publisher is too close to its own issues or too removed from an understanding of the best technologies available. An outside perspective can help. In the case of metadata and metadata/content systems, there are a number of good potential partners.

The first step for the publishing executive is to recognize this fact: metadata is far too important to slip under the radar. How content and metadata are managed will depend on individual publisher needs and budgets. If uncertain, an executive should get professional advice. It is always better to anticipate the pitfalls before stepping into one.

Read the full white paper this article is based on.

The Technology-Publishing Connection

This article is the second in a print and webinar series presented by CodeMantra on how publishers can best use technology to expand their businesses. The series will feature four print articles and four free webinars.

Reflections on Metadata

Publishers are at risk of losing time, money, and customers if metadata is not right