File formats and software

Research data needs to be saved in a file format that ensures the accessibility and re-usability of your research data in both the short and longer term.

Why does file format and software selection matter?

Making available and future proofing your data.

"All digital information is designed to be interpreted by computer programs to make it understandable and is - by nature - software dependent. All digital data are thus endangered by the obsolescence of the hardware and software environment on which access to data depends.

Despite the backward compatibility of many software packages to import data created in previous software versions and the interoperability between competing popular software programs, the safest option to guarantee long-term data access and usable data is to convert data to standard formats that most software are capable of interpreting, and that are suitable for data interchange and transformation. UK Data Archive – File Formats and Software

“The goal of digital curation is to ensure the appropriate usability of managed digital assets over time. Format is a fundamental characteristic of a digital asset that governs its ability to be used effectively.

Without strong format typing a digital asset is merely an undifferentiated string of bits. The information content encoded into an asset's bits can only be interpreted properly and rendered in human-sensible form if that asset's format is known.

While it is possible for bits to be preserved indefinitely without consideration of format, it is only through the careful management of format that the meaning of those bits remains accessible over time" Stephen Abrams, California Digital Library 

Which file formats are more likely to be accessible in the future?

There aren’t certainties in this area and no guarantees can be given, but at least you can follow some basic common sense principles.

"As technology continually changes, researchers should plan for both hardware and software obsolescence. How will your data be read if the software used to produce them becomes unavailable?

Formats more likely to be accessible in the future are:

  • Non-proprietary
  • Open, documented standard
  • Common usage by research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format." File Formats for Long-term Access – MIT

Checklist for file format selection:

  1. Is there a risk of file format obsolescence?
      • Is there format independence from specific platforms, hardware or software?
      • Could the supporting software be upgraded and the new version not able to open old files?
      • Could the supporting software be bought by a competitor and withdrawn?
      • Could the format could fall into disuse and lack software support?
  1. Is the format proprietary, or is it open?
  2. Is the format specification publicly available?
  3. Is the format convenient for extracting the data for further use and indexing for search/discovery, or is it only suitable for viewing the data?
  4. Does the format store the data at the required level of fidelity? Will the data be degraded every time it is saved in this format?
  5. Does the format compress the data? (Generally compression makes a file less robust to errors in data transmission or damage or degradation of storage media.)
  6. Is the chosen format an accepted standard? (Either a formal standard managed by a standards organisation, or a de-facto standard in your field of research.)
  7. What software and formats you or colleagues have used in past projects?
  8. What formats will be easiest to share with colleagues for future projects?
  9. Are there any discipline-specific norms?
  10. What software is compatible with hardware you already have?
  11. Will you need to purchase new software?
  12. What formats will be easiest to annotate with metadata so that you and others can interpret them days, months, or years in the future?

Adapted from Australian National Data Service – “File Formats” and University of Cambridge “Choosing Formats"

What file formats are recommended?

The UK Data Archive (UKDA) provides guidance on file formats it accepts for deposited data in relation to the following data types:

  • Quantitative tabular data with extensive metadata
  • Quantitative tabular data with minimal metadata
  • Geospatial data
  • Qualitative data
  • Digital image data
  • Digital audio data
  • Digital video data
  • Documentation and scripts

Cambridge University, Common Image Formatspdf   
Digital Still images – JISC Digital Media  
Digital Video File types -  JISC Digital Media  
Digital Audio Files - JISC Digital Media  

What are 'non-proprietary' or 'open' formats, and why would I use them?

In the simplest cases, a non-proprietary format is a format which doesn't have restrictions on its use and over which no one claims intellectual property rights. For example, Microsoft Office products, such as Microsoft Word, are proprietary, while Open Office products are non-proprietary (and open source).

For long term access to files, digital preservation experts tend to recommend 'non-proprietary' and 'open' formats (terms which they use somewhat interchangeably). The logic here is that if the code behind the software is publically available (i.e. open source), then that format/software will be supported so long as at least one competent tinkerer still finds it interesting or useful.

In contrast, a private software company can go out of business or stop producing a compatible version of the software in whose format your data was saved, and no one will have the rights or knowledge to provide it anymore.
Cambridge University – Choosing Formats