11 KiB
Raw Blame History

MATLAB objects

This document describes the format that MATLAB uses to store (new-style) objects in .mat files. As far as I know, this format is not documented anywhere. The official MATLAB documentation (as of version 2018) does not even mention the corresponding MX class.

We assume the reader's familiarity with the document linked above. It describes how the "normal" MATLAB data types (matrices, cell arrays, structure arrays, etc.) are stored in .mat files. Objects (instances of user-defined classes, or of build-in MATLAB types such as tables, etc.) are stored in a slightly different way. If you save a variable into a .mat file, it appears alongside the "normal" variables, but its data only contains a "reference" that points to the actual object contents. The actual data are stored in a so called "subsystem data" in a separate portion of a .mat file. Recall that .mat file header contains (among some text information and a couple of flags) a "subsystem data offset field". As described in the official documentation, it appears right after the first 124 bytes of a .mat file (which contain a "header text field") and occupies 16 bit, i.e., is a short (16-bit) integer. This integer in fact tells you the offset in the .mat file (starting from the very beginning of the file) where the subsystem data starts. Typically it appears after all the "normal" variables.

So, we have to describe two things: how the subsystem data is organized and how the references to it are stored.

Let's start with the second. Every MATLAB matrix is stored in a data element of type miMATRIX = 14. The type of its elements is determined by a "MATLAB array type" field (see table on page 1-16): mxCELL_CLASS = 1 means cell array, mxSTRUCT_CLASS = 2 means structure array, and so on, up to mxUINT64_CLASS = 15 (unsigned 64-bit integer). The official table ends here, but in fact there are (at least) to more classes: mxFUNCTION_CLASS = 16 for function handles, and mxOPAQUE_CLASS = 17 for "opaque objects", and that is what we are interested in.

Opaque data element format

After the array flags element that has array class field set to mxOPAQUE_CLASS = 17, come four more data elements:

  • Array name. This is the variable name in case of variables, and empty in other cases (same as for "normal" data types).
  • Type system name. This is stored in the same format as array name (class miINT8), and contains four characters: MCOS. It stands for "MATLAB Class Object System".
  • Class name. Again, the same format for storing strings. This element contains name of the class that this object belongs to.
  • Data reference. This element is a matrix of class mxUINT32 and has size n×1, where n is at least 6. The 32-bit unsigned integers inside contain information that you need to find the actual object's data inside subsystem data:
    • First number is 3707764736 (0xdd000000). This is the "magic" number that is used to identify MATLAB object data. We'll see it used (quite unexpectedly) below.
    • Second number (let's call it d) is the number of dimensions in the object array. Typically it's 2 (so, we usually deal with 2-dimensional object arrays).
    • Next d numbers contain the size of each of all the dimensions. So, for a single object we have 2 numbers: 1 and 1.
    • Then, for each of the object in the object array, comes objectId. Naturally, the number of the objects is just a product of all the dimension sizes.
    • Finally, the last number is classId.

So, the main information you can learn from this element is two numbers: objectId is the index of object in the list of all the objects stored in the file (numbering starts with 1), while classId is the index of our object's class in the list of all the classes that are present in the file (again, numbering starts with 1). For example, if we only store one object in out file, both of these numbers will be equal to 1.

These two numbers is all you need to find your object's data in the subsystem data, so let's proceed to that.

Subsystem data format

General structure

First, comes a small header.

  • 0x00, 0x01 (2 bytes)
  • 0x4d49 (MI, 16-bit integer): endian indicator (see page 1-7).

After that come variables that are stored in the same way as in the "normal" portion of the file.

I think that there should be only two variables, and the second one does not contain any useful information. So we'll focus on the first variables, which should be a structure array with a single field MCOS (MATLAB Class Object System, again). The value of this field is an array of class mxOPAQUE_CLASS again, so it has the structure described above. It is different, though: its class name is now FileWrapper__, and its data is not a matrix of class mxUINT32 with object number and class number, but a cell array. This cell array is what we are interested in. Its first element is a byte array (matrix of class mxUINT8) that ties together all the references to object data, i.e., object numbers and class numbers. Let's call this byte array object metadata. The second element doesn't really contain anything, as far as I can say: it is an empty cell array. The following elements contain the fields of the objects that are stored in the file; the object metadata tells you which fields belong to which objects. We shall refer to these elements as field contents. Finally, the last element is of unknown purpose: in all the examples I've seen, it contains a column cell array with several empty structure arrays.

Object metadata

Object metadata is a single byte array with several regions inside:

  • Table of contents: a list of 16-bit integers:

    • first is of unknown purpose;
    • second is number of field/class names (see below);
    • the rest are treated as offsets (in bytes) to the regions inside object metdata. The offsets are listed in ascending order, and the list is 0-terminated. Note that every region, including table of contents, is aligned on 8-byte boundaries, so there might be extra 4 bytes of padding after the terminating 0. In all the examples I've seen, this list contains 6 offsets, and the last one actually points to the end of object metadata (its value is the same as the metadata length). Thus we shall assume that the object metadata has 6 regions: the last one is probably empty.
    • Then come field/class names: a list of zero-terminated ASCII strings. The number of the strings was mentioned above; again, the data is aligned on 8-byte boundaries. This region gives you names for all the classes that are stored in the file, and all their fields. The order is arbitrary, and the information that connects the field to classes (and tells you which of the names are actually names of classes) comes later.
  • Region 1 contains several blocks, four 32-bit integers in each block. First block contains integers (0, 0, 0, 0), and the rest of the blocks contain (0, n, 0, 0), where n is an index in the field/class names list that points to a class name. Thus, the number of this 128-byte blocks is the number of different classes stored in the file, plus one zero-filled block in the beginning. This allows you to get class names from region 1; the rest are field names. Moreover, the order of the class names in this region corresponds to the "classId" numbers, which allows you to connect class names to classIds.

  • Region 2 seems to be empty.

  • Region 3 connects objects to classes. It contains a sequence of blocks, six 32-bit integers in each block. First block is filled with zeros, and the rest contain (classId, 0, 0, 0, objectId, objectDependencyId). Each block describes one object, and tells you which class this object has. The meaning of objectDependencyId is not exactly clear (but see below for some hints). Note that classId and objectId here are the same that you may find in the references stored in "opaque" objects.

  • Region 4 connects objects to their fields. This is a list of 32-bit integers. The first two integers (i.e., first 8 bytes) are zero. Then comes a sequence of blocks, one for each object, in the ascending order of objectIds. Each block starts with one 32-bit integer containing the number of sub-blocks to follow; a sub-block consists of three 32-bit integers. Note that each block is aligned on 64-bit boundaries, so there might be padding involved. As we said, a block corresponds to an object. Its sub-blocks correspond to the fields in this object, and contain the following three numbers: (fieldId, 1, fieldContentsId). Here fieldId allows you to get field name from the field/class names list of Region 1, and fieldContentsId points to the location of the value of this field in the current object. As we said earlier, the field contents are stored in the same cell array where object metadata is the first element. The fieldContentsId numbering, again, starts with 1, and does not include the first two cells (the object metadata cell and the unknown stuff cell).

This all might seem a little complicated, but actually is rather simple. We have four lists: classes, objects, fields, and field contents. Region 1 connects classes to their names (stored in field/class names list, which allows you to get names for the field, as well). Region 3 connects objects to classes. Region 4 connects objects to fields, and fields to field contents. The meaning of objectDependencyId values in Region 3 is not totally clear, but it seems to suggest an alternative numbering of objects, which takes into account their dependencies: if one object contains another one as a field, then the second one should be constructed before the first. Thus, the objectDependencyId of the second object should be lower that that of the first object.

Actually, this situation (objects as fields of other objects) contains a caveat: if the field that is stored in the field contents cells is itself an object, it is not wrapped in an "opaque object" class data type, in contrast to the objects stored in the "normal" section of a .mat file. Instead, a field contents cell just contains the mxUINT32 matrix that would be the "data reference" part of an "opaque object". This can lead to hilarious results: if you have an object that has a field containing a column matrix of type int32 whose first element is the magic number 0xdd000000, this object can be saved into a .mat file using MATLAB, but cannot be read back, since MATLAB treats it as an object reference and tries to extract its value based on the other values of the matrix, which can easily lead to crashes or even infinite loops.