Add description of mxOPAQUE_CLASS objects
This commit is contained in:
parent
092ce8a176
commit
6a6d0a342a
@ -9,6 +9,7 @@ Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "MatFileHandler.Tests", "Mat
|
||||
EndProject
|
||||
Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "Solution Items", "Solution Items", "{2CF58D3D-4CEC-419B-AD67-58665A888BDF}"
|
||||
ProjectSection(SolutionItems) = preProject
|
||||
MatFileHandler\Objects.md = MatFileHandler\Objects.md
|
||||
README.md = README.md
|
||||
EndProjectSection
|
||||
EndProject
|
||||
|
201
MatFileHandler/Objects.md
Normal file
201
MatFileHandler/Objects.md
Normal file
@ -0,0 +1,201 @@
|
||||
# MATLAB objects
|
||||
|
||||
This document describes the format that MATLAB uses to store (new-style)
|
||||
objects in .mat files. As far as I know, this format is not documented
|
||||
anywhere. The [official MATLAB documentation](https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf)
|
||||
(as of version 2018) does not even mention the corresponding MX class.
|
||||
|
||||
We assume the reader's familiarity with the document linked above.
|
||||
It describes how the "normal" MATLAB data types (matrices, cell arrays,
|
||||
structure arrays, etc.) are stored in .mat files. Objects (instances
|
||||
of user-defined classes, or of build-in MATLAB types such as tables, etc.)
|
||||
are stored in a slightly different way. If you save a variable into a .mat
|
||||
file, it appears alongside the "normal" variables, but its data only contains
|
||||
a "reference" that points to the actual object contents.
|
||||
The actual data are stored in a so called "subsystem data" in a separate
|
||||
portion of a .mat file.
|
||||
Recall that .mat file header contains (among some text information and
|
||||
a couple of flags) a "subsystem data offset field".
|
||||
As described in the official documentation, it appears right after
|
||||
the first 124 bytes of a .mat file (which contain a "header text field")
|
||||
and occupies 16 bit, i.e., is a short (16-bit) integer.
|
||||
This integer in fact tells you the offset in the .mat file
|
||||
(starting from the very beginning of the file) where the subsystem data
|
||||
starts. Typically it appears after all the "normal" variables.
|
||||
|
||||
So, we have to describe two things: how the subsystem data is organized and how
|
||||
the references to it are stored.
|
||||
|
||||
Let's start with the second. Every MATLAB matrix is stored in a data element
|
||||
of type `miMATRIX = 14`. The type of its elements is determined by
|
||||
a "MATLAB array type" field (see table on page 1-16):
|
||||
`mxCELL_CLASS = 1` means cell array, `mxSTRUCT_CLASS = 2` means structure
|
||||
array, and so on, up to `mxUINT64_CLASS = 15` (unsigned 64-bit integer).
|
||||
The official table ends here, but in fact there are (at least) to more
|
||||
classes: `mxFUNCTION_CLASS = 16` for function handles,
|
||||
and `mxOPAQUE_CLASS = 17` for "opaque objects", and that is what we are
|
||||
interested in.
|
||||
|
||||
## Opaque data element format
|
||||
|
||||
After the array flags element that has array class field set
|
||||
to `mxOPAQUE_CLASS = 17`, come four more data elements:
|
||||
* Array name. This is the variable name in case of variables, and empty
|
||||
in other cases (same as for "normal" data types).
|
||||
* Type system name. This is stored in the same format as array name
|
||||
(class `miINT8`), and contains four characters: `MCOS`. It stands for
|
||||
"MATLAB Class Object System".
|
||||
* Class name. Again, the same format for storing strings. This element
|
||||
contains name of the class that this object belongs to.
|
||||
* Data reference. This element is a matrix of class `mxUINT32`
|
||||
and has size *n*×1, where *n* is at least 6.
|
||||
The 32-bit unsigned integers inside contain information that you need to find
|
||||
the actual object's data inside subsystem data:
|
||||
* First number is 3707764736 (0xdd000000). This is the "magic" number
|
||||
that is used to identify MATLAB object data. We'll see it used
|
||||
(quite unexpectedly) below.
|
||||
* Second number (let's call it *d*) is the number of dimensions
|
||||
in the object array. Typically it's 2 (so, we usually deal with
|
||||
2-dimensional object arrays).
|
||||
* Next *d* numbers contain the size of each of all the dimensions.
|
||||
So, for a single object we have 2 numbers: 1 and 1.
|
||||
* Then, for each of the object in the object array, comes
|
||||
**objectId**. Naturally, the number of the objects is just a
|
||||
product of all the dimension sizes.
|
||||
* Finally, the last number is **classId**.
|
||||
|
||||
So, the main information you can learn from this element is two numbers:
|
||||
*objectId* is the index of object in the list of all the objects
|
||||
stored in the file (numbering starts with 1), while *classId*
|
||||
is the index of our object's class in the list of all the classes
|
||||
that are present in the file (again, numbering starts with 1).
|
||||
For example, if we only store one object in out file,
|
||||
both of these numbers will be equal to 1.
|
||||
|
||||
These two numbers is all you need to find your object's data
|
||||
in the subsystem data, so let's proceed to that.
|
||||
|
||||
|
||||
## Subsystem data format
|
||||
|
||||
### General structure
|
||||
|
||||
First, comes a small header.
|
||||
|
||||
* `0x00`, `0x01` (2 bytes)
|
||||
* `0x4d49` (`MI`, 16-bit integer): endian indicator (see page 1-7).
|
||||
|
||||
After that come variables that are stored in the same way as in the
|
||||
"normal" portion of the file.
|
||||
|
||||
I think that there should be only two variables, and the second one does
|
||||
not contain any useful information.
|
||||
So we'll focus on the first variables, which should be a structure array
|
||||
with a single field `MCOS` (MATLAB Class Object System, again).
|
||||
The value of this field is an array of class `mxOPAQUE_CLASS` again,
|
||||
so it has the structure described above. It is different, though:
|
||||
its class name is now `FileWrapper__`, and its data is not a matrix of class
|
||||
`mxUINT32` with object number and class number, but a cell array.
|
||||
This cell array is what we are interested in.
|
||||
Its first element is a byte array (matrix of class `mxUINT8`)
|
||||
that ties together all the references to object data, i.e.,
|
||||
object numbers and class numbers. Let's call this byte array
|
||||
**object metadata**.
|
||||
The second element doesn't really contain anything, as far as I can say:
|
||||
it is an empty cell array.
|
||||
The following elements contain the fields of the objects that are stored
|
||||
in the file; the object metadata tells you which fields belong to which
|
||||
objects. We shall refer to these elements as **field contents**.
|
||||
Finally, the last element is of unknown purpose: in all the examples
|
||||
I've seen, it contains a column cell array with several empty
|
||||
structure arrays.
|
||||
|
||||
### Object metadata
|
||||
|
||||
Object metadata is a single byte array with several regions inside:
|
||||
|
||||
* **Table of contents**: a list of 16-bit integers:
|
||||
* first is of unknown purpose;
|
||||
* second is **number of field/class names** (see below);
|
||||
* the rest are treated as offsets (in bytes) to the regions inside
|
||||
object metdata. The offsets are listed in ascending order, and the list
|
||||
is 0-terminated. Note that every region, including table of contents,
|
||||
is aligned on 8-byte boundaries, so there might be extra 4 bytes of padding
|
||||
after the terminating 0.
|
||||
In all the examples I've seen, this list contains 6 offsets, and the last
|
||||
one actually points to the end of object metadata (its value is the same
|
||||
as the metadata length). Thus we shall assume that the object metadata has
|
||||
6 regions: the last one is probably empty.
|
||||
* Then come **field/class names**: a list of zero-terminated ASCII
|
||||
strings. The number of the strings was mentioned above; again, the data is
|
||||
aligned on 8-byte boundaries.
|
||||
This region gives you names for all the classes that are stored in the file,
|
||||
and all their fields. The order is arbitrary, and the information that
|
||||
connects the field to classes (and tells you which of the names are
|
||||
actually names of classes) comes later.
|
||||
|
||||
* Region 1 contains several blocks, four 32-bit integers in each block.
|
||||
First block contains integers (0, 0, 0, 0), and the rest of the blocks
|
||||
contain (0, *n*, 0, 0), where *n* is an index in the **field/class names**
|
||||
list that points to a class name. Thus, the number of this 128-byte blocks
|
||||
is the number of different classes stored in the file, plus one zero-filled
|
||||
block in the beginning.
|
||||
This allows you to get class names from region 1; the rest are field names.
|
||||
Moreover, the order of the class names in this region corresponds to
|
||||
the "classId" numbers, which allows you to connect class names to
|
||||
classIds.
|
||||
|
||||
* Region 2 seems to be empty.
|
||||
|
||||
* Region 3 connects objects to classes. It contains a sequence of blocks,
|
||||
six 32-bit integers in each block. First block is filled with zeros,
|
||||
and the rest contain (*classId*, 0, 0, 0, *objectId*, *objectDependencyId*).
|
||||
Each block describes one object, and tells you which class this object
|
||||
has. The meaning of *objectDependencyId* is not exactly clear
|
||||
(but see below for some hints). Note that *classId* and *objectId* here
|
||||
are the same that you may find in the references stored in
|
||||
"opaque" objects.
|
||||
|
||||
* Region 4 connects objects to their fields. This is a list
|
||||
of 32-bit integers. The first two integers (i.e., first 8 bytes) are zero.
|
||||
Then comes a sequence of blocks, one for each object, in the ascending order
|
||||
of objectIds. Each block starts with one 32-bit integer containing the number
|
||||
of sub-blocks to follow; a sub-block consists of three 32-bit integers.
|
||||
Note that each block is aligned on 64-bit boundaries, so there might
|
||||
be padding involved.
|
||||
As we said, a block corresponds to an object. Its sub-blocks
|
||||
correspond to the fields in this object, and contain the following three
|
||||
numbers: (*fieldId*, 1, *fieldContentsId*).
|
||||
Here *fieldId* allows you to get field name from the *field/class names*
|
||||
list of Region 1, and *fieldContentsId* points to the location of the value
|
||||
of this field in the current object. As we said earlier, the field contents
|
||||
are stored in the same cell array where object metadata is the first element.
|
||||
The fieldContentsId numbering, again, starts with 1, and does not include
|
||||
the first two cells (the object metadata cell and the unknown stuff cell).
|
||||
|
||||
This all might seem a little complicated, but actually is rather simple.
|
||||
We have four lists: classes, objects, fields, and field contents.
|
||||
Region 1 connects classes to their names (stored in **field/class names**
|
||||
list, which allows you to get names for the field, as well).
|
||||
Region 3 connects objects to classes.
|
||||
Region 4 connects objects to fields, and fields to field contents.
|
||||
The meaning of *objectDependencyId* values in Region 3 is not totally clear,
|
||||
but it seems to suggest an alternative numbering of objects, which takes
|
||||
into account their dependencies: if one object contains another one as a
|
||||
field, then the second one should be constructed before the first.
|
||||
Thus, the *objectDependencyId* of the second object should be lower
|
||||
that that of the first object.
|
||||
|
||||
Actually, this situation (objects as fields of other objects) contains
|
||||
a caveat: if the field that is stored in the *field contents* cells is
|
||||
itself an object, it is not wrapped in an "opaque object" class data type,
|
||||
in contrast to the objects stored in the "normal" section of a .mat file.
|
||||
Instead, a *field contents* cell just contains the `mxUINT32` matrix
|
||||
that would be the "data reference" part of an "opaque object".
|
||||
This can lead to hilarious results: if you have an object that has
|
||||
a field containing a column matrix of type `int32` whose first element
|
||||
is the magic number 0xdd000000, this object can be saved into a .mat file
|
||||
using MATLAB, but cannot be read back, since MATLAB treats it as an object
|
||||
reference and tries to extract its value based on the other values of the
|
||||
matrix, which can easily lead to crashes or even infinite loops.
|
||||
|
10
README.md
10
README.md
@ -1,5 +1,15 @@
|
||||
# MatFileHandler
|
||||
|
||||
MatFileHandler is a .NET library for reading and writing MATLAB .mat files
|
||||
(of so-called "Level 5"). MatFileHandler supports numerical arrays,
|
||||
logical arrays, sparse arrays, char arrays, cell arrays and structure arrays.
|
||||
Moreover, MatFileHandler is able to read the contents of MATLAB objects,
|
||||
and is currently probably the only third-party library that can do that.
|
||||
Since the format for storing MATLAB objects seems to be obscure and
|
||||
undocumented, support for them is preliminary and probably contains bugs.
|
||||
You can find (partial) technical description of MATLAB object data format
|
||||
[here](MatFileHandler/Objects.md).
|
||||
|
||||
This document briefly describes how to perform simple operations with .mat files
|
||||
using MatFileHandler.
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user