Accessing and Converting Merlin Binary Files

In (most) electron microscopes with a Medipix3 detector, the data is acquired using the Merlin readout system. This outputs a header file and the image data in a one or more binary files. These can be converted into a more convenient HDF5-based EMD (https://emdatasets.com/format) compatible format by using the MerlinBinary class. The class also allows quick access to the data without conversion.

To raw data is stored flat on disk, so to navigate and store the data in the correct shape, the scan parameters can be supplied. These are not stored in the Merlin files, but can be parsed (along with most other scan information) from a Digital Micrograph (DM) image acquired during the scan. If a DM file is not recorded, the information can be specified as parameter, either manually or by parsing it from another image.

Using a DM File

With a DM file, the class can be initialised using the following:

  • Binary file: medipixfile.mib
  • Header file: medipixfile.hdr
  • DM-file : dmfile.dm3
>>> from fpd.fpd_file import MerlinBinary
>>> mb = MerlinBinary(binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])

One or more DM files can be specified and all files will be parsed and stored as EMD groups and in raw binary form, allowing the native DM file to be extracted unmodified and read in Digital Micrograph. Only the first DM file is used for the scan information.

Without a DM File

If you do not have any DM files, the scan parameters can be specified through the scanXalu and scanYalu arguments which are tuples containing for each axis the axis, the label, and units. If axis is an integer, it is interpreted as the axis length and a new integer axis is generated. Otherwise, axis should be an iterable that will be used as the axis directly.

>>> import numpy as np
>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr',
...         scanXalu=(128, 'xaxis', 'pixels'), scanYalu=(np.arange(256)*0.1, 'yaxis', 'nm'))

If another image has been recorded with axis information and it can be read by HyperSpy, it can be parsed for the axes information using:

>>> from hyperspy.io import load
>>> im = load('image_filename')
>>> scanXalu, scanYalu = [(m.axis, m.name, m.units) for m in im.axes_manager.signal_axes]

Specifying the detector axes

In most systems, it is not trivial to determine the detector axes properties and, often, a calibration based upon known image dimensions is used. For example, in STEM the brightfield disc may be used to set the reciprocal space pixel size.

Accessing the data pre-conversion using the memmory mapping or other means discussed below allows the pixel size to be determined from the data. Once done, the detector axes may be specified in a similar manner as to file conversion without a DM file, discussed above. This is dome by specifying the detXalu and detYalu arguments. For example, to set the y-axis as pixels, and the x-axis in nm:

>>> import numpy as np
>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
...         detYalu=(256, 'y-axis', 'pixels'), detXalu=(np.arange(256)*0.1, 'x-axis', 'nm'))

Without any scan information

If no scan information is passed, and a single binary file is provided, all images in the file are processed as though the shape was [1, nimages, detY, detX] (as are TEM Data, discussed below). In this case, row_end_skip should be set to 0 unless there are a known number of unwanted images at the end of the dataset. For example:

Flyback

Depending on how the Merlin system is triggered, when acquiring data a number of extra pixels can be included at the end of each row due to triggering whilst the beam is moving to the start of the next row of the scan, the flyback. These extra pixels can be removed using the row_end_skip argument, which is by default is set to 1. Faster acquisitions and larger flyback times will produce more extra pixels. To use a different value:

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
...         row_end_skip=2)

The number of extra pixels at the end of the data after taking account of flyback pixels is counted and printed at the end of file conversion. If this is unusually high (there are often a couple of extra pixels), the row_end_skip parameter may be incorrect.

Accessing Files Pre-Conversion

Conversation to HDF5 is the most flexible solution, however, it is often useful to access a file before spending the several minutes it takes to convert the data to HDF5. This is particularly useful when quickly checking a dataset after acquisition or preparing a mask for use in removal of bad pixels.

The MerlinBinary class provides a number of methods to access the data in this way. The most useful for common data acquisition modes is the get_memmap() method.

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
...         row_end_skip=2)

>>> mm = mb.get_memmap()

mm is a memory mapped file with an array interface. It can be indexed in the usual way, passed to the processing functions in the fpd_processing module (or elsewhere) , or plotted using the DataBrowser. For example:

Where nav_im specifies a navigation image, and is set here to be blank for speed. The memory mapped data could be processed to generate one and will do so automatically if nav_im is not specified. If this navigation image is to be reused, it is best to assign it to a reusable variable (i.e. nav_im = …).

Memory mapping only works for certain datasets, and the class will issue appropriate errors when a file or files cannot be memory mapped. A much slower but more general method is accessing the data through the array interface of the MerlinBinary class. For example:

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
...         row_end_skip=2)

>>> d = mb[0, :10]

The data is in-memory, so care must be taken over the amount of the available system memory that is used.

Conversion to HDF5

The data may be converted to hdf5 with the write_hdf5() method.

Data Chunking

An important parameter is data chunking, which defines how the data is structured in the HDF5 file. This can have a large influence on the I/O performance and compression. The default chunking is to store the dataset in cubes of 16x16[x1]x16x16 pixel chunks. Different settings may be more appropriate for other uses. The chunking can be set by scan_chunks, im_chunks, or chunks, with chunks taking precedence over the other chunk parameters. For example:

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
>>> mb.write_hdf5(chunks=(8, 8, 256, 256))

The data is repacked by default (repack=True), unless memory mapping the file is possible. Repacking stores the fpd data in a temporary file, then copies it chunk-by-chunk to the final file. It therefore requires more disk space during conversion, but it is RAM efficient. The final file sizes are typically smaller and processing the data is faster. If memory mapping the file is possible, this will automatically happen for writing the data in chunks, avoiding the intermediate file.

Compression

The default compression method used is ‘gzip’. This method is built-in to the hdf5 library, so files made with this compression method may be opened anywhere.

Starting with fpd version 0.2.1, the compression method may be set as one of [‘gzip’, ‘LZ4’] through the compression parameter of the write_hdf5() method. This parameter only affects the main dataset. All other datasets are relatively small and so ‘gzip’ is used so that some idea of the data set may be obtained without plugins, if needed.

The LZ4 plugin is provided by the hdf5plugin package. This compression method creates slightly larger files than gzip, but is very much faster. To use LZ4, the compression parameter must be set:

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
>>> mb.write_hdf5(compression='LZ4')

Compared with gzip, tests with LZ4 typical gave the following (the numbers are likely to vary with hardware spec):

  • 2.2x faster read speed
  • 2.8x faster write speed
  • 1.7x faster file creation

Data Masking

There are sometimes bad pixels (dead or hot or noisy or contaminated …) on a detector and these can be interpolated out during conversion by specifying a mask. In the following example, we generate a very simple mask using memory mapped file access to a specific image (it could be a flat field) and apply during the conversion. In general, the mask generation procedure may be more complex.

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
>>> mm = mb.get_memmap()
>>> mask = mm[23, 45] < 6
>>> mb.write_hdf5(mask=mask)

Locations where the mask is True are replaced using cubic interpolation of the non-masked values. The mask is stored in the HDF5 file in the usual EMD format.

Data Manipulation

Arbitrary functions may be applied to the image data during conversion to hdf5. This is enabled by passing a function func to the write_hdf5() method. The function should be of the form func(image, scanYind, scanXind, **kwargs), where image is a single detector image and scanYind and scanXind are the scan pixel indices. kwargs are optional keyword argument pairs.

An example of shifting all images by a scan-position dependent value is:

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])

>>> from fpd.synthetic_data import shift_im
>>> def shift_func(image, scanYind, scanXind, shift_array):
>>>     syx = shift_array[:, scanYind, scanXind]
>>>     new_im = shift_im(image, syx, method='fourier')
>>>     return new_im

>>> mb.write_hdf5(func=shift_func, func_kwargs={'shift_array': syx})

where syx is an array of shifts in pixels along the y- and x-axis, of shape (2, scanY, scanX). Multiple keyword arguments may be passed through func_kwargs in a dictionary.

The shift array may be generated by analysing the data before conversion using the memmory mapping interface or other methods described above. Many functions in the fpd_processing module, such as the center_of_mass() function, or other methods may be used to generate the array. Most functions in the fpd_processing module can process data from memory mapped arrays.

Conversion of the data happens as the file is read and so is serial and single threaded which will slow down the conversion. However, it may still be useful to align images as described here for future processing which assumes the images are aligned (note that many algorithms do not assume this), or to apply more general data filtering. Note that the data type of the function return will be coerced to the original type.

Detector Axis and Colour Mode

The Medipix3 chip (the actual detector with which the Merlin readout system works) can come in a several different configurations and can be operated in number of different modes with different numbers of counting thresholds. Its four modes of operation are:

  • Single pixel mode (SPM), in which each pixel operates independently of one another and up two seperate counting thresholds can be set.
  • Charge summing mode (CSM), in which each pixel operates independently of one another and in which threshold 1 is “gated” by threshold 0. If using both counters, then the images acquired with both threshold can be saved.
  • Colour single pixel mode (Colour SPM), in which one in every four pixels is operational, with the three neighbouring pixels donating their two thresholds and counters to the operational pixel, so that up to eight independent counting thresholds can be set.
  • Colour charge summing mode (Colour CSM), in which, again, one in every four pixels is operational, with up to four thresholds, each of which is gated by another threshold, being available. If acquiring data with both counters, then the images acquired with the gating thresholds can also be saved, such that there are eight images (which can all be acquired with different, but not necessarily independent thresholds) for each frame saved.

The standard configuration has every pixel on the chip bonded to the sensor, resulting in a device with 256x256 55um pitch pixels (which can then be tiled together to create larger detector, e.g. 512x512 or 256x1024 pixels). It is possible to bond only one in every four pixels to the sensor, creating a device which has 128x128 110um pitch pixels. Such devices are intended to operate in one the Medipix3’s colour modes as the three unbonded pixels share their thresholds and counters with the pixel that is bonded.

The Merlin system is unaware of how many pixels on the chip are actually bonded to the sensor and displays and saves data on the basis of what mode of operation is chosen. If using a 256x256 55um pitch device in SPM or CSM, then the Merlin system will display 256x256 pixel images and save the data accordingly. If operating in Colour SPM or Colour CSM with the same device, then the Merlin system will display and save images in a 128x128 pixel format (with the signal deposited in three out of four pixels being disregarded).

Similiarly, if using a 128x128 110um pitch device in SPM or CSM, then the Merlin system will display and save images as having 256x256 pixels, but only one in every four pixels will register counts. When using such a device in Colour SPM and Colour CSM, then the images will be displayed and saved in a 128x128 pixel format by the Merlin system.

To simplify data conversion, the MerlinBinary class is written such that it determines the size of the dataset using input from the user and checking the number of thresholds and operational mode saved in the header data. When converting data, detY and detX should be the dimensions of the detector as determined by the Merlin system on the basis of the operational mode, not the physical number of bonded pixels.

For example, if using a 256x256 55um pitch device in SPM, or CSM then detY and detX should both be set equal to 256. If using the same device in Colour SPM or Colour CSM, then these two arguments should be set to 128. If using a 128x128 110um pitch device in SPM or CSM, then detY and detX should still be set to 256, while when using such a device in Colour SPM or Colour CSM then these arguments should again be set to 128.

TEM Data

CTEM data can be processed by setting the y-axis size to 1, and the x-axis size to the number of images. In this case, ds_start_skip and row_end_skip should be set to 0. For example:

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
...         row_end_skip=0, ds_start_skip=0, scanYalu=(1, 'na', 'na'),
...         scanXalu=(16, 'Time', 'seconds'))

If the scanXalu axis is None, it is automatically generated from the header file with axis being the index. Note that the header file contains the maximum number of frames and the real number of frames could be smaller by design or error. For example:

>>> mb = MerlinBinary(
...         binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
...         row_end_skip=0, ds_start_skip=0, scanYalu=(1, 'na', 'na'),
...         scanXalu=(None, 'images', 'index'))

Here, the y-axis is singular, which replicates the way in which the binary data is stored.

Updating Files

There are currently two methods of updating file contents: modifying the data, and updating calibrations.

Modifying data

A new copy of the src file may be made with an arbitrary function, func, applied to the multidimensional dataset using the make_updated_fpd_file() function.

For example, below we determine the positions of the direct beam and then displace each image by the amount required to centre the dataset. For simplicity, we use the center_of_mass() function with the minimum parameters set, which will calculate the centre of mass of the entire image (in general, additional parameters should be set to obtain the best results).

>>> from fpd.fpd_file import fpd_to_tuple, make_updated_fpd_file
>>> from fpd.fpd_processing import center_of_mass
>>> from fpd.synthetic_data import shift_im

>>> src = 'original_file.hdf5'
>>> dts = 'aligned_file.hdf5'

>>> fpd_nt = fpd_to_tuple(src)
>>> ds = fpd_nt.fpd_data.data

>>> com_yx = center_of_mass(ds, 16, 16)
>>> syx = com_yx - np.percentile(com_yx, 50, axis=(-2, -1))[..., None, None]

>>> def shift_func(image, scanYind, scanXind, shift_array):
...     syx = shift_array[:, scanYind, scanXind]
...     new_im = shift_im(image, -syx, fill_value=0).astype(int)
...     return new_im

>>> new_fn = make_updated_fpd_file(src, dst, func=shift_func,
...                               func_kwargs=dict(shift_array=syx),
...                               update_sum_im=True, update_sum_dif=True)

Updating Calibrations

Calibrations may be specified at the time of file creation, or updated afterwards using the update_calibration() function. All datasets are updated with the exception of the DM files and data.

In the example below, we update the calibration for the scan and detector axes in-place, i.e. the file is modified. The detector axis units are set to None so that the units in the file are unchanged, while the numerical values are updated.

>>> from fpd.fpd_file import update_calibration

>>> filename = 'my_file.hdf5'

>>> update_calibration(filename, scan=[0.2, 'nm'],
                        detector=[5.1, None], colour=[None, None])

File Contents and Access

The Merlin images and metadata, basic analysis of the images, and any DM files are parsed and embedded in a single HDF5 file in multiple EMD datasets and can be accessed and viewed in any HDF5 compatible software.

The simplest way to access the data in Python is to use the fpd_to_tuple() function. This function parses the HDF5 file and extracts all EMD datasets as a hierarchical namedtuple of datasets, names, units, and axes, all of which may be navigated by tab completion.

To read everything from a file called filename and then access the 4D dataset:

>>> import fpd.fpd_file as fpdf
>>> fpd_nt = fpdf.fpd_to_tuple(filename)

>>> ds = fpd_nt.fpd_data.data

To return only specific datasets, they may be specified by group_names:

>>> fpd_nt = fpdf.fpd_to_tuple(filename,
...     group_names=['fpd_data', 'fpd_sum_im', 'fpd_sum_dif'])

If group_names is None, all groups are returned.

The attributes of each dataset are also extracted (‘name’ and ‘units’), as are each dimension (eg ‘dim1’, ‘dim2’, …). The structure of the tuple is the following, where ‘dim1’ is another named tuple, following the same style:

>>> fpd_nt.fpd_data.data
>>> fpd_nt.fpd_data.name
>>> fpd_nt.fpd_data.unit
>>> fpd_nt.fpd_data.dim1.data
>>> fpd_nt.fpd_data.dim1.name
>>> fpd_nt.fpd_data.dim1.units

Control over whether the datasets are loaded into memory or as HDF5 objects is provided by the nd_max and gigabytes_max arguments. The former sets an upper limit to the dimensionality of the array that will be loaded into memory, while the latter is an upper limit of the data size. Both must be satisfied for the data to be loaded into memory.

Lower level Python access may be obtained with the h5py module.

These data may be then processed using any of the various processing functions that are contained in the fpd_processing module or by any other means.

Conversion to other formats

Various conversion utilities are included the fpd_file module.

HyperSpy Signals

All EMD datasets in an HDF5 file can be converted to ‘Lazy’ Hyperspy signals using fpd_to_hyperspy():

>>> from fpd.fpd_file import fpd_to_hyperspy
>>> fpd_signals = fpd_to_hyperspy('file.hdf5')
>>> im = fpd_signals.fpd_data

Here, fpd_signals is a named tuple containing all recognised EMD datasets as named fields.

FPD data to binary

hdf5_fpd_to_bin()

To be continued…

Visualisation

Built-in: DataBrowser

HyperSpy:

To be continued…