Accessing and Converting Merlin Binary Files¶
In (most) electron microscopes with a Medipix3 detector, the data is acquired
using the Merlin readout system. This outputs a header file and the image data
in a one or more binary files. These can be converted into a more convenient
HDF5-based EMD (https://emdatasets.com/format) compatible format by using the
MerlinBinary
class. The class also allows quick access
to the data without conversion.
To raw data is stored flat on disk, so to navigate and store the data in the correct shape, the scan parameters can be supplied. These are not stored in the Merlin files, but can be parsed (along with most other scan information) from a Digital Micrograph (DM) image acquired during the scan. If a DM file is not recorded, the information can be specified as parameter, either manually or by parsing it from another image.
Using a DM File¶
With a DM file, the class can be initialised using the following:
- Binary file:
medipixfile.mib
- Header file:
medipixfile.hdr
- DM-file :
dmfile.dm3
>>> from fpd.fpd_file import MerlinBinary
>>> mb = MerlinBinary(binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
One or more DM files can be specified and all files will be parsed and stored as EMD groups and in raw binary form, allowing the native DM file to be extracted unmodified and read in Digital Micrograph. Only the first DM file is used for the scan information.
Without a DM File¶
If you do not have any DM files, the scan parameters can be specified through the
scanXalu
and scanYalu
arguments which are tuples containing for each axis
the axis, the label, and units. If axis is an integer, it is interpreted
as the axis length and a new integer axis is generated. Otherwise, axis
should be an iterable that will be used as the axis directly.
>>> import numpy as np
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr',
... scanXalu=(128, 'xaxis', 'pixels'), scanYalu=(np.arange(256)*0.1, 'yaxis', 'nm'))
If another image has been recorded with axis information and it can be read by HyperSpy, it can be parsed for the axes information using:
>>> from hyperspy.io import load
>>> im = load('image_filename')
>>> scanXalu, scanYalu = [(m.axis, m.name, m.units) for m in im.axes_manager.signal_axes]
Specifying the detector axes¶
In most systems, it is not trivial to determine the detector axes properties and, often, a calibration based upon known image dimensions is used. For example, in STEM the brightfield disc may be used to set the reciprocal space pixel size.
Accessing the data pre-conversion using the memmory mapping or other means discussed below
allows the pixel size to be determined from the data. Once done, the detector axes may be
specified in a similar manner as to file conversion without a DM file, discussed above. This is
dome by specifying the detXalu
and detYalu
arguments. For example, to set the y-axis
as pixels, and the x-axis in nm:
>>> import numpy as np
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
... detYalu=(256, 'y-axis', 'pixels'), detXalu=(np.arange(256)*0.1, 'x-axis', 'nm'))
Without any scan information¶
If no scan information is passed, and a single binary file is provided, all images in the file are processed as though the shape was [1, nimages, detY, detX] (as are TEM Data, discussed below). In this case, row_end_skip should be set to 0 unless there are a known number of unwanted images at the end of the dataset. For example:
Flyback¶
Depending on how the Merlin system is triggered, when acquiring data a number of
extra pixels can be included at the end of each row due to triggering whilst
the beam is moving to the start of the next row of the scan, the flyback.
These extra pixels can be removed using the row_end_skip
argument, which is
by default is set to 1
. Faster acquisitions and larger flyback times will
produce more extra pixels. To use a different value:
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
... row_end_skip=2)
The number of extra pixels at the end of the data after taking account of flyback
pixels is counted and printed at the end of file conversion. If this is unusually
high (there are often a couple of extra pixels), the row_end_skip
parameter
may be incorrect.
Accessing Files Pre-Conversion¶
Conversation to HDF5 is the most flexible solution, however, it is often useful to access a file before spending the several minutes it takes to convert the data to HDF5. This is particularly useful when quickly checking a dataset after acquisition or preparing a mask for use in removal of bad pixels.
The MerlinBinary
class provides a number of methods to
access the data in this way. The most useful for common data acquisition modes is the
get_memmap()
method.
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
... row_end_skip=2)
>>> mm = mb.get_memmap()
mm is a memory mapped file with an array interface. It can be indexed in the usual way,
passed to the processing functions in the fpd_processing
module (or elsewhere)
, or plotted using the DataBrowser
. For example:
Where nav_im specifies a navigation image, and is set here to be blank for speed. The memory mapped data could be processed to generate one and will do so automatically if nav_im is not specified. If this navigation image is to be reused, it is best to assign it to a reusable variable (i.e. nav_im = …).
Memory mapping only works for certain datasets, and the class will issue appropriate errors
when a file or files cannot be memory mapped. A much slower but more general method is
accessing the data through the array interface of the MerlinBinary
class.
For example:
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
... row_end_skip=2)
>>> d = mb[0, :10]
The data is in-memory, so care must be taken over the amount of the available system memory that is used.
Conversion to HDF5¶
The data may be converted to hdf5 with the write_hdf5()
method.
Data Chunking¶
An important parameter is data chunking, which defines how the data is
structured in the HDF5 file. This can have a large influence on the I/O
performance and compression. The default chunking is to store the dataset in cubes
of 16x16[x1]x16x16 pixel chunks. Different settings may be more appropriate for other uses.
The chunking can be set by scan_chunks
, im_chunks
, or chunks
, with
chunks
taking precedence over the other chunk parameters. For example:
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
>>> mb.write_hdf5(chunks=(8, 8, 256, 256))
The data is repacked by default (repack=True
), unless memory mapping the file is possible.
Repacking stores the fpd data in a temporary file, then copies it chunk-by-chunk to the final
file. It therefore requires more disk space during conversion, but it is RAM efficient. The
final file sizes are typically smaller and processing the data is faster. If memory mapping
the file is possible, this will automatically happen for writing the data in chunks, avoiding
the intermediate file.
Compression¶
The default compression method used is ‘gzip’. This method is built-in to the hdf5 library, so files made with this compression method may be opened anywhere.
Starting with fpd version 0.2.1, the compression method may be set as one of [‘gzip’, ‘LZ4’]
through the compression
parameter of the write_hdf5()
method.
This parameter only affects the main dataset. All other datasets are relatively small and so
‘gzip’ is used so that some idea of the data set may be obtained without plugins, if needed.
The LZ4 plugin is provided by the hdf5plugin package.
This compression method creates slightly larger files than gzip, but is very much faster.
To use LZ4, the compression
parameter must be set:
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
>>> mb.write_hdf5(compression='LZ4')
Compared with gzip, tests with LZ4 typical gave the following (the numbers are likely to vary with hardware spec):
- 2.2x faster read speed
- 2.8x faster write speed
- 1.7x faster file creation
Data Masking¶
There are sometimes bad pixels (dead or hot or noisy or contaminated …) on a detector and these can be interpolated out during conversion by specifying a mask. In the following example, we generate a very simple mask using memory mapped file access to a specific image (it could be a flat field) and apply during the conversion. In general, the mask generation procedure may be more complex.
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
>>> mm = mb.get_memmap()
>>> mask = mm[23, 45] < 6
>>> mb.write_hdf5(mask=mask)
Locations where the mask is True are replaced using cubic interpolation of the non-masked values. The mask is stored in the HDF5 file in the usual EMD format.
Data Manipulation¶
Arbitrary functions may be applied to the image data during conversion to hdf5. This
is enabled by passing a function func
to the write_hdf5()
method.
The function should be of the form func(image, scanYind, scanXind, **kwargs), where image
is a single detector image and scanYind and scanXind are the scan pixel indices.
kwargs are optional keyword argument pairs.
An example of shifting all images by a scan-position dependent value is:
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'])
>>> from fpd.synthetic_data import shift_im
>>> def shift_func(image, scanYind, scanXind, shift_array):
>>> syx = shift_array[:, scanYind, scanXind]
>>> new_im = shift_im(image, syx, method='fourier')
>>> return new_im
>>> mb.write_hdf5(func=shift_func, func_kwargs={'shift_array': syx})
where syx
is an array of shifts in pixels along the y- and x-axis, of shape
(2, scanY, scanX). Multiple keyword arguments may be passed through func_kwargs
in a dictionary.
The shift array may be generated by analysing the data before conversion using the
memmory mapping interface or other methods described above. Many functions in the
fpd_processing
module, such as the center_of_mass()
function, or other methods may be used to generate the array. Most functions in the
fpd_processing
module can process data from memory mapped arrays.
Conversion of the data happens as the file is read and so is serial and single threaded which will slow down the conversion. However, it may still be useful to align images as described here for future processing which assumes the images are aligned (note that many algorithms do not assume this), or to apply more general data filtering. Note that the data type of the function return will be coerced to the original type.
Detector Axis and Colour Mode¶
The Medipix3 chip (the actual detector with which the Merlin readout system works) can come in a several different configurations and can be operated in number of different modes with different numbers of counting thresholds. Its four modes of operation are:
- Single pixel mode (SPM), in which each pixel operates independently of one another and up two seperate counting thresholds can be set.
- Charge summing mode (CSM), in which each pixel operates independently of one another and in which threshold 1 is “gated” by threshold 0. If using both counters, then the images acquired with both threshold can be saved.
- Colour single pixel mode (Colour SPM), in which one in every four pixels is operational, with the three neighbouring pixels donating their two thresholds and counters to the operational pixel, so that up to eight independent counting thresholds can be set.
- Colour charge summing mode (Colour CSM), in which, again, one in every four pixels is operational, with up to four thresholds, each of which is gated by another threshold, being available. If acquiring data with both counters, then the images acquired with the gating thresholds can also be saved, such that there are eight images (which can all be acquired with different, but not necessarily independent thresholds) for each frame saved.
The standard configuration has every pixel on the chip bonded to the sensor, resulting in a device with 256x256 55um pitch pixels (which can then be tiled together to create larger detector, e.g. 512x512 or 256x1024 pixels). It is possible to bond only one in every four pixels to the sensor, creating a device which has 128x128 110um pitch pixels. Such devices are intended to operate in one the Medipix3’s colour modes as the three unbonded pixels share their thresholds and counters with the pixel that is bonded.
The Merlin system is unaware of how many pixels on the chip are actually bonded to the sensor and displays and saves data on the basis of what mode of operation is chosen. If using a 256x256 55um pitch device in SPM or CSM, then the Merlin system will display 256x256 pixel images and save the data accordingly. If operating in Colour SPM or Colour CSM with the same device, then the Merlin system will display and save images in a 128x128 pixel format (with the signal deposited in three out of four pixels being disregarded).
Similiarly, if using a 128x128 110um pitch device in SPM or CSM, then the Merlin system will display and save images as having 256x256 pixels, but only one in every four pixels will register counts. When using such a device in Colour SPM and Colour CSM, then the images will be displayed and saved in a 128x128 pixel format by the Merlin system.
To simplify data conversion, the MerlinBinary class is written such that it determines the
size of the dataset using input from the user and checking the number of thresholds and
operational mode saved in the header data. When converting data, detY
and detX
should be the dimensions of the detector as determined by the Merlin system on the basis
of the operational mode, not the physical number of bonded pixels.
For example, if using a 256x256 55um pitch device in SPM, or CSM then detY
and detX
should both be set equal to 256. If using the same device in Colour SPM or Colour CSM, then
these two arguments should be set to 128. If using a 128x128 110um pitch device
in SPM or CSM, then detY
and detX
should still be set to 256, while when using such
a device in Colour SPM or Colour CSM then these arguments should again be set to 128.
TEM Data¶
CTEM data can be processed by setting the y-axis size to 1, and the x-axis size
to the number of images. In this case, ds_start_skip
and row_end_skip
should be set to 0. For example:
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
... row_end_skip=0, ds_start_skip=0, scanYalu=(1, 'na', 'na'),
... scanXalu=(16, 'Time', 'seconds'))
If the scanXalu
axis is None, it is automatically generated from the header
file with axis being the index. Note that the header file contains the
maximum number of frames and the real number of frames could be smaller by
design or error. For example:
>>> mb = MerlinBinary(
... binfns='medipixfile.mib', hdrfn='medipixfile.hdr', dmfns=['dmfile.dm3'],
... row_end_skip=0, ds_start_skip=0, scanYalu=(1, 'na', 'na'),
... scanXalu=(None, 'images', 'index'))
Here, the y-axis is singular, which replicates the way in which the binary data is stored.
Updating Files¶
There are currently two methods of updating file contents: modifying the data, and updating calibrations.
Modifying data¶
A new copy of the src
file may be made with an arbitrary function, func
,
applied to the multidimensional dataset using the make_updated_fpd_file()
function.
For example, below we determine the positions of the direct beam and then
displace each image by the amount required to centre the dataset. For
simplicity, we use the center_of_mass()
function
with the minimum parameters set, which will calculate the centre of mass of
the entire image (in general, additional parameters should be set to obtain
the best results).
>>> from fpd.fpd_file import fpd_to_tuple, make_updated_fpd_file
>>> from fpd.fpd_processing import center_of_mass
>>> from fpd.synthetic_data import shift_im
>>> src = 'original_file.hdf5'
>>> dts = 'aligned_file.hdf5'
>>> fpd_nt = fpd_to_tuple(src)
>>> ds = fpd_nt.fpd_data.data
>>> com_yx = center_of_mass(ds, 16, 16)
>>> syx = com_yx - np.percentile(com_yx, 50, axis=(-2, -1))[..., None, None]
>>> def shift_func(image, scanYind, scanXind, shift_array):
... syx = shift_array[:, scanYind, scanXind]
... new_im = shift_im(image, -syx, fill_value=0).astype(int)
... return new_im
>>> new_fn = make_updated_fpd_file(src, dst, func=shift_func,
... func_kwargs=dict(shift_array=syx),
... update_sum_im=True, update_sum_dif=True)
Updating Calibrations¶
Calibrations may be specified at the time of file creation, or updated afterwards
using the update_calibration()
function. All datasets are
updated with the exception of the DM files and data.
In the example below, we update the calibration for the scan and detector axes
in-place, i.e. the file is modified. The detector axis units are set to None
so that the units in the file are unchanged, while the numerical values are updated.
>>> from fpd.fpd_file import update_calibration
>>> filename = 'my_file.hdf5'
>>> update_calibration(filename, scan=[0.2, 'nm'],
detector=[5.1, None], colour=[None, None])
File Contents and Access¶
The Merlin images and metadata, basic analysis of the images, and any DM files are parsed and embedded in a single HDF5 file in multiple EMD datasets and can be accessed and viewed in any HDF5 compatible software.
The simplest way to access the data in Python is to use the fpd_to_tuple()
function. This function parses the HDF5 file and extracts all EMD datasets
as a hierarchical namedtuple
of datasets, names, units, and axes, all of which may be navigated by tab completion.
To read everything from a file called filename
and then access the 4D dataset:
>>> import fpd.fpd_file as fpdf
>>> fpd_nt = fpdf.fpd_to_tuple(filename)
>>> ds = fpd_nt.fpd_data.data
To return only specific datasets, they may be specified by group_names
:
>>> fpd_nt = fpdf.fpd_to_tuple(filename,
... group_names=['fpd_data', 'fpd_sum_im', 'fpd_sum_dif'])
If group_names
is None, all groups are returned.
The attributes of each dataset are also extracted (‘name’ and ‘units’), as are each dimension (eg ‘dim1’, ‘dim2’, …). The structure of the tuple is the following, where ‘dim1’ is another named tuple, following the same style:
>>> fpd_nt.fpd_data.data
>>> fpd_nt.fpd_data.name
>>> fpd_nt.fpd_data.unit
>>> fpd_nt.fpd_data.dim1.data
>>> fpd_nt.fpd_data.dim1.name
>>> fpd_nt.fpd_data.dim1.units
Control over whether the datasets are loaded into memory or as HDF5 objects is
provided by the nd_max
and gigabytes_max
arguments. The former sets an
upper limit to the dimensionality of the array that will be loaded into memory,
while the latter is an upper limit of the data size. Both must be satisfied for
the data to be loaded into memory.
Lower level Python access may be obtained with the h5py module.
These data may be then processed using any of the various processing functions
that are contained in the fpd_processing
module or by any other
means.
Conversion to other formats¶
Various conversion utilities are included the fpd_file
module.
HyperSpy Signals¶
All EMD datasets in an HDF5 file can be converted to ‘Lazy’ Hyperspy signals
using fpd_to_hyperspy()
:
>>> from fpd.fpd_file import fpd_to_hyperspy
>>> fpd_signals = fpd_to_hyperspy('file.hdf5')
>>> im = fpd_signals.fpd_data
Here, fpd_signals
is a named tuple containing all recognised EMD datasets
as named fields.
Others¶
To be continued…
find_hdf5_objname_by_attribute()
hdf5_dm_tags_to_dict()
hdf5_dm_to_bin()
hdf5_src_to_file()