xbout.lazyload¶
Fast lazy-loading of BOUT++ multi-file datasets.
Overview¶
BOUT++ writes output distributed across many NetCDF files, one per processor. The standard xarray.open_mfdataset opens every file to read metadata before constructing the dataset, which is slow when there are hundreds of files on a parallel filesystem where each file open incurs a metadata server round-trip.
This module avoids that overhead by reading metadata from a single file and constructing a lazy dask-backed xarray Dataset without opening any other files. Data is only read from disk when explicitly requested via .compute() or .load().
Method¶
All BOUT++ output files share the same array shapes and metadata. The processor layout (NXPE x NYPE grid) and array dimensions are read from the first file (BOUT.dmp.0.nc). From this, the slice of the global array stored in each processor’s file is determined, accounting for MXG/MYG guard cells.
A dask task graph is constructed as a Python dict, with one entry per processor file per variable. Each task uses dask’s internal getter() function, which calls __getitem__ on a LazyFileArray object. LazyFileArray defers all file I/O to h5py, reading only the requested hyperslab when the task is executed.
Slice fusion¶
Dask’s _optimize_slices pass (dask/array/optimization.py) recognizes chained getter() tasks and fuses them into a single getter() call with composed slices. Each variable’s task graph uses two chained tasks per file:
lazy task: LazyFileArray object slice task: (getter, lazy_task_key, boundary_slices, False, False)
The boundary slice removes MXG/MYG guard cells from the raw file data as needed, depending on keep_xboundaries and keep_yboundaries. When the user slices the resulting DataArray, dask fuses the user slice with the boundary slice into a single composed slice, which is passed through getter() to LazyFileArray.__getitem__() and then directly to h5py as an HDF5 hyperslab selection. This means only the requested bytes are read from disk, with a single file open per chunk per compute() call.
Usage¶
ds = lazy_open_boutdataset(‘/path/to/BOUT/dmp/files/’) ds = lazy_open_boutdataset(‘/path/to/files/’, keep_xboundaries=True)
# Data is not read until here: data = ds[‘Ne’].isel(t=slice(10, 20)).compute()
Functions
|
Open a multi-file dataset by only opening one file. |
|
Identify processor layout and the array slices to be extracted from each file. |
|
Creates a lazy-loaded array, gathering data from a collection of NetCDF files. |
Classes
|
Presents a numpy-like interface that defers reads to HDF5 hyperslabs. |