xbout.lazyload

Fast lazy-loading of BOUT++ multi-file datasets.

Overview

BOUT++ writes output distributed across many NetCDF files, one per processor. The standard xarray.open_mfdataset opens every file to read metadata before constructing the dataset, which is slow when there are hundreds of files on a parallel filesystem where each file open incurs a metadata server round-trip.

This module avoids that overhead by reading metadata from a single file and constructing a lazy dask-backed xarray Dataset without opening any other files. Data is only read from disk when explicitly requested via .compute() or .load().

Method

All BOUT++ output files share the same array shapes and metadata. The processor layout (NXPE x NYPE grid) and array dimensions are read from the first file (BOUT.dmp.0.nc). From this, the slice of the global array stored in each processor’s file is determined, accounting for MXG/MYG guard cells.

A dask task graph is constructed as a Python dict, with one entry per processor file per variable. Each task uses dask’s internal getter() function, which calls __getitem__ on a LazyFileArray object. LazyFileArray defers all file I/O to h5py, reading only the requested hyperslab when the task is executed.

Slice fusion

Dask’s _optimize_slices pass (dask/array/optimization.py) recognizes chained getter() tasks and fuses them into a single getter() call with composed slices. Each variable’s task graph uses two chained tasks per file:

lazy task: LazyFileArray object slice task: (getter, lazy_task_key, boundary_slices, False, False)

The boundary slice removes MXG/MYG guard cells from the raw file data as needed, depending on keep_xboundaries and keep_yboundaries. When the user slices the resulting DataArray, dask fuses the user slice with the boundary slice into a single composed slice, which is passed through getter() to LazyFileArray.__getitem__() and then directly to h5py as an HDF5 hyperslab selection. This means only the requested bytes are read from disk, with a single file open per chunk per compute() call.

Usage

ds = lazy_open_boutdataset(‘/path/to/BOUT/dmp/files/’) ds = lazy_open_boutdataset(‘/path/to/files/’, keep_xboundaries=True)

# Data is not read until here: data = ds[‘Ne’].isel(t=slice(10, 20)).compute()

Functions

lazy_open_boutdataset(datapath[, ...])

Open a multi-file dataset by only opening one file.

make_chunkinfo(metadata[, keep_xboundaries, ...])

Identify processor layout and the array slices to be extracted from each file.

make_lazy_array(datafilepath, ds, chunkinfo, ...)

Creates a lazy-loaded array, gathering data from a collection of NetCDF files.

Classes

LazyFileArray(filepath, varname, shape, dtype)

Presents a numpy-like interface that defers reads to HDF5 hyperslabs.