gpm.bucket package#


gpm.bucket.analysis module#

This module provide utilities for GPM Geographic Bucket Analysis.

gpm.bucket.analysis.pl_add_geographic_bins(df, xbin_column, ybin_column, bin_spacing, x_column='lon', y_column='lat')[source]#
gpm.bucket.analysis.pl_df_to_xarray(df, xbin_column, ybin_column, bin_spacing)[source]#

gpm.bucket.dataset module#

This module provide to write a GPM Geographic Bucket Apache Parquet Dataset.

gpm.bucket.dataset.write_partitioned_dataset(df, base_dir, partitioning, filename_prefix='part', **writer_kwargs)[source]# module#

This module provide utilities to search GPM Geographic Buckets files., parallel=True)[source]#

Retrieve a dictionary with the list of filepaths for each bucket bin.

gpm.bucket.processing module#

This module provide utilities for the creation of GPM Geographic Buckets.

gpm.bucket.processing.assign_spatial_partitions(df, xbin_name, ybin_name, xbin_size, ybin_size, x_column='lat', y_column='lon')[source]#

Add partitioning bin columns to dataframe.

Works for both dask.dataframe and pandas.dataframe.


Drop undesired columns like dataset dimensions without coordinates.


Convert an xr.Dataset to a dask.Dataframe.

This function expects a xr.Dataset with only 2D spatial DataArrays.


Convert an xr.Dataset to a pandas.Dataframe.

This function expects a xr.Dataset with only 2D spatial DataArrays.


Convert ‘object’ type columns to pyarrow strings.


Ensure the dataset has unique chunking.

Conversion to dask.dataframe requires unique chunking. If the xr.Dataset does not have unique chunking, perform ds.unify_chunks.

Variable chunks can be visualized with:

for var in ds.data_vars:

print(var, ds[var].chunks)

gpm.bucket.processing.estimate_row_group_size(df, size='200MB')[source]#

Estimate row_group_size parameter based on the desired row group memory size.

row_group_size is a Parquet argument controlling the number of rows in each Apache Parquet File Row Group.

gpm.bucket.processing.get_bin_partition(values, bin_size)[source]#

Compute the bins partitioning values.

  • values (float or array-like) – Values.

  • bin_size (float) – Bin size.


Bin value – DESCRIPTION.

Return type:

float or array-like


Get the dataframe columns which have ‘object’ type.


Check if a dataset has unique chunking.

gpm.bucket.readers module#

This module provide utilities to read GPM Geographic Buckets Apache Parquet files.

gpm.bucket.readers.read_partitioned_dataset(filepath, columns=None)[source]#

gpm.bucket.utils module#

This module provide utilities to manipulate GPM Geographic Buckets.

gpm.bucket.utils.add_bin_column(df, column, bin_size, vmin, vmax, bin_name, add_midpoint=True)[source]#
gpm.bucket.utils.add_spatial_bins(df, x='x', y='y', xbin_size=1, ybin_size=1, xlim=(-180, 180), ylim=(-90, 90), xbin_name='xbin', ybin_name='ybin', add_bin_midpoint=True)[source]#
gpm.bucket.utils.create_spatial_bin_empty_df(xbin_size=1, ybin_size=1, xlim=(-180, 180), ylim=(-90, 90), xbin_name='xbin', ybin_name='ybin')[source]#

Create empty spatial bin DataFrame.

gpm.bucket.writers module#

This module provide utilities to write GPM Geographic Buckets Apache Parquet files.

gpm.bucket.writers.split_list_in_blocks(values, block_size)[source]#
gpm.bucket.writers.write_granule_bucket(src_filepath, bucket_base_dir, ds_to_df_converter, xbin_size=15, ybin_size=15, xbin_name='lonbin', ybin_name='latbin', row_group_size='500MB', **writer_kwargs)[source]#

Write a geographically partitioned Parquet Dataset of a GPM granules.

  • src_filepath (str) – File path of the granule to store in the bucket archive.

  • bucket_base_dir (str) – Base directory of the per-granule bucket archive.

  • ds_to_df_converter (callable,) – Function taking a granule filepath, opening it and returning a pandas or dask dataframe.

  • xbin_name (str, optional) – Name of the binned column used to partition the data along the x dimension. The default is "lonbin".

  • ybin_name (str, optional) – Name of the binned column used to partition the data along the y dimension. The default is "latbin".

  • xbin_size (int) – Longitude bin size. The default is 15.

  • xbin_size – Latitude bin size. The default is 15.

  • row_group_size ((int, str), optional) – Maximum number of rows in each written Parquet row group. If specified as a string (i.e. "500 MB"), the equivalent row group size number is estimated. The default is "500MB".

  • **writer_kwargs (dict) – Optional arguments to be passed to the pyarrow Dataset Writer. Common arguments are ‘format’ and ‘use_threads’. The default file format is 'parquet'. The default use_threads is True, which enable multithreaded file writing. More information available at


Example of partitioning:

  • Partition by 1° degree pixels: 64800 directories (360*180)

  • Partition by 5° degree pixels: 2592 directories (72*36)

  • Partition by 10° degree pixels: 648 directories (36*18)

  • Partition by 15° degree pixels: 288 directories (24*12)

Module contents#

This directory defines the GPM-API geographic binning toolbox.