gpm.bucket package#

Submodules#

gpm.bucket.analysis module#

This module provide utilities for GPM Geographic Bucket Analysis.

gpm.bucket.analysis.get_cut_lat_breaks_labels(bin_spacing)[source]#
gpm.bucket.analysis.get_cut_lon_breaks_labels(bin_spacing)[source]#
gpm.bucket.analysis.get_lat_bins(bin_spacing)[source]#
gpm.bucket.analysis.get_lat_labels(bin_spacing)[source]#
gpm.bucket.analysis.get_lon_bins(bin_spacing)[source]#
gpm.bucket.analysis.get_lon_labels(bin_spacing)[source]#
gpm.bucket.analysis.get_n_decimals(number)[source]#
gpm.bucket.analysis.pl_add_geographic_bins(df, xbin_column, ybin_column, bin_spacing, x_column='lon', y_column='lat')[source]#
gpm.bucket.analysis.pl_df_to_xarray(df, xbin_column, ybin_column, bin_spacing)[source]#

gpm.bucket.dataset module#

This module provide to write a GPM Geographic Bucket Apache Parquet Dataset.

gpm.bucket.dataset.write_partitioned_dataset(df, base_dir, partitioning, filename_prefix='part', **writer_kwargs)[source]#

gpm.bucket.io module#

This module provide utilities to search GPM Geographic Buckets files.

gpm.bucket.io.get_filepaths_by_bin(base_dir, parallel=True)[source]#

Retrieve a dictionary with the list of filepaths for each bucket bin.

gpm.bucket.processing module#

This module provide utilities for the creation of GPM Geographic Buckets.

gpm.bucket.processing.assign_spatial_partitions(df, xbin_name, ybin_name, xbin_size, ybin_size, x_column='lat', y_column='lon')[source]#

Add partitioning bin columns to dataframe.

Works for both dask.dataframe and pandas.dataframe.

gpm.bucket.processing.convert_size_to_bytes(size)[source]#
gpm.bucket.processing.drop_undesired_columns(df)[source]#

Drop undesired columns like dataset dimensions without coordinates.

gpm.bucket.processing.ds_to_dask_df_function(ds)[source]#

Convert an xr.Dataset to a dask.Dataframe.

This function expects a xr.Dataset with only 2D spatial DataArrays.

gpm.bucket.processing.ds_to_pd_df_function(ds)[source]#

Convert an xr.Dataset to a pandas.Dataframe.

This function expects a xr.Dataset with only 2D spatial DataArrays.

gpm.bucket.processing.ensure_pyarrow_string_columns(df)[source]#

Convert ‘object’ type columns to pyarrow strings.

gpm.bucket.processing.ensure_unique_chunking(ds)[source]#

Ensure the dataset has unique chunking.

Conversion to dask.dataframe requires unique chunking. If the xr.Dataset does not have unique chunking, perform ds.unify_chunks.

Variable chunks can be visualized with:

for var in ds.data_vars:

print(var, ds[var].chunks)

gpm.bucket.processing.estimate_row_group_size(df, size='200MB')[source]#

Estimate row_group_size parameter based on the desired row group memory size.

row_group_size is a Parquet argument controlling the number of rows in each Apache Parquet File Row Group.

gpm.bucket.processing.get_bin_partition(values, bin_size)[source]#

Compute the bins partitioning values.

Parameters:
  • values (float or array-like) – Values.

  • bin_size (float) – Bin size.

Returns:

Bin value – DESCRIPTION.

Return type:

float or array-like

gpm.bucket.processing.get_df_object_columns(df)[source]#

Get the dataframe columns which have ‘object’ type.

gpm.bucket.processing.has_unique_chunking(ds)[source]#

Check if a dataset has unique chunking.

gpm.bucket.readers module#

This module provide utilities to read GPM Geographic Buckets Apache Parquet files.

gpm.bucket.readers.read_partitioned_dataset(filepath, columns=None)[source]#

gpm.bucket.utils module#

This module provide utilities to manipulate GPM Geographic Buckets.

gpm.bucket.utils.add_bin_column(df, column, bin_size, vmin, vmax, bin_name, add_midpoint=True)[source]#
gpm.bucket.utils.add_spatial_bins(df, x='x', y='y', xbin_size=1, ybin_size=1, xlim=(-180, 180), ylim=(-90, 90), xbin_name='xbin', ybin_name='ybin', add_bin_midpoint=True)[source]#
gpm.bucket.utils.create_spatial_bin_empty_df(xbin_size=1, ybin_size=1, xlim=(-180, 180), ylim=(-90, 90), xbin_name='xbin', ybin_name='ybin')[source]#

Create empty spatial bin DataFrame.

gpm.bucket.writers module#

This module provide utilities to write GPM Geographic Buckets Apache Parquet files.

gpm.bucket.writers.split_list_in_blocks(values, block_size)[source]#
gpm.bucket.writers.write_granule_bucket(src_filepath, bucket_base_dir, ds_to_df_converter, xbin_size=15, ybin_size=15, xbin_name='lonbin', ybin_name='latbin', row_group_size='500MB', **writer_kwargs)[source]#

Write a geographically partitioned Parquet Dataset of a GPM granules.

Parameters:
  • src_filepath (str) – File path of the granule to store in the bucket archive.

  • bucket_base_dir (str) – Base directory of the per-granule bucket archive.

  • ds_to_df_converter (callable,) – Function taking a granule filepath, opening it and returning a pandas or dask dataframe.

  • xbin_name (str, optional) – Name of the binned column used to partition the data along the x dimension. The default is "lonbin".

  • ybin_name (str, optional) – Name of the binned column used to partition the data along the y dimension. The default is "latbin".

  • xbin_size (int) – Longitude bin size. The default is 15.

  • xbin_size – Latitude bin size. The default is 15.

  • row_group_size ((int, str), optional) – Maximum number of rows in each written Parquet row group. If specified as a string (i.e. "500 MB"), the equivalent row group size number is estimated. The default is "500MB".

  • **writer_kwargs (dict) – Optional arguments to be passed to the pyarrow Dataset Writer. Common arguments are ‘format’ and ‘use_threads’. The default file format is 'parquet'. The default use_threads is True, which enable multithreaded file writing. More information available at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html

Notes

Example of partitioning:

  • Partition by 1° degree pixels: 64800 directories (360*180)

  • Partition by 5° degree pixels: 2592 directories (72*36)

  • Partition by 10° degree pixels: 648 directories (36*18)

  • Partition by 15° degree pixels: 288 directories (24*12)

Module contents#

This directory defines the GPM-API geographic binning toolbox.