gpm.bucket package#
Submodules#
gpm.bucket.analysis module#
This module provide utilities for GPM Geographic Bucket Analysis.
gpm.bucket.dataset module#
This module provide to write a GPM Geographic Bucket Apache Parquet Dataset.
gpm.bucket.io module#
This module provide utilities to search GPM Geographic Buckets files.
gpm.bucket.processing module#
This module provide utilities for the creation of GPM Geographic Buckets.
- gpm.bucket.processing.assign_spatial_partitions(df, xbin_name, ybin_name, xbin_size, ybin_size, x_column='lat', y_column='lon')[source]#
Add partitioning bin columns to dataframe.
Works for both dask.dataframe and pandas.dataframe.
- gpm.bucket.processing.drop_undesired_columns(df)[source]#
Drop undesired columns like dataset dimensions without coordinates.
- gpm.bucket.processing.ds_to_dask_df_function(ds)[source]#
Convert an xr.Dataset to a dask.Dataframe.
This function expects a xr.Dataset with only 2D spatial DataArrays.
- gpm.bucket.processing.ds_to_pd_df_function(ds)[source]#
Convert an xr.Dataset to a pandas.Dataframe.
This function expects a xr.Dataset with only 2D spatial DataArrays.
- gpm.bucket.processing.ensure_pyarrow_string_columns(df)[source]#
Convert ‘object’ type columns to pyarrow strings.
- gpm.bucket.processing.ensure_unique_chunking(ds)[source]#
Ensure the dataset has unique chunking.
Conversion to dask.dataframe requires unique chunking. If the xr.Dataset does not have unique chunking, perform ds.unify_chunks.
Variable chunks can be visualized with:
- for var in ds.data_vars:
print(var, ds[var].chunks)
- gpm.bucket.processing.estimate_row_group_size(df, size='200MB')[source]#
Estimate row_group_size parameter based on the desired row group memory size.
row_group_size is a Parquet argument controlling the number of rows in each Apache Parquet File Row Group.
- gpm.bucket.processing.get_bin_partition(values, bin_size)[source]#
Compute the bins partitioning values.
- Parameters:
values (float or array-like) – Values.
bin_size (float) – Bin size.
- Returns:
Bin value – DESCRIPTION.
- Return type:
float or array-like
gpm.bucket.readers module#
This module provide utilities to read GPM Geographic Buckets Apache Parquet files.
gpm.bucket.utils module#
This module provide utilities to manipulate GPM Geographic Buckets.
- gpm.bucket.utils.add_bin_column(df, column, bin_size, vmin, vmax, bin_name, add_midpoint=True)[source]#
gpm.bucket.writers module#
This module provide utilities to write GPM Geographic Buckets Apache Parquet files.
- gpm.bucket.writers.write_granule_bucket(src_filepath, bucket_base_dir, ds_to_df_converter, xbin_size=15, ybin_size=15, xbin_name='lonbin', ybin_name='latbin', row_group_size='500MB', **writer_kwargs)[source]#
Write a geographically partitioned Parquet Dataset of a GPM granules.
- Parameters:
src_filepath (str) – File path of the granule to store in the bucket archive.
bucket_base_dir (str) – Base directory of the per-granule bucket archive.
ds_to_df_converter (callable,) – Function taking a granule filepath, opening it and returning a pandas or dask dataframe.
xbin_name (str, optional) – Name of the binned column used to partition the data along the x dimension. The default is
"lonbin"
.ybin_name (str, optional) – Name of the binned column used to partition the data along the y dimension. The default is
"latbin"
.xbin_size (int) – Longitude bin size. The default is 15.
xbin_size – Latitude bin size. The default is 15.
row_group_size ((int, str), optional) – Maximum number of rows in each written Parquet row group. If specified as a string (i.e.
"500 MB"
), the equivalent row group size number is estimated. The default is"500MB"
.**writer_kwargs (dict) – Optional arguments to be passed to the pyarrow Dataset Writer. Common arguments are ‘format’ and ‘use_threads’. The default file
format
is'parquet'
. The defaultuse_threads
isTrue
, which enable multithreaded file writing. More information available at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
Notes
Example of partitioning:
Partition by 1° degree pixels: 64800 directories (360*180)
Partition by 5° degree pixels: 2592 directories (72*36)
Partition by 10° degree pixels: 648 directories (36*18)
Partition by 15° degree pixels: 288 directories (24*12)
Module contents#
This directory defines the GPM-API geographic binning toolbox.