gpm.bucket package#
Submodules#
gpm.bucket.analysis module#
This module contains a mix of function to analysis bucket archives.
gpm.bucket.dataframe module#
This module implements manipulation wrappers for multiple DataFrame classes.
gpm.bucket.filters module#
gpm.bucket.io module#
This module provide utilities to search GPM Geographic Buckets files.
- gpm.bucket.io.get_exisiting_partitions_paths(bucket_dir, dir_trees)[source][source]#
Get the path of existing bucket partitions on disk.
- gpm.bucket.io.get_filepaths(bucket_dir, parallel=True, file_extension=None, glob_pattern=None, regex_pattern=None)[source][source]#
Return the filepaths matching the specified filename filtering criteria.
- gpm.bucket.io.get_filepaths_by_partition(bucket_dir, parallel=True, file_extension=None, glob_pattern=None, regex_pattern=None)[source][source]#
Return a dictionary with the list of filepaths for each bucket partition.
gpm.bucket.partitioning module#
This module implements Spatial Partitioning classes.
- class gpm.bucket.partitioning.Base2DPartitioning(x_bounds, y_bounds, levels, flavor=None, order=None)[source][source]#
Bases:
object
Handles partitioning of 2D data into rectangular tiles.
The size of the partitions can varies between and across the x and y directions.
- Parameters:
levels (str or list) – Name or names of the partitions. If partitioning by 1 level (i.e. by a unique partition id), specify a single partition name. If partitioning by 2 or more levels (i.e. by x and y), specify the x, y (z, …) partition levels names.
x_bounds (numpy.ndarray) – The partition bounds across the x (horizontal) dimension.
y_bounds (numpy.ndarray) – The partition bounds across the y (vertical) dimension. Please provide the bounds with increasing values order. The origin of the partition class indices is the top, left corner.
order (list) – The order of the partitions when writing multi-level partitions (i.e. x, y) to disk. The default,
None
, corresponds tonames
.flavor (str) – This argument governs the directories names of partitioned datasets. The default,
None
, name the directories with the partitions labels (DirectoryPartitioning). The option"hive"
, name the directories with the format{partition_name}={partition_label}
.
- add_centroids(df, x, y, x_coord=None, y_coord=None, remove_invalid_rows=True)[source][source]#
Add partitions centroids to the dataframe.
- Parameters:
df (pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame) – Dataframe to which add partitions centroids.
x (str) – Column name with the x coordinate.
y (str) – Column name with the y coordinate..
x_coord (str, optional) – Name of the new column with the centroids x coordinates. The default is “x_c”.
y_coord (str, optional) – Name of the new column with the centroids y coordinates. The default is “y_c”.
remove_invalid_rows (bool, optional) – Whether to remove dataframe rows for which coordinates are invalid or out of the partitioning extent. The default is
True
.
- Returns:
df – Dataframe with the partitions centroids x and y coordinates columns.
- Return type:
pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame
- add_labels(df, x, y, remove_invalid_rows=True)[source][source]#
Add partitions labels to the dataframe.
- Parameters:
df (pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame) – Dataframe to which add partitions centroids.
x (str) – Column name with the x coordinate.
y (str) – Column name with the y coordinate.
remove_invalid_rows (bool, optional) – Whether to remove dataframe rows for which coordinates are invalid or out of the partitioning extent. The default is
True
.
- Returns:
df – Dataframe with the partitions label(s) column(s).
- Return type:
pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame
- directories_around_point(x, y, distance=None, size=None)[source][source]#
Return the directory trees with data within the specified distance from a point.
- directories_by_extent(extent)[source][source]#
Return the directory trees with data within the specified extent.
- get_partitions_around_point(x, y, distance=None, size=None)[source][source]#
Return the partition labels with data within the distance/size from a point.
- get_partitions_by_extent(extent)[source][source]#
Return the partitions labels containing data within the extent.
- quadmesh_corners(origin='bottom')[source][source]#
Return the quadrilateral mesh corners.
A quadrilateral mesh is a grid of M by N adjacent quadrilaterals that are defined via a (M+1, N+1) grid of vertices.
The quadrilateral mesh is accepted by
matplotlib.pyplot.pcolormesh
,matplotlib.collections.QuadMesh
andmatplotlib.collections.PolyQuadMesh
.
- query_centroids(x, y)[source][source]#
Return the partition centroids for the specified x,y coordinates.
- query_centroids_by_indices(x_indices, y_indices)[source][source]#
Return the partition centroids for the specified x,y indices.
- query_indices(x, y)[source][source]#
Return the 2D partition indices for the specified x,y coordinates.
- query_labels_by_indices(x_indices, y_indices)[source][source]#
Return the partition labels as function of the specified 2D partitions indices.
- query_vertices_by_indices(x_indices, y_indices, ccw=True)[source][source]#
Return the partitions vertices in an array of shape (indices, 4, 2).
- to_xarray(df, spatial_coords=None, aux_coords=None)[source][source]#
Convert dataframe to spatial xarray Dataset based on partitions centroids.
This routine assumes that you have grouped and aggregated the dataframe over the partition labels or the partition centroids!
Please add the partition centroids to the dataframe with
add_centroids
before calling this method. Please specify the partition centroids x and y columns in thespatial_coords
argument.Please also specify the presence of auxiliary coordinates (indices) with
aux_coords
. The array cells with coordinates not included in the dataframe will have NaN values.
- vertices(ccw=True, origin='bottom')[source][source]#
Return the partitions vertices in an array of shape (N, M, 4, 2).
The output vertices, once the first 2 dimensions are flattened, can be passed directly to a
matplotlib.collections.PolyCollection
. For plotting with cartopy, the polygon order must be counterclockwise ordered.
- class gpm.bucket.partitioning.LonLatPartitioning(size, extent=[-180, 180, -90, 90], levels=None, flavor='hive', order=None, labels_decimals=None)[source][source]#
Bases:
XYPartitioning
Handles geographic partitioning of data based on longitude and latitude bin sizes within a defined extent.
The last bin size (in lon and lat direction) might not be of size
size
!- Parameters:
size (float) – The uniform size for longitude and latitude binning. Carefully consider the size of the partitions. Earth partitioning by: - 1° degree corresponds to 64800 directories (360*180) - 5° degree corresponds to 2592 directories (72*36) - 10° degree corresponds to 648 directories (36*18) - 15° degree corresponds to 288 directories (24*12)
levels (list, optional) – Names of the longitude and latitude partitions. The default is
["lon_bin", "lat_bin"]
.extent (list, optional) – The geographical extent for the partitioning specified as
[xmin, xmax, ymin, ymax]
. Default is the whole Earth:[-180, 180, -90, 90]
.order (list, optional) – The order of the partitions when writing partitioned datasets. The default,
None
, corresponds tolevels
.flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default, “hive”`, names the directories with the format
{partition_name}={partition_label}
. IfNone
, names the directories with the partitions labels (DirectoryPartitioning).Inherits –
---------- –
XYPartitioning –
- directories_around_point(lon, lat, distance=None, size=None)[source][source]#
Return the directory trees with data within the distance/size from a point.
- directories_by_continent(name, padding=None)[source][source]#
Return the directory trees with data within a continent.
- directories_by_country(name, padding=None)[source][source]#
Return the directory trees with data within a country.
- get_partitions_around_point(lon, lat, distance=None, size=None)[source][source]#
Return the partition labels with data within the distance/size from a point.
- class gpm.bucket.partitioning.TilePartitioning(size, extent, n_levels, levels=None, origin='bottom', direction='x', justify=False, flavor=None, order=None)[source][source]#
Bases:
Base2DPartitioning
Handles partitioning of data into tiles.
- Parameters:
size (int, float, tuple, list) – The size value(s) of the bins. The function interprets the input as follows: - int or float: The same size is enforced in both x and y directions. - tuple or list: The bin size for the x and y directions.
extent (list) – The extent for the partitioning specified as
[xmin, xmax, ymin, ymax]
.n_levels (int) – The number of tile partitioning levels. If
n_levels=2
, a (x,y) label is assigned to each tile. Ifn_levels=1
, a unique id label is assigned to each tile combining the x and y tile indices. Theorigin
anddirection
parameters governs its value.levels (list, optional) – If
n_levels>=2
, the first two names must correspond to the x and y partitions. The first two levels must The default withn_levels=1
is["tile"]
. The default withn_levels=2
is["x", "y"]
.origin (str, optional) – The origin of the Y axis. Either
"bottom"
or"top"
. TMS tiles assumesorigin="top"
. Google Maps tiles assumesorigin="bottom"
. The default is"bottom"
.direction (str, optional) – The direction to follow to define tile ids if
levels=1
is specified. Valid direction values are “x” and “y”.direction=x
numbers the tiles rows by rows.direction=y
numbers the tiles columns by columns.justify (bool, optional) – Whether to justify the labels to ensure having all same number of characters. 0 is added on the left side of the labels to justify the length. THe default is
False
.order (list, optional) – The order of the partitions when writing partitioned datasets. The default,
None
, corresponds tolevels
.flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default,
None
, name the directories with the partitions labels (DirectoryPartitioning). The option"hive"
, name the directories with the format{partition_name}={partition_label}
.
- class gpm.bucket.partitioning.XYPartitioning(size, extent, levels=None, order=None, flavor=None, labels_decimals=None)[source][source]#
Bases:
Base2DPartitioning
Handles partitioning of data into x and y regularly spaced bins.
- Parameters:
size (int, float, tuple, list) – The size value(s) of the bins. The function interprets the input as follows: - int or float: The same size is enforced in both x and y directions. - tuple or list: The bin size for the x and y directions.
extent (list) – The extent for the partitioning specified as
[xmin, xmax, ymin, ymax]
.levels (list, optional) – Names of the x and y partitions. The default is
["xbin", "ybin"]
.order (list, optional) – The order of the x and y partitions when writing partitioned datasets. The default,
None
, corresponds tolevels
.flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default,
None
, name the directories with the partitions labels (DirectoryPartitioning). The option"hive"
, name the directories with the format{partition_name}={partition_label}
.
- gpm.bucket.partitioning.check_partitioning_flavor(flavor)[source][source]#
Validate the flavor argument.
If
None
, defaults to “directory”.
- gpm.bucket.partitioning.check_valid_x_y(df, x, y)[source][source]#
Check if the x and y columns are in the dataframe.
- gpm.bucket.partitioning.get_array_combinations(x, y)[source][source]#
Return all combinations between the two input arrays.
- gpm.bucket.partitioning.get_centroids_from_bounds(bounds)[source][source]#
Define partitions centroids from bounds.
- gpm.bucket.partitioning.get_directories(dict_labels, order, flavor)[source][source]#
Return the directory trees of a partitioned dataset.
- gpm.bucket.partitioning.get_n_decimals(number)[source][source]#
Get the number of decimals of a number.
- gpm.bucket.partitioning.get_partition_dir_name(partition_name, partition_labels, flavor)[source][source]#
Return the directories name of a partition.
- gpm.bucket.partitioning.get_tile_id_labels(x_indices, y_indices, origin, direction, n_x, n_y, justify)[source][source]#
Return the 1D tile labels for the specified x,y indices.
gpm.bucket.readers module#
This module provide utilities to read GPM Geographic Buckets Apache Parquet files.
- gpm.bucket.readers.read_bucket(bucket_dir, extent=None, country=None, continent=None, point=None, distance=None, size=None, padding=0, file_extension=None, glob_pattern=None, regex_pattern=None, backend='polars', **polars_kwargs)[source][source]#
Read a geographic bucket.
The
extent
,country
,continent
, orpoint
arguments allows to read only a spatial subset of the original bucket. Please specify only one of this arguments !The
file_extension
,glob_pattern
andregex_pattern
arguments allows to further restrict the selection of files read from the partitioned dataset.- Parameters:
bucket_dir (str) – Base directory of the geographic bucket.
extent (list, optional) – The extent specified as [xmin, xmax, ymin, ymax].
country (str, optional) – The name of the country of interest.
continent (str, optional) – The name of the continent of interest.
point (list or tuple, optional) – The longitude and latitude coordinates of the point around which you are interested to get the data. To effectively subset data around this point, also specify
size
ordistance
arguments. Longitude of the point.distance (float, optional) – Distance (in meters) from the specified point in each direction.
size (int, float, tuple, list, optional) – The size in degrees of the extent in each direction centered around the specified point.
padding (int, float, tuple, list) – The number of degrees to extend the (country, continent) extent in each direction. If padding is a single number, the same padding is applied in all directions. If padding is a tuple or list, it must contain 2 or 4 elements. If two values are provided (x, y), they are interpreted as longitude and latitude padding, respectively. If four values are provided, they directly correspond to padding for each side (left, right, top, bottom). Default is 0.
file_extension (str, optional) – Name of the file extension. The default is
None
.glob_pattern (str, optional) – Unix shell-style wildcards to subset the files to read in. The default is
None
.regex_pattern (str, optional) – Regex pattern to subset the files to read in. The default is
None
.backend (str, optional) – The wished type of dataframe returned by the function. The default is a polars.DataFrame. Valid backends are
pandas
,polars_lazy
andpyarrow
.**polars_kwargs (dict) – Arguments to be passed to polars.read_parquet()
columns
allow to specify the subset of columns to read.n_rows
allows to stop reading data from Parquet files after reading n_rows. For other arguments, please refer to: https://docs.pola.rs/py-polars/html/reference/api/polars.read_parquet.html
- Returns:
df – Bucket dataframe.
- Return type:
pandas.DataFrame, polars.DataFrame, polars.LazyFrame or pyarrow.Table
gpm.bucket.routines module#
This module provides the routines for the creation of GPM Geographic Buckets.
- gpm.bucket.routines.write_granule_bucket(src_filepath, bucket_dir, partitioning, granule_to_df_func, x='lon', y='lat', **writer_kwargs)[source][source]#
Write a geographically partitioned Parquet Dataset of a GPM granules.
- Parameters:
src_filepath (str) – File path of the granule to store in the bucket archive.
bucket_dir (str) – Base directory of the per-granule bucket archive.
partitioning (gpm.bucket.SpatialPartitioning) – A spatial partitioning class.
granule_to_df_func (Callable) – Function taking a granule filepath, opening it and returning a pandas or dask dataframe.
x (str) – The name of the x column. The default is “lon”.
y (str) – The name of the y column. The default is “lat”.
**writer_kwargs (dict) – Optional arguments to be passed to the pyarrow Dataset Writer. Common arguments are ‘format’ and ‘use_threads’. The default file
format
is'parquet'
. The defaultuse_threads
isTrue
, which enable multithreaded file writing. More information available at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
gpm.bucket.writers module#
This module provide to write a GPM Geographic Bucket Apache Parquet Dataset.
- gpm.bucket.writers.estimate_row_group_size(df, size='200MB')[source][source]#
Estimate
row_group_size
parameter based on the desired row group memory size.row_group_size
is a Parquet argument controlling the number of rows in each Apache Parquet File Row Group.
- gpm.bucket.writers.write_arrow_partitioned_dataset(table, base_dir, filename_prefix, partitions, **writer_kwargs)[source][source]#
- gpm.bucket.writers.write_dask_partitioned_dataset(df, base_dir, filename_prefix, partitions, **writer_kwargs)[source][source]#
Write a Dask DataFrame to a partitioned dataset.
It loops over the dataframe partitions and write them to disk. If
row_group_size
ormax_file_size
are specified as string, it loads the first dataframe partition to estimate the row numbers.
- gpm.bucket.writers.write_pandas_partitioned_dataset(df, base_dir, filename_prefix, partitions, **writer_kwargs)[source][source]#
Module contents#
This directory defines the GPM-API geographic binning toolbox.
- class gpm.bucket.LonLatPartitioning(size, extent=[-180, 180, -90, 90], levels=None, flavor='hive', order=None, labels_decimals=None)[source][source]#
Bases:
XYPartitioning
Handles geographic partitioning of data based on longitude and latitude bin sizes within a defined extent.
The last bin size (in lon and lat direction) might not be of size
size
!- Parameters:
size (float) – The uniform size for longitude and latitude binning. Carefully consider the size of the partitions. Earth partitioning by: - 1° degree corresponds to 64800 directories (360*180) - 5° degree corresponds to 2592 directories (72*36) - 10° degree corresponds to 648 directories (36*18) - 15° degree corresponds to 288 directories (24*12)
levels (list, optional) – Names of the longitude and latitude partitions. The default is
["lon_bin", "lat_bin"]
.extent (list, optional) – The geographical extent for the partitioning specified as
[xmin, xmax, ymin, ymax]
. Default is the whole Earth:[-180, 180, -90, 90]
.order (list, optional) – The order of the partitions when writing partitioned datasets. The default,
None
, corresponds tolevels
.flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default, “hive”`, names the directories with the format
{partition_name}={partition_label}
. IfNone
, names the directories with the partitions labels (DirectoryPartitioning).Inherits –
---------- –
XYPartitioning –
- directories_around_point(lon, lat, distance=None, size=None)[source][source]#
Return the directory trees with data within the distance/size from a point.
- directories_by_continent(name, padding=None)[source][source]#
Return the directory trees with data within a continent.
- directories_by_country(name, padding=None)[source][source]#
Return the directory trees with data within a country.
- get_partitions_around_point(lon, lat, distance=None, size=None)[source][source]#
Return the partition labels with data within the distance/size from a point.
- class gpm.bucket.TilePartitioning(size, extent, n_levels, levels=None, origin='bottom', direction='x', justify=False, flavor=None, order=None)[source][source]#
Bases:
Base2DPartitioning
Handles partitioning of data into tiles.
- Parameters:
size (int, float, tuple, list) – The size value(s) of the bins. The function interprets the input as follows: - int or float: The same size is enforced in both x and y directions. - tuple or list: The bin size for the x and y directions.
extent (list) – The extent for the partitioning specified as
[xmin, xmax, ymin, ymax]
.n_levels (int) – The number of tile partitioning levels. If
n_levels=2
, a (x,y) label is assigned to each tile. Ifn_levels=1
, a unique id label is assigned to each tile combining the x and y tile indices. Theorigin
anddirection
parameters governs its value.levels (list, optional) – If
n_levels>=2
, the first two names must correspond to the x and y partitions. The first two levels must The default withn_levels=1
is["tile"]
. The default withn_levels=2
is["x", "y"]
.origin (str, optional) – The origin of the Y axis. Either
"bottom"
or"top"
. TMS tiles assumesorigin="top"
. Google Maps tiles assumesorigin="bottom"
. The default is"bottom"
.direction (str, optional) – The direction to follow to define tile ids if
levels=1
is specified. Valid direction values are “x” and “y”.direction=x
numbers the tiles rows by rows.direction=y
numbers the tiles columns by columns.justify (bool, optional) – Whether to justify the labels to ensure having all same number of characters. 0 is added on the left side of the labels to justify the length. THe default is
False
.order (list, optional) – The order of the partitions when writing partitioned datasets. The default,
None
, corresponds tolevels
.flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default,
None
, name the directories with the partitions labels (DirectoryPartitioning). The option"hive"
, name the directories with the format{partition_name}={partition_label}
.
- gpm.bucket.read(bucket_dir, extent=None, country=None, continent=None, point=None, distance=None, size=None, padding=0, file_extension=None, glob_pattern=None, regex_pattern=None, backend='polars', **polars_kwargs)[source]#
Read a geographic bucket.
The
extent
,country
,continent
, orpoint
arguments allows to read only a spatial subset of the original bucket. Please specify only one of this arguments !The
file_extension
,glob_pattern
andregex_pattern
arguments allows to further restrict the selection of files read from the partitioned dataset.- Parameters:
bucket_dir (str) – Base directory of the geographic bucket.
extent (list, optional) – The extent specified as [xmin, xmax, ymin, ymax].
country (str, optional) – The name of the country of interest.
continent (str, optional) – The name of the continent of interest.
point (list or tuple, optional) – The longitude and latitude coordinates of the point around which you are interested to get the data. To effectively subset data around this point, also specify
size
ordistance
arguments. Longitude of the point.distance (float, optional) – Distance (in meters) from the specified point in each direction.
size (int, float, tuple, list, optional) – The size in degrees of the extent in each direction centered around the specified point.
padding (int, float, tuple, list) – The number of degrees to extend the (country, continent) extent in each direction. If padding is a single number, the same padding is applied in all directions. If padding is a tuple or list, it must contain 2 or 4 elements. If two values are provided (x, y), they are interpreted as longitude and latitude padding, respectively. If four values are provided, they directly correspond to padding for each side (left, right, top, bottom). Default is 0.
file_extension (str, optional) – Name of the file extension. The default is
None
.glob_pattern (str, optional) – Unix shell-style wildcards to subset the files to read in. The default is
None
.regex_pattern (str, optional) – Regex pattern to subset the files to read in. The default is
None
.backend (str, optional) – The wished type of dataframe returned by the function. The default is a polars.DataFrame. Valid backends are
pandas
,polars_lazy
andpyarrow
.**polars_kwargs (dict) – Arguments to be passed to polars.read_parquet()
columns
allow to specify the subset of columns to read.n_rows
allows to stop reading data from Parquet files after reading n_rows. For other arguments, please refer to: https://docs.pola.rs/py-polars/html/reference/api/polars.read_parquet.html
- Returns:
df – Bucket dataframe.
- Return type:
pandas.DataFrame, polars.DataFrame, polars.LazyFrame or pyarrow.Table