gpm.bucket package

Contents

gpm.bucket package#

Submodules#

gpm.bucket.analysis module#

This module contains a mix of function to analysis bucket archives.

gpm.bucket.analysis.get_list_overpass_time(timesteps)[source][source]#

Return a list with (start_time, end_time) of the overpasses.

This function is typically called on a regional subset of a bucket archive.

gpm.bucket.dataframe module#

This module implements manipulation wrappers for multiple DataFrame classes.

gpm.bucket.dataframe.check_valid_dataframe(df)[source][source]#

Check the dataframe class.

gpm.bucket.dataframe.df_add_column(df, column, values)[source][source]#

Add column to dataframe.

gpm.bucket.dataframe.df_get_column(df, column)[source][source]#

Get the dataframe column.

gpm.bucket.dataframe.df_is_column_in(df, column)[source][source]#
gpm.bucket.dataframe.df_select_valid_rows(df, valid_rows)[source][source]#

Select only dataframe rows with valid rows (using boolean array).

gpm.bucket.dataframe.df_to_pandas(df)[source][source]#

Convert dataframe to pandas.

gpm.bucket.filters module#

gpm.bucket.filters.apply_spatial_filters(df, filters=None)[source][source]#
gpm.bucket.filters.filter_around_point(df, lon, lat, distance)[source][source]#
gpm.bucket.filters.filter_by_extent(df, extent, x='lon', y='lat')[source][source]#
gpm.bucket.filters.get_geodesic_distance_from_point(lons, lats, lon, lat)[source][source]#

gpm.bucket.io module#

This module provide utilities to search GPM Geographic Buckets files.

gpm.bucket.io.get_bucket_partitioning(bucket_dir)[source][source]#
gpm.bucket.io.get_exisiting_partitions_paths(bucket_dir, dir_trees)[source][source]#

Get the path of existing bucket partitions on disk.

gpm.bucket.io.get_filepaths(bucket_dir, parallel=True, file_extension=None, glob_pattern=None, regex_pattern=None)[source][source]#

Return the filepaths matching the specified filename filtering criteria.

gpm.bucket.io.get_filepaths_by_partition(bucket_dir, parallel=True, file_extension=None, glob_pattern=None, regex_pattern=None)[source][source]#

Return a dictionary with the list of filepaths for each bucket partition.

gpm.bucket.io.get_partitions_paths(bucket_dir)[source][source]#

Get the path of the bucket partitions.

gpm.bucket.io.read_bucket_info(bucket_dir)[source][source]#
gpm.bucket.io.write_bucket_info(bucket_dir, partitioning)[source][source]#

gpm.bucket.partitioning module#

This module implements Spatial Partitioning classes.

class gpm.bucket.partitioning.Base2DPartitioning(x_bounds, y_bounds, levels, flavor=None, order=None)[source][source]#

Bases: object

Handles partitioning of 2D data into rectangular tiles.

The size of the partitions can varies between and across the x and y directions.

Parameters:
  • levels (str or list) – Name or names of the partitions. If partitioning by 1 level (i.e. by a unique partition id), specify a single partition name. If partitioning by 2 or more levels (i.e. by x and y), specify the x, y (z, …) partition levels names.

  • x_bounds (numpy.ndarray) – The partition bounds across the x (horizontal) dimension.

  • y_bounds (numpy.ndarray) – The partition bounds across the y (vertical) dimension. Please provide the bounds with increasing values order. The origin of the partition class indices is the top, left corner.

  • order (list) – The order of the partitions when writing multi-level partitions (i.e. x, y) to disk. The default, None, corresponds to names.

  • flavor (str) – This argument governs the directories names of partitioned datasets. The default, None, name the directories with the partitions labels (DirectoryPartitioning). The option "hive", name the directories with the format {partition_name}={partition_label}.

add_centroids(df, x, y, x_coord=None, y_coord=None, remove_invalid_rows=True)[source][source]#

Add partitions centroids to the dataframe.

Parameters:
  • df (pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame) – Dataframe to which add partitions centroids.

  • x (str) – Column name with the x coordinate.

  • y (str) – Column name with the y coordinate..

  • x_coord (str, optional) – Name of the new column with the centroids x coordinates. The default is “x_c”.

  • y_coord (str, optional) – Name of the new column with the centroids y coordinates. The default is “y_c”.

  • remove_invalid_rows (bool, optional) – Whether to remove dataframe rows for which coordinates are invalid or out of the partitioning extent. The default is True.

Returns:

df – Dataframe with the partitions centroids x and y coordinates columns.

Return type:

pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame

add_labels(df, x, y, remove_invalid_rows=True)[source][source]#

Add partitions labels to the dataframe.

Parameters:
  • df (pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame) – Dataframe to which add partitions centroids.

  • x (str) – Column name with the x coordinate.

  • y (str) – Column name with the y coordinate.

  • remove_invalid_rows (bool, optional) – Whether to remove dataframe rows for which coordinates are invalid or out of the partitioning extent. The default is True.

Returns:

df – Dataframe with the partitions label(s) column(s).

Return type:

pandas.DataFrame, dask.dataframe.DataFrame, polars.DataFrame, pyarrow.Table or polars.LazyFrame

property bounds[source]#

Return the partitions bounds.

property centroids[source]#

Return the centroids array of shape (n_y, n_x, 2).

property directories[source]#

Return the directory trees.

directories_around_point(x, y, distance=None, size=None)[source][source]#

Return the directory trees with data within the specified distance from a point.

directories_by_extent(extent)[source][source]#

Return the directory trees with data within the specified extent.

get_partitions_around_point(x, y, distance=None, size=None)[source][source]#

Return the partition labels with data within the distance/size from a point.

get_partitions_by_extent(extent)[source][source]#

Return the partitions labels containing data within the extent.

property labels[source]#

Return the labels array of shape (n_y, n_x, n_levels).

quadmesh_corners(origin='bottom')[source][source]#

Return the quadrilateral mesh corners.

A quadrilateral mesh is a grid of M by N adjacent quadrilaterals that are defined via a (M+1, N+1) grid of vertices.

The quadrilateral mesh is accepted by matplotlib.pyplot.pcolormesh, matplotlib.collections.QuadMesh and matplotlib.collections.PolyQuadMesh.

Parameters:

origin (str) – Origin of the y axis. The default is bottom.

Returns:

(x_corners, y_corners) – Numpy array of shape (M+1, N+1)

Return type:

tuple

query_centroids(x, y)[source][source]#

Return the partition centroids for the specified x,y coordinates.

query_centroids_by_indices(x_indices, y_indices)[source][source]#

Return the partition centroids for the specified x,y indices.

query_indices(x, y)[source][source]#

Return the 2D partition indices for the specified x,y coordinates.

query_labels(x, y)[source][source]#

Return the partition labels for the specified x,y coordinates.

query_labels_by_indices(x_indices, y_indices)[source][source]#

Return the partition labels as function of the specified 2D partitions indices.

query_vertices(x, y, ccw=True)[source][source]#
query_vertices_by_indices(x_indices, y_indices, ccw=True)[source][source]#

Return the partitions vertices in an array of shape (indices, 4, 2).

to_shapely()[source][source]#

Return an array with shapely polygons.

to_xarray(df, spatial_coords=None, aux_coords=None)[source][source]#

Convert dataframe to spatial xarray Dataset based on partitions centroids.

This routine assumes that you have grouped and aggregated the dataframe over the partition labels or the partition centroids!

Please add the partition centroids to the dataframe with add_centroids before calling this method. Please specify the partition centroids x and y columns in the spatial_coords argument.

Please also specify the presence of auxiliary coordinates (indices) with aux_coords. The array cells with coordinates not included in the dataframe will have NaN values.

vertices(ccw=True, origin='bottom')[source][source]#

Return the partitions vertices in an array of shape (N, M, 4, 2).

The output vertices, once the first 2 dimensions are flattened, can be passed directly to a matplotlib.collections.PolyCollection. For plotting with cartopy, the polygon order must be counterclockwise ordered.

Parameters:
  • ccw (bool, optional) – If True, vertices are ordered counterclockwise. If False, vertices are ordered clockwise. The default is True.

  • origin (str) – Origin of the y axis. The default is bottom.

class gpm.bucket.partitioning.LonLatPartitioning(size, extent=[-180, 180, -90, 90], levels=None, flavor='hive', order=None, labels_decimals=None)[source][source]#

Bases: XYPartitioning

Handles geographic partitioning of data based on longitude and latitude bin sizes within a defined extent.

The last bin size (in lon and lat direction) might not be of size size !

Parameters:
  • size (float) – The uniform size for longitude and latitude binning. Carefully consider the size of the partitions. Earth partitioning by: - 1° degree corresponds to 64800 directories (360*180) - 5° degree corresponds to 2592 directories (72*36) - 10° degree corresponds to 648 directories (36*18) - 15° degree corresponds to 288 directories (24*12)

  • levels (list, optional) – Names of the longitude and latitude partitions. The default is ["lon_bin", "lat_bin"].

  • extent (list, optional) – The geographical extent for the partitioning specified as [xmin, xmax, ymin, ymax]. Default is the whole Earth: [-180, 180, -90, 90].

  • order (list, optional) – The order of the partitions when writing partitioned datasets. The default, None, corresponds to levels.

  • flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default, “hive”`, names the directories with the format {partition_name}={partition_label}. If None, names the directories with the partitions labels (DirectoryPartitioning).

  • Inherits

  • ----------

  • XYPartitioning

directories_around_point(lon, lat, distance=None, size=None)[source][source]#

Return the directory trees with data within the distance/size from a point.

directories_by_continent(name, padding=None)[source][source]#

Return the directory trees with data within a continent.

directories_by_country(name, padding=None)[source][source]#

Return the directory trees with data within a country.

get_partitions_around_point(lon, lat, distance=None, size=None)[source][source]#

Return the partition labels with data within the distance/size from a point.

get_partitions_by_continent(name, padding=None)[source][source]#

Return the partition labels enclosing the specified continent.

get_partitions_by_country(name, padding=None)[source][source]#

Return the partition labels enclosing the specified country.

class gpm.bucket.partitioning.TilePartitioning(size, extent, n_levels, levels=None, origin='bottom', direction='x', justify=False, flavor=None, order=None)[source][source]#

Bases: Base2DPartitioning

Handles partitioning of data into tiles.

Parameters:
  • size (int, float, tuple, list) – The size value(s) of the bins. The function interprets the input as follows: - int or float: The same size is enforced in both x and y directions. - tuple or list: The bin size for the x and y directions.

  • extent (list) – The extent for the partitioning specified as [xmin, xmax, ymin, ymax].

  • n_levels (int) – The number of tile partitioning levels. If n_levels=2, a (x,y) label is assigned to each tile. If n_levels=1, a unique id label is assigned to each tile combining the x and y tile indices. The origin and direction parameters governs its value.

  • levels (list, optional) – If n_levels>=2, the first two names must correspond to the x and y partitions. The first two levels must The default with n_levels=1 is ["tile"]. The default with n_levels=2 is ["x", "y"].

  • origin (str, optional) – The origin of the Y axis. Either "bottom" or "top". TMS tiles assumes origin="top". Google Maps tiles assumes origin="bottom". The default is "bottom".

  • direction (str, optional) – The direction to follow to define tile ids if levels=1 is specified. Valid direction values are “x” and “y”. direction=x numbers the tiles rows by rows. direction=y numbers the tiles columns by columns.

  • justify (bool, optional) – Whether to justify the labels to ensure having all same number of characters. 0 is added on the left side of the labels to justify the length. THe default is False.

  • order (list, optional) – The order of the partitions when writing partitioned datasets. The default, None, corresponds to levels.

  • flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default, None, name the directories with the partitions labels (DirectoryPartitioning). The option "hive", name the directories with the format {partition_name}={partition_label}.

to_dict()[source][source]#

Return the partitioning settings.

class gpm.bucket.partitioning.XYPartitioning(size, extent, levels=None, order=None, flavor=None, labels_decimals=None)[source][source]#

Bases: Base2DPartitioning

Handles partitioning of data into x and y regularly spaced bins.

Parameters:
  • size (int, float, tuple, list) – The size value(s) of the bins. The function interprets the input as follows: - int or float: The same size is enforced in both x and y directions. - tuple or list: The bin size for the x and y directions.

  • extent (list) – The extent for the partitioning specified as [xmin, xmax, ymin, ymax].

  • levels (list, optional) – Names of the x and y partitions. The default is ["xbin", "ybin"].

  • order (list, optional) – The order of the x and y partitions when writing partitioned datasets. The default, None, corresponds to levels.

  • flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default, None, name the directories with the partitions labels (DirectoryPartitioning). The option "hive", name the directories with the format {partition_name}={partition_label}.

to_dict()[source][source]#

Return the partitioning settings.

property x_labels[source]#

Return the partition labels across the horizontal dimension.

property y_labels[source]#

Return the partition labels across the vertical dimension.

gpm.bucket.partitioning.check_default_levels(levels, default_levels)[source][source]#
gpm.bucket.partitioning.check_partitioning_flavor(flavor)[source][source]#

Validate the flavor argument.

If None, defaults to “directory”.

gpm.bucket.partitioning.check_partitioning_order(levels, order)[source][source]#
gpm.bucket.partitioning.check_valid_x_y(df, x, y)[source][source]#

Check if the x and y columns are in the dataframe.

gpm.bucket.partitioning.flatten_indices_arrays(func)[source][source]#
gpm.bucket.partitioning.flatten_xy_arrays(func)[source][source]#
gpm.bucket.partitioning.get_array_combinations(x, y)[source][source]#

Return all combinations between the two input arrays.

gpm.bucket.partitioning.get_bounds(size, vmin, vmax)[source][source]#

Define partitions edges.

gpm.bucket.partitioning.get_centroids_from_bounds(bounds)[source][source]#

Define partitions centroids from bounds.

gpm.bucket.partitioning.get_directories(dict_labels, order, flavor)[source][source]#

Return the directory trees of a partitioned dataset.

gpm.bucket.partitioning.get_n_decimals(number)[source][source]#

Get the number of decimals of a number.

gpm.bucket.partitioning.get_partition_dir_name(partition_name, partition_labels, flavor)[source][source]#

Return the directories name of a partition.

gpm.bucket.partitioning.get_tile_id_labels(x_indices, y_indices, origin, direction, n_x, n_y, justify)[source][source]#

Return the 1D tile labels for the specified x,y indices.

gpm.bucket.partitioning.get_tile_xy_labels(x_indices, y_indices, origin, n_x, n_y, justify=False)[source][source]#

Return the 2D tile labels for the specified x,y indices.

gpm.bucket.partitioning.justify_labels(labels, length)[source][source]#
gpm.bucket.partitioning.mask_invalid_indices(flag_value=nan)[source][source]#
gpm.bucket.partitioning.np_broadcast_like(x, shape)[source][source]#
gpm.bucket.partitioning.query_indices(values, bounds)[source][source]#

Return the index for the specified coordinates.

Invalid values (NaN, None) or out of bounds values returns NaN.

gpm.bucket.readers module#

This module provide utilities to read GPM Geographic Buckets Apache Parquet files.

gpm.bucket.readers.check_backend(backend)[source][source]#

Check backend type.

gpm.bucket.readers.read_bucket(bucket_dir, extent=None, country=None, continent=None, point=None, distance=None, size=None, padding=0, file_extension=None, glob_pattern=None, regex_pattern=None, backend='polars', **polars_kwargs)[source][source]#

Read a geographic bucket.

The extent, country, continent, or point arguments allows to read only a spatial subset of the original bucket. Please specify only one of this arguments !

The file_extension, glob_pattern and regex_pattern arguments allows to further restrict the selection of files read from the partitioned dataset.

Parameters:
  • bucket_dir (str) – Base directory of the geographic bucket.

  • extent (list, optional) – The extent specified as [xmin, xmax, ymin, ymax].

  • country (str, optional) – The name of the country of interest.

  • continent (str, optional) – The name of the continent of interest.

  • point (list or tuple, optional) – The longitude and latitude coordinates of the point around which you are interested to get the data. To effectively subset data around this point, also specify size or distance arguments. Longitude of the point.

  • distance (float, optional) – Distance (in meters) from the specified point in each direction.

  • size (int, float, tuple, list, optional) – The size in degrees of the extent in each direction centered around the specified point.

  • padding (int, float, tuple, list) – The number of degrees to extend the (country, continent) extent in each direction. If padding is a single number, the same padding is applied in all directions. If padding is a tuple or list, it must contain 2 or 4 elements. If two values are provided (x, y), they are interpreted as longitude and latitude padding, respectively. If four values are provided, they directly correspond to padding for each side (left, right, top, bottom). Default is 0.

  • file_extension (str, optional) – Name of the file extension. The default is None.

  • glob_pattern (str, optional) – Unix shell-style wildcards to subset the files to read in. The default is None.

  • regex_pattern (str, optional) – Regex pattern to subset the files to read in. The default is None.

  • backend (str, optional) – The wished type of dataframe returned by the function. The default is a polars.DataFrame. Valid backends are pandas, polars_lazy and pyarrow.

  • **polars_kwargs (dict) – Arguments to be passed to polars.read_parquet() columns allow to specify the subset of columns to read. n_rows allows to stop reading data from Parquet files after reading n_rows. For other arguments, please refer to: https://docs.pola.rs/py-polars/html/reference/api/polars.read_parquet.html

Returns:

df – Bucket dataframe.

Return type:

pandas.DataFrame, polars.DataFrame, polars.LazyFrame or pyarrow.Table

gpm.bucket.readers.read_dask_partitioned_dataset(base_dir, columns=None)[source][source]#

gpm.bucket.routines module#

This module provides the routines for the creation of GPM Geographic Buckets.

gpm.bucket.routines.split_list_in_blocks(values, block_size)[source][source]#
gpm.bucket.routines.write_granule_bucket(src_filepath, bucket_dir, partitioning, granule_to_df_func, x='lon', y='lat', **writer_kwargs)[source][source]#

Write a geographically partitioned Parquet Dataset of a GPM granules.

Parameters:
  • src_filepath (str) – File path of the granule to store in the bucket archive.

  • bucket_dir (str) – Base directory of the per-granule bucket archive.

  • partitioning (gpm.bucket.SpatialPartitioning) – A spatial partitioning class.

  • granule_to_df_func (Callable) – Function taking a granule filepath, opening it and returning a pandas or dask dataframe.

  • x (str) – The name of the x column. The default is “lon”.

  • y (str) – The name of the y column. The default is “lat”.

  • **writer_kwargs (dict) – Optional arguments to be passed to the pyarrow Dataset Writer. Common arguments are ‘format’ and ‘use_threads’. The default file format is 'parquet'. The default use_threads is True, which enable multithreaded file writing. More information available at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html

gpm.bucket.writers module#

This module provide to write a GPM Geographic Bucket Apache Parquet Dataset.

gpm.bucket.writers.convert_size_to_bytes(size)[source][source]#
gpm.bucket.writers.estimate_row_group_size(df, size='200MB')[source][source]#

Estimate row_group_size parameter based on the desired row group memory size.

row_group_size is a Parquet argument controlling the number of rows in each Apache Parquet File Row Group.

gpm.bucket.writers.get_table_from_dask_dataframe_partition(df)[source][source]#
gpm.bucket.writers.get_table_schema_without_partitions(table, partitions=None)[source][source]#
gpm.bucket.writers.preprocess_writer_kwargs(writer_kwargs, df)[source][source]#
gpm.bucket.writers.write_arrow_partitioned_dataset(table, base_dir, filename_prefix, partitions, **writer_kwargs)[source][source]#
gpm.bucket.writers.write_dask_partitioned_dataset(df, base_dir, filename_prefix, partitions, **writer_kwargs)[source][source]#

Write a Dask DataFrame to a partitioned dataset.

It loops over the dataframe partitions and write them to disk. If row_group_size or max_file_size are specified as string, it loads the first dataframe partition to estimate the row numbers.

gpm.bucket.writers.write_dataset_metadata(base_dir, metadata_collector, schema)[source][source]#
gpm.bucket.writers.write_pandas_partitioned_dataset(df, base_dir, filename_prefix, partitions, **writer_kwargs)[source][source]#
gpm.bucket.writers.write_partitioned_dataset(df, base_dir, partitions=None, filename_prefix='part', **writer_kwargs)[source][source]#
gpm.bucket.writers.write_polars_partitioned_dataset(df, base_dir, filename_prefix, partitions, **writer_kwargs)[source][source]#

Module contents#

This directory defines the GPM-API geographic binning toolbox.

class gpm.bucket.LonLatPartitioning(size, extent=[-180, 180, -90, 90], levels=None, flavor='hive', order=None, labels_decimals=None)[source][source]#

Bases: XYPartitioning

Handles geographic partitioning of data based on longitude and latitude bin sizes within a defined extent.

The last bin size (in lon and lat direction) might not be of size size !

Parameters:
  • size (float) – The uniform size for longitude and latitude binning. Carefully consider the size of the partitions. Earth partitioning by: - 1° degree corresponds to 64800 directories (360*180) - 5° degree corresponds to 2592 directories (72*36) - 10° degree corresponds to 648 directories (36*18) - 15° degree corresponds to 288 directories (24*12)

  • levels (list, optional) – Names of the longitude and latitude partitions. The default is ["lon_bin", "lat_bin"].

  • extent (list, optional) – The geographical extent for the partitioning specified as [xmin, xmax, ymin, ymax]. Default is the whole Earth: [-180, 180, -90, 90].

  • order (list, optional) – The order of the partitions when writing partitioned datasets. The default, None, corresponds to levels.

  • flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default, “hive”`, names the directories with the format {partition_name}={partition_label}. If None, names the directories with the partitions labels (DirectoryPartitioning).

  • Inherits

  • ----------

  • XYPartitioning

directories_around_point(lon, lat, distance=None, size=None)[source][source]#

Return the directory trees with data within the distance/size from a point.

directories_by_continent(name, padding=None)[source][source]#

Return the directory trees with data within a continent.

directories_by_country(name, padding=None)[source][source]#

Return the directory trees with data within a country.

get_partitions_around_point(lon, lat, distance=None, size=None)[source][source]#

Return the partition labels with data within the distance/size from a point.

get_partitions_by_continent(name, padding=None)[source][source]#

Return the partition labels enclosing the specified continent.

get_partitions_by_country(name, padding=None)[source][source]#

Return the partition labels enclosing the specified country.

class gpm.bucket.TilePartitioning(size, extent, n_levels, levels=None, origin='bottom', direction='x', justify=False, flavor=None, order=None)[source][source]#

Bases: Base2DPartitioning

Handles partitioning of data into tiles.

Parameters:
  • size (int, float, tuple, list) – The size value(s) of the bins. The function interprets the input as follows: - int or float: The same size is enforced in both x and y directions. - tuple or list: The bin size for the x and y directions.

  • extent (list) – The extent for the partitioning specified as [xmin, xmax, ymin, ymax].

  • n_levels (int) – The number of tile partitioning levels. If n_levels=2, a (x,y) label is assigned to each tile. If n_levels=1, a unique id label is assigned to each tile combining the x and y tile indices. The origin and direction parameters governs its value.

  • levels (list, optional) – If n_levels>=2, the first two names must correspond to the x and y partitions. The first two levels must The default with n_levels=1 is ["tile"]. The default with n_levels=2 is ["x", "y"].

  • origin (str, optional) – The origin of the Y axis. Either "bottom" or "top". TMS tiles assumes origin="top". Google Maps tiles assumes origin="bottom". The default is "bottom".

  • direction (str, optional) – The direction to follow to define tile ids if levels=1 is specified. Valid direction values are “x” and “y”. direction=x numbers the tiles rows by rows. direction=y numbers the tiles columns by columns.

  • justify (bool, optional) – Whether to justify the labels to ensure having all same number of characters. 0 is added on the left side of the labels to justify the length. THe default is False.

  • order (list, optional) – The order of the partitions when writing partitioned datasets. The default, None, corresponds to levels.

  • flavor (str, optional) – This argument governs the directories names of partitioned datasets. The default, None, name the directories with the partitions labels (DirectoryPartitioning). The option "hive", name the directories with the format {partition_name}={partition_label}.

to_dict()[source][source]#

Return the partitioning settings.

gpm.bucket.merge_granule_buckets(*args, **kwargs)[source]#
gpm.bucket.read(bucket_dir, extent=None, country=None, continent=None, point=None, distance=None, size=None, padding=0, file_extension=None, glob_pattern=None, regex_pattern=None, backend='polars', **polars_kwargs)[source]#

Read a geographic bucket.

The extent, country, continent, or point arguments allows to read only a spatial subset of the original bucket. Please specify only one of this arguments !

The file_extension, glob_pattern and regex_pattern arguments allows to further restrict the selection of files read from the partitioned dataset.

Parameters:
  • bucket_dir (str) – Base directory of the geographic bucket.

  • extent (list, optional) – The extent specified as [xmin, xmax, ymin, ymax].

  • country (str, optional) – The name of the country of interest.

  • continent (str, optional) – The name of the continent of interest.

  • point (list or tuple, optional) – The longitude and latitude coordinates of the point around which you are interested to get the data. To effectively subset data around this point, also specify size or distance arguments. Longitude of the point.

  • distance (float, optional) – Distance (in meters) from the specified point in each direction.

  • size (int, float, tuple, list, optional) – The size in degrees of the extent in each direction centered around the specified point.

  • padding (int, float, tuple, list) – The number of degrees to extend the (country, continent) extent in each direction. If padding is a single number, the same padding is applied in all directions. If padding is a tuple or list, it must contain 2 or 4 elements. If two values are provided (x, y), they are interpreted as longitude and latitude padding, respectively. If four values are provided, they directly correspond to padding for each side (left, right, top, bottom). Default is 0.

  • file_extension (str, optional) – Name of the file extension. The default is None.

  • glob_pattern (str, optional) – Unix shell-style wildcards to subset the files to read in. The default is None.

  • regex_pattern (str, optional) – Regex pattern to subset the files to read in. The default is None.

  • backend (str, optional) – The wished type of dataframe returned by the function. The default is a polars.DataFrame. Valid backends are pandas, polars_lazy and pyarrow.

  • **polars_kwargs (dict) – Arguments to be passed to polars.read_parquet() columns allow to specify the subset of columns to read. n_rows allows to stop reading data from Parquet files after reading n_rows. For other arguments, please refer to: https://docs.pola.rs/py-polars/html/reference/api/polars.read_parquet.html

Returns:

df – Bucket dataframe.

Return type:

pandas.DataFrame, polars.DataFrame, polars.LazyFrame or pyarrow.Table

gpm.bucket.write_bucket(*args, **kwargs)[source]#
gpm.bucket.write_granules_bucket(*args, **kwargs)[source]#