Table functions#

legate_dataframe.lib.groupby_aggregation.groupby_aggregation(LogicalTable table, keys: Iterable[str], column_aggregations: Iterable[Tuple[str, AggregationKind, str]]) LogicalTable#

Perform a groupby and aggregation in a single operation.

Warning

non-default cudf::aggregation arguments are ignored. The default constructor is used always. This also means that we only support aggregations that have a default constructor!

Parameters:
  • table – The table to group and aggregate.

  • keys – The names of the columns whose rows act as the groupby keys.

  • column_aggregations – A list of column aggregations to perform. Each column aggregation produces a column in the output table by performing an AggregationKind on a column in table. It consist of a tuple: (<input-column-name>, <aggregation-kind>, <output-column-name>). E.g. ("x", SUM, "sum-of-x")} will produce a column named “sum-of-x” in the output table, which, for each groupby key, has a row that contains the sum of the values in the column “x”. Multiple column aggregations can share the same input column but all output columns must be unique and not conflict with the name of the key columns.

Returns:

  • A new logical table that contains the key columns and the aggregated columns

  • using the output column names and order specified in column_aggregations.

legate_dataframe.lib.join.join(LogicalTable lhs, LogicalTable rhs, *, lhs_keys: Iterable[str], rhs_keys: Iterable[str], JoinType join_type, lhs_out_columns: Optional[Iterable[str]] = None, rhs_out_columns: Optional[Iterable[str]] = None, null_equality compare_nulls=null_equality.EQUAL, BroadcastInput broadcast=BroadcastInput.AUTO)#

Perform an join between the specified tables.

By default, the returned Table includes the columns from both lhs and rhs. In order to select the desired output columns, please use the lhs_out_columns and rhs_out_columns arguments. This can be useful to avoid duplicate key names and columns.

Parameters:
  • lhs – The left table

  • rhs – The right table

  • lhs_keys – The column names of the left table to join on

  • rhs_keys – The column names of the right table to join on

  • join_type – The JoinType such as INNER, LEFT, FULL

  • lhs_out_columns – Left table column names to include in the result. If None, all columns are included. All names in lhs_out_columns and rhs_out_columns must be unique.

  • rhs_out_columns – Right table column names to include in the result. If None, all columns are included. All names in lhs_out_columns and rhs_out_columns must be unique.

  • compare_nulls – Controls whether null join-key values should match or not

  • broadcast (BroadcastInput) – Can be RIGHT or LEFT to indicate that the array is “broadcast” to all workers (i.e. copied fully). This can be much faster, as it avoids more complex all-to-all communication. Defaults to AUTO which may do this based on the data size.

Returns:

  • The result of the join, which include the columns specified in lhs_out_columns

  • and rhs_out_columns (in that order).

Raises:

ValueError – If number of elements in lhs_keys or rhs_keys mismatch or if the column names of lhs_out_columns and rhs_out_columns are not unique.