Column functions#

legate_dataframe.lib.unaryop.cast(LogicalColumn col, dtype) LogicalColumn#

Cast a logical column to the desired data type.

Parameters:
  • col – Logical column as input

  • dtype – The data type of the result.

Return type:

Logical column of same size as col but with new data type.

legate_dataframe.lib.unaryop.round(LogicalColumn col, int32_t digits, mode='half_to_even') LogicalColumn#

Cast a logical column to the desired data type.

Parameters:
  • col – Logical column as input

  • decimals – Number of decimals to round to.

  • mode – Rounding mode, currently either “half_to_even” or “half_away_from_zero” are supported.

Return type:

Logical column of same size as col but with new data type.

legate_dataframe.lib.unaryop.unary_operation(LogicalColumn col, str op: str) LogicalColumn#

Performs unary operation on all values in column

Note: For decimal32 and decimal64, only abs, ceil and floor are supported.

Parameters:
  • col – Logical column as input

  • op – Operation to perform, see arrow compute functions.

Return type:

Logical column of same size as col containing result of the operation.

legate_dataframe.lib.binaryop.binary_operation(lhs: LogicalColumn | ScalarLike, rhs: LogicalColumn | ScalarLike, str op: str, output_type: DTypeLike) LogicalColumn#

Performs a binary operation between two columns or a column and a scalar.

The output contains the result of op(lhs[i], rhs[i]) for all 0 <= i < lhs.size() where lhs[i] or rhs[i] (but not both) can be replaced with a scalar value.

Regardless of the operator, the validity of the output value is the logical AND of the validity of the two operands except for NullMin and NullMax (logical OR).

Parameters:
  • lhs – The left operand

  • lhs – The right operand

  • op – String for arrow compute function e.g. “add”, “multiply”

  • output_type – The desired data type of the output column

Returns:

  • Output column of output_type type containing the result of the binary

  • operation

Raises:
  • ValueError – if lhs and rhs are both scalars

  • RuntimeError – if lhs and rhs are different sizes

  • RuntimeError – if output_type dtype isn’t boolean for comparison and logical operations.

  • RuntimeError – if output_type dtype isn’t fixed-width

  • RuntimeError – if the operation is not supported for the types of lhs and rhs

legate_dataframe.lib.copying.copy_if_else(LogicalColumn cond, lhs: LogicalColumn | ScalarLike, rhs: LogicalColumn | ScalarLike) LogicalColumn#

Performs a ternary if/else operation along the columns.

The result will contain the values of lhs[i] if cond[i] else rhs[i]. Both lhs and rhs may be scalar columns in which case they are broadcast against cond. lhs and rhs must have the same type.

Parameters:
  • cond – Boolean column deciding which column each result element is taken from.

  • lhs – The left operand

  • lhs – The right operand

Return type:

Output column containing the result of the ternary if/else operation

Raises:

ValueError – If lhs and rhs do not have the same type or cond is not boolean.

legate_dataframe.lib.copying.concatenate(columns)#

Concetenate columns into a single long column.

Creates a new column concatenating all columns. Must have at least one column and all columns must have the same type.

Parameters:

columns – Iterable of logical columns.

Return type:

Output column with as many rows as all input columns combined.

legate_dataframe.lib.timestamps.to_timestamps(LogicalColumn col, timestamp_type: DTypeLike, str format_pattern: str) LogicalColumn#

Converting a strings column into timestamps using the provided format pattern.

The format pattern can include the following specifiers: “%Y,%y,%m,%d,%H,%I,%p, %M,%S,%f,%z”.

Please see to_timestamps() for details.

Warning

Invalid formats are not checked, the format pattern must be well defined as per the C++ API.

Parameters:
  • col – Strings instance for this operation

  • timestamp_type – The timestamp type used for creating the output column

  • format_pattern – String specifying the timestamp format in strings

Return type:

New datetime column

Raises:

RuntimeError – if timestamp_type is not a timestamp type.:

legate_dataframe.lib.timestamps.extract_timestamp_component(LogicalColumn col, str component: str) LogicalColumn#

Extract part of the timestamp as int16.

Parameters:
  • col (LogicalColumn) – Column of timestamps

  • component – The component which to extract. A string like “year”, “month”, “day”, “millisecond” etc. See arrow documentation for “Temporal component extraction” for a full list.

Return type:

New int64 column

Notes

Unlike pandas and cudf, this function counts the days of the week as Monday-Sunday being 1-7 and microsecond_fraction does not include milliseconds.

legate_dataframe.lib.reduction.reduce(LogicalColumn col, str op, output_type, *, initial=None)#

Apply a reduction along a column.

Parameters:
  • col – The column to reduce.

  • op – The operation to apply, must be one of the following: “sum”, “mean”, “min”, “max”, “product”, “count_valid”.

  • output_type – The result dtype, must be specified.

  • initial – Scalar column containing an initial value for the reduction.

legate_dataframe.lib.replace.replace_nulls(LogicalColumn col, replacement: ScalarLike) LogicalColumn#

Return a new column with NULL entries replaced by value.

Parameters:
  • lhs – Operand column

  • replacement – Value to replace NULLs with (currently limited to scalars).

Return type:

Output column of output_type type without NULL entries.

Raises:

ValueError – if the value is not of the correct scalar type.:

legate_dataframe.lib.search.contains(LogicalColumn haystack, LogicalColumn needles) LogicalColumn#

Check if haystack contains the values in needles.

The result will contain boolean values indicating whether each element in the input column exists in the set of values. This is an elementwise needles[i] in haystack.

Parameters:
  • haystack – Column of values to search against. This column is currently broadcast to all workers and assumed to be small.

  • needles – Column of values to check if they exist in the haystack.

Returns:

  • Boolean column indicating which values exist in the set, has the same

  • size and nullability as haystack.

Raises:

ValueError – If the input columns have different types.

legate_dataframe.lib.strings.match(str match_func, LogicalColumn column, str pattern) LogicalColumn#

Check if strings match a given pattern.

Parameters:
  • match_func – The type of matching to perform: “starts_with”, “ends_with”, “match_substring”, or “match_substring_regex”. (Note that the “match_substring*” check for containment not full matches.)

  • column – The column of string values to check

  • pattern – The pattern string to check for. A regular expression for “match_substring_regex”.

Return type:

A boolean column indicating which values match the pattern