Input/Output#

Parquet files#

legate_dataframe.lib.parquet.parquet_read(files, *, columns=None, ignore_row_groups=None) LogicalTable#

Read Parquet files into a logical table

Parameters:
  • files (str, Path, or iterable of paths) – If a string, glob.glob is used to conveniently load multiple files, otherwise must be a path or an iterable of paths (or strings).

  • columns – List of strings selecting a subset of columns to read.

  • ignore_row_groups

    If set to True the read operation will not be chunked into row groups. When row groups are large, this may lead to better resource use and more efficient reads. Note that temporary resource use may be higher due to the different approaches to reading the data.

    Note

    The Python side default is currently set to True when LDF_PREFER_EAGER_ALLOCATIONS is used as it helps with streaming. We expect future improvements when ignore_row_groups, as of now at least on the CPU it may not be beneficial even with large row groups.

Return type:

The read logical table.

See also

parquet_write

Write parquet data

legate_dataframe.lib.parquet.parquet_read_array(files, *, columns=None, null_value=None, type=None) LogicalArray#

Read Parquet files into a logical array

To successfully read the files, all selected columns must have the same type that is compatible with legate (currently only numeric types).

Parameters:
  • files (str, Path, or iterable of paths) – If a string, glob.glob is used to conveniently load multiple files, otherwise must be a path or an iterable of paths (or strings).

  • columns – List of strings selecting a subset of columns to read.

  • null_value (legate.core.Scalar or None) – If given (not None) the result will not have a null mask and null values are instead replaced with this value.

  • type (legate.core.Type or None) – The desired result legate type. If given, columns are cast to this type. If not given the dtype is inferred, but all columns must have the same one.

Return type:

The read logical array.

See also

parquet_read

Write parquet data into a table

legate_dataframe.lib.parquet.parquet_write(LogicalTable tbl, path: pathlib.Path | str) None#

Write logical table to Parquet files

Each partition will be written to a separate file.

Parameters:
  • tbl (LogicalTable) – The table to write.

  • path (str or pathlib.Path) – Destination directory for data.

  • the (Files will be created in the specified output directory using)

  • part.0.parquet (convention)

  • part.1.parquet

  • part.2.parquet

  • and (...)

  • table:: (so on for each partition in the) –

    /path/to/output/

    ├── part.0.parquet ├── part.1.parquet ├── part.2.parquet └── …

See also

parquet_read

Read parquet data

CSV files#

Legate-dataframe supports writing data to parquet or CSV files via:

legate_dataframe.lib.csv.csv_read(files, *, dtypes, na_filter=True, delimiter=',', usecols=None, names=None)#

Read csv files into a logical table

Parameters:
  • files (str, Path, or iterable of paths) – If a string, glob.glob is used to conveniently load multiple files, otherwise must be a path or an iterable of paths (or strings).

  • dtypes (iterable of arrow dtype-likes) – The arrow dypes to extract for each column (or a single one for all).

  • na_filter (bool, optional) – Whether to detect missing values, set to False to improve performance.

  • delimiter (str, optional) – The field delimiter.

  • usecols (iterable of str or int or None, optional) – If given, must match dtypes in length and denotes column names to be extracted from the file. If passes as integers, implies the file has no header and names must be passed.

  • names (iterable of str) – The names of the read columns, must be used with integral usecols.

Return type:

The read logical table

See also

csv_write

Write csv data

lib.parquet.parquet_write

Write parquet data

legate_dataframe.lib.csv.csv_write(LogicalTable tbl, path, delimiter=', ')#

Write logical table to csv files

Each partition will be written to a separate file.

Parameters:
  • tbl (LogicalTable) – The table to write.

  • path (str or pathlib.Path) – Destination directory for data.

  • delimiter (str) – The field delimiter.

  • the (Files will be created in the specified output directory using)

  • part.0.csv (convention)

  • part.1.csv

  • part.2.csv

  • and (...)

  • table:: (so on for each partition in the) –

    /path/to/output/

    ├── part.0.csv ├── part.1.csv ├── part.2.csv └── …

See also

csv_read

Read csv data

lib.parquet.parquet_read

Read parquet data