Input/Output#
Parquet files#
- legate_dataframe.lib.parquet.parquet_read(files, *, columns=None, ignore_row_groups=None) LogicalTable #
Read Parquet files into a logical table
- Parameters:
files (str, Path, or iterable of paths) – If a string,
glob.glob
is used to conveniently load multiple files, otherwise must be a path or an iterable of paths (or strings).columns – List of strings selecting a subset of columns to read.
ignore_row_groups –
If set to
True
the read operation will not be chunked into row groups. When row groups are large, this may lead to better resource use and more efficient reads. Note that temporary resource use may be higher due to the different approaches to reading the data.Note
The Python side default is currently set to
True
whenLDF_PREFER_EAGER_ALLOCATIONS
is used as it helps with streaming. We expect future improvements whenignore_row_groups
, as of now at least on the CPU it may not be beneficial even with large row groups.
- Return type:
The read logical table.
See also
parquet_write
Write parquet data
- legate_dataframe.lib.parquet.parquet_read_array(files, *, columns=None, null_value=None, type=None) LogicalArray #
Read Parquet files into a logical array
To successfully read the files, all selected columns must have the same type that is compatible with legate (currently only numeric types).
- Parameters:
files (str, Path, or iterable of paths) – If a string,
glob.glob
is used to conveniently load multiple files, otherwise must be a path or an iterable of paths (or strings).columns – List of strings selecting a subset of columns to read.
null_value (legate.core.Scalar or None) – If given (not
None
) the result will not have a null mask and null values are instead replaced with this value.type (legate.core.Type or None) – The desired result legate type. If given, columns are cast to this type. If not given the
dtype
is inferred, but all columns must have the same one.
- Return type:
The read logical array.
See also
parquet_read
Write parquet data into a table
- legate_dataframe.lib.parquet.parquet_write(LogicalTable tbl, path: pathlib.Path | str) None #
Write logical table to Parquet files
Each partition will be written to a separate file.
- Parameters:
tbl (LogicalTable) – The table to write.
path (str or pathlib.Path) – Destination directory for data.
the (Files will be created in the specified output directory using)
part.0.parquet (convention)
part.1.parquet
part.2.parquet
and (...)
table:: (so on for each partition in the) –
- /path/to/output/
├── part.0.parquet ├── part.1.parquet ├── part.2.parquet └── …
See also
parquet_read
Read parquet data
CSV files#
Legate-dataframe supports writing data to parquet or CSV files via:
- legate_dataframe.lib.csv.csv_read(files, *, dtypes, na_filter=True, delimiter=',', usecols=None, names=None)#
Read csv files into a logical table
- Parameters:
files (str, Path, or iterable of paths) – If a string,
glob.glob
is used to conveniently load multiple files, otherwise must be a path or an iterable of paths (or strings).dtypes (iterable of arrow dtype-likes) – The arrow dypes to extract for each column (or a single one for all).
na_filter (bool, optional) – Whether to detect missing values, set to
False
to improve performance.delimiter (str, optional) – The field delimiter.
usecols (iterable of str or int or None, optional) – If given, must match dtypes in length and denotes column names to be extracted from the file. If passes as integers, implies the file has no header and names must be passed.
names (iterable of str) – The names of the read columns, must be used with integral usecols.
- Return type:
The read logical table
See also
csv_write
Write csv data
lib.parquet.parquet_write
Write parquet data
- legate_dataframe.lib.csv.csv_write(LogicalTable tbl, path, delimiter=', ')#
Write logical table to csv files
Each partition will be written to a separate file.
- Parameters:
tbl (LogicalTable) – The table to write.
path (str or pathlib.Path) – Destination directory for data.
delimiter (str) – The field delimiter.
the (Files will be created in the specified output directory using)
part.0.csv (convention)
part.1.csv
part.2.csv
and (...)
table:: (so on for each partition in the) –
- /path/to/output/
├── part.0.csv ├── part.1.csv ├── part.2.csv └── …
See also
csv_read
Read csv data
lib.parquet.parquet_read
Read parquet data