This shows the basic code organization.
Currently, the repo is basically flat. All implementations are directly under
cudf/ directory. All tests are in
Here’s a quick map to decide which file contains which feature:
Columnand its subclasses:
numerical.pyfor numeric columns
categorical.pyfor categorical columns
.apply()and simliar functions:
.query()and similar functions:
- GPU helper functions:
- Docstring helpers:
- Output formating:
- Dask serialization helpers:
- Operations on multiple DataFrame, Series or Indices:
- Other general helper functions:
Code that should move to libgdf¶
Code that should be re-implemented in libgdf in CUDA-C for better reusability and performance.
cudf/cudautils.pycontains a lot of GPU helper functions that are jitted by numba with
@cuda.jitinto CUDA kernels. All CUDA kernels in this file should be moved to libgdf if possible.
Some logic in
cudf/groupby.pyshould be move to libgdf to make groupby operation faster. Some groupby aggregations are implemented with
Code that cannot move to libgdf¶
Some features requires the jit to be useful; e.g features that use user-defined functions. These features cannot be moved to libgdf.