Developer Documentation

Code Organization

This shows the basic code organization.

Currently, the repo is basically flat. All implementations are directly under under the cudf/ directory. All tests are in cudf/tests/ directory.

Here’s a quick map to decide which file contains which feature:

  • DataFrame:
    • dataframe.py

  • Series:
    • series.py

  • Column and its subclasses:
    • column.py

    • columnops.py

    • numerical.py for numeric columns

    • categorical.py for categorical columns

  • Buffer:
    • buffer.py

  • .apply() and simliar functions:
    • applyutils.py

  • .query() and similar functions:
    • queryutils.py

  • GPU helper functions:
    • cudautils.py

  • Docstring helpers:
    • docutils.py

  • Output formating:
    • formatting.py

  • Arrow:
    • gpuarrow.py

  • Groupby:
    • groupby.py

  • Dask serialization helpers:
    • serialize.py

  • Index:
    • index.py

  • Operations on multiple DataFrame, Series or Indices:
    • multi.py

  • Other general helper functions:
    • utils.py

Code that should move to libgdf

Code that should be re-implemented in libgdf in CUDA-C for better reusability and performance.

  • cudf/cudautils.py contains a lot of GPU helper functions that are jitted by numba with @cuda.jit into CUDA kernels. All CUDA kernels in this file should be moved to libgdf if possible.

  • Some logic in cudf/groupby.py should be move to libgdf to make groupby operation faster. Some groupby aggregations are implemented with @cuda.jit here.

Code that cannot move to libgdf

Some features requires the jit to be useful; e.g features that use user-defined functions. These features cannot be moved to libgdf.