Developer Documentation ======================= Code Organization ----------------- This shows the basic code organization. Currently, the repo is basically flat. All implementations are directly under under the ``cudf/`` directory. All tests are in ``cudf/tests/`` directory. Here's a quick map to decide which file contains which feature: - ``DataFrame``: - ``dataframe.py`` - ``Series``: - ``series.py`` - ``Column`` and its subclasses: - ``column.py`` - ``columnops.py`` - ``numerical.py`` for numeric columns - ``categorical.py`` for categorical columns - ``Buffer``: - ``buffer.py`` - ``.apply()`` and simliar functions: - ``applyutils.py`` - ``.query()`` and similar functions: - ``queryutils.py`` - GPU helper functions: - ``cudautils.py`` - Docstring helpers: - ``docutils.py`` - Output formating: - ``formatting.py`` - Arrow: - ``gpuarrow.py`` - Groupby: - ``groupby.py`` - Dask serialization helpers: - ``serialize.py`` - ``Index``: - ``index.py`` - Operations on multiple DataFrame, Series or Indices: - ``multi.py`` - Other general helper functions: - ``utils.py`` Code that should move to libgdf -------------------------------- Code that should be re-implemented in libgdf in CUDA-C for better reusability and performance. - ``cudf/cudautils.py`` contains a lot of GPU helper functions that are jitted by numba with ``@cuda.jit`` into CUDA kernels. All CUDA kernels in this file should be moved to libgdf if possible. - Some logic in ``cudf/groupby.py`` should be move to libgdf to make groupby operation faster. Some groupby aggregations are implemented with ``@cuda.jit`` here. Code that cannot move to libgdf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some features requires the jit to be useful; e.g features that use user-defined functions. These features cannot be moved to libgdf.