{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "10 Minutes to cuDF and Dask-cuDF\n", "=======================\n", "\n", "Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.\n", "\n", "### What are these Libraries?\n", "\n", "[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.\n", "\n", "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple.\n", "\n", "[Dask-cuDF](https://github.com/rapidsai/dask-cudf) is a library that provides a partitioned, GPU-backed dataframe, using Dask.\n", "\n", "\n", "### When to use cuDF and Dask-cuDF\n", "\n", "If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import cudf\n", "import dask_cudf\n", "\n", "np.random.seed(12)\n", "\n", "#### Portions of this were borrowed and adapted from the\n", "#### cuDF cheatsheet, existing cuDF documentation,\n", "#### and 10 Minutes to Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Object Creation\n", "---------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `cudf.Series` and `dask_cudf.Series`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 4\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([1,2,3,None,4])\n", "s" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 4\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2) \n", "ds.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
33163
44154
55145
66136
77127
88118
99109
1010910
1111811
1212712
1313613
1414514
1515415
1616316
1717217
1818118
1919019
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4\n", "5 5 14 5\n", "6 6 13 6\n", "7 7 12 7\n", "8 8 11 8\n", "9 9 10 9\n", "10 10 9 10\n", "11 11 8 11\n", "12 12 7 12\n", "13 13 6 13\n", "14 14 5 14\n", "15 15 4 15\n", "16 16 3 16\n", "17 17 2 17\n", "18 18 1 18\n", "19 19 0 19" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.DataFrame([('a', list(range(20))),\n", "('b', list(reversed(range(20)))),\n", "('c', list(range(20)))])\n", "df" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
33163
44154
55145
66136
77127
88118
99109
1010910
1111811
1212712
1313613
1414514
1515415
1616316
1717217
1818118
1919019
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4\n", "5 5 14 5\n", "6 6 13 6\n", "7 7 12 7\n", "8 8 11 8\n", "9 9 10 9\n", "10 10 9 10\n", "11 11 8 11\n", "12 12 7 12\n", "13 13 6 13\n", "14 14 5 14\n", "15 15 4 15\n", "16 16 3 16\n", "17 17 2 17\n", "18 18 1 18\n", "19 19 0 19" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.from_cudf(df, npartitions=2) \n", "ddf.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n", "\n", "*Note that best practice for using Dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with something like `read_csv` (discussed below).*" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
000.1
110.2
22null
330.3
\n", "
" ], "text/plain": [ " a b\n", "0 0 0.1\n", "1 1 0.2\n", "2 2 null\n", "3 3 0.3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n", "gdf = cudf.DataFrame.from_pandas(pdf)\n", "gdf" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
000.1
110.2
22null
330.3
\n", "
" ], "text/plain": [ " a b\n", "0 0 0.1\n", "1 1 0.2\n", "2 2 null\n", "3 3 0.3" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dask_gdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", "dask_gdf.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Viewing Data\n", "-------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Viewing the top rows of a GPU dataframe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting by values." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1919019
1818118
1717217
1616316
1515415
1414514
1313613
1212712
1111811
1010910
99109
88118
77127
66136
55145
44154
33163
22172
11181
00190
\n", "
" ], "text/plain": [ " a b c\n", "19 19 0 19\n", "18 18 1 18\n", "17 17 2 17\n", "16 16 3 16\n", "15 15 4 15\n", "14 14 5 14\n", "13 13 6 13\n", "12 12 7 12\n", "11 11 8 11\n", "10 10 9 10\n", "9 9 10 9\n", "8 8 11 8\n", "7 7 12 7\n", "6 6 13 6\n", "5 5 14 5\n", "4 4 15 4\n", "3 3 16 3\n", "2 2 17 2\n", "1 1 18 1\n", "0 0 19 0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by='b')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
019019
118118
217217
316316
415415
514514
613613
712712
811811
910910
109109
118118
127127
136136
145145
154154
163163
172172
181181
190190
\n", "
" ], "text/plain": [ " a b c\n", "0 19 0 19\n", "1 18 1 18\n", "2 17 2 17\n", "3 16 3 16\n", "4 15 4 15\n", "5 14 5 14\n", "6 13 6 13\n", "7 12 7 12\n", "8 11 8 11\n", "9 10 9 10\n", "10 9 10 9\n", "11 8 11 8\n", "12 7 12 7\n", "13 6 13 6\n", "14 5 14 5\n", "15 4 15 4\n", "16 3 16 3\n", "17 2 17 2\n", "18 1 18 1\n", "19 0 19 0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.sort_values(by='b').compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selection\n", "------------\n", "\n", "## Getting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "10 10\n", "11 11\n", "12 12\n", "13 13\n", "14 14\n", "15 15\n", "16 16\n", "17 17\n", "18 18\n", "19 19\n", "Name: a, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['a']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "10 10\n", "11 11\n", "12 12\n", "13 13\n", "14 14\n", "15 15\n", "16 16\n", "17 17\n", "18 18\n", "19 19\n", "Name: a, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf['a'].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection by Label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
2217
3316
4415
5514
\n", "
" ], "text/plain": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[2:5, ['a', 'b']]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
2217
3316
4415
5514
\n", "
" ], "text/plain": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.loc[2:5, ['a', 'b']].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection by Position" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 19\n", "c 0\n", "Name: 0, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0019
1118
2217
\n", "
" ], "text/plain": [ " a b\n", "0 0 19\n", "1 1 18\n", "2 2 17" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0:3, 0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also select elements of a `DataFrame` or `Series` with direct index access." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
33163
44154
\n", "
" ], "text/plain": [ " a b c\n", "3 3 16 3\n", "4 4 15 4" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[3:5]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 null\n", "4 4\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[3:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
33163
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.b > 15]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
33163
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[ddf.b > 15].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"b == 3\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.query(\"b == 3\").compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf_comparator = 3\n", "df.query(\"b == @cudf_comparator\")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dask_cudf_comparator = 3\n", "ddf.query(\"b == @val\", local_dict={'val':dask_cudf_comparator}).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `isin` method for filtering." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
55145
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "5 5 14 5" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.a.isin([0, 5])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MultiIndex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex(levels=[0 a\n", "1 b\n", "dtype: object, 0 1\n", "1 2\n", "2 3\n", "3 4\n", "dtype: int64],\n", "codes= 0 1\n", "0 0 0\n", "1 0 1\n", "2 1 2\n", "3 1 3)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arrays = [['a', 'a', 'b', 'b'],\n", " [1, 2, 3, 4]]\n", "tuples = list(zip(*arrays))\n", "idx = cudf.MultiIndex.from_tuples(tuples)\n", "idx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This index can back either axis of a DataFrame." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstsecond
a10.1541630.014575
20.7400500.918747
b30.2633150.900715
40.5337390.033421
\n", "
" ], "text/plain": [ " first second\n", "a 1 0.154163 0.014575\n", " 2 0.740050 0.918747\n", "b 3 0.263315 0.900715\n", " 4 0.533739 0.033421" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)})\n", "gdf1.index = idx\n", "gdf1" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
1234
first0.9569490.1372090.2838280.606083
second0.9442250.8527360.0022590.521226
\n", "
" ], "text/plain": [ " a b \n", " 1 2 3 4\n", "first 0.956949 0.137209 0.283828 0.606083\n", "second 0.944225 0.852736 0.002259 0.521226" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf2 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)}).T\n", "gdf2.columns = idx\n", "gdf2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstsecond
b30.2633150.900715
\n", "
" ], "text/plain": [ " first second\n", "b 3 0.263315 0.900715" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1.loc[('b', 3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing Data\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 999\n", "4 4\n", "dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.fillna(999)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 999\n", "4 4\n", "dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.fillna(999).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2.5, 1.666666666666666)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.mean(), s.var()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2.5, 1.6666666666666667)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.mean().compute(), ds.var().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applymap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) to apply a function to each partition of the distributed dataframe." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "5 15\n", "6 16\n", "7 17\n", "8 18\n", "9 19\n", "10 20\n", "11 21\n", "12 22\n", "13 23\n", "14 24\n", "15 25\n", "16 26\n", "17 27\n", "18 28\n", "19 29\n", "Name: a, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add_ten(num):\n", " return num + 10\n", "\n", "df['a'].applymap(add_ten)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "5 15\n", "6 16\n", "7 17\n", "8 18\n", "9 19\n", "10 20\n", "11 21\n", "12 22\n", "13 23\n", "14 24\n", "15 25\n", "16 26\n", "17 27\n", "18 28\n", "19 29\n", "Name: a, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf['a'].map_partitions(add_ten).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogramming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", "5 1\n", "6 1\n", "7 1\n", "8 1\n", "9 1\n", "10 1\n", "11 1\n", "12 1\n", "13 1\n", "14 1\n", "15 1\n", "16 1\n", "17 1\n", "18 1\n", "19 1\n", "Name: a, dtype: int32" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.a.value_counts()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", "5 1\n", "6 1\n", "7 1\n", "8 1\n", "9 1\n", "10 1\n", "11 1\n", "12 1\n", "13 1\n", "14 1\n", "15 1\n", "16 1\n", "17 1\n", "18 1\n", "19 1\n", "Name: a, dtype: int64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.a.value_counts().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String Methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "4 baca\n", "5 None\n", "6 caba\n", "7 dog\n", "8 cat\n", "dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])\n", "s.str.lower()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "4 baca\n", "5 None\n", "6 caba\n", "7 dog\n", "8 cat\n", "dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2)\n", "ds.str.lower().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([1, 2, 3, None, 5])\n", "cudf.concat([s, s])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n", "dask_cudf.concat([ds2, ds2]).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvals_avals_b
0a10.0100.0
1c12.0101.0
2e14.0102.0
3b11.0null
4d13.0null
\n", "
" ], "text/plain": [ " key vals_a vals_b\n", "0 a 10.0 100.0\n", "1 c 12.0 101.0\n", "2 e 14.0 102.0\n", "3 b 11.0 null\n", "4 d 13.0 null" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_a = cudf.DataFrame()\n", "df_a['key'] = ['a', 'b', 'c', 'd', 'e']\n", "df_a['vals_a'] = [float(i + 10) for i in range(5)]\n", "\n", "df_b = cudf.DataFrame()\n", "df_b['key'] = ['a', 'c', 'e']\n", "df_b['vals_b'] = [float(i+100) for i in range(3)]\n", "\n", "merged = df_a.merge(df_b, on=['key'], how='left')\n", "merged" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvals_avals_b
0a10.0100.0
1c12.0101.0
2b11.0null
0e14.0102.0
1d13.0null
\n", "
" ], "text/plain": [ " key vals_a vals_b\n", "0 a 10.0 100.0\n", "1 c 12.0 101.0\n", "2 b 11.0 null\n", "0 e 14.0 102.0\n", "1 d 13.0 null" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n", "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n", "\n", "merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()\n", "merged" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Append" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Appending values from another `Series` or array-like object." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "dtype: int64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.append(s)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 null\n", "4 5\n", "dtype: int64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds2.append(ds2).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n", "df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n", "\n", "ddf = dask_cudf.from_cudf(df, npartitions=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col2
agg_col1
0100901003
190100904
\n", "
" ], "text/plain": [ " a b c agg_col2\n", "agg_col1 \n", "0 100 90 100 3\n", "1 90 100 90 4" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('agg_col1').sum()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col2
agg_col1
0100901003
190100904
\n", "
" ], "text/plain": [ " a b c agg_col2\n", "agg_col1 \n", "0 100 90 100 3\n", "1 90 100 90 4" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby('agg_col1').sum().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1agg_col2
00736073
1273027
10546054
1364036
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 agg_col2 \n", "0 0 73 60 73\n", " 1 27 30 27\n", "1 0 54 60 54\n", " 1 36 40 36" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['agg_col1', 'agg_col2']).sum()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1agg_col2
11364036
00736073
10546054
01273027
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 agg_col2 \n", "1 1 36 40 36\n", "0 0 73 60 73\n", "1 0 54 60 54\n", "0 1 27 30 27" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby(['agg_col1', 'agg_col2']).sum().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1
0199.0100
11810.090
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 \n", "0 19 9.0 100\n", "1 18 10.0 90" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'})" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1
0199.0100
11810.090
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 \n", "0 19 9.0 100\n", "1 18 10.0 90" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transpose" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
014
125
236
\n", "
" ], "text/plain": [ " a b\n", "0 1 4\n", "1 2 5\n", "2 3 6" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample = cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]})\n", "sample" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
a123
b456
\n", "
" ], "text/plain": [ " 0 1 2\n", "a 1 2 3\n", "b 4 5 6" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample.transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time Series\n", "------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datevalue
02018-11-200.552038
12018-11-210.485377
22018-11-220.768134
32018-11-230.160717
\n", "
" ], "text/plain": [ " date value\n", "0 2018-11-20 0.552038\n", "1 2018-11-21 0.485377\n", "2 2018-11-22 0.768134\n", "3 2018-11-23 0.160717" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime as dt\n", "\n", "date_df = cudf.DataFrame()\n", "date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')\n", "date_df['value'] = np.random.sample(len(date_df))\n", "\n", "search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')\n", "date_df.query('date <= @search_date')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datevalue
02018-11-200.552038
12018-11-210.485377
22018-11-220.768134
32018-11-230.160717
\n", "
" ], "text/plain": [ " date value\n", "0 2018-11-20 0.552038\n", "1 2018-11-21 0.485377\n", "2 2018-11-22 0.768134\n", "3 2018-11-23 0.160717" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n", "date_ddf.query('date <= @search_date', local_dict={'search_date':search_date}).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categoricals\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrames` support categorical columns." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idgrade
01a
12b
23b
34a
45a
56e
\n", "
" ], "text/plain": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b\n", "3 4 a\n", "4 5 a\n", "5 6 e" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf = cudf.DataFrame({\"id\":[1,2,3,4,5,6], \"grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n", "gdf['grade'] = gdf['grade'].astype('category')\n", "gdf" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idgrade
01a
12b
23b
34a
45a
56e
\n", "
" ], "text/plain": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b\n", "3 4 a\n", "4 5 a\n", "5 6 e" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", "dgdf.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "StringIndex(['a' 'b' 'e'], dtype='object')" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.grade.cat.categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 0\n", "4 0\n", "5 2\n", "Name: grade, dtype: int32" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.grade.cat.codes" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "0 0\n", "1 0\n", "2 2\n", "Name: grade, dtype: int32" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dgdf.grade.cat.codes.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting Data Representation\n", "--------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head().to_pandas()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute().head().to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 19, 0, 1, 1],\n", " [ 1, 18, 1, 0, 0],\n", " [ 2, 17, 2, 1, 0],\n", " [ 3, 16, 3, 0, 1],\n", " [ 4, 15, 4, 1, 0],\n", " [ 5, 14, 5, 0, 0],\n", " [ 6, 13, 6, 1, 1],\n", " [ 7, 12, 7, 0, 0],\n", " [ 8, 11, 8, 1, 0],\n", " [ 9, 10, 9, 0, 1],\n", " [10, 9, 10, 1, 0],\n", " [11, 8, 11, 0, 0],\n", " [12, 7, 12, 1, 1],\n", " [13, 6, 13, 0, 0],\n", " [14, 5, 14, 1, 0],\n", " [15, 4, 15, 0, 1],\n", " [16, 3, 16, 1, 0],\n", " [17, 2, 17, 0, 0],\n", " [18, 1, 18, 1, 1],\n", " [19, 0, 19, 0, 0]])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.as_matrix()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 19, 0, 1, 1],\n", " [ 1, 18, 1, 0, 0],\n", " [ 2, 17, 2, 1, 0],\n", " [ 3, 16, 3, 0, 1],\n", " [ 4, 15, 4, 1, 0],\n", " [ 5, 14, 5, 0, 0],\n", " [ 6, 13, 6, 1, 1],\n", " [ 7, 12, 7, 0, 0],\n", " [ 8, 11, 8, 1, 0],\n", " [ 9, 10, 9, 0, 1],\n", " [10, 9, 10, 1, 0],\n", " [11, 8, 11, 0, 0],\n", " [12, 7, 12, 1, 1],\n", " [13, 6, 13, 0, 0],\n", " [14, 5, 14, 1, 0],\n", " [15, 4, 15, 0, 1],\n", " [16, 3, 16, 1, 0],\n", " [17, 2, 17, 0, 0],\n", " [18, 1, 18, 1, 1],\n", " [19, 0, 19, 0, 0]])" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute().as_matrix()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['a'].to_array()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf['a'].compute().to_array()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arrow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "metadata\n", "--------\n", "{b'pandas': b'{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": 0, \"'\n", " b'stop\": 20, \"step\": 1}], \"column_indexes\": [{\"name\": null, \"field'\n", " b'_name\": null, \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", '\n", " b'\"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns\": [{\"name\": \"a\", \"'\n", " b'field_name\": \"a\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\",'\n", " b' \"metadata\": null}, {\"name\": \"b\", \"field_name\": \"b\", \"pandas_typ'\n", " b'e\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": '\n", " b'\"c\", \"field_name\": \"c\", \"pandas_type\": \"int64\", \"numpy_type\": \"i'\n", " b'nt64\", \"metadata\": null}, {\"name\": \"agg_col1\", \"field_name\": \"ag'\n", " b'g_col1\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadat'\n", " b'a\": null}, {\"name\": \"agg_col2\", \"field_name\": \"agg_col2\", \"panda'\n", " b's_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"cr'\n", " b'eator\": {\"library\": \"pyarrow\", \"version\": \"0.14.1\"}, \"pandas_ver'\n", " b'sion\": \"0.24.2\"}'}" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_arrow()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "__index_level_0__: int64\n", "metadata\n", "--------\n", "{b'pandas': b'{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": [{\"na'\n", " b'me\": null, \"field_name\": null, \"pandas_type\": \"unicode\", \"numpy_'\n", " b'type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}], \"columns\":'\n", " b' [{\"name\": \"a\", \"field_name\": \"a\", \"pandas_type\": \"int64\", \"nump'\n", " b'y_type\": \"int64\", \"metadata\": null}, {\"name\": \"b\", \"field_name\":'\n", " b' \"b\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\":'\n", " b' null}, {\"name\": \"c\", \"field_name\": \"c\", \"pandas_type\": \"int64\",'\n", " b' \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"agg_col1\", '\n", " b'\"field_name\": \"agg_col1\", \"pandas_type\": \"int64\", \"numpy_type\": '\n", " b'\"int64\", \"metadata\": null}, {\"name\": \"agg_col2\", \"field_name\": \"'\n", " b'agg_col2\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metad'\n", " b'ata\": null}, {\"name\": null, \"field_name\": \"__index_level_0__\", \"'\n", " b'pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}]'\n", " b', \"creator\": {\"library\": \"pyarrow\", \"version\": \"0.14.1\"}, \"panda'\n", " b's_version\": \"0.24.2\"}'}" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute().to_arrow()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting Data In/Out\n", "------------------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing to a CSV file." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists('example_output'):\n", " os.mkdir('example_output')\n", " \n", "df.to_csv('example_output/foo.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "ddf.compute().to_csv('example_output/foo_dask.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading from a csv file." ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.read_csv('example_output/foo.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv('example_output/foo_dask.csv')\n", "ddf.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv('example_output/*.csv')\n", "ddf.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parquet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing to parquet files, using the CPU via PyArrow." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/parquet.py:68: UserWarning: Using CPU via PyArrow to write Parquet dataset, this will be GPU accelerated in the future\n", " \"Using CPU via PyArrow to write Parquet dataset, this will \"\n" ] } ], "source": [ "df.to_parquet('example_output/temp_parquet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading parquet files with a GPU-accelerated parquet reader." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.read_parquet('example_output/temp_parquet/81389eafdff74e20bcff94c134f6f162.parquet')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "ddf.to_parquet('example_files') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ORC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading ORC files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = cudf.read_orc('/rapids/cudf/python/cudf/cudf/tests/data/orc/TestOrcFile.test1.orc')\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dask Performance Tips\n", "--------------------------------\n", "\n", "Like Apache Spark, Dask operations are [lazy](https://en.wikipedia.org/wiki/Lazy_evaluation). Instead of being executed at that moment, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.\n", "\n", "Sometimes, though, we want to force the execution of operations. Calling `persist` on a Dask collection fully computes it (or actively computes it in the background), persisting the result into memory. When we're using distributed systems, we may want to wait until `persist` is finished before beginning any downstream operations. We can enforce this contract by using `wait`. Wrapping an operation with `wait` will ensure it doesn't begin executing until all necessary upstream operations have finished.\n", "\n", "The snippets below provide basic examples, using `LocalCUDACluster` to create one dask-worker per GPU on the local machine. For more detailed information about `persist` and `wait`, please see the Dask documentation for [persist](https://docs.dask.org/en/latest/api.html#dask.persist) and [wait](https://docs.dask.org/en/latest/futures.html#distributed.wait). Wait relies on the concept of Futures, which is beyond the scope of this tutorial. For more information on Futures, see the Dask [Futures](https://docs.dask.org/en/latest/futures.html) documentation. For more information about multi-GPU clusters, please see the [dask-cuda](https://github.com/rapidsai/dask-cuda) library (documentation is in progress)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we set up a GPU cluster. With our `client` set up, Dask-cuDF computation will be distributed across the GPUs in the cluster." ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "

Client

\n", "\n", "
\n", "

Cluster

\n", "
    \n", "
  • Workers: 2
  • \n", "
  • Cores: 2
  • \n", "
  • Memory: 404.27 GB
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import time\n", "\n", "from dask.distributed import Client, wait\n", "from dask_cuda import LocalCUDACluster\n", "\n", "cluster = LocalCUDACluster()\n", "client = Client(cluster)\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Persisting Data\n", "Next, we create our Dask-cuDF DataFrame and apply a transformation, storing the result as a new column." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
npartitions=5
0int64int64int64
2000000.........
............
8000000.........
9999999.........
\n", "
\n", "
Dask Name: assign, 20 tasks
" ], "text/plain": [ "" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nrows = 10000000\n", "\n", "df2 = cudf.DataFrame({'a':np.arange(nrows), 'b':np.arange(nrows)})\n", "ddf2 = dask_cudf.from_cudf(df2, npartitions=5)\n", "ddf2['c'] = ddf2['a'] + 5\n", "ddf2" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tue Aug 20 18:50:06 2019 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla T4 On | 00000000:AF:00.0 Off | 0 |\n", "| N/A 51C P0 29W / 70W | 803MiB / 15079MiB | 0% Default |\n", "+-------------------------------+----------------------+----------------------+\n", "| 1 Tesla T4 On | 00000000:D8:00.0 Off | 0 |\n", "| N/A 46C P0 27W / 70W | 129MiB / 15079MiB | 0% Default |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: GPU Memory |\n", "| GPU PID Type Process name Usage |\n", "|=============================================================================|\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because Dask is lazy, the computation has not yet occurred. We can see that there are twenty tasks in the task graph and we've used about 800 MB of memory. We can force computation by using `persist`. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline)." ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
npartitions=5
0int64int64int64
2000000.........
............
8000000.........
9999999.........
\n", "
\n", "
Dask Name: assign, 5 tasks
" ], "text/plain": [ "" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf2 = ddf2.persist()\n", "ddf2" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tue Aug 20 18:50:29 2019 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla T4 On | 00000000:AF:00.0 Off | 0 |\n", "| N/A 51C P0 29W / 70W | 1125MiB / 15079MiB | 2% Default |\n", "+-------------------------------+----------------------+----------------------+\n", "| 1 Tesla T4 On | 00000000:D8:00.0 Off | 0 |\n", "| N/A 46C P0 27W / 70W | 563MiB / 15079MiB | 1% Default |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: GPU Memory |\n", "| GPU PID Type Process name Usage |\n", "|=============================================================================|\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we forced computation, we now have a larger object in distributed GPU memory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wait\n", "Depending on our workflow or distributed computing setup, we may want to `wait` until all upstream tasks have finished before proceeding with a specific function. This section shows an example of this behavior, adapted from the Dask documentation.\n", "\n", "First, we create a new Dask DataFrame and define a function that we'll map to every partition in the dataframe." ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "nrows = 10000000\n", "\n", "df1 = cudf.DataFrame({'a':np.arange(nrows), 'b':np.arange(nrows)})\n", "ddf1 = dask_cudf.from_cudf(df1, npartitions=100)\n", "\n", "def func(df):\n", " time.sleep(np.random.randint(1, 60))\n", " return (df + 5) * 3 - 11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-60 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "results_ddf = ddf2.map_partitions(func)\n", "results_ddf = results_ddf.persist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, some partitions will be done **much** sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using `wait`." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wait(results_ddf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `wait`, we can safely proceed on in our workflow." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 4 }