{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "10 Minutes to cuDF and Dask-cuDF\n",
    "=======================\n",
    "\n",
    "Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.\n",
    "\n",
    "### What are these Libraries?\n",
    "\n",
    "[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.\n",
    "\n",
    "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple.\n",
    "\n",
    "[Dask-cuDF](https://github.com/rapidsai/dask-cudf) is a library that provides a partitioned, GPU-backed dataframe, using Dask.\n",
    "\n",
    "\n",
    "### When to use cuDF and Dask-cuDF\n",
    "\n",
    "If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import cudf\n",
    "import dask_cudf\n",
    "\n",
    "np.random.seed(12)\n",
    "\n",
    "#### Portions of this were borrowed and adapted from the\n",
    "#### cuDF cheatsheet, existing cuDF documentation,\n",
    "#### and 10 Minutes to Pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Object Creation\n",
    "---------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.Series` and `dask_cudf.Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "s = cudf.Series([1,2,3,None,4])\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "ds = dask_cudf.from_cudf(s, npartitions=2) \n",
    "print(ds.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n",
      "4  4  15  4\n",
      "5  5  14  5\n",
      "6  6  13  6\n",
      "7  7  12  7\n",
      "8  8  11  8\n",
      "9  9  10  9\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "df = cudf.DataFrame([('a', list(range(20))),\n",
    "('b', list(reversed(range(20)))),\n",
    "('c', list(range(20)))])\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n",
      "4  4  15  4\n",
      "5  5  14  5\n",
      "6  6  13  6\n",
      "7  7  12  7\n",
      "8  8  11  8\n",
      "9  9  10  9\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "ddf = dask_cudf.from_cudf(df, npartitions=2) \n",
    "print(ddf.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n",
    "\n",
    "*Note that best practice for using Dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with something like `read_csv` (discussed below).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a    b\n",
      "0  0  0.1\n",
      "1  1  0.2\n",
      "2  2     \n",
      "3  3  0.3\n"
     ]
    }
   ],
   "source": [
    "pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n",
    "gdf = cudf.DataFrame.from_pandas(pdf)\n",
    "print(gdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a    b\n",
      "0  0  0.1\n",
      "1  1  0.2\n",
      "2  2     \n",
      "3  3  0.3\n"
     ]
    }
   ],
   "source": [
    "dask_df = dask_cudf.from_cudf(pdf, npartitions=2)\n",
    "dask_gdf = dask_cudf.from_dask_dataframe(dask_df)\n",
    "print(dask_gdf.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing Data\n",
    "-------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing the top rows of a GPU dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n"
     ]
    }
   ],
   "source": [
    "print(df.head(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n"
     ]
    }
   ],
   "source": [
    "print(ddf.head(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sorting by values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "19  19  0  19\n",
      "18  18  1  18\n",
      "17  17  2  17\n",
      "16  16  3  16\n",
      "15  15  4  15\n",
      "14  14  5  14\n",
      "13  13  6  13\n",
      "12  12  7  12\n",
      "11  11  8  11\n",
      "10  10  9  10\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "print(df.sort_values(by='b'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "0  19  0  19\n",
      "1  18  1  18\n",
      "2  17  2  17\n",
      "3  16  3  16\n",
      "4  15  4  15\n",
      "5  14  5  14\n",
      "6  13  6  13\n",
      "7  12  7  12\n",
      "8  11  8  11\n",
      "9  10  9  10\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "print(ddf.sort_values(by='b').compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selection\n",
    "------------\n",
    "\n",
    "## Getting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    0\n",
      "1    1\n",
      "2    2\n",
      "3    3\n",
      "4    4\n",
      "5    5\n",
      "6    6\n",
      "7    7\n",
      "8    8\n",
      "9    9\n",
      "[10 more rows]\n",
      "Name: a, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df['a'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    0\n",
      "1    1\n",
      "2    2\n",
      "3    3\n",
      "4    4\n",
      "5    5\n",
      "6    6\n",
      "7    7\n",
      "8    8\n",
      "9    9\n",
      "[10 more rows]\n",
      "Name: a, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ddf['a'].compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection by Label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows from index 2 to index 5 from columns 'a' and 'b'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b\n",
      "2  2  17\n",
      "3  3  16\n",
      "4  4  15\n",
      "5  5  14\n"
     ]
    }
   ],
   "source": [
    "print(df.loc[2:5, ['a', 'b']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b\n",
      "2  2  17\n",
      "3  3  16\n",
      "4  4  15\n",
      "5  5  14\n"
     ]
    }
   ],
   "source": [
    "print(ddf.loc[2:5, ['a', 'b']].compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection by Position"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a     0\n",
      "b    19\n",
      "c     0\n",
      "Name: 0, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.iloc[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b\n",
      "0  0  19\n",
      "1  1  18\n",
      "2  2  17\n"
     ]
    }
   ],
   "source": [
    "print(df.iloc[0:3, 0:2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also select elements of a `DataFrame` or `Series` with direct index access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "3  3  16  3\n",
      "4  4  15  4\n"
     ]
    }
   ],
   "source": [
    "print(df[3:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3     \n",
      "4    4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(s[3:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n"
     ]
    }
   ],
   "source": [
    "print(df[df.b > 15])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n"
     ]
    }
   ],
   "source": [
    "print(ddf[ddf.b > 15].compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "print(df.query(\"b == 3\"))  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "print(ddf.query(\"b == 3\").compute())  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "cudf_comparator = 3\n",
    "print(df.query(\"b == @cudf_comparator\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "dask_cudf_comparator = 3\n",
    "print(ddf.query(\"b == @val\", local_dict={'val':dask_cudf_comparator}).compute())  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MultiIndex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiIndex(levels=[array(['a', 'b'], dtype=object) array([1, 2, 3, 4])],\n",
       "codes=   0  1\n",
       "0  0  0\n",
       "1  0  1\n",
       "2  1  2\n",
       "3  1  3)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrays = [['a', 'a', 'b', 'b'],\n",
    "          [1, 2, 3, 4]]\n",
    "tuples = list(zip(*arrays))\n",
    "idx = cudf.MultiIndex.from_tuples(tuples)\n",
    "idx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This index can back either axis of a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        first    second\n",
      "a 1  0.154163  0.014575\n",
      "  2  0.740050  0.918747\n",
      "b 3  0.263315  0.900715\n",
      "  4  0.533739  0.033421\n"
     ]
    }
   ],
   "source": [
    "gdf1 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)})\n",
    "gdf1.index = idx\n",
    "print(gdf1.to_pandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "               a                   b          \n",
      "               1         2         3         4\n",
      "first   0.956949  0.137209  0.283828  0.606083\n",
      "second  0.944225  0.852736  0.002259  0.521226\n"
     ]
    }
   ],
   "source": [
    "gdf2 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)}).T\n",
    "gdf2.columns = idx\n",
    "print(gdf2.to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        first    second\n",
      "b 3  0.263315  0.900715\n"
     ]
    }
   ],
   "source": [
    "print(gdf1.loc[('b', 3)].to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing Data\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing data can be replaced by using the `fillna` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      1\n",
      "1      2\n",
      "2      3\n",
      "3    999\n",
      "4      4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(s.fillna(999))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      1\n",
      "1      2\n",
      "2      3\n",
      "3    999\n",
      "4      4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ds.fillna(999).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Operations\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating descriptive statistics for a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.5 1.666666666666666\n"
     ]
    }
   ],
   "source": [
    "print(s.mean(), s.var())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.5 1.6666666666666667\n"
     ]
    }
   ],
   "source": [
    "print(ds.mean().compute(), ds.var().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applymap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) to apply a function to each partition of the distributed dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    10\n",
      "1    11\n",
      "2    12\n",
      "3    13\n",
      "4    14\n",
      "5    15\n",
      "6    16\n",
      "7    17\n",
      "8    18\n",
      "9    19\n",
      "[10 more rows]\n",
      "Name: a, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "def add_ten(num):\n",
    "    return num + 10\n",
    "\n",
    "print(df['a'].applymap(add_ten))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    10\n",
      "1    11\n",
      "2    12\n",
      "3    13\n",
      "4    14\n",
      "5    15\n",
      "6    16\n",
      "7    17\n",
      "8    18\n",
      "9    19\n",
      "[10 more rows]\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ddf['a'].map_partitions(add_ten).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Histogramming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Counting the number of occurrences of each unique value of variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    1\n",
      "2    1\n",
      "3    1\n",
      "4    1\n",
      "5    1\n",
      "6    1\n",
      "7    1\n",
      "8    1\n",
      "9    1\n",
      "[10 more rows]\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.a.value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    1\n",
      "2    1\n",
      "3    1\n",
      "4    1\n",
      "5    1\n",
      "6    1\n",
      "7    1\n",
      "8    1\n",
      "9    1\n",
      "[10 more rows]\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ddf.a.value_counts().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0       a\n",
      "1       b\n",
      "2       c\n",
      "3    aaba\n",
      "4    baca\n",
      "5    None\n",
      "6    caba\n",
      "7     dog\n",
      "8     cat\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])\n",
    "print(s.str.lower())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0       a\n",
      "1       b\n",
      "2       c\n",
      "3    aaba\n",
      "4    baca\n",
      "5    None\n",
      "6    caba\n",
      "7     dog\n",
      "8     cat\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "ds = dask_cudf.from_cudf(s, npartitions=2)\n",
    "print(ds.str.lower().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenating `Series` and `DataFrames` row-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "s = cudf.Series([1, 2, 3, None, 5])\n",
    "print(cudf.concat([s, s]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n",
    "print(dask_cudf.concat([ds2, ds2]).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   key  vals_a  vals_b\n",
      "0    a    10.0   100.0\n",
      "1    c    12.0   101.0\n",
      "2    e    14.0   102.0\n",
      "3    b    11.0        \n",
      "4    d    13.0        \n"
     ]
    }
   ],
   "source": [
    "df_a = cudf.DataFrame()\n",
    "df_a['key'] = ['a', 'b', 'c', 'd', 'e']\n",
    "df_a['vals_a'] = [float(i + 10) for i in range(5)]\n",
    "\n",
    "df_b = cudf.DataFrame()\n",
    "df_b['key'] = ['a', 'c', 'e']\n",
    "df_b['vals_b'] = [float(i+100) for i in range(3)]\n",
    "\n",
    "merged = df_a.merge(df_b, on=['key'], how='left')\n",
    "print(merged)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   key  vals_a  vals_b\n",
      "0    a    10.0   100.0\n",
      "1    c    12.0   101.0\n",
      "2    b    11.0        \n",
      "0    e    14.0   102.0\n",
      "1    d    13.0        \n"
     ]
    }
   ],
   "source": [
    "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n",
    "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n",
    "\n",
    "merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()\n",
    "print(merged)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Append"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Appending values from another `Series` or array-like object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(s.append(s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ds2.append(ds2).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n",
    "df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n",
    "\n",
    "ddf = dask_cudf.from_cudf(df, npartitions=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping and then applying the `sum` function to the grouped data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b    c  agg_col2\n",
      "agg_col1\n",
      "0  100   90  100         3\n",
      "1   90  100   90         4\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby('agg_col1').sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b    c  agg_col2\n",
      "0  100   90  100         3\n",
      "1   90  100   90         4\n"
     ]
    }
   ],
   "source": [
    "print(ddf.groupby('agg_col1').sum().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping hierarchically then applying the `sum` function to grouped data. We send the result to a pandas dataframe only for printing purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                    a   b   c\n",
      "agg_col1 agg_col2            \n",
      "0        0         73  60  73\n",
      "         1         27  30  27\n",
      "1        0         54  60  54\n",
      "         1         36  40  36\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby(['agg_col1', 'agg_col2']).sum().to_pandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | \n",
       " | a\n",
       " | b\n",
       " | c\n",
       " | 
\n",
       "    \n",
       "      | agg_col1\n",
       " | agg_col2\n",
       " | \n",
       " | \n",
       " | \n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 0\n",
       " | 0\n",
       " | 73\n",
       " | 60\n",
       " | 73\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | 27\n",
       " | 30\n",
       " | 27\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | 0\n",
       " | 54\n",
       " | 60\n",
       " | 54\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | 36\n",
       " | 40\n",
       " | 36\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | boolean1\n",
       " | byte1\n",
       " | short1\n",
       " | int1\n",
       " | long1\n",
       " | float1\n",
       " | double1\n",
       " | bytes1\n",
       " | string1\n",
       " | middle.list.int1\n",
       " | middle.list.string1\n",
       " | list.int1\n",
       " | list.string1\n",
       " | map\n",
       " | map.int1\n",
       " | map.string1\n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 0\n",
       " | False\n",
       " | 1\n",
       " | 1024\n",
       " | 65536\n",
       " | 9223372036854775807\n",
       " | 1.0\n",
       " | -15.0\n",
       " | \u0000\u0001\u0002\u0003\u0004\n",
       " | hi\n",
       " | 3\n",
       " | bye\n",
       " | 4\n",
       " | \n",
       " | chani\n",
       " | 5\n",
       " | chani\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | True\n",
       " | 100\n",
       " | 2048\n",
       " | 65536\n",
       " | 9223372036854775807\n",
       " | 2.0\n",
       " | -5.0\n",
       " | \n",
       " | bye\n",
       " | 0\n",
       " | bye\n",
       " | 0\n",
       " | \n",
       " | mauddib\n",
       " | 1\n",
       " | mauddib\n",
       " | 
\n",
       "  \n",
       "
\n",
       "