{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "10 Minutes to cuDF\n",
    "=======================\n",
    "\n",
    "Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF, geared mainly for new users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import cudf\n",
    "np.random.seed(12)\n",
    "\n",
    "#### Portions of this were borrowed from the\n",
    "#### cuDF cheatsheet, existing cuDF documentation,\n",
    "#### and 10 Minutes to Pandas.\n",
    "#### Created November, 2018."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Object Creation\n",
    "---------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      \n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    4\n"
     ]
    }
   ],
   "source": [
    "s = cudf.Series([1,2,3,None,4])\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `DataFrame` by specifying values for each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      a    b    c\n",
      " 0    0   19    0\n",
      " 1    1   18    1\n",
      " 2    2   17    2\n",
      " 3    3   16    3\n",
      " 4    4   15    4\n",
      " 5    5   14    5\n",
      " 6    6   13    6\n",
      " 7    7   12    7\n",
      " 8    8   11    8\n",
      " 9    9   10    9\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "df = cudf.DataFrame([('a', list(range(20))),\n",
    "('b', list(reversed(range(20)))),\n",
    "('c', list(range(20)))])\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `Dataframe` from a pandas `Dataframe`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b\n",
      "0    0  0.1\n",
      "1    1  0.2\n",
      "2    2     \n",
      "3    3  0.3\n"
     ]
    }
   ],
   "source": [
    "pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n",
    "gdf = cudf.DataFrame.from_pandas(pdf)\n",
    "print(gdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing Data\n",
    "-------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing the top rows of the GPU dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b    c\n",
      "0    0   19    0\n",
      "1    1   18    1\n"
     ]
    }
   ],
   "source": [
    "print(df.head(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sorting by values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      a    b    c\n",
      "19   19    0   19\n",
      "18   18    1   18\n",
      "17   17    2   17\n",
      "16   16    3   16\n",
      "15   15    4   15\n",
      "14   14    5   14\n",
      "13   13    6   13\n",
      "12   12    7   12\n",
      "11   11    8   11\n",
      "10   10    9   10\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "print(df.sort_values(by='a', ascending=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selection\n",
    "------------\n",
    "\n",
    "## Getting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting a single column, which yields a `cudf.Series`, equivalent to `df.a`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       \n",
      " 0    0\n",
      " 1    1\n",
      " 2    2\n",
      " 3    3\n",
      " 4    4\n",
      " 5    5\n",
      " 6    6\n",
      " 7    7\n",
      " 8    8\n",
      " 9    9\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "print(df['a'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection by Label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows from index 2 to index 5 from columns 'a' and 'b'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b\n",
      "2    2   17\n",
      "3    3   16\n",
      "4    4   15\n",
      "5    5   14\n"
     ]
    }
   ],
   "source": [
    "print(df.loc[2:5, ['a', 'b']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection by Position"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting by integer slicing, like numpy/pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b    c\n",
      "3    3   16    3\n",
      "4    4   15    4\n"
     ]
    }
   ],
   "source": [
    "print(df[3:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting elements of a `Series` with direct index access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n"
     ]
    }
   ],
   "source": [
    "print(s[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows in a `Series` by direct Boolean indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      \n",
      "0   19\n",
      "1   18\n",
      "2   17\n",
      "3   16\n"
     ]
    }
   ],
   "source": [
    "print(df.b[df.b > 15])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b    c\n",
      "16   16    3   16\n"
     ]
    }
   ],
   "source": [
    "print(df.query(\"b == 3\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing Data\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing data can be replaced by using the `fillna` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      \n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3  999\n",
      "4    4\n"
     ]
    }
   ],
   "source": [
    "print(s.fillna(999))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Operations\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating descriptive statistics for a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.5 1.666666666666666\n"
     ]
    }
   ],
   "source": [
    "print(s.mean(), s.var())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applymap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying functions to a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       \n",
      " 0   10\n",
      " 1   11\n",
      " 2   12\n",
      " 3   13\n",
      " 4   14\n",
      " 5   15\n",
      " 6   16\n",
      " 7   17\n",
      " 8   18\n",
      " 9   19\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "def add_ten(num):\n",
    "    return num + 10\n",
    "\n",
    "print(df['a'].applymap(add_ten))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Histogramming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Counting the number of occurrences of each unique value of variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       \n",
      " 0    1\n",
      " 1    1\n",
      " 2    1\n",
      " 3    1\n",
      " 4    1\n",
      " 5    1\n",
      " 6    1\n",
      " 7    1\n",
      " 8    1\n",
      " 9    1\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "print(df.a.value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenating `Series` and `DataFrames` row-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(cudf.concat([s, s]))\n",
    "print(cudf.concat([df.head(), df.head()], ignore_index=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performing SQL style merges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1:float64\n",
      "2:int64\n",
      "3:float64\n",
      "   key vals_a vals_b\n",
      "3    0   10.0       \n",
      "0    1   11.0   10.0\n",
      "1    2   12.0   11.0\n",
      "4    3   13.0       \n",
      "2    4   14.0   12.0\n"
     ]
    }
   ],
   "source": [
    "df_a = cudf.DataFrame()\n",
    "df_a['key'] = [0, 1, 2, 3, 4]\n",
    "df_a['vals_a'] = [float(i + 10) for i in range(5)]\n",
    "\n",
    "df_b = cudf.DataFrame()\n",
    "df_b['key'] = [1, 2, 4]\n",
    "df_b['vals_b'] = [float(i+10) for i in range(3)]\n",
    "\n",
    "df_merged = df_a.merge(df_b, on=['key'], how='left')\n",
    "print(df_merged.sort_values('key'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Append"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Appending values from another `Series` or array-like object. `Append` does not support `Series` with nulls. For handling null values, use the `concat` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       \n",
      " 0    0\n",
      " 1    1\n",
      " 2    2\n",
      " 3    3\n",
      " 4    4\n",
      " 5   19\n",
      " 6   18\n",
      " 7   17\n",
      " 8   16\n",
      " 9   15\n"
     ]
    }
   ],
   "source": [
    "print(df.a.head().append(df.b.head()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like pandas, cuDF supports the Split-Apply-Combine groupby paradigm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n",
    "df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping and then applying the `sum` function to the grouped data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  agg_col1 sum_a sum_b sum_c sum_agg_col2\n",
      "0        0   100    90   100            3\n",
      "1        1    90   100    90            4\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby('agg_col1').sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping hierarchically then applying the `sum` function to grouped data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  agg_col1 agg_col2 sum_a sum_b sum_c\n",
      "0        0        0    73    60    73\n",
      "1        0        1    27    30    27\n",
      "2        1        0    54    60    54\n",
      "3        1        1    36    40    36\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby(['agg_col1', 'agg_col2']).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping and applying statistical functions to specific columns, using `agg`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  agg_col1 mean_b sum_c max_a\n",
      "0        0      9   100    19\n",
      "1        1     10    90    18\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reshaping\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Time Series\n",
    "------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cuDF supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                     date               value\n",
      "0 2018-11-20T00:00:00.000 0.15416284237967237\n",
      "1 2018-11-21T00:00:00.000  0.7400496965154048\n",
      "2 2018-11-22T00:00:00.000 0.26331501518513467\n",
      "3 2018-11-23T00:00:00.000  0.5337393933802977\n"
     ]
    }
   ],
   "source": [
    "import datetime as dt\n",
    "\n",
    "date_df = cudf.DataFrame()\n",
    "date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')\n",
    "date_df['value'] = np.random.sample(len(date_df))\n",
    "\n",
    "search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')\n",
    "print(date_df.query('date <= @search_date'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Categoricals\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cuDF supports categorical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  grade   id\n",
      "0     a    1\n",
      "1     b    2\n",
      "2     b    3\n",
      "3     a    4\n",
      "4     a    5\n",
      "5     e    6\n"
     ]
    }
   ],
   "source": [
    "pdf = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n",
    "pdf[\"grade\"] = pdf[\"grade\"].astype(\"category\")\n",
    "\n",
    "gdf = cudf.DataFrame.from_pandas(pdf)\n",
    "print(gdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing the categories of a column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('a', 'b', 'e')\n"
     ]
    }
   ],
   "source": [
    "print(gdf.grade.cat.categories)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing the underlying code values of each categorical observation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      \n",
      "0    0\n",
      "1    1\n",
      "2    1\n",
      "3    0\n",
      "4    0\n",
      "5    2\n"
     ]
    }
   ],
   "source": [
    "print(gdf.grade.cat.codes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting\n",
    "------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting Data Representation\n",
    "--------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF `DataFrame` to a pandas `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c  agg_col1  agg_col2\n",
      "0  0  19  0         1         1\n",
      "1  1  18  1         0         0\n",
      "2  2  17  2         1         0\n",
      "3  3  16  3         0         1\n",
      "4  4  15  4         1         0\n"
     ]
    }
   ],
   "source": [
    "print(df.head().to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF `DataFrame` to a numpy `rec.array`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[( 0,  0, 19,  0, 1, 1) ( 1,  1, 18,  1, 0, 0) ( 2,  2, 17,  2, 1, 0)\n",
      " ( 3,  3, 16,  3, 0, 1) ( 4,  4, 15,  4, 1, 0) ( 5,  5, 14,  5, 0, 0)\n",
      " ( 6,  6, 13,  6, 1, 1) ( 7,  7, 12,  7, 0, 0) ( 8,  8, 11,  8, 1, 0)\n",
      " ( 9,  9, 10,  9, 0, 1) (10, 10,  9, 10, 1, 0) (11, 11,  8, 11, 0, 0)\n",
      " (12, 12,  7, 12, 1, 1) (13, 13,  6, 13, 0, 0) (14, 14,  5, 14, 1, 0)\n",
      " (15, 15,  4, 15, 0, 1) (16, 16,  3, 16, 1, 0) (17, 17,  2, 17, 0, 0)\n",
      " (18, 18,  1, 18, 1, 1) (19, 19,  0, 19, 0, 0)]\n"
     ]
    }
   ],
   "source": [
    "print(df.to_records())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF `Series` to a numpy `ndarray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]\n"
     ]
    }
   ],
   "source": [
    "print(df['a'].to_array())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Arrow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF `DataFrame` to a PyArrow `Table`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pyarrow.Table\n",
      "None: int64\n",
      "a: int64\n",
      "b: int64\n",
      "c: int64\n",
      "agg_col1: int64\n",
      "agg_col2: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.to_arrow())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Getting Data In/Out\n",
    "------------------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Writing to a CSV file, by first sending data to a pandas `Dataframe` on the host."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.to_pandas().to_csv('foo.txt', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading from a csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      a    b    c   a1   a2\n",
      " 0    0   19    0    1    1\n",
      " 1    1   18    1    0    0\n",
      " 2    2   17    2    1    0\n",
      " 3    3   16    3    0    1\n",
      " 4    4   15    4    1    0\n",
      " 5    5   14    5    0    0\n",
      " 6    6   13    6    1    1\n",
      " 7    7   12    7    0    0\n",
      " 8    8   11    8    1    0\n",
      " 9    9   10    9    0    1\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "df = cudf.read_csv('foo.txt', delimiter=',',\n",
    "        names=['a', 'b', 'c', 'a1', 'a2'],\n",
    "        dtype=['int64', 'int64', 'int64', 'int64', 'int64'],\n",
    "        skiprows=1)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ORC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gotchas\n",
    "--------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are attempting to perform Boolean indexing directly or using the `query` API, you might see an exception like:\n",
    "\n",
    "```\n",
    "    ---------------------------------------------------------------------------\n",
    "    AssertionError                            Traceback (most recent call last)\n",
    "   ...\n",
    "        103     from .numerical import NumericalColumn\n",
    "    --> 104     assert column.null_count == 0  # We don't properly handle the boolmask yet\n",
    "        105     boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())\n",
    "        106     indices = cudautils.arange(len(boolmask))\n",
    "\n",
    "    AssertionError: \n",
    "\n",
    "```\n",
    "Boolean indexing a `Series` containing null values will cause this error. Consider filling or removing the missing values.\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}