{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "10 Minutes to cuDF\n", "=======================\n", "\n", "Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF, geared mainly for new users." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "import cudf\n", "np.random.seed(12)\n", "\n", "#### Portions of this were borrowed from the\n", "#### cuDF cheatsheet, existing cuDF documentation,\n", "#### and 10 Minutes to Pandas.\n", "#### Created November, 2018." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Object Creation\n", "---------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `Series`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 4\n" ] } ], "source": [ "s = cudf.Series([1,2,3,None,4])\n", "print(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `DataFrame` by specifying values for each column." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", " 0 0 19 0\n", " 1 1 18 1\n", " 2 2 17 2\n", " 3 3 16 3\n", " 4 4 15 4\n", " 5 5 14 5\n", " 6 6 13 6\n", " 7 7 12 7\n", " 8 8 11 8\n", " 9 9 10 9\n", "[10 more rows]\n" ] } ], "source": [ "df = cudf.DataFrame([('a', list(range(20))),\n", "('b', list(reversed(range(20)))),\n", "('c', list(range(20)))])\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `Dataframe` from a pandas `Dataframe`. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "0 0 0.1\n", "1 1 0.2\n", "2 2 \n", "3 3 0.3\n" ] } ], "source": [ "pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n", "gdf = cudf.DataFrame.from_pandas(pdf)\n", "print(gdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Viewing Data\n", "-------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Viewing the top rows of the GPU dataframe." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n" ] } ], "source": [ "print(df.head(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting by values." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "19 19 0 19\n", "18 18 1 18\n", "17 17 2 17\n", "16 16 3 16\n", "15 15 4 15\n", "14 14 5 14\n", "13 13 6 13\n", "12 12 7 12\n", "11 11 8 11\n", "10 10 9 10\n", "[10 more rows]\n" ] } ], "source": [ "print(df.sort_values(by='a', ascending=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selection\n", "------------\n", "\n", "## Getting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting a single column, which yields a `cudf.Series`, equivalent to `df.a`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", " 0 0\n", " 1 1\n", " 2 2\n", " 3 3\n", " 4 4\n", " 5 5\n", " 6 6\n", " 7 7\n", " 8 8\n", " 9 9\n", "[10 more rows]\n" ] } ], "source": [ "print(df['a'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection by Label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14\n" ] } ], "source": [ "print(df.loc[2:5, ['a', 'b']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection by Position" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting by integer slicing, like numpy/pandas." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "3 3 16 3\n", "4 4 15 4\n" ] } ], "source": [ "print(df[3:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting elements of a `Series` with direct index access." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n" ] } ], "source": [ "print(s[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting rows in a `Series` by direct Boolean indexing." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "0 19\n", "1 18\n", "2 17\n", "3 16\n" ] } ], "source": [ "print(df.b[df.b > 15])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "16 16 3 16\n" ] } ], "source": [ "print(df.query(\"b == 3\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing Data\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "0 1\n", "1 2\n", "2 3\n", "3 999\n", "4 4\n" ] } ], "source": [ "print(s.fillna(999))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.5 1.666666666666666\n" ] } ], "source": [ "print(s.mean(), s.var())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applymap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying functions to a `Series`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", " 0 10\n", " 1 11\n", " 2 12\n", " 3 13\n", " 4 14\n", " 5 15\n", " 6 16\n", " 7 17\n", " 8 18\n", " 9 19\n", "[10 more rows]\n" ] } ], "source": [ "def add_ten(num):\n", " return num + 10\n", "\n", "print(df['a'].applymap(add_ten))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogramming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", " 0 1\n", " 1 1\n", " 2 1\n", " 3 1\n", " 4 1\n", " 5 1\n", " 6 1\n", " 7 1\n", " 8 1\n", " 9 1\n", "[10 more rows]\n" ] } ], "source": [ "print(df.a.value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String Methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merge\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cudf.concat([s, s]))\n", "print(cudf.concat([df.head(), df.head()], ignore_index=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performing SQL style merges." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1:float64\n", "2:int64\n", "3:float64\n", " key vals_a vals_b\n", "3 0 10.0 \n", "0 1 11.0 10.0\n", "1 2 12.0 11.0\n", "4 3 13.0 \n", "2 4 14.0 12.0\n" ] } ], "source": [ "df_a = cudf.DataFrame()\n", "df_a['key'] = [0, 1, 2, 3, 4]\n", "df_a['vals_a'] = [float(i + 10) for i in range(5)]\n", "\n", "df_b = cudf.DataFrame()\n", "df_b['key'] = [1, 2, 4]\n", "df_b['vals_b'] = [float(i+10) for i in range(3)]\n", "\n", "df_merged = df_a.merge(df_b, on=['key'], how='left')\n", "print(df_merged.sort_values('key'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Append" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Appending values from another `Series` or array-like object. `Append` does not support `Series` with nulls. For handling null values, use the `concat` method." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", " 0 0\n", " 1 1\n", " 2 2\n", " 3 3\n", " 4 4\n", " 5 19\n", " 6 18\n", " 7 17\n", " 8 16\n", " 9 15\n" ] } ], "source": [ "print(df.a.head().append(df.b.head()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like pandas, cuDF supports the Split-Apply-Combine groupby paradigm." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n", "df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " agg_col1 sum_a sum_b sum_c sum_agg_col2\n", "0 0 100 90 100 3\n", "1 1 90 100 90 4\n" ] } ], "source": [ "print(df.groupby('agg_col1').sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " agg_col1 agg_col2 sum_a sum_b sum_c\n", "0 0 0 73 60 73\n", "1 0 1 27 30 27\n", "2 1 0 54 60 54\n", "3 1 1 36 40 36\n" ] } ], "source": [ "print(df.groupby(['agg_col1', 'agg_col2']).sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " agg_col1 mean_b sum_c max_a\n", "0 0 9 100 19\n", "1 1 10 90 18\n" ] } ], "source": [ "print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reshaping\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time Series\n", "------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cuDF supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " date value\n", "0 2018-11-20T00:00:00.000 0.15416284237967237\n", "1 2018-11-21T00:00:00.000 0.7400496965154048\n", "2 2018-11-22T00:00:00.000 0.26331501518513467\n", "3 2018-11-23T00:00:00.000 0.5337393933802977\n" ] } ], "source": [ "import datetime as dt\n", "\n", "date_df = cudf.DataFrame()\n", "date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')\n", "date_df['value'] = np.random.sample(len(date_df))\n", "\n", "search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')\n", "print(date_df.query('date <= @search_date'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categoricals\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cuDF supports categorical columns." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " grade id\n", "0 a 1\n", "1 b 2\n", "2 b 3\n", "3 a 4\n", "4 a 5\n", "5 e 6\n" ] } ], "source": [ "pdf = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n", "pdf[\"grade\"] = pdf[\"grade\"].astype(\"category\")\n", "\n", "gdf = cudf.DataFrame.from_pandas(pdf)\n", "print(gdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the categories of a column." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('a', 'b', 'e')\n" ] } ], "source": [ "print(gdf.grade.cat.categories)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "0 0\n", "1 1\n", "2 1\n", "3 0\n", "4 0\n", "5 2\n" ] } ], "source": [ "print(gdf.grade.cat.codes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting\n", "------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting Data Representation\n", "--------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF `DataFrame` to a pandas `DataFrame`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n" ] } ], "source": [ "print(df.head().to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF `DataFrame` to a numpy `rec.array`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[( 0, 0, 19, 0, 1, 1) ( 1, 1, 18, 1, 0, 0) ( 2, 2, 17, 2, 1, 0)\n", " ( 3, 3, 16, 3, 0, 1) ( 4, 4, 15, 4, 1, 0) ( 5, 5, 14, 5, 0, 0)\n", " ( 6, 6, 13, 6, 1, 1) ( 7, 7, 12, 7, 0, 0) ( 8, 8, 11, 8, 1, 0)\n", " ( 9, 9, 10, 9, 0, 1) (10, 10, 9, 10, 1, 0) (11, 11, 8, 11, 0, 0)\n", " (12, 12, 7, 12, 1, 1) (13, 13, 6, 13, 0, 0) (14, 14, 5, 14, 1, 0)\n", " (15, 15, 4, 15, 0, 1) (16, 16, 3, 16, 1, 0) (17, 17, 2, 17, 0, 0)\n", " (18, 18, 1, 18, 1, 1) (19, 19, 0, 19, 0, 0)]\n" ] } ], "source": [ "print(df.to_records())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF `Series` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]\n" ] } ], "source": [ "print(df['a'].to_array())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arrow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF `DataFrame` to a PyArrow `Table`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pyarrow.Table\n", "None: int64\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n" ] } ], "source": [ "print(df.to_arrow())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting Data In/Out\n", "------------------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing to a CSV file, by first sending data to a pandas `Dataframe` on the host." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.to_pandas().to_csv('foo.txt', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading from a csv file." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c a1 a2\n", " 0 0 19 0 1 1\n", " 1 1 18 1 0 0\n", " 2 2 17 2 1 0\n", " 3 3 16 3 0 1\n", " 4 4 15 4 1 0\n", " 5 5 14 5 0 0\n", " 6 6 13 6 1 1\n", " 7 7 12 7 0 0\n", " 8 8 11 8 1 0\n", " 9 9 10 9 0 1\n", "[10 more rows]\n" ] } ], "source": [ "df = cudf.read_csv('foo.txt', delimiter=',',\n", " names=['a', 'b', 'c', 'a1', 'a2'],\n", " dtype=['int64', 'int64', 'int64', 'int64', 'int64'],\n", " skiprows=1)\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parquet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ORC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gotchas\n", "--------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are attempting to perform Boolean indexing directly or using the `query` API, you might see an exception like:\n", "\n", "```\n", " ---------------------------------------------------------------------------\n", " AssertionError Traceback (most recent call last)\n", " ...\n", " 103 from .numerical import NumericalColumn\n", " --> 104 assert column.null_count == 0 # We don't properly handle the boolmask yet\n", " 105 boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())\n", " 106 indices = cudautils.arange(len(boolmask))\n", "\n", " AssertionError: \n", "\n", "```\n", "Boolean indexing a `Series` containing null values will cause this error. Consider filling or removing the missing values.\n" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.6" } }, "nbformat": 4, "nbformat_minor": 2 }