{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "10 Minutes to cuDF and Dask-cuDF\n", "=======================\n", "\n", "Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.\n", "\n", "### What are these Libraries?\n", "\n", "[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.\n", "\n", "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple.\n", "\n", "[Dask-cuDF](https://github.com/rapidsai/dask-cudf) is a library that provides a partitioned, GPU-backed dataframe, using Dask.\n", "\n", "\n", "### When to use cuDF and Dask-cuDF\n", "\n", "If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import cudf\n", "import dask_cudf\n", "\n", "np.random.seed(12)\n", "\n", "#### Portions of this were borrowed and adapted from the\n", "#### cuDF cheatsheet, existing cuDF documentation,\n", "#### and 10 Minutes to Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Object Creation\n", "---------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `cudf.Series` and `dask_cudf.Series`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 4\n", "dtype: int64\n" ] } ], "source": [ "s = cudf.Series([1,2,3,None,4])\n", "print(s)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 4\n", "dtype: int64\n" ] } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2) \n", "print(ds.compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4\n", "5 5 14 5\n", "6 6 13 6\n", "7 7 12 7\n", "8 8 11 8\n", "9 9 10 9\n", "[10 more rows]\n" ] } ], "source": [ "df = cudf.DataFrame([('a', list(range(20))),\n", "('b', list(reversed(range(20)))),\n", "('c', list(range(20)))])\n", "print(df)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4\n", "5 5 14 5\n", "6 6 13 6\n", "7 7 12 7\n", "8 8 11 8\n", "9 9 10 9\n", "[10 more rows]\n" ] } ], "source": [ "ddf = dask_cudf.from_cudf(df, npartitions=2) \n", "print(ddf.compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n", "\n", "*Note that best practice for using Dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with something like `read_csv` (discussed below).*" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "0 0 0.1\n", "1 1 0.2\n", "2 2 \n", "3 3 0.3\n" ] } ], "source": [ "pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n", "gdf = cudf.DataFrame.from_pandas(pdf)\n", "print(gdf)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "0 0 0.1\n", "1 1 0.2\n", "2 2 \n", "3 3 0.3\n" ] } ], "source": [ "dask_df = dask_cudf.from_cudf(pdf, npartitions=2)\n", "dask_gdf = dask_cudf.from_dask_dataframe(dask_df)\n", "print(dask_gdf.compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Viewing Data\n", "-------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Viewing the top rows of a GPU dataframe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n" ] } ], "source": [ "print(df.head(2))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n" ] } ], "source": [ "print(ddf.head(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting by values." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "19 19 0 19\n", "18 18 1 18\n", "17 17 2 17\n", "16 16 3 16\n", "15 15 4 15\n", "14 14 5 14\n", "13 13 6 13\n", "12 12 7 12\n", "11 11 8 11\n", "10 10 9 10\n", "[10 more rows]\n" ] } ], "source": [ "print(df.sort_values(by='b'))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 19 0 19\n", "1 18 1 18\n", "2 17 2 17\n", "3 16 3 16\n", "4 15 4 15\n", "5 14 5 14\n", "6 13 6 13\n", "7 12 7 12\n", "8 11 8 11\n", "9 10 9 10\n", "[10 more rows]\n" ] } ], "source": [ "print(ddf.sort_values(by='b').compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selection\n", "------------\n", "\n", "## Getting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "[10 more rows]\n", "Name: a, dtype: int64\n" ] } ], "source": [ "print(df['a'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "[10 more rows]\n", "Name: a, dtype: int64\n" ] } ], "source": [ "print(ddf['a'].compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection by Label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14\n" ] } ], "source": [ "print(df.loc[2:5, ['a', 'b']])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14\n" ] } ], "source": [ "print(ddf.loc[2:5, ['a', 'b']].compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection by Position" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a 0\n", "b 19\n", "c 0\n", "Name: 0, dtype: int64\n" ] } ], "source": [ "print(df.iloc[0])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "0 0 19\n", "1 1 18\n", "2 2 17\n" ] } ], "source": [ "print(df.iloc[0:3, 0:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also select elements of a `DataFrame` or `Series` with direct index access." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "3 3 16 3\n", "4 4 15 4\n" ] } ], "source": [ "print(df[3:5])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3 \n", "4 4\n", "dtype: int64\n" ] } ], "source": [ "print(s[3:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n" ] } ], "source": [ "print(df[df.b > 15])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n" ] } ], "source": [ "print(ddf[ddf.b > 15].compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "16 16 3 16\n" ] } ], "source": [ "print(df.query(\"b == 3\")) " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "16 16 3 16\n" ] } ], "source": [ "print(ddf.query(\"b == 3\").compute()) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "16 16 3 16\n" ] } ], "source": [ "cudf_comparator = 3\n", "print(df.query(\"b == @cudf_comparator\"))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "16 16 3 16\n" ] } ], "source": [ "dask_cudf_comparator = 3\n", "print(ddf.query(\"b == @val\", local_dict={'val':dask_cudf_comparator}).compute()) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MultiIndex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex(levels=[array(['a', 'b'], dtype=object) array([1, 2, 3, 4])],\n", "codes= 0 1\n", "0 0 0\n", "1 0 1\n", "2 1 2\n", "3 1 3)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arrays = [['a', 'a', 'b', 'b'],\n", " [1, 2, 3, 4]]\n", "tuples = list(zip(*arrays))\n", "idx = cudf.MultiIndex.from_tuples(tuples)\n", "idx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This index can back either axis of a DataFrame." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " first second\n", "a 1 0.154163 0.014575\n", " 2 0.740050 0.918747\n", "b 3 0.263315 0.900715\n", " 4 0.533739 0.033421\n" ] } ], "source": [ "gdf1 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)})\n", "gdf1.index = idx\n", "print(gdf1.to_pandas())" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b \n", " 1 2 3 4\n", "first 0.956949 0.137209 0.283828 0.606083\n", "second 0.944225 0.852736 0.002259 0.521226\n" ] } ], "source": [ "gdf2 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)}).T\n", "gdf2.columns = idx\n", "print(gdf2.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " first second\n", "b 3 0.263315 0.900715\n" ] } ], "source": [ "print(gdf1.loc[('b', 3)].to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing Data\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 999\n", "4 4\n", "dtype: int64\n" ] } ], "source": [ "print(s.fillna(999))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 999\n", "4 4\n", "dtype: int64\n" ] } ], "source": [ "print(ds.fillna(999).compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.5 1.666666666666666\n" ] } ], "source": [ "print(s.mean(), s.var())" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.5 1.6666666666666667\n" ] } ], "source": [ "print(ds.mean().compute(), ds.var().compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applymap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) to apply a function to each partition of the distributed dataframe." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "5 15\n", "6 16\n", "7 17\n", "8 18\n", "9 19\n", "[10 more rows]\n", "Name: a, dtype: int64\n" ] } ], "source": [ "def add_ten(num):\n", " return num + 10\n", "\n", "print(df['a'].applymap(add_ten))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "5 15\n", "6 16\n", "7 17\n", "8 18\n", "9 19\n", "[10 more rows]\n", "dtype: int64\n" ] } ], "source": [ "print(ddf['a'].map_partitions(add_ten).compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogramming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", "5 1\n", "6 1\n", "7 1\n", "8 1\n", "9 1\n", "[10 more rows]\n", "dtype: int64\n" ] } ], "source": [ "print(df.a.value_counts())" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", "5 1\n", "6 1\n", "7 1\n", "8 1\n", "9 1\n", "[10 more rows]\n", "dtype: int64\n" ] } ], "source": [ "print(ddf.a.value_counts().compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String Methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "4 baca\n", "5 None\n", "6 caba\n", "7 dog\n", "8 cat\n", "dtype: object\n" ] } ], "source": [ "s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])\n", "print(s.str.lower())" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "4 baca\n", "5 None\n", "6 caba\n", "7 dog\n", "8 cat\n", "dtype: object\n" ] } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2)\n", "print(ds.str.lower().compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "dtype: int64\n" ] } ], "source": [ "s = cudf.Series([1, 2, 3, None, 5])\n", "print(cudf.concat([s, s]))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "dtype: int64\n" ] } ], "source": [ "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n", "print(dask_cudf.concat([ds2, ds2]).compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " key vals_a vals_b\n", "0 a 10.0 100.0\n", "1 c 12.0 101.0\n", "2 e 14.0 102.0\n", "3 b 11.0 \n", "4 d 13.0 \n" ] } ], "source": [ "df_a = cudf.DataFrame()\n", "df_a['key'] = ['a', 'b', 'c', 'd', 'e']\n", "df_a['vals_a'] = [float(i + 10) for i in range(5)]\n", "\n", "df_b = cudf.DataFrame()\n", "df_b['key'] = ['a', 'c', 'e']\n", "df_b['vals_b'] = [float(i+100) for i in range(3)]\n", "\n", "merged = df_a.merge(df_b, on=['key'], how='left')\n", "print(merged)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " key vals_a vals_b\n", "0 a 10.0 100.0\n", "1 c 12.0 101.0\n", "2 b 11.0 \n", "0 e 14.0 102.0\n", "1 d 13.0 \n" ] } ], "source": [ "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n", "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n", "\n", "merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()\n", "print(merged)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Append" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Appending values from another `Series` or array-like object." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "dtype: int64\n" ] } ], "source": [ "print(s.append(s))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "dtype: int64\n" ] } ], "source": [ "print(ds2.append(ds2).compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n", "df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n", "\n", "ddf = dask_cudf.from_cudf(df, npartitions=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col2\n", "agg_col1\n", "0 100 90 100 3\n", "1 90 100 90 4\n" ] } ], "source": [ "print(df.groupby('agg_col1').sum())" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col2\n", "0 100 90 100 3\n", "1 90 100 90 4\n" ] } ], "source": [ "print(ddf.groupby('agg_col1').sum().compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data. We send the result to a pandas dataframe only for printing purposes." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "agg_col1 agg_col2 \n", "0 0 73 60 73\n", " 1 27 30 27\n", "1 0 54 60 54\n", " 1 36 40 36\n" ] } ], "source": [ "print(df.groupby(['agg_col1', 'agg_col2']).sum().to_pandas())" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1agg_col2
00736073
1273027
10546054
1364036
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 agg_col2 \n", "0 0 73 60 73\n", " 1 27 30 27\n", "1 0 54 60 54\n", " 1 36 40 36" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby(['agg_col1', 'agg_col2']).sum().compute().to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "agg_col1\n", "0 19 9.0 100\n", "1 18 10.0 90\n" ] } ], "source": [ "print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c\n", "0 19 9.0 100\n", "1 18 10.0 90\n" ] } ], "source": [ "print(ddf.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}).compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transpose" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "0 1 4\n", "1 2 5\n", "2 3 6\n" ] } ], "source": [ "sample = cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]})\n", "print(sample)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2\n", "a 1 2 3\n", "b 4 5 6\n" ] } ], "source": [ "print(sample.transpose())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time Series\n", "------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " date value\n", "0 2018-11-20T00:00:00.000 0.5520376332645666\n", "1 2018-11-21T00:00:00.000 0.4853774136627097\n", "2 2018-11-22T00:00:00.000 0.7681341540644223\n", "3 2018-11-23T00:00:00.000 0.1607167531255701\n" ] } ], "source": [ "import datetime as dt\n", "\n", "date_df = cudf.DataFrame()\n", "date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')\n", "date_df['value'] = np.random.sample(len(date_df))\n", "\n", "search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')\n", "print(date_df.query('date <= @search_date'))" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " date value\n", "0 2018-11-20T00:00:00.000 0.5520376332645666\n", "1 2018-11-21T00:00:00.000 0.4853774136627097\n", "2 2018-11-22T00:00:00.000 0.7681341540644223\n", "3 2018-11-23T00:00:00.000 0.1607167531255701\n" ] } ], "source": [ "date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n", "print(date_ddf.query('date <= @search_date', local_dict={'search_date':search_date}).compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categoricals\n", "------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrames` support categorical columns." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b\n", "3 4 a\n", "4 5 a\n", "5 6 e\n" ] } ], "source": [ "pdf = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n", "pdf[\"grade\"] = pdf[\"grade\"].astype(\"category\")\n", "\n", "gdf = cudf.DataFrame.from_pandas(pdf)\n", "print(gdf)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b\n", "3 4 a\n", "4 5 a\n", "5 6 e\n" ] } ], "source": [ "dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", "print(dgdf.compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('a', 'b', 'e')" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.grade.cat.categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 1\n", "2 1\n", "3 0\n", "4 0\n", "5 2\n", "dtype: int8\n" ] } ], "source": [ "print(gdf.grade.cat.codes)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 1\n", "2 1\n", "0 0\n", "1 0\n", "2 2\n", "dtype: int8\n" ] } ], "source": [ "print(dgdf.grade.cat.codes.compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting Data Representation\n", "--------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n" ] } ], "source": [ "print(df.head().to_pandas())" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n" ] } ], "source": [ "print(ddf.compute().head().to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0 19 0 1 1]\n", " [ 1 18 1 0 0]\n", " [ 2 17 2 1 0]\n", " [ 3 16 3 0 1]\n", " [ 4 15 4 1 0]\n", " [ 5 14 5 0 0]\n", " [ 6 13 6 1 1]\n", " [ 7 12 7 0 0]\n", " [ 8 11 8 1 0]\n", " [ 9 10 9 0 1]\n", " [10 9 10 1 0]\n", " [11 8 11 0 0]\n", " [12 7 12 1 1]\n", " [13 6 13 0 0]\n", " [14 5 14 1 0]\n", " [15 4 15 0 1]\n", " [16 3 16 1 0]\n", " [17 2 17 0 0]\n", " [18 1 18 1 1]\n", " [19 0 19 0 0]]\n" ] } ], "source": [ "print(df.as_matrix())" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0 19 0 1 1]\n", " [ 1 18 1 0 0]\n", " [ 2 17 2 1 0]\n", " [ 3 16 3 0 1]\n", " [ 4 15 4 1 0]\n", " [ 5 14 5 0 0]\n", " [ 6 13 6 1 1]\n", " [ 7 12 7 0 0]\n", " [ 8 11 8 1 0]\n", " [ 9 10 9 0 1]\n", " [10 9 10 1 0]\n", " [11 8 11 0 0]\n", " [12 7 12 1 1]\n", " [13 6 13 0 0]\n", " [14 5 14 1 0]\n", " [15 4 15 0 1]\n", " [16 3 16 1 0]\n", " [17 2 17 0 0]\n", " [18 1 18 1 1]\n", " [19 0 19 0 0]]\n" ] } ], "source": [ "print(ddf.compute().as_matrix())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(df['a'].to_array())" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]\n" ] } ], "source": [ "print(ddf['a'].compute().to_array())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arrow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "__index_level_0__: int64\n", "metadata\n", "--------\n", "OrderedDict([(b'pandas',\n", " b'{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": ['\n", " b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n", " b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n", " b', \"columns\": [{\"name\": \"a\", \"field_name\": \"a\", \"pandas_type\"'\n", " b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n", " b': \"b\", \"field_name\": \"b\", \"pandas_type\": \"int64\", \"numpy_typ'\n", " b'e\": \"int64\", \"metadata\": null}, {\"name\": \"c\", \"field_name\": '\n", " b'\"c\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadat'\n", " b'a\": null}, {\"name\": \"agg_col1\", \"field_name\": \"agg_col1\", \"p'\n", " b'andas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": nul'\n", " b'l}, {\"name\": \"agg_col2\", \"field_name\": \"agg_col2\", \"pandas_t'\n", " b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n", " b'ame\": null, \"field_name\": \"__index_level_0__\", \"pandas_type\"'\n", " b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"panda'\n", " b's_version\": \"0.23.4\"}')])\n" ] } ], "source": [ "print(df.to_arrow())" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "__index_level_0__: int64\n", "metadata\n", "--------\n", "OrderedDict([(b'pandas',\n", " b'{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": ['\n", " b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n", " b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n", " b', \"columns\": [{\"name\": \"a\", \"field_name\": \"a\", \"pandas_type\"'\n", " b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n", " b': \"b\", \"field_name\": \"b\", \"pandas_type\": \"int64\", \"numpy_typ'\n", " b'e\": \"int64\", \"metadata\": null}, {\"name\": \"c\", \"field_name\": '\n", " b'\"c\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadat'\n", " b'a\": null}, {\"name\": \"agg_col1\", \"field_name\": \"agg_col1\", \"p'\n", " b'andas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": nul'\n", " b'l}, {\"name\": \"agg_col2\", \"field_name\": \"agg_col2\", \"pandas_t'\n", " b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n", " b'ame\": null, \"field_name\": \"__index_level_0__\", \"pandas_type\"'\n", " b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"panda'\n", " b's_version\": \"0.23.4\"}')])\n" ] } ], "source": [ "print(ddf.compute().to_arrow())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting Data In/Out\n", "------------------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing to a CSV file, by first sending data to a pandas `Dataframe` on the host." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists('example_output'):\n", " os.mkdir('example_output')\n", " \n", "df.to_pandas().to_csv('example_output/foo.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ddf.compute().to_pandas().to_csv('example_output/foo_dask.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading from a csv file." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "[10 more rows]\n" ] } ], "source": [ "df = cudf.read_csv('example_output/foo.csv')\n", "print(df)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "[10 more rows]\n" ] } ], "source": [ "ddf = dask_cudf.read_csv('example_output/foo_dask.csv')\n", "print(ddf.compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "[30 more rows]\n" ] } ], "source": [ "ddf = dask_cudf.read_csv('example_output/*.csv')\n", "print(ddf.compute())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parquet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing to parquet files, using the CPU via PyArrow." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/parquet.py:56: UserWarning: Using CPU via PyArrow to write Parquet dataset, this will be GPU accelerated in the future\n", " warnings.warn(\"Using CPU via PyArrow to write Parquet dataset, this will \"\n" ] } ], "source": [ "df.to_parquet('example_output/temp_parquet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading parquet files with a GPU-accelerated parquet reader." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0\n" ] } ], "source": [ "df = cudf.read_parquet('example_output/temp_parquet/72706b163a0d4feb949005d22146ad83.parquet')\n", "print(df.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "ddf.to_parquet('example_files') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ORC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading ORC files." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
boolean1byte1short1int1long1float1double1bytes1string1middle.list.int1middle.list.string1list.int1list.string1mapmap.int1map.string1
0False110246553692233720368547758071.0-15.0\u0000\u0001\u0002\u0003\u0004hi3bye4chani5chani
1True10020486553692233720368547758072.0-5.0bye0bye0mauddib1mauddib
\n", "
" ], "text/plain": [ " boolean1 byte1 short1 int1 long1 float1 double1 \\\n", "0 False 1 1024 65536 9223372036854775807 1.0 -15.0 \n", "1 True 100 2048 65536 9223372036854775807 2.0 -5.0 \n", "\n", " bytes1 string1 middle.list.int1 middle.list.string1 list.int1 \\\n", "0 \u0000\u0001\u0002\u0003\u0004 hi 3 bye 4 \n", "1 bye 0 bye 0 \n", "\n", " list.string1 map map.int1 map.string1 \n", "0 chani 5 chani \n", "1 mauddib 1 mauddib " ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = cudf.read_orc('/cudf/python/cudf/tests/data/orc/TestOrcFile.test1.orc')\n", "df2.to_pandas()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }