{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"10 Minutes to cuDF and Dask-cuDF\n",
"=======================\n",
"\n",
"Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.\n",
"\n",
"### What are these Libraries?\n",
"\n",
"[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.\n",
"\n",
"[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple.\n",
"\n",
"[Dask-cuDF](https://github.com/rapidsai/dask-cudf) is a library that provides a partitioned, GPU-backed dataframe, using Dask.\n",
"\n",
"\n",
"### When to use cuDF and Dask-cuDF\n",
"\n",
"If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import os\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import cudf\n",
"import dask_cudf\n",
"\n",
"np.random.seed(12)\n",
"\n",
"#### Portions of this were borrowed and adapted from the\n",
"#### cuDF cheatsheet, existing cuDF documentation,\n",
"#### and 10 Minutes to Pandas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Object Creation\n",
"---------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a `cudf.Series` and `dask_cudf.Series`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 4\n",
"dtype: int64\n"
]
}
],
"source": [
"s = cudf.Series([1,2,3,None,4])\n",
"print(s)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 4\n",
"dtype: int64\n"
]
}
],
"source": [
"ds = dask_cudf.from_cudf(s, npartitions=2) \n",
"print(ds.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 0 19 0\n",
"1 1 18 1\n",
"2 2 17 2\n",
"3 3 16 3\n",
"4 4 15 4\n",
"5 5 14 5\n",
"6 6 13 6\n",
"7 7 12 7\n",
"8 8 11 8\n",
"9 9 10 9\n",
"[10 more rows]\n"
]
}
],
"source": [
"df = cudf.DataFrame([('a', list(range(20))),\n",
"('b', list(reversed(range(20)))),\n",
"('c', list(range(20)))])\n",
"print(df)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 0 19 0\n",
"1 1 18 1\n",
"2 2 17 2\n",
"3 3 16 3\n",
"4 4 15 4\n",
"5 5 14 5\n",
"6 6 13 6\n",
"7 7 12 7\n",
"8 8 11 8\n",
"9 9 10 9\n",
"[10 more rows]\n"
]
}
],
"source": [
"ddf = dask_cudf.from_cudf(df, npartitions=2) \n",
"print(ddf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n",
"\n",
"*Note that best practice for using Dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with something like `read_csv` (discussed below).*"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b\n",
"0 0 0.1\n",
"1 1 0.2\n",
"2 2 \n",
"3 3 0.3\n"
]
}
],
"source": [
"pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n",
"gdf = cudf.DataFrame.from_pandas(pdf)\n",
"print(gdf)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b\n",
"0 0 0.1\n",
"1 1 0.2\n",
"2 2 \n",
"3 3 0.3\n"
]
}
],
"source": [
"dask_df = dask_cudf.from_cudf(pdf, npartitions=2)\n",
"dask_gdf = dask_cudf.from_dask_dataframe(dask_df)\n",
"print(dask_gdf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Viewing Data\n",
"-------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Viewing the top rows of a GPU dataframe."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 0 19 0\n",
"1 1 18 1\n"
]
}
],
"source": [
"print(df.head(2))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 0 19 0\n",
"1 1 18 1\n"
]
}
],
"source": [
"print(ddf.head(2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sorting by values."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"19 19 0 19\n",
"18 18 1 18\n",
"17 17 2 17\n",
"16 16 3 16\n",
"15 15 4 15\n",
"14 14 5 14\n",
"13 13 6 13\n",
"12 12 7 12\n",
"11 11 8 11\n",
"10 10 9 10\n",
"[10 more rows]\n"
]
}
],
"source": [
"print(df.sort_values(by='b'))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 19 0 19\n",
"1 18 1 18\n",
"2 17 2 17\n",
"3 16 3 16\n",
"4 15 4 15\n",
"5 14 5 14\n",
"6 13 6 13\n",
"7 12 7 12\n",
"8 11 8 11\n",
"9 10 9 10\n",
"[10 more rows]\n"
]
}
],
"source": [
"print(ddf.sort_values(by='b').compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selection\n",
"------------\n",
"\n",
"## Getting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0\n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 4\n",
"5 5\n",
"6 6\n",
"7 7\n",
"8 8\n",
"9 9\n",
"[10 more rows]\n",
"Name: a, dtype: int64\n"
]
}
],
"source": [
"print(df['a'])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0\n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 4\n",
"5 5\n",
"6 6\n",
"7 7\n",
"8 8\n",
"9 9\n",
"[10 more rows]\n",
"Name: a, dtype: int64\n"
]
}
],
"source": [
"print(ddf['a'].compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selection by Label"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting rows from index 2 to index 5 from columns 'a' and 'b'."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b\n",
"2 2 17\n",
"3 3 16\n",
"4 4 15\n",
"5 5 14\n"
]
}
],
"source": [
"print(df.loc[2:5, ['a', 'b']])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b\n",
"2 2 17\n",
"3 3 16\n",
"4 4 15\n",
"5 5 14\n"
]
}
],
"source": [
"print(ddf.loc[2:5, ['a', 'b']].compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selection by Position"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a 0\n",
"b 19\n",
"c 0\n",
"Name: 0, dtype: int64\n"
]
}
],
"source": [
"print(df.iloc[0])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b\n",
"0 0 19\n",
"1 1 18\n",
"2 2 17\n"
]
}
],
"source": [
"print(df.iloc[0:3, 0:2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also select elements of a `DataFrame` or `Series` with direct index access."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"3 3 16 3\n",
"4 4 15 4\n"
]
}
],
"source": [
"print(df[3:5])"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3 \n",
"4 4\n",
"dtype: int64\n"
]
}
],
"source": [
"print(s[3:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Boolean Indexing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 0 19 0\n",
"1 1 18 1\n",
"2 2 17 2\n",
"3 3 16 3\n"
]
}
],
"source": [
"print(df[df.b > 15])"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 0 19 0\n",
"1 1 18 1\n",
"2 2 17 2\n",
"3 3 16 3\n"
]
}
],
"source": [
"print(ddf[ddf.b > 15].compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"16 16 3 16\n"
]
}
],
"source": [
"print(df.query(\"b == 3\")) "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"16 16 3 16\n"
]
}
],
"source": [
"print(ddf.query(\"b == 3\").compute()) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"16 16 3 16\n"
]
}
],
"source": [
"cudf_comparator = 3\n",
"print(df.query(\"b == @cudf_comparator\"))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"16 16 3 16\n"
]
}
],
"source": [
"dask_cudf_comparator = 3\n",
"print(ddf.query(\"b == @val\", local_dict={'val':dask_cudf_comparator}).compute()) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## MultiIndex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MultiIndex(levels=[array(['a', 'b'], dtype=object) array([1, 2, 3, 4])],\n",
"codes= 0 1\n",
"0 0 0\n",
"1 0 1\n",
"2 1 2\n",
"3 1 3)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arrays = [['a', 'a', 'b', 'b'],\n",
" [1, 2, 3, 4]]\n",
"tuples = list(zip(*arrays))\n",
"idx = cudf.MultiIndex.from_tuples(tuples)\n",
"idx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This index can back either axis of a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" first second\n",
"a 1 0.154163 0.014575\n",
" 2 0.740050 0.918747\n",
"b 3 0.263315 0.900715\n",
" 4 0.533739 0.033421\n"
]
}
],
"source": [
"gdf1 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)})\n",
"gdf1.index = idx\n",
"print(gdf1.to_pandas())"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b \n",
" 1 2 3 4\n",
"first 0.956949 0.137209 0.283828 0.606083\n",
"second 0.944225 0.852736 0.002259 0.521226\n"
]
}
],
"source": [
"gdf2 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)}).T\n",
"gdf2.columns = idx\n",
"print(gdf2.to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" first second\n",
"b 3 0.263315 0.900715\n"
]
}
],
"source": [
"print(gdf1.loc[('b', 3)].to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Missing Data\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Missing data can be replaced by using the `fillna` method."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 999\n",
"4 4\n",
"dtype: int64\n"
]
}
],
"source": [
"print(s.fillna(999))"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 999\n",
"4 4\n",
"dtype: int64\n"
]
}
],
"source": [
"print(ds.fillna(999).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Operations\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculating descriptive statistics for a `Series`."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.5 1.666666666666666\n"
]
}
],
"source": [
"print(s.mean(), s.var())"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.5 1.6666666666666667\n"
]
}
],
"source": [
"print(ds.mean().compute(), ds.var().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applymap"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) to apply a function to each partition of the distributed dataframe."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 10\n",
"1 11\n",
"2 12\n",
"3 13\n",
"4 14\n",
"5 15\n",
"6 16\n",
"7 17\n",
"8 18\n",
"9 19\n",
"[10 more rows]\n",
"Name: a, dtype: int64\n"
]
}
],
"source": [
"def add_ten(num):\n",
" return num + 10\n",
"\n",
"print(df['a'].applymap(add_ten))"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 10\n",
"1 11\n",
"2 12\n",
"3 13\n",
"4 14\n",
"5 15\n",
"6 16\n",
"7 17\n",
"8 18\n",
"9 19\n",
"[10 more rows]\n",
"dtype: int64\n"
]
}
],
"source": [
"print(ddf['a'].map_partitions(add_ten).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Histogramming"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Counting the number of occurrences of each unique value of variable."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 1\n",
"2 1\n",
"3 1\n",
"4 1\n",
"5 1\n",
"6 1\n",
"7 1\n",
"8 1\n",
"9 1\n",
"[10 more rows]\n",
"dtype: int64\n"
]
}
],
"source": [
"print(df.a.value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 1\n",
"2 1\n",
"3 1\n",
"4 1\n",
"5 1\n",
"6 1\n",
"7 1\n",
"8 1\n",
"9 1\n",
"[10 more rows]\n",
"dtype: int64\n"
]
}
],
"source": [
"print(ddf.a.value_counts().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## String Methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 a\n",
"1 b\n",
"2 c\n",
"3 aaba\n",
"4 baca\n",
"5 None\n",
"6 caba\n",
"7 dog\n",
"8 cat\n",
"dtype: object\n"
]
}
],
"source": [
"s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])\n",
"print(s.str.lower())"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 a\n",
"1 b\n",
"2 c\n",
"3 aaba\n",
"4 baca\n",
"5 None\n",
"6 caba\n",
"7 dog\n",
"8 cat\n",
"dtype: object\n"
]
}
],
"source": [
"ds = dask_cudf.from_cudf(s, npartitions=2)\n",
"print(ds.str.lower().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Concatenating `Series` and `DataFrames` row-wise."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"dtype: int64\n"
]
}
],
"source": [
"s = cudf.Series([1, 2, 3, None, 5])\n",
"print(cudf.concat([s, s]))"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"dtype: int64\n"
]
}
],
"source": [
"ds2 = dask_cudf.from_cudf(s, npartitions=2)\n",
"print(dask_cudf.concat([ds2, ds2]).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Join"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" key vals_a vals_b\n",
"0 a 10.0 100.0\n",
"1 c 12.0 101.0\n",
"2 e 14.0 102.0\n",
"3 b 11.0 \n",
"4 d 13.0 \n"
]
}
],
"source": [
"df_a = cudf.DataFrame()\n",
"df_a['key'] = ['a', 'b', 'c', 'd', 'e']\n",
"df_a['vals_a'] = [float(i + 10) for i in range(5)]\n",
"\n",
"df_b = cudf.DataFrame()\n",
"df_b['key'] = ['a', 'c', 'e']\n",
"df_b['vals_b'] = [float(i+100) for i in range(3)]\n",
"\n",
"merged = df_a.merge(df_b, on=['key'], how='left')\n",
"print(merged)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" key vals_a vals_b\n",
"0 a 10.0 100.0\n",
"1 c 12.0 101.0\n",
"2 b 11.0 \n",
"0 e 14.0 102.0\n",
"1 d 13.0 \n"
]
}
],
"source": [
"ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n",
"ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n",
"\n",
"merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()\n",
"print(merged)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Append"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Appending values from another `Series` or array-like object."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"dtype: int64\n"
]
}
],
"source": [
"print(s.append(s))"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 \n",
"4 5\n",
"dtype: int64\n"
]
}
],
"source": [
"print(ds2.append(ds2).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Grouping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n",
"df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n",
"\n",
"ddf = dask_cudf.from_cudf(df, npartitions=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Grouping and then applying the `sum` function to the grouped data."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col2\n",
"agg_col1\n",
"0 100 90 100 3\n",
"1 90 100 90 4\n"
]
}
],
"source": [
"print(df.groupby('agg_col1').sum())"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col2\n",
"0 100 90 100 3\n",
"1 90 100 90 4\n"
]
}
],
"source": [
"print(ddf.groupby('agg_col1').sum().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Grouping hierarchically then applying the `sum` function to grouped data. We send the result to a pandas dataframe only for printing purposes."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"agg_col1 agg_col2 \n",
"0 0 73 60 73\n",
" 1 27 30 27\n",
"1 0 54 60 54\n",
" 1 36 40 36\n"
]
}
],
"source": [
"print(df.groupby(['agg_col1', 'agg_col2']).sum().to_pandas())"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" agg_col1 | \n",
" agg_col2 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 73 | \n",
" 60 | \n",
" 73 | \n",
"
\n",
" \n",
" 1 | \n",
" 27 | \n",
" 30 | \n",
" 27 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 54 | \n",
" 60 | \n",
" 54 | \n",
"
\n",
" \n",
" 1 | \n",
" 36 | \n",
" 40 | \n",
" 36 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c\n",
"agg_col1 agg_col2 \n",
"0 0 73 60 73\n",
" 1 27 30 27\n",
"1 0 54 60 54\n",
" 1 36 40 36"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.groupby(['agg_col1', 'agg_col2']).sum().compute().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Grouping and applying statistical functions to specific columns, using `agg`."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"agg_col1\n",
"0 19 9.0 100\n",
"1 18 10.0 90\n"
]
}
],
"source": [
"print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c\n",
"0 19 9.0 100\n",
"1 18 10.0 90\n"
]
}
],
"source": [
"print(ddf.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transpose"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b\n",
"0 1 4\n",
"1 2 5\n",
"2 3 6\n"
]
}
],
"source": [
"sample = cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]})\n",
"print(sample)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0 1 2\n",
"a 1 2 3\n",
"b 4 5 6\n"
]
}
],
"source": [
"print(sample.transpose())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time Series\n",
"------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" date value\n",
"0 2018-11-20T00:00:00.000 0.5520376332645666\n",
"1 2018-11-21T00:00:00.000 0.4853774136627097\n",
"2 2018-11-22T00:00:00.000 0.7681341540644223\n",
"3 2018-11-23T00:00:00.000 0.1607167531255701\n"
]
}
],
"source": [
"import datetime as dt\n",
"\n",
"date_df = cudf.DataFrame()\n",
"date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')\n",
"date_df['value'] = np.random.sample(len(date_df))\n",
"\n",
"search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')\n",
"print(date_df.query('date <= @search_date'))"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" date value\n",
"0 2018-11-20T00:00:00.000 0.5520376332645666\n",
"1 2018-11-21T00:00:00.000 0.4853774136627097\n",
"2 2018-11-22T00:00:00.000 0.7681341540644223\n",
"3 2018-11-23T00:00:00.000 0.1607167531255701\n"
]
}
],
"source": [
"date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n",
"print(date_ddf.query('date <= @search_date', local_dict={'search_date':search_date}).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Categoricals\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrames` support categorical columns."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" id grade\n",
"0 1 a\n",
"1 2 b\n",
"2 3 b\n",
"3 4 a\n",
"4 5 a\n",
"5 6 e\n"
]
}
],
"source": [
"pdf = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n",
"pdf[\"grade\"] = pdf[\"grade\"].astype(\"category\")\n",
"\n",
"gdf = cudf.DataFrame.from_pandas(pdf)\n",
"print(gdf)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" id grade\n",
"0 1 a\n",
"1 2 b\n",
"2 3 b\n",
"3 4 a\n",
"4 5 a\n",
"5 6 e\n"
]
}
],
"source": [
"dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n",
"print(dgdf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('a', 'b', 'e')"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf.grade.cat.categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accessing the underlying code values of each categorical observation."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0\n",
"1 1\n",
"2 1\n",
"3 0\n",
"4 0\n",
"5 2\n",
"dtype: int8\n"
]
}
],
"source": [
"print(gdf.grade.cat.codes)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0\n",
"1 1\n",
"2 1\n",
"0 0\n",
"1 0\n",
"2 2\n",
"dtype: int8\n"
]
}
],
"source": [
"print(dgdf.grade.cat.codes.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting Data Representation\n",
"--------------------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col1 agg_col2\n",
"0 0 19 0 1 1\n",
"1 1 18 1 0 0\n",
"2 2 17 2 1 0\n",
"3 3 16 3 0 1\n",
"4 4 15 4 1 0\n"
]
}
],
"source": [
"print(df.head().to_pandas())"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col1 agg_col2\n",
"0 0 19 0 1 1\n",
"1 1 18 1 0 0\n",
"2 2 17 2 1 0\n",
"3 3 16 3 0 1\n",
"4 4 15 4 1 0\n"
]
}
],
"source": [
"print(ddf.compute().head().to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Numpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0 19 0 1 1]\n",
" [ 1 18 1 0 0]\n",
" [ 2 17 2 1 0]\n",
" [ 3 16 3 0 1]\n",
" [ 4 15 4 1 0]\n",
" [ 5 14 5 0 0]\n",
" [ 6 13 6 1 1]\n",
" [ 7 12 7 0 0]\n",
" [ 8 11 8 1 0]\n",
" [ 9 10 9 0 1]\n",
" [10 9 10 1 0]\n",
" [11 8 11 0 0]\n",
" [12 7 12 1 1]\n",
" [13 6 13 0 0]\n",
" [14 5 14 1 0]\n",
" [15 4 15 0 1]\n",
" [16 3 16 1 0]\n",
" [17 2 17 0 0]\n",
" [18 1 18 1 1]\n",
" [19 0 19 0 0]]\n"
]
}
],
"source": [
"print(df.as_matrix())"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0 19 0 1 1]\n",
" [ 1 18 1 0 0]\n",
" [ 2 17 2 1 0]\n",
" [ 3 16 3 0 1]\n",
" [ 4 15 4 1 0]\n",
" [ 5 14 5 0 0]\n",
" [ 6 13 6 1 1]\n",
" [ 7 12 7 0 0]\n",
" [ 8 11 8 1 0]\n",
" [ 9 10 9 0 1]\n",
" [10 9 10 1 0]\n",
" [11 8 11 0 0]\n",
" [12 7 12 1 1]\n",
" [13 6 13 0 0]\n",
" [14 5 14 1 0]\n",
" [15 4 15 0 1]\n",
" [16 3 16 1 0]\n",
" [17 2 17 0 0]\n",
" [18 1 18 1 1]\n",
" [19 0 19 0 0]]\n"
]
}
],
"source": [
"print(ddf.compute().as_matrix())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n",
" 17, 18, 19])"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(df['a'].to_array())"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]\n"
]
}
],
"source": [
"print(ddf['a'].compute().to_array())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Arrow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pyarrow.Table\n",
"a: int64\n",
"b: int64\n",
"c: int64\n",
"agg_col1: int64\n",
"agg_col2: int64\n",
"__index_level_0__: int64\n",
"metadata\n",
"--------\n",
"OrderedDict([(b'pandas',\n",
" b'{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": ['\n",
" b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n",
" b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n",
" b', \"columns\": [{\"name\": \"a\", \"field_name\": \"a\", \"pandas_type\"'\n",
" b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n",
" b': \"b\", \"field_name\": \"b\", \"pandas_type\": \"int64\", \"numpy_typ'\n",
" b'e\": \"int64\", \"metadata\": null}, {\"name\": \"c\", \"field_name\": '\n",
" b'\"c\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadat'\n",
" b'a\": null}, {\"name\": \"agg_col1\", \"field_name\": \"agg_col1\", \"p'\n",
" b'andas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": nul'\n",
" b'l}, {\"name\": \"agg_col2\", \"field_name\": \"agg_col2\", \"pandas_t'\n",
" b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n",
" b'ame\": null, \"field_name\": \"__index_level_0__\", \"pandas_type\"'\n",
" b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"panda'\n",
" b's_version\": \"0.23.4\"}')])\n"
]
}
],
"source": [
"print(df.to_arrow())"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pyarrow.Table\n",
"a: int64\n",
"b: int64\n",
"c: int64\n",
"agg_col1: int64\n",
"agg_col2: int64\n",
"__index_level_0__: int64\n",
"metadata\n",
"--------\n",
"OrderedDict([(b'pandas',\n",
" b'{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": ['\n",
" b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n",
" b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n",
" b', \"columns\": [{\"name\": \"a\", \"field_name\": \"a\", \"pandas_type\"'\n",
" b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n",
" b': \"b\", \"field_name\": \"b\", \"pandas_type\": \"int64\", \"numpy_typ'\n",
" b'e\": \"int64\", \"metadata\": null}, {\"name\": \"c\", \"field_name\": '\n",
" b'\"c\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadat'\n",
" b'a\": null}, {\"name\": \"agg_col1\", \"field_name\": \"agg_col1\", \"p'\n",
" b'andas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": nul'\n",
" b'l}, {\"name\": \"agg_col2\", \"field_name\": \"agg_col2\", \"pandas_t'\n",
" b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n",
" b'ame\": null, \"field_name\": \"__index_level_0__\", \"pandas_type\"'\n",
" b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"panda'\n",
" b's_version\": \"0.23.4\"}')])\n"
]
}
],
"source": [
"print(ddf.compute().to_arrow())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting Data In/Out\n",
"------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CSV"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Writing to a CSV file, by first sending data to a pandas `Dataframe` on the host."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"if not os.path.exists('example_output'):\n",
" os.mkdir('example_output')\n",
" \n",
"df.to_pandas().to_csv('example_output/foo.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ddf.compute().to_pandas().to_csv('example_output/foo_dask.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading from a csv file."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col1 agg_col2\n",
"0 0 19 0 1 1\n",
"1 1 18 1 0 0\n",
"2 2 17 2 1 0\n",
"3 3 16 3 0 1\n",
"4 4 15 4 1 0\n",
"5 5 14 5 0 0\n",
"6 6 13 6 1 1\n",
"7 7 12 7 0 0\n",
"8 8 11 8 1 0\n",
"9 9 10 9 0 1\n",
"[10 more rows]\n"
]
}
],
"source": [
"df = cudf.read_csv('example_output/foo.csv')\n",
"print(df)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col1 agg_col2\n",
"0 0 19 0 1 1\n",
"1 1 18 1 0 0\n",
"2 2 17 2 1 0\n",
"3 3 16 3 0 1\n",
"4 4 15 4 1 0\n",
"5 5 14 5 0 0\n",
"6 6 13 6 1 1\n",
"7 7 12 7 0 0\n",
"8 8 11 8 1 0\n",
"9 9 10 9 0 1\n",
"[10 more rows]\n"
]
}
],
"source": [
"ddf = dask_cudf.read_csv('example_output/foo_dask.csv')\n",
"print(ddf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col1 agg_col2\n",
"0 0 19 0 1 1\n",
"1 1 18 1 0 0\n",
"2 2 17 2 1 0\n",
"3 3 16 3 0 1\n",
"4 4 15 4 1 0\n",
"5 5 14 5 0 0\n",
"6 6 13 6 1 1\n",
"7 7 12 7 0 0\n",
"8 8 11 8 1 0\n",
"9 9 10 9 0 1\n",
"[30 more rows]\n"
]
}
],
"source": [
"ddf = dask_cudf.read_csv('example_output/*.csv')\n",
"print(ddf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parquet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Writing to parquet files, using the CPU via PyArrow."
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/parquet.py:56: UserWarning: Using CPU via PyArrow to write Parquet dataset, this will be GPU accelerated in the future\n",
" warnings.warn(\"Using CPU via PyArrow to write Parquet dataset, this will \"\n"
]
}
],
"source": [
"df.to_parquet('example_output/temp_parquet')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading parquet files with a GPU-accelerated parquet reader."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" a b c agg_col1 agg_col2\n",
"0 0 19 0 1 1\n",
"1 1 18 1 0 0\n",
"2 2 17 2 1 0\n",
"3 3 16 3 0 1\n",
"4 4 15 4 1 0\n",
"5 5 14 5 0 0\n",
"6 6 13 6 1 1\n",
"7 7 12 7 0 0\n",
"8 8 11 8 1 0\n",
"9 9 10 9 0 1\n",
"10 10 9 10 1 0\n",
"11 11 8 11 0 0\n",
"12 12 7 12 1 1\n",
"13 13 6 13 0 0\n",
"14 14 5 14 1 0\n",
"15 15 4 15 0 1\n",
"16 16 3 16 1 0\n",
"17 17 2 17 0 0\n",
"18 18 1 18 1 1\n",
"19 19 0 19 0 0\n"
]
}
],
"source": [
"df = cudf.read_parquet('example_output/temp_parquet/72706b163a0d4feb949005d22146ad83.parquet')\n",
"print(df.to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"ddf.to_parquet('example_files') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ORC"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading ORC files."
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" boolean1 | \n",
" byte1 | \n",
" short1 | \n",
" int1 | \n",
" long1 | \n",
" float1 | \n",
" double1 | \n",
" bytes1 | \n",
" string1 | \n",
" middle.list.int1 | \n",
" middle.list.string1 | \n",
" list.int1 | \n",
" list.string1 | \n",
" map | \n",
" map.int1 | \n",
" map.string1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" 1 | \n",
" 1024 | \n",
" 65536 | \n",
" 9223372036854775807 | \n",
" 1.0 | \n",
" -15.0 | \n",
" \u0000\u0001\u0002\u0003\u0004 | \n",
" hi | \n",
" 3 | \n",
" bye | \n",
" 4 | \n",
" | \n",
" chani | \n",
" 5 | \n",
" chani | \n",
"
\n",
" \n",
" 1 | \n",
" True | \n",
" 100 | \n",
" 2048 | \n",
" 65536 | \n",
" 9223372036854775807 | \n",
" 2.0 | \n",
" -5.0 | \n",
" | \n",
" bye | \n",
" 0 | \n",
" bye | \n",
" 0 | \n",
" | \n",
" mauddib | \n",
" 1 | \n",
" mauddib | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" boolean1 byte1 short1 int1 long1 float1 double1 \\\n",
"0 False 1 1024 65536 9223372036854775807 1.0 -15.0 \n",
"1 True 100 2048 65536 9223372036854775807 2.0 -5.0 \n",
"\n",
" bytes1 string1 middle.list.int1 middle.list.string1 list.int1 \\\n",
"0 \u0000\u0001\u0002\u0003\u0004 hi 3 bye 4 \n",
"1 bye 0 bye 0 \n",
"\n",
" list.string1 map map.int1 map.string1 \n",
"0 chani 5 chani \n",
"1 mauddib 1 mauddib "
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = cudf.read_orc('/cudf/python/cudf/tests/data/orc/TestOrcFile.test1.orc')\n",
"df2.to_pandas()"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}