10 Minutes to cuDF and Dask-cuDF

Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.

What are these Libraries?

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.

Dask is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple.

Dask-cuDF is a library that provides a partitioned, GPU-backed dataframe, using Dask.

When to use cuDF and Dask-cuDF

If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF.

[1]:
import os

import numpy as np
import pandas as pd
import cudf
import dask_cudf

np.random.seed(12)

#### Portions of this were borrowed and adapted from the
#### cuDF cheatsheet, existing cuDF documentation,
#### and 10 Minutes to Pandas.

Object Creation

Creating a cudf.Series and dask_cudf.Series.

[2]:
s = cudf.Series([1,2,3,None,4])
print(s)
0    1
1    2
2    3
3
4    4
dtype: int64
[3]:
ds = dask_cudf.from_cudf(s, npartitions=2)
print(ds.compute())
0    1
1    2
2    3
3
4    4
dtype: int64

Creating a cudf.DataFrame and a dask_cudf.DataFrame by specifying values for each column.

[4]:
df = cudf.DataFrame([('a', list(range(20))),
('b', list(reversed(range(20)))),
('c', list(range(20)))])
print(df)
   a   b  c
0  0  19  0
1  1  18  1
2  2  17  2
3  3  16  3
4  4  15  4
5  5  14  5
6  6  13  6
7  7  12  7
8  8  11  8
9  9  10  9
[10 more rows]
[5]:
ddf = dask_cudf.from_cudf(df, npartitions=2)
print(ddf.compute())
   a   b  c
0  0  19  0
1  1  18  1
2  2  17  2
3  3  16  3
4  4  15  4
5  5  14  5
6  6  13  6
7  7  12  7
8  8  11  8
9  9  10  9
[10 more rows]

Creating a cudf.DataFrame from a pandas Dataframe and a dask_cudf.Dataframe from a cudf.Dataframe.

Note that best practice for using Dask-cuDF is to read data directly into a ``dask_cudf.DataFrame`` with something like ``read_csv`` (discussed below).

[6]:
pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
gdf = cudf.DataFrame.from_pandas(pdf)
print(gdf)
   a    b
0  0  0.1
1  1  0.2
2  2
3  3  0.3
[7]:
dask_df = dask_cudf.from_cudf(pdf, npartitions=2)
dask_gdf = dask_cudf.from_dask_dataframe(dask_df)
print(dask_gdf.compute())
   a    b
0  0  0.1
1  1  0.2
2  2
3  3  0.3

Viewing Data

Viewing the top rows of a GPU dataframe.

[8]:
print(df.head(2))
   a   b  c
0  0  19  0
1  1  18  1
[9]:
print(ddf.head(2))
   a   b  c
0  0  19  0
1  1  18  1

Sorting by values.

[10]:
print(df.sort_values(by='b'))
    a  b   c
19  19  0  19
18  18  1  18
17  17  2  17
16  16  3  16
15  15  4  15
14  14  5  14
13  13  6  13
12  12  7  12
11  11  8  11
10  10  9  10
[10 more rows]
[11]:
print(ddf.sort_values(by='b').compute())
    a  b   c
0  19  0  19
1  18  1  18
2  17  2  17
3  16  3  16
4  15  4  15
5  14  5  14
6  13  6  13
7  12  7  12
8  11  8  11
9  10  9  10
[10 more rows]

Selection

Getting

Selecting a single column, which initially yields a cudf.Series or dask_cudf.Series. Calling compute results in a cudf.Series (equivalent to df.a).

[12]:
print(df['a'])
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
[10 more rows]
Name: a, dtype: int64
[13]:
print(ddf['a'].compute())
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
[10 more rows]
Name: a, dtype: int64

Selection by Label

Selecting rows from index 2 to index 5 from columns ‘a’ and ‘b’.

[14]:
print(df.loc[2:5, ['a', 'b']])
   a   b
2  2  17
3  3  16
4  4  15
5  5  14
[15]:
print(ddf.loc[2:5, ['a', 'b']].compute())
   a   b
2  2  17
3  3  16
4  4  15
5  5  14

Selection by Position

Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames.

[16]:
print(df.iloc[0])
a     0
b    19
c     0
Name: 0, dtype: int64
[17]:
print(df.iloc[0:3, 0:2])
   a   b
0  0  19
1  1  18
2  2  17

You can also select elements of a DataFrame or Series with direct index access.

[18]:
print(df[3:5])
   a   b  c
3  3  16  3
4  4  15  4
[19]:
print(s[3:5])
3
4    4
dtype: int64

Boolean Indexing

Selecting rows in a DataFrame or Series by direct Boolean indexing.

[20]:
print(df[df.b > 15])
   a   b  c
0  0  19  0
1  1  18  1
2  2  17  2
3  3  16  3
[21]:
print(ddf[ddf.b > 15].compute())
   a   b  c
0  0  19  0
1  1  18  1
2  2  17  2
3  3  16  3

Selecting values from a DataFrame where a Boolean condition is met, via the query API.

[22]:
print(df.query("b == 3"))
    a  b   c
16  16  3  16
[23]:
print(ddf.query("b == 3").compute())
    a  b   c
16  16  3  16

You can also pass local variables to Dask-cuDF queries, via the local_dict keyword. With standard cuDF, you may either use the local_dict keyword or directly pass the variable via the @ keyword.

[24]:
cudf_comparator = 3
print(df.query("b == @cudf_comparator"))
    a  b   c
16  16  3  16
[25]:
dask_cudf_comparator = 3
print(ddf.query("b == @val", local_dict={'val':dask_cudf_comparator}).compute())
    a  b   c
16  16  3  16

Supported logical operators include >, <, >=, <=, ==, and !=.

MultiIndex

cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see Grouping below) automatically produces a DataFrame with a MultiIndex.

[26]:
arrays = [['a', 'a', 'b', 'b'],
          [1, 2, 3, 4]]
tuples = list(zip(*arrays))
idx = cudf.MultiIndex.from_tuples(tuples)
idx
[26]:
MultiIndex(levels=[array(['a', 'b'], dtype=object) array([1, 2, 3, 4])],
codes=   0  1
0  0  0
1  0  1
2  1  2
3  1  3)

This index can back either axis of a DataFrame.

[27]:
gdf1 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)})
gdf1.index = idx
print(gdf1.to_pandas())
        first    second
a 1  0.154163  0.014575
  2  0.740050  0.918747
b 3  0.263315  0.900715
  4  0.533739  0.033421
[28]:
gdf2 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)}).T
gdf2.columns = idx
print(gdf2.to_pandas())
               a                   b
               1         2         3         4
first   0.956949  0.137209  0.283828  0.606083
second  0.944225  0.852736  0.002259  0.521226

Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported.

[29]:
print(gdf1.loc[('b', 3)].to_pandas())
        first    second
b 3  0.263315  0.900715

Missing Data

Missing data can be replaced by using the fillna method.

[30]:
print(s.fillna(999))
0      1
1      2
2      3
3    999
4      4
dtype: int64
[31]:
print(ds.fillna(999).compute())
0      1
1      2
2      3
3    999
4      4
dtype: int64

Operations

Stats

Calculating descriptive statistics for a Series.

[32]:
print(s.mean(), s.var())
2.5 1.666666666666666
[33]:
print(ds.mean().compute(), ds.var().compute())
2.5 1.6666666666666667

Applymap

Applying functions to a Series. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use map_partitions to apply a function to each partition of the distributed dataframe.

[34]:
def add_ten(num):
    return num + 10

print(df['a'].applymap(add_ten))
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
[10 more rows]
Name: a, dtype: int64
[35]:
print(ddf['a'].map_partitions(add_ten).compute())
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
[10 more rows]
dtype: int64

Histogramming

Counting the number of occurrences of each unique value of variable.

[36]:
print(df.a.value_counts())
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
[10 more rows]
dtype: int64
[37]:
print(ddf.a.value_counts().compute())
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
[10 more rows]
dtype: int64

String Methods

Like pandas, cuDF provides string processing methods in the str attribute of Series. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information.

[38]:
s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])
print(s.str.lower())
0       a
1       b
2       c
3    aaba
4    baca
5    None
6    caba
7     dog
8     cat
dtype: object
[39]:
ds = dask_cudf.from_cudf(s, npartitions=2)
print(ds.str.lower().compute())
0       a
1       b
2       c
3    aaba
4    baca
5    None
6    caba
7     dog
8     cat
dtype: object

Concat

Concatenating Series and DataFrames row-wise.

[40]:
s = cudf.Series([1, 2, 3, None, 5])
print(cudf.concat([s, s]))
0    1
1    2
2    3
3
4    5
0    1
1    2
2    3
3
4    5
dtype: int64
[41]:
ds2 = dask_cudf.from_cudf(s, npartitions=2)
print(dask_cudf.concat([ds2, ds2]).compute())
0    1
1    2
2    3
3
4    5
0    1
1    2
2    3
3
4    5
dtype: int64

Join

Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index.

[42]:
df_a = cudf.DataFrame()
df_a['key'] = ['a', 'b', 'c', 'd', 'e']
df_a['vals_a'] = [float(i + 10) for i in range(5)]

df_b = cudf.DataFrame()
df_b['key'] = ['a', 'c', 'e']
df_b['vals_b'] = [float(i+100) for i in range(3)]

merged = df_a.merge(df_b, on=['key'], how='left')
print(merged)
   key  vals_a  vals_b
0    a    10.0   100.0
1    c    12.0   101.0
2    e    14.0   102.0
3    b    11.0
4    d    13.0
[43]:
ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)
ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)

merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()
print(merged)
   key  vals_a  vals_b
0    a    10.0   100.0
1    c    12.0   101.0
2    b    11.0
0    e    14.0   102.0
1    d    13.0

Append

Appending values from another Series or array-like object.

[44]:
print(s.append(s))
0    1
1    2
2    3
3
4    5
0    1
1    2
2    3
3
4    5
dtype: int64
[45]:
print(ds2.append(ds2).compute())
0    1
1    2
2    3
3
4    5
0    1
1    2
2    3
3
4    5
dtype: int64

Grouping

Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm.

[46]:
df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]
df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]

ddf = dask_cudf.from_cudf(df, npartitions=2)

Grouping and then applying the sum function to the grouped data.

[47]:
print(df.groupby('agg_col1').sum())
     a    b    c  agg_col2
agg_col1
0  100   90  100         3
1   90  100   90         4
[48]:
print(ddf.groupby('agg_col1').sum().compute())
     a    b    c  agg_col2
0  100   90  100         3
1   90  100   90         4

Grouping hierarchically then applying the sum function to grouped data. We send the result to a pandas dataframe only for printing purposes.

[49]:
print(df.groupby(['agg_col1', 'agg_col2']).sum().to_pandas())
                    a   b   c
agg_col1 agg_col2
0        0         73  60  73
         1         27  30  27
1        0         54  60  54
         1         36  40  36
[50]:
ddf.groupby(['agg_col1', 'agg_col2']).sum().compute().to_pandas()
[50]:
a b c
agg_col1 agg_col2
0 0 73 60 73
1 27 30 27
1 0 54 60 54
1 36 40 36

Grouping and applying statistical functions to specific columns, using agg.

[51]:
print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))
    a     b    c
agg_col1
0  19   9.0  100
1  18  10.0   90
[52]:
print(ddf.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}).compute())
    a     b    c
0  19   9.0  100
1  18  10.0   90

Transpose

Transposing a dataframe, using either the transpose method or T property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF.

[53]:
sample = cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
print(sample)
   a  b
0  1  4
1  2  5
2  3  6
[54]:
print(sample.transpose())
   0  1  2
a  1  2  3
b  4  5  6

Time Series

DataFrames supports datetime typed columns, which allow users to interact with and filter data based on specific timestamps.

[55]:
import datetime as dt

date_df = cudf.DataFrame()
date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')
date_df['value'] = np.random.sample(len(date_df))

search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')
print(date_df.query('date <= @search_date'))
                     date               value
0 2018-11-20T00:00:00.000  0.5520376332645666
1 2018-11-21T00:00:00.000  0.4853774136627097
2 2018-11-22T00:00:00.000  0.7681341540644223
3 2018-11-23T00:00:00.000  0.1607167531255701
[56]:
date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)
print(date_ddf.query('date <= @search_date', local_dict={'search_date':search_date}).compute())
                     date               value
0 2018-11-20T00:00:00.000  0.5520376332645666
1 2018-11-21T00:00:00.000  0.4853774136627097
2 2018-11-22T00:00:00.000  0.7681341540644223
3 2018-11-23T00:00:00.000  0.1607167531255701

Categoricals

DataFrames support categorical columns.

[57]:
pdf = pd.DataFrame({"id":[1,2,3,4,5,6], "grade":['a', 'b', 'b', 'a', 'a', 'e']})
pdf["grade"] = pdf["grade"].astype("category")

gdf = cudf.DataFrame.from_pandas(pdf)
print(gdf)
   id  grade
0   1      a
1   2      b
2   3      b
3   4      a
4   5      a
5   6      e
[58]:
dgdf = dask_cudf.from_cudf(gdf, npartitions=2)
print(dgdf.compute())
   id  grade
0   1      a
1   2      b
2   3      b
3   4      a
4   5      a
5   6      e

Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF.

[59]:
gdf.grade.cat.categories
[59]:
('a', 'b', 'e')

Accessing the underlying code values of each categorical observation.

[60]:
print(gdf.grade.cat.codes)
0    0
1    1
2    1
3    0
4    0
5    2
dtype: int8
[61]:
print(dgdf.grade.cat.codes.compute())
0    0
1    1
2    1
0    0
1    0
2    2
dtype: int8

Converting Data Representation

Pandas

Converting a cuDF and Dask-cuDF DataFrame to a pandas DataFrame.

[62]:
print(df.head().to_pandas())
   a   b  c  agg_col1  agg_col2
0  0  19  0         1         1
1  1  18  1         0         0
2  2  17  2         1         0
3  3  16  3         0         1
4  4  15  4         1         0
[63]:
print(ddf.compute().head().to_pandas())
   a   b  c  agg_col1  agg_col2
0  0  19  0         1         1
1  1  18  1         0         0
2  2  17  2         1         0
3  3  16  3         0         1
4  4  15  4         1         0

Numpy

Converting a cuDF or Dask-cuDF DataFrame to a numpy ndarray.

[64]:
print(df.as_matrix())
[[ 0 19  0  1  1]
 [ 1 18  1  0  0]
 [ 2 17  2  1  0]
 [ 3 16  3  0  1]
 [ 4 15  4  1  0]
 [ 5 14  5  0  0]
 [ 6 13  6  1  1]
 [ 7 12  7  0  0]
 [ 8 11  8  1  0]
 [ 9 10  9  0  1]
 [10  9 10  1  0]
 [11  8 11  0  0]
 [12  7 12  1  1]
 [13  6 13  0  0]
 [14  5 14  1  0]
 [15  4 15  0  1]
 [16  3 16  1  0]
 [17  2 17  0  0]
 [18  1 18  1  1]
 [19  0 19  0  0]]
[65]:
print(ddf.compute().as_matrix())
[[ 0 19  0  1  1]
 [ 1 18  1  0  0]
 [ 2 17  2  1  0]
 [ 3 16  3  0  1]
 [ 4 15  4  1  0]
 [ 5 14  5  0  0]
 [ 6 13  6  1  1]
 [ 7 12  7  0  0]
 [ 8 11  8  1  0]
 [ 9 10  9  0  1]
 [10  9 10  1  0]
 [11  8 11  0  0]
 [12  7 12  1  1]
 [13  6 13  0  0]
 [14  5 14  1  0]
 [15  4 15  0  1]
 [16  3 16  1  0]
 [17  2 17  0  0]
 [18  1 18  1  1]
 [19  0 19  0  0]]

Converting a cuDF or Dask-cuDF Series to a numpy ndarray.

[79]:
print(df['a'].to_array())
[79]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
[67]:
print(ddf['a'].compute().to_array())
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

Arrow

Converting a cuDF or Dask-cuDF DataFrame to a PyArrow Table.

[68]:
print(df.to_arrow())
pyarrow.Table
a: int64
b: int64
c: int64
agg_col1: int64
agg_col2: int64
__index_level_0__: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
              b'{"name": null, "field_name": null, "pandas_type": "unicode",'
              b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
              b', "columns": [{"name": "a", "field_name": "a", "pandas_type"'
              b': "int64", "numpy_type": "int64", "metadata": null}, {"name"'
              b': "b", "field_name": "b", "pandas_type": "int64", "numpy_typ'
              b'e": "int64", "metadata": null}, {"name": "c", "field_name": '
              b'"c", "pandas_type": "int64", "numpy_type": "int64", "metadat'
              b'a": null}, {"name": "agg_col1", "field_name": "agg_col1", "p'
              b'andas_type": "int64", "numpy_type": "int64", "metadata": nul'
              b'l}, {"name": "agg_col2", "field_name": "agg_col2", "pandas_t'
              b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
              b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
              b': "int64", "numpy_type": "int64", "metadata": null}], "panda'
              b's_version": "0.23.4"}')])
[69]:
print(ddf.compute().to_arrow())
pyarrow.Table
a: int64
b: int64
c: int64
agg_col1: int64
agg_col2: int64
__index_level_0__: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
              b'{"name": null, "field_name": null, "pandas_type": "unicode",'
              b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
              b', "columns": [{"name": "a", "field_name": "a", "pandas_type"'
              b': "int64", "numpy_type": "int64", "metadata": null}, {"name"'
              b': "b", "field_name": "b", "pandas_type": "int64", "numpy_typ'
              b'e": "int64", "metadata": null}, {"name": "c", "field_name": '
              b'"c", "pandas_type": "int64", "numpy_type": "int64", "metadat'
              b'a": null}, {"name": "agg_col1", "field_name": "agg_col1", "p'
              b'andas_type": "int64", "numpy_type": "int64", "metadata": nul'
              b'l}, {"name": "agg_col2", "field_name": "agg_col2", "pandas_t'
              b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
              b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
              b': "int64", "numpy_type": "int64", "metadata": null}], "panda'
              b's_version": "0.23.4"}')])

Getting Data In/Out

CSV

Writing to a CSV file, by first sending data to a pandas Dataframe on the host.

[70]:
if not os.path.exists('example_output'):
    os.mkdir('example_output')

df.to_pandas().to_csv('example_output/foo.csv', index=False)
[71]:
ddf.compute().to_pandas().to_csv('example_output/foo_dask.csv', index=False)

Reading from a csv file.

[72]:
df = cudf.read_csv('example_output/foo.csv')
print(df)
   a   b  c  agg_col1  agg_col2
0  0  19  0         1         1
1  1  18  1         0         0
2  2  17  2         1         0
3  3  16  3         0         1
4  4  15  4         1         0
5  5  14  5         0         0
6  6  13  6         1         1
7  7  12  7         0         0
8  8  11  8         1         0
9  9  10  9         0         1
[10 more rows]
[73]:
ddf = dask_cudf.read_csv('example_output/foo_dask.csv')
print(ddf.compute())
   a   b  c  agg_col1  agg_col2
0  0  19  0         1         1
1  1  18  1         0         0
2  2  17  2         1         0
3  3  16  3         0         1
4  4  15  4         1         0
5  5  14  5         0         0
6  6  13  6         1         1
7  7  12  7         0         0
8  8  11  8         1         0
9  9  10  9         0         1
[10 more rows]

Reading all CSV files in a directory into a single dask_cudf.DataFrame, using the star wildcard.

[74]:
ddf = dask_cudf.read_csv('example_output/*.csv')
print(ddf.compute())
   a   b  c  agg_col1  agg_col2
0  0  19  0         1         1
1  1  18  1         0         0
2  2  17  2         1         0
3  3  16  3         0         1
4  4  15  4         1         0
5  5  14  5         0         0
6  6  13  6         1         1
7  7  12  7         0         0
8  8  11  8         1         0
9  9  10  9         0         1
[30 more rows]

Parquet

Writing to parquet files, using the CPU via PyArrow.

[75]:
df.to_parquet('example_output/temp_parquet')
/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/parquet.py:56: UserWarning: Using CPU via PyArrow to write Parquet dataset, this will be GPU accelerated in the future
  warnings.warn("Using CPU via PyArrow to write Parquet dataset, this will "

Reading parquet files with a GPU-accelerated parquet reader.

[76]:
df = cudf.read_parquet('example_output/temp_parquet/72706b163a0d4feb949005d22146ad83.parquet')
print(df.to_pandas())
     a   b   c  agg_col1  agg_col2
0    0  19   0         1         1
1    1  18   1         0         0
2    2  17   2         1         0
3    3  16   3         0         1
4    4  15   4         1         0
5    5  14   5         0         0
6    6  13   6         1         1
7    7  12   7         0         0
8    8  11   8         1         0
9    9  10   9         0         1
10  10   9  10         1         0
11  11   8  11         0         0
12  12   7  12         1         1
13  13   6  13         0         0
14  14   5  14         1         0
15  15   4  15         0         1
16  16   3  16         1         0
17  17   2  17         0         0
18  18   1  18         1         1
19  19   0  19         0         0

Writing to parquet files from a dask_cudf.DataFrame using PyArrow under the hood.

[77]:
ddf.to_parquet('example_files')

ORC

Reading ORC files.

[78]:
df2 = cudf.read_orc('/cudf/python/cudf/tests/data/orc/TestOrcFile.test1.orc')
df2.to_pandas()
[78]:
boolean1 byte1 short1 int1 long1 float1 double1 bytes1 string1 middle.list.int1 middle.list.string1 list.int1 list.string1 map map.int1 map.string1
0 False 1 1024 65536 9223372036854775807 1.0 -15.0  hi 3 bye 4 chani 5 chani
1 True 100 2048 65536 9223372036854775807 2.0 -5.0 bye 0 bye 0 mauddib 1 mauddib