10 Minutes to cuDF

Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF, geared mainly for new users.

[1]:
import os
import numpy as np
import pandas as pd
import cudf
np.random.seed(12)

#### Portions of this were borrowed from the
#### cuDF cheatsheet, existing cuDF documentation,
#### and 10 Minutes to Pandas.
#### Created November, 2018.

Object Creation

Creating a Series.

[2]:
s = cudf.Series([1,2,3,None,4])
print(s)

0    1
1    2
2    3
3
4    4

Creating a DataFrame by specifying values for each column.

[3]:
df = cudf.DataFrame([('a', list(range(20))),
('b', list(reversed(range(20)))),
('c', list(range(20)))])
print(df)
      a    b    c
 0    0   19    0
 1    1   18    1
 2    2   17    2
 3    3   16    3
 4    4   15    4
 5    5   14    5
 6    6   13    6
 7    7   12    7
 8    8   11    8
 9    9   10    9
[10 more rows]

Creating a Dataframe from a pandas Dataframe.

[4]:
pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
gdf = cudf.DataFrame.from_pandas(pdf)
print(gdf)
     a    b
0    0  0.1
1    1  0.2
2    2
3    3  0.3

Viewing Data

Viewing the top rows of the GPU dataframe.

[5]:
print(df.head(2))
     a    b    c
0    0   19    0
1    1   18    1

Sorting by values.

[6]:
print(df.sort_values(by='a', ascending=False))
      a    b    c
19   19    0   19
18   18    1   18
17   17    2   17
16   16    3   16
15   15    4   15
14   14    5   14
13   13    6   13
12   12    7   12
11   11    8   11
10   10    9   10
[10 more rows]

Selection

Getting

Selecting a single column, which yields a cudf.Series, equivalent to df.a.

[7]:
print(df['a'])

 0    0
 1    1
 2    2
 3    3
 4    4
 5    5
 6    6
 7    7
 8    8
 9    9
[10 more rows]

Selection by Label

Selecting rows from index 2 to index 5 from columns ‘a’ and ‘b’.

[8]:
print(df.loc[2:5, ['a', 'b']])
     a    b
2    2   17
3    3   16
4    4   15
5    5   14

Selection by Position

Selecting by integer slicing, like numpy/pandas.

[9]:
print(df[3:5])
     a    b    c
3    3   16    3
4    4   15    4

Selecting elements of a Series with direct index access.

[10]:
print(s[2])
3

Boolean Indexing

Selecting rows in a Series by direct Boolean indexing.

[11]:
print(df.b[df.b > 15])

0   19
1   18
2   17
3   16

Selecting values from a DataFrame where a Boolean condition is met, via the query API.

[12]:
print(df.query("b == 3"))
     a    b    c
16   16    3   16

Supported logical operators include >, <, >=, <=, ==, and !=.

Setting

Missing Data

Missing data can be replaced by using the fillna method.

[13]:
print(s.fillna(999))

0    1
1    2
2    3
3  999
4    4

Operations

Stats

Calculating descriptive statistics for a Series.

[14]:
print(s.mean(), s.var())
2.5 1.666666666666666

Applymap

Applying functions to a Series.

[15]:
def add_ten(num):
    return num + 10

print(df['a'].applymap(add_ten))

 0   10
 1   11
 2   12
 3   13
 4   14
 5   15
 6   16
 7   17
 8   18
 9   19
[10 more rows]

Histogramming

Counting the number of occurrences of each unique value of variable.

[16]:
print(df.a.value_counts())

 0    1
 1    1
 2    1
 3    1
 4    1
 5    1
 6    1
 7    1
 8    1
 9    1
[10 more rows]

String Methods

Merge

Concat

Concatenating Series and DataFrames row-wise.

[ ]:
print(cudf.concat([s, s]))
print(cudf.concat([df.head(), df.head()], ignore_index=True))

Join

Performing SQL style merges.

[17]:
df_a = cudf.DataFrame()
df_a['key'] = [0, 1, 2, 3, 4]
df_a['vals_a'] = [float(i + 10) for i in range(5)]

df_b = cudf.DataFrame()
df_b['key'] = [1, 2, 4]
df_b['vals_b'] = [float(i+10) for i in range(3)]

df_merged = df_a.merge(df_b, on=['key'], how='left')
print(df_merged.sort_values('key'))
1:float64
2:int64
3:float64
   key vals_a vals_b
3    0   10.0
0    1   11.0   10.0
1    2   12.0   11.0
4    3   13.0
2    4   14.0   12.0

Append

Appending values from another Series or array-like object. Append does not support Series with nulls. For handling null values, use the concat method.

[18]:
print(df.a.head().append(df.b.head()))

 0    0
 1    1
 2    2
 3    3
 4    4
 5   19
 6   18
 7   17
 8   16
 9   15

Grouping

Like pandas, cuDF supports the Split-Apply-Combine groupby paradigm.

[19]:
df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]
df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]

Grouping and then applying the sum function to the grouped data.

[20]:
print(df.groupby('agg_col1').sum())
  agg_col1 sum_a sum_b sum_c sum_agg_col2
0        0   100    90   100            3
1        1    90   100    90            4

Grouping hierarchically then applying the sum function to grouped data.

[21]:
print(df.groupby(['agg_col1', 'agg_col2']).sum())
  agg_col1 agg_col2 sum_a sum_b sum_c
0        0        0    73    60    73
1        0        1    27    30    27
2        1        0    54    60    54
3        1        1    36    40    36

Grouping and applying statistical functions to specific columns, using agg.

[22]:
print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))
  agg_col1 mean_b sum_c max_a
0        0      9   100    19
1        1     10    90    18

Reshaping

Time Series

cuDF supports datetime typed columns, which allow users to interact with and filter data based on specific timestamps.

[23]:
import datetime as dt

date_df = cudf.DataFrame()
date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')
date_df['value'] = np.random.sample(len(date_df))

search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')
print(date_df.query('date <= @search_date'))
                     date               value
0 2018-11-20T00:00:00.000 0.15416284237967237
1 2018-11-21T00:00:00.000  0.7400496965154048
2 2018-11-22T00:00:00.000 0.26331501518513467
3 2018-11-23T00:00:00.000  0.5337393933802977

Categoricals

cuDF supports categorical columns.

[24]:
pdf = pd.DataFrame({"id":[1,2,3,4,5,6], "grade":['a', 'b', 'b', 'a', 'a', 'e']})
pdf["grade"] = pdf["grade"].astype("category")

gdf = cudf.DataFrame.from_pandas(pdf)
print(gdf)
  grade   id
0     a    1
1     b    2
2     b    3
3     a    4
4     a    5
5     e    6

Accessing the categories of a column.

[25]:
print(gdf.grade.cat.categories)
('a', 'b', 'e')

Accessing the underlying code values of each categorical observation.

[26]:
print(gdf.grade.cat.codes)

0    0
1    1
2    1
3    0
4    0
5    2

Plotting

Converting Data Representation

Pandas

Converting a cuDF DataFrame to a pandas DataFrame.

[27]:
print(df.head().to_pandas())
   a   b  c  agg_col1  agg_col2
0  0  19  0         1         1
1  1  18  1         0         0
2  2  17  2         1         0
3  3  16  3         0         1
4  4  15  4         1         0

Numpy

Converting a cuDF DataFrame to a numpy rec.array.

[28]:
print(df.to_records())
[( 0,  0, 19,  0, 1, 1) ( 1,  1, 18,  1, 0, 0) ( 2,  2, 17,  2, 1, 0)
 ( 3,  3, 16,  3, 0, 1) ( 4,  4, 15,  4, 1, 0) ( 5,  5, 14,  5, 0, 0)
 ( 6,  6, 13,  6, 1, 1) ( 7,  7, 12,  7, 0, 0) ( 8,  8, 11,  8, 1, 0)
 ( 9,  9, 10,  9, 0, 1) (10, 10,  9, 10, 1, 0) (11, 11,  8, 11, 0, 0)
 (12, 12,  7, 12, 1, 1) (13, 13,  6, 13, 0, 0) (14, 14,  5, 14, 1, 0)
 (15, 15,  4, 15, 0, 1) (16, 16,  3, 16, 1, 0) (17, 17,  2, 17, 0, 0)
 (18, 18,  1, 18, 1, 1) (19, 19,  0, 19, 0, 0)]

Converting a cuDF Series to a numpy ndarray.

[29]:
print(df['a'].to_array())
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

Arrow

Converting a cuDF DataFrame to a PyArrow Table.

[30]:
print(df.to_arrow())
pyarrow.Table
None: int64
a: int64
b: int64
c: int64
agg_col1: int64
agg_col2: int64

Getting Data In/Out

CSV

Writing to a CSV file, by first sending data to a pandas Dataframe on the host.

[31]:
df.to_pandas().to_csv('foo.txt', index=False)

Reading from a csv file.

[32]:
df = cudf.read_csv('foo.txt', delimiter=',',
        names=['a', 'b', 'c', 'a1', 'a2'],
        dtype=['int64', 'int64', 'int64', 'int64', 'int64'],
        skiprows=1)
print(df)
      a    b    c   a1   a2
 0    0   19    0    1    1
 1    1   18    1    0    0
 2    2   17    2    1    0
 3    3   16    3    0    1
 4    4   15    4    1    0
 5    5   14    5    0    0
 6    6   13    6    1    1
 7    7   12    7    0    0
 8    8   11    8    1    0
 9    9   10    9    0    1
[10 more rows]

Parquet

ORC

Gotchas

If you are attempting to perform Boolean indexing directly or using the query API, you might see an exception like:

 ---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
...
     103     from .numerical import NumericalColumn
 --> 104     assert column.null_count == 0  # We don't properly handle the boolmask yet
     105     boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())
     106     indices = cudautils.arange(len(boolmask))

 AssertionError:

Boolean indexing a Series containing null values will cause this error. Consider filling or removing the missing values.