API Reference

DataFrame

class cudf.dataframe.DataFrame(data=None, index=None, columns=None)

A GPU Dataframe object.

Examples

Build dataframe with __setitem__:

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0

Build dataframe with initializer:

>>> import cudf
>>> import numpy as np
>>> from datetime import datetime, timedelta
>>> ids = np.arange(5)

Create some datetime data

>>> t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
>>> datetimes = [(t0+ timedelta(seconds=x)) for x in range(5)]
>>> dts = np.array(datetimes, dtype='datetime64')

Create the GPU DataFrame

>>> df = cudf.DataFrame([('id', ids), ('datetimes', dts)])
>>> df
    id                datetimes
0    0  2018-10-07T12:00:00.000
1    1  2018-10-07T12:00:01.000
2    2  2018-10-07T12:00:02.000
3    3  2018-10-07T12:00:03.000
4    4  2018-10-07T12:00:04.000

Convert from a Pandas DataFrame:

>>> import pandas as pd
>>> import cudf
>>> pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
>>> df = cudf.from_pandas(pdf)
>>> df
  a b
0 0 0.1
1 1 0.2
2 2 nan
3 3 0.3
Attributes
T
columns

Returns a tuple of columns

dtypes

Return the dtypes in this object.

empty
iloc

Selecting rows and column by position.

index

Returns the index of the DataFrame

loc

Selecting rows and columns by label or boolean mask.

ndim

Dimension of the data.

shape

Returns a tuple representing the dimensionality of the DataFrame.

Methods

add_column(self, name, data[, forceindex])

Add a column

apply_chunks(self, func, incols, outcols[, …])

Transform user-specified chunks using the user-provided function.

apply_rows(self, func, incols, outcols, kwargs)

Apply a row-wise user defined function.

as_gpu_matrix(self[, columns, order])

Convert to a matrix in device memory.

as_matrix(self[, columns])

Convert to a matrix in host memory.

assign(self, \*\*kwargs)

Assign columns to DataFrame from keyword arguments.

at(self)

Alias for DataFrame.loc; provided for compatibility with Pandas.

copy(self[, deep])

Returns a copy of this dataframe

count(self, \*\*kwargs)

describe(self[, percentiles, include, exclude])

Compute summary statistics of a DataFrame’s columns.

drop(self, labels[, axis, errors])

Drop column(s)

drop_column(self, name)

Drop a column by name

drop_duplicates(self[, subset, keep, inplace])

Return DataFrame with duplicate rows removed, optionally only considering certain subset of columns.

dropna(self[, axis, how, subset, thresh])

Drops rows (or columns) containing nulls from a Column.

fillna(self, value[, method, axis, inplace, …])

Fill null values with value.

from_arrow(table)

Convert from a PyArrow Table.

from_gpu_matrix(data[, index, columns, …])

Convert from a numba gpu ndarray.

from_pandas(dataframe[, nan_as_null])

Convert from a Pandas DataFrame.

from_records(data[, index, columns, nan_as_null])

Convert from a numpy recarray or structured array.

groupby(self[, by, sort, as_index, method, …])

Groupby

hash_columns(self[, columns])

Hash the given columns and return a new Series

head(self[, n])

Returns the first n rows as a new DataFrame

iat(self)

Alias for DataFrame.iloc; provided for compatibility with Pandas.

isna(self, \*\*kwargs)

Identify missing values in a DataFrame.

isnull(self, \*\*kwargs)

Identify missing values in a DataFrame.

iteritems(self)

Iterate over column names and series pairs

join(self, other[, on, how, lsuffix, …])

Join columns with other DataFrame on index or on a key column.

label_encoding(self, column, prefix, cats[, …])

Encode labels in a column with label encoding.

mean(self[, numeric_only])

Return the mean of the values for the requested axis.

melt(self, \*\*kwargs)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

merge(self, right[, on, how, left_on, …])

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

nans_to_nulls(self)

Convert nans (if any) to nulls.

nlargest(self, n, columns[, keep])

Get the rows of the DataFrame sorted by the n largest value of columns

notna(self, \*\*kwargs)

Identify non-missing values in a DataFrame.

nsmallest(self, n, columns[, keep])

Get the rows of the DataFrame sorted by the n smallest value of columns

one_hot_encoding(self, column, prefix, cats)

Expand a column with one-hot-encoding.

partition_by_hash(self, columns, nparts)

Partition the dataframe by the hashed value of data in columns.

pop(self, item)

Return a column and drop it from the DataFrame.

quantile(self[, q, axis, numeric_only, …])

Return values at the given quantile.

query(self, expr[, local_dict])

Query with a boolean expression using Numba to compile a GPU kernel.

reindex(self[, labels, axis, index, …])

Return a new DataFrame whose axes conform to a new index

rename(self[, mapper, columns, copy, inplace])

Alter column labels.

replace(self, to_replace, replacement)

Replace values given in to_replace with replacement.

rolling(self, window[, min_periods, center, …])

Rolling window calculations.

select_dtypes(self[, include, exclude])

Return a subset of the DataFrame’s columns based on the column dtypes.

set_index(self, index[, drop])

Return a new DataFrame with a new index

sort_index(self[, ascending])

Sort by the index

sort_values(self, by[, ascending, na_position])

Sort by the values row-wise.

tail(self[, n])

Returns the last n rows as a new DataFrame

to_arrow(self[, preserve_index])

Convert to a PyArrow Table.

to_csv(self[, path, sep, na_rep, columns, …])

Write a dataframe to csv file format.

to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

to_feather(self, path, \*args, \*\*kwargs)

Write a DataFrame to the feather format.

to_gpu_matrix(self)

Convert to a numba gpu ndarray

to_hdf(self, path_or_buf, key, \*args, …)

Write the contained data to an HDF5 file using HDFStore.

to_json(self[, path_or_buf])

Convert the cuDF object to a JSON string.

to_pandas(self)

Convert to a Pandas DataFrame.

to_parquet(self, path, \*args, \*\*kwargs)

Write a DataFrame to the parquet format.

to_records(self[, index])

Convert to a numpy recarray

to_string(self)

Convert to string

transpose(self)

Transpose index and columns.

acos

add

all

any

argsort

asin

atan

cos

cummax

cummin

cumprod

cumsum

deserialize

equals

exp

floordiv

get_renderable_dataframe

log

mask

max

min

mod

mul

pow

product

radd

reset_index

rfloordiv

rmod

rmul

rpow

rsub

rtruediv

serialize

sin

sqrt

std

sub

sum

take

tan

truediv

var

add_column(self, name, data, forceindex=False)

Add a column

Parameters
namestr

Name of column to be added.

dataSeries, array-like

Values to be added.

apply_chunks(self, func, incols, outcols, kwargs={}, chunks=None, tpb=1)

Transform user-specified chunks using the user-provided function.

Parameters
funcfunction

The transformation function that will be executed on the CUDA GPU.

incols: list

A list of names of input columns.

outcols: dict

A dictionary of output column names and their dtype.

kwargs: dict

name-value of extra arguments. These values are passed directly into the function.

chunksint or Series-like

If it is an int, it is the chunksize. If it is an array, it contains integer offset for the start of each chunk. The span of a chunk for chunk i-th is data[chunks[i] : chunks[i + 1]] for any i + 1 < chunks.size; or, data[chunks[i]:] for the i == len(chunks) - 1.

tpbint; optional

It is the thread-per-block for the underlying kernel. The default uses 1 thread to emulate serial execution for each chunk. It is a good starting point but inefficient. Its maximum possible value is limited by the available CUDA GPU resources.

Examples

For tpb > 1, func is executed by tpb number of threads concurrently. To access the thread id and count, use numba.cuda.threadIdx.x and numba.cuda.blockDim.x, respectively (See numba CUDA kernel documentation).

In the example below, the kernel is invoked concurrently on each specified chunk. The kernel computes the corresponding output for the chunk.

By looping over the range range(cuda.threadIdx.x, in1.size, cuda.blockDim.x), the kernel function can be used with any tpb in a efficient manner.

>>> from numba import cuda
>>> @cuda.jit
... def kernel(in1, in2, in3, out1):
...      for i in range(cuda.threadIdx.x, in1.size, cuda.blockDim.x):
...          x = in1[i]
...          y = in2[i]
...          z = in3[i]
...          out1[i] = x * y + z
apply_rows(self, func, incols, outcols, kwargs, cache_key=None)

Apply a row-wise user defined function.

Parameters
funcfunction

The transformation function that will be executed on the CUDA GPU.

incols: list

A list of names of input columns.

outcols: dict

A dictionary of output column names and their dtype.

kwargs: dict

name-value of extra arguments. These values are passed directly into the function.

Examples

The user function should loop over the columns and set the output for each row. Loop execution order is arbitrary, so each iteration of the loop MUST be independent of each other.

When func is invoked, the array args corresponding to the input/output are strided so as to improve GPU parallelism. The loop in the function resembles serial code, but executes concurrently in multiple threads.

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame()
>>> nelem = 3
>>> df['in1'] = np.arange(nelem)
>>> df['in2'] = np.arange(nelem)
>>> df['in3'] = np.arange(nelem)

Define input columns for the kernel

>>> in1 = df['in1']
>>> in2 = df['in2']
>>> in3 = df['in3']
>>> def kernel(in1, in2, in3, out1, out2, kwarg1, kwarg2):
...     for i, (x, y, z) in enumerate(zip(in1, in2, in3)):
...         out1[i] = kwarg2 * x - kwarg1 * y
...         out2[i] = y - kwarg1 * z

Call .apply_rows with the name of the input columns, the name and dtype of the output columns, and, optionally, a dict of extra arguments.

>>> df.apply_rows(kernel,
...               incols=['in1', 'in2', 'in3'],
...               outcols=dict(out1=np.float64, out2=np.float64),
...               kwargs=dict(kwarg1=3, kwarg2=4))
   in1  in2  in3 out1 out2
0    0    0    0  0.0  0.0
1    1    1    1  1.0 -2.0
2    2    2    2  2.0 -4.0
as_gpu_matrix(self, columns=None, order='F')

Convert to a matrix in device memory.

Parameters
columnssequence of str

List of a column names to be extracted. The order is preserved. If None is specified, all columns are used.

order‘F’ or ‘C’

Optional argument to determine whether to return a column major (Fortran) matrix or a row major (C) matrix.

Returns
A (nrow x ncol) numpy ndarray in “F” order.
as_matrix(self, columns=None)

Convert to a matrix in host memory.

Parameters
columnssequence of str

List of a column names to be extracted. The order is preserved. If None is specified, all columns are used.

Returns
A (nrow x ncol) numpy ndarray in “F” order.
assign(self, **kwargs)

Assign columns to DataFrame from keyword arguments.

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df = df.assign(a=[0, 1, 2], b=[3, 4, 5])
>>> print(df)
   a  b
0  0  3
1  1  4
2  2  5
at(self)

Alias for DataFrame.loc; provided for compatibility with Pandas.

property columns

Returns a tuple of columns

copy(self, deep=True)

Returns a copy of this dataframe

Parameters
deep: bool

Make a full copy of Series columns and Index at the GPU level, or create a new allocation with references.

describe(self, percentiles=None, include=None, exclude=None)

Compute summary statistics of a DataFrame’s columns. For numeric data, the output includes the minimum, maximum, mean, median, standard deviation, and various quantiles. For object data, the output includes the count, number of unique values, the most common value, and the number of occurrences of the most common value.

Parameters
percentileslist-like, optional

The percentiles used to generate the output summary statistics. If None, the default percentiles used are the 25th, 50th and 75th. Values should be within the interval [0, 1].

include: str, list-like, optional

The dtypes to be included in the output summary statistics. Columns of dtypes not included in this list will not be part of the output. If include=’all’, all dtypes are included. Default of None includes all numeric columns.

exclude: str, list-like, optional

The dtypes to be excluded from the output summary statistics. Columns of dtypes included in this list will not be part of the output. Default of None excludes no columns.

Returns
output_frameDataFrame

Summary statistics of relevant columns in the original dataframe.

Examples

Describing a Series containing numeric values. >>> import cudf >>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> print(s.describe())

stats values

0 count 10.0 1 mean 5.5 2 std 3.02765 3 min 1.0 4 25% 2.5 5 50% 5.5 6 75% 7.5 7 max 10.0

Describing a DataFrame. By default all numeric fields are returned. >>> gdf = cudf.DataFrame() >>> gdf[‘a’] = [1,2,3] >>> gdf[‘b’] = [1.0, 2.0, 3.0] >>> gdf[‘c’] = [‘x’, ‘y’, ‘z’] >>> gdf[‘d’] = [1.0, 2.0, 3.0] >>> gdf[‘d’] = gdf[‘d’].astype(‘float32’) >>> print(gdf.describe())

stats a b d

0 count 3.0 3.0 3.0 1 mean 2.0 2.0 2.0 2 std 1.0 1.0 1.0 3 min 1.0 1.0 1.0 4 25% 1.5 1.5 1.5 5 50% 1.5 1.5 1.5 6 75% 2.5 2.5 2.5 7 max 3.0 3.0 3.0

Using the include keyword to describe only specific dtypes. >>> gdf = cudf.DataFrame() >>> gdf[‘a’] = [1,2,3] >>> gdf[‘b’] = [1.0, 2.0, 3.0] >>> gdf[‘c’] = [‘x’, ‘y’, ‘z’] >>> print(gdf.describe(include=’int’))

stats a

0 count 3.0 1 mean 2.0 2 std 1.0 3 min 1.0 4 25% 1.5 5 50% 1.5 6 75% 2.5 7 max 3.0

drop(self, labels, axis=None, errors='raise')

Drop column(s)

Parameters
labelsstr or sequence of strings

Name of column(s) to be dropped.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Only axis=1 is currently supported.

errors{‘ignore’, ‘raise’}, default ‘raise’

This parameter is currently ignored.

Returns
A dataframe without dropped column(s)

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]
>>> df_new = df.drop('val')
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0
>>> print(df_new)
   key
0    0
1    1
2    2
3    3
4    4
drop_column(self, name)

Drop a column by name

drop_duplicates(self, subset=None, keep='first', inplace=False)

Return DataFrame with duplicate rows removed, optionally only considering certain subset of columns.

dropna(self, axis=0, how='any', subset=None, thresh=None)

Drops rows (or columns) containing nulls from a Column.

Parameters
axis{0, 1}, optional

Whether to drop rows (axis=0, default) or columns (axis=1) containing nulls.

how{“any”, “all”}, optional

Specifies how to decide whether to drop a row (or column). any (default) drops rows (or columns) containing at least one null value. all drops only rows (or columns) containing all null values.

subsetlist, optional

List of columns to consider when dropping rows (all columns are considered by default). Alternatively, when dropping columns, subset is a list of rows to consider.

thresh: int, optional

If specified, then drops every row (or column) containing less than thresh non-null values

Returns
Copy of the DataFrame with rows/columns containing nulls dropped.
property dtypes

Return the dtypes in this object.

fillna(self, value, method=None, axis=None, inplace=False, limit=None)

Fill null values with value.

Parameters
valuescalar, Series-like or dict

Value to use to fill nulls. If Series-like, null values are filled with values in corresponding indices. A dict can be used to provide different values to fill nulls in different columns.

Returns
resultDataFrame

Copy with nulls filled.

Examples

>>> import cudf
>>> gdf = cudf.DataFrame({'a': [1, 2, None], 'b': [3, None, 5]})
>>> gdf.fillna(4).to_pandas()
a  b
0  1  3
1  2  4
2  4  5
>>> gdf.fillna({'a': 3, 'b': 4}).to_pandas()
a  b
0  1  3
1  2  4
2  3  5
classmethod from_arrow(table)

Convert from a PyArrow Table.

Raises
TypeError for invalid input type.
Notes
Does not support automatically setting index column(s) similar to how
to_pandas works for PyArrow Tables.

Examples

>>> import pyarrow as pa
>>> import cudf
>>> data = [pa.array([1, 2, 3]), pa.array([4, 5, 6])]
>>> batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1'])
>>> table = pa.Table.from_batches([batch])
>>> cudf.DataFrame.from_arrow(table)
<cudf.DataFrame ncols=2 nrows=3 >
classmethod from_gpu_matrix(data, index=None, columns=None, nan_as_null=False)

Convert from a numba gpu ndarray.

Parameters
datanumba gpu ndarray
indexstr

The name of the index column in data. If None, the default index is used.

columnslist of str

List of column names to include.

Returns
DataFrame
classmethod from_pandas(dataframe, nan_as_null=True)

Convert from a Pandas DataFrame.

Raises
TypeError for invalid input type.

Examples

>>> import cudf
>>> import pandas as pd
>>> data = [[0,1], [1,2], [3,4]]
>>> pdf = pd.DataFrame(data, columns=['a', 'b'], dtype=int)
>>> cudf.from_pandas(pdf)
<cudf.DataFrame ncols=2 nrows=3 >
classmethod from_records(data, index=None, columns=None, nan_as_null=False)

Convert from a numpy recarray or structured array.

Parameters
datanumpy structured dtype or recarray of ndim=2
indexstr

The name of the index column in data. If None, the default index is used.

columnslist of str

List of column names to include.

Returns
DataFrame
groupby(self, by=None, sort=True, as_index=True, method='hash', level=None, group_keys=True)

Groupby

Parameters
bylist-of-str or str

Column name(s) to form that groups by.

sortbool, default True

Force sorting group keys.

as_indexbool, default True

Indicates whether the grouped by columns become the index of the returned DataFrame

methodstr, optional

A string indicating the method to use to perform the group by. Valid values are “hash” or “cudf”. “cudf” method may be deprecated in the future, but is currently the only method supporting group UDFs via the apply function.

Returns
The groupby object

Notes

No empty rows are returned. (For categorical keys, pandas returns rows for all categories even if they are no corresponding values.)

hash_columns(self, columns=None)

Hash the given columns and return a new Series

Parameters
columnsequence of str; optional

Sequence of column names. If columns is None (unspecified), all columns in the frame are used.

head(self, n=5)

Returns the first n rows as a new DataFrame

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df.head(2))
   key   val
0    0  10.0
1    1  11.0
iat(self)

Alias for DataFrame.iloc; provided for compatibility with Pandas.

property iloc

Selecting rows and column by position.

See also

DataFrame.loc

Examples

>>> df = DataFrame([('a', list(range(20))),
...                 ('b', list(range(20))),
...                 ('c', list(range(20)))])

Select a single row using an integer index.

>>> print(df.iloc[1])
a    1
b    1
c    1

Select multiple rows using a list of integers.

>>> print(df.iloc[[0, 2, 9, 18]])
      a    b    c
 0    0    0    0
 2    2    2    2
 9    9    9    9
18   18   18   18

Select rows using a slice.

>>> print(df.iloc[3:10:2])
     a    b    c
3    3    3    3
5    5    5    5
7    7    7    7
9    9    9    9

Select both rows and columns.

>>> print(df.iloc[[1, 3, 5, 7], 2])
1    1
3    3
5    5
7    7
Name: c, dtype: int64

Setting values in a column using iloc.

>>> df.iloc[:4] = 0
>>> print(df)
   a  b  c
0  0  0  0
1  0  0  0
2  0  0  0
3  0  0  0
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9
[10 more rows]
property index

Returns the index of the DataFrame

isna(self, **kwargs)

Identify missing values in a DataFrame. Alias for isnull.

isnull(self, **kwargs)

Identify missing values in a DataFrame.

iteritems(self)

Iterate over column names and series pairs

join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False, type='', method='hash')

Join columns with other DataFrame on index or on a key column.

Parameters
otherDataFrame
howstr

Only accepts “left”, “right”, “inner”, “outer”

lsuffix, rsuffixstr

The suffices to add to the left (lsuffix) and right (rsuffix) column names when avoiding conflicts.

sortbool

Set to True to ensure sorted ordering.

Returns
joinedDataFrame

Notes

Difference from pandas:

  • other must be a single DataFrame for now.

  • on is not supported yet due to lack of multi-index support.

label_encoding(self, column, prefix, cats, prefix_sep='_', dtype=None, na_sentinel=-1)

Encode labels in a column with label encoding.

Parameters
columnstr

the source column with binary encoding for the data.

prefixstr

the new column name prefix.

catssequence of ints

the sequence of categories as integers.

prefix_sepstr

the separator between the prefix and the category.

dtype :

the dtype for the outputs; see Series.label_encoding

na_sentinelnumber

Value to indicate missing category.

Returns
——-
a new dataframe with a new column append for the coded values.
property loc

Selecting rows and columns by label or boolean mask.

See also

DataFrame.iloc

Examples

DataFrame with string index.

>>> print(df)
   a  b
a  0  5
b  1  6
c  2  7
d  3  8
e  4  9

Select a single row by label.

>>> print(df.loc['a'])
a    0
b    5
Name: a, dtype: int64

Select multiple rows and a single column.

>>> print(df.loc[['a', 'c', 'e'], 'b'])
a    5
c    7
e    9
Name: b, dtype: int64

Selection by boolean mask. >>> print(df.loc[df.a > 2])

a b

d 3 8 e 4 9

Setting values using loc. >>> df.loc[[‘a’, ‘c’, ‘e’], ‘a’] = 0 >>> print(df)

a b

a 0 5 b 1 6 c 0 7 d 3 8 e 0 9

mean(self, numeric_only=None, **kwargs)

Return the mean of the values for the requested axis.

Parameters
axis{index (0), columns (1)}

Axis for the function to be applied on.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

**kwargs

Additional keyword arguments to be passed to the function.

Returns
meanSeries or DataFrame (if level specified)
melt(self, **kwargs)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters
frameDataFrame
id_varstuple, list, or ndarray, optional

Column(s) to use as identifier variables. default: None

value_varstuple, list, or ndarray, optional

Column(s) to unpivot. default: all columns that are not set as id_vars.

var_namescalar

Name to use for the variable column. default: frame.columns.name or ‘variable’

value_namestr

Name to use for the value column. default: ‘value’

Returns
outDataFrame

Melted result

merge(self, right, on=None, how='inner', left_on=None, right_on=None, left_index=False, right_index=False, sort=False, lsuffix=None, rsuffix=None, type='', method='hash', indicator=False, suffixes=('_x', '_y'))

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

Parameters
rightDataFrame
onlabel or list; defaults to None

Column or index level names to join on. These must be found in both DataFrames.

If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

how{‘left’, ‘outer’, ‘inner’}, default ‘inner’
Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left

outer join; preserve key order.

right: not supported. outer: use union of keys from both frames, similar to a SQL

full outer join; sort keys lexicographically.

inner: use intersection of keys from both frames, similar to

a SQL inner join; preserve the order of the left keys.

left_onlabel or list, or array-like

Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_onlabel or list, or array-like

Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_indexbool, default False

Use the index from the left DataFrame as the join key(s).

right_indexbool, default False

Use the index from the right DataFrame as the join key.

sortbool, default False

Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (see the how keyword).

suffixes: Tuple[str, str], defaults to (‘_x’, ‘_y’)

Suffixes applied to overlapping column names on the left and right sides

method{‘hash’, ‘sort’}, default ‘hash’

The implementation method to be used for the operation.

Returns
mergedDataFrame

Examples

>>> import cudf
>>> df_a = cudf.DataFrame()
>>> df_a['key'] = [0, 1, 2, 3, 4]
>>> df_a['vals_a'] = [float(i + 10) for i in range(5)]
>>> df_b = cudf.DataFrame()
>>> df_b['key'] = [1, 2, 4]
>>> df_b['vals_b'] = [float(i+10) for i in range(3)]
>>> df_merged = df_a.merge(df_b, on=['key'], how='left')
>>> df_merged.sort_values('key')  
   key  vals_a  vals_b
3    0    10.0
0    1    11.0    10.0
1    2    12.0    11.0
4    3    13.0
2    4    14.0    12.0
nans_to_nulls(self)

Convert nans (if any) to nulls.

property ndim

Dimension of the data. DataFrame ndim is always 2.

nlargest(self, n, columns, keep='first')

Get the rows of the DataFrame sorted by the n largest value of columns

Notes

Difference from pandas: * Only a single column is supported in columns

notna(self, **kwargs)

Identify non-missing values in a DataFrame.

nsmallest(self, n, columns, keep='first')

Get the rows of the DataFrame sorted by the n smallest value of columns

Difference from pandas: * Only a single column is supported in columns

one_hot_encoding(self, column, prefix, cats, prefix_sep='_', dtype='float64')

Expand a column with one-hot-encoding.

Parameters
columnstr

the source column with binary encoding for the data.

prefixstr

the new column name prefix.

catssequence of ints

the sequence of categories as integers.

prefix_sepstr

the separator between the prefix and the category.

dtype :

the dtype for the outputs; defaults to float64.

Returns
a new dataframe with new columns append for each category.

Examples

>>> import pandas as pd
>>> import cudf
>>> pet_owner = [1, 2, 3, 4, 5]
>>> pet_type = ['fish', 'dog', 'fish', 'bird', 'fish']
>>> df = pd.DataFrame({'pet_owner': pet_owner, 'pet_type': pet_type})
>>> df.pet_type = df.pet_type.astype('category')

Create a column with numerically encoded category values

>>> df['pet_codes'] = df.pet_type.cat.codes
>>> gdf = cudf.from_pandas(df)

Create the list of category codes to use in the encoding

>>> codes = gdf.pet_codes.unique()
>>> gdf.one_hot_encoding('pet_codes', 'pet_dummy', codes).head()
  pet_owner  pet_type  pet_codes  pet_dummy_0  pet_dummy_1  pet_dummy_2
0         1      fish          2          0.0          0.0          1.0
1         2       dog          1          0.0          1.0          0.0
2         3      fish          2          0.0          0.0          1.0
3         4      bird          0          1.0          0.0          0.0
4         5      fish          2          0.0          0.0          1.0
partition_by_hash(self, columns, nparts)

Partition the dataframe by the hashed value of data in columns.

Parameters
columnssequence of str

The names of the columns to be hashed. Must have at least one name.

npartsint

Number of output partitions

Returns
partitioned: list of DataFrame
pop(self, item)

Return a column and drop it from the DataFrame.

quantile(self, q=0.5, axis=0, numeric_only=True, interpolation='linear', columns=None, exact=True)

Return values at the given quantile.

Parameters
qfloat or array-like

0 <= q <= 1, the quantile(s) to compute

axisint

axis is a NON-FUNCTIONAL parameter

numeric_onlyboolean

numeric_only is a NON-FUNCTIONAL parameter

interpolation{linear, lower, higher, midpoint, nearest}

This parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j. Default ‘linear’.

columnslist of str

List of column names to include.

exactboolean

Whether to use approximate or exact quantile algorithm.

Returns
DataFrame
query(self, expr, local_dict={})

Query with a boolean expression using Numba to compile a GPU kernel.

See pandas.DataFrame.query.

Parameters
exprstr

A boolean expression. Names in expression refer to columns.

Names starting with @ refer to Python variables.

An output value will be null if any of the input values are null regardless of expression.

local_dictdict

Containing the local variable to be used in query.

Returns
filteredDataFrame

Examples

>>> import cudf
>>> a = ('a', [1, 2, 2])
>>> b = ('b', [3, 4, 5])
>>> df = cudf.DataFrame([a, b])
>>> expr = "(a == 2 and b == 4) or (b == 3)"
>>> print(df.query(expr))
   a  b
0  1  3
1  2  4

DateTime conditionals:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> print(df.query('datetimes==@search_date'))
                datetimes
1 2018-10-08T00:00:00.000

Using local_dict:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date2 = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> print(df.query('datetimes==@search_date',
>>>         local_dict={'search_date':search_date2}))
                datetimes
1 2018-10-08T00:00:00.000
reindex(self, labels=None, axis=0, index=None, columns=None, copy=True)

Return a new DataFrame whose axes conform to a new index

DataFrame.reindex supports two calling conventions * (index=index_labels, columns=column_names) * (labels, axis={0 or 'index', 1 or 'columns'})

Parameters
labelsIndex, Series-convertible, optional, default None
axis{0 or ‘index’, 1 or ‘columns’}, optional, default 0
indexIndex, Series-convertible, optional, default None

Shorthand for df.reindex(labels=index_labels, axis=0)

columnsarray-like, optional, default None

Shorthand for df.reindex(labels=column_names, axis=1)

copyboolean, optional, default True
Returns
A DataFrame whose axes conform to the new index(es)

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]
>>> df_new = df.reindex(index=[0, 3, 4, 5],
                        columns=['key', 'val', 'sum'])
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0
>>> print(df_new)
   key   val  sum
0    0  10.0  NaN
3    3  13.0  NaN
4    4  14.0  NaN
5   -1   NaN  NaN
rename(self, mapper=None, columns=None, copy=True, inplace=False)

Alter column labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Parameters
mapper, columnsdict-like or function, optional

dict-like or functions transformations to apply to the column axis’ values.

copyboolean, default True

Also copy underlying data

inplace: boolean, default False

Return new DataFrame. If True, assign columns without copy

Returns
DataFrame

Notes

Difference from pandas:
  • Support axis=’columns’ only.

  • Not supporting: index, level

Rename will not overwite column names. If a list with duplicates it passed, column names will be postfixed.

replace(self, to_replace, replacement)

Replace values given in to_replace with replacement.

Parameters
to_replacenumeric, str, list-like or dict

Value(s) to replace.

  • numeric or str:

    • values equal to to_replace will be replaced with replacement

  • list of numeric or str:

    • If replacement is also list-like, to_replace and replacement must be of same length.

  • dict:

    • Dicts can be used to replace different values in different columns. For example, {‘a’: 1, ‘z’: 2} specifies that the value 1 in column a and the value 2 in column z should be replaced with replacement*.

replacementnumeric, str, list-like, or dict

Value(s) to replace to_replace with. If a dict is provided, then its keys must match the keys in to_replace, and correponding values must be compatible (e.g., if they are lists, then they must match in length).

Returns
resultDataFrame

DataFrame after replacement.

rolling(self, window, min_periods=None, center=False, axis=0, win_type=None)

Rolling window calculations.

Parameters
windowint or offset

Size of the window, i.e., the number of observations used to calculate the statistic. For datetime indexes, an offset can be provided instead of an int. The offset must be convertible to a timedelta. As opposed to a fixed window size, each window will be sized to accommodate observations within the time period specified by the offset.

min_periodsint, optional

The minimum number of observations in the window that are required to be non-null, so that the result is non-null. If not provided or None, min_periods is equal to the window size.

centerbool, optional

If True, the result is set at the center of the window. If False (default), the result is set at the right edge of the window.

Returns
Rolling object.

Examples

>>> import cudf
>>> a = cudf.Series([1, 2, 3, None, 4])

Rolling sum with window size 2.

>>> print(a.rolling(2).sum())
0
1    3
2    5
3
4
dtype: int64

Rolling sum with window size 2 and min_periods 1.

>>> print(a.rolling(2, min_periods=1).sum())
0    1
1    3
2    5
3    3
4    4
dtype: int64

Rolling count with window size 3.

>>> print(a.rolling(3).count())
0    1
1    2
2    3
3    2
4    2
dtype: int64

Rolling count with window size 3, but with the result set at the center of the window.

>>> print(a.rolling(3, center=True).count())
0    2
1    3
2    2
3    2
4    1 dtype: int64

Rolling max with variable window size specified by an offset; only valid for datetime index.

>>> a = cudf.Series(
...     [1, 9, 5, 4, np.nan, 1],
...     index=[
...         pd.Timestamp('20190101 09:00:00'),
...         pd.Timestamp('20190101 09:00:01'),
...         pd.Timestamp('20190101 09:00:02'),
...         pd.Timestamp('20190101 09:00:04'),
...         pd.Timestamp('20190101 09:00:07'),
...         pd.Timestamp('20190101 09:00:08')
...     ]
... )
>>> print(a.rolling('2s').max())
2019-01-01T09:00:00.000    1
2019-01-01T09:00:01.000    9
2019-01-01T09:00:02.000    9
2019-01-01T09:00:04.000    4
2019-01-01T09:00:07.000
2019-01-01T09:00:08.000    1
dtype: int64

Apply custom function on the window with the apply method

>>> import numpy as np
>>> import math
>>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64)
>>> def some_func(A):
...     b = 0
...     for a in A:
...         b = b + math.sqrt(a)
...     return b
...
>>> print(b.rolling(3, min_periods=1).apply(some_func))
0     4.0
1     9.0
2    15.0
3    18.0
4    21.0
5    24.0
dtype: float64

And this also works for window rolling set by an offset

>>> import pandas as pd
>>> c = cudf.Series(
...     [16, 25, 36, 49, 64, 81],
...     index=[
...          pd.Timestamp('20190101 09:00:00'),
...          pd.Timestamp('20190101 09:00:01'),
...          pd.Timestamp('20190101 09:00:02'),
...          pd.Timestamp('20190101 09:00:04'),
...          pd.Timestamp('20190101 09:00:07'),
...          pd.Timestamp('20190101 09:00:08')
...      ],
...     dtype=np.float64
... )
>>> print(c.rolling('2s').apply(some_func))
2019-01-01T09:00:00.000     4.0
2019-01-01T09:00:01.000     9.0
2019-01-01T09:00:02.000    11.0
2019-01-01T09:00:04.000     7.0
2019-01-01T09:00:07.000     8.0
2019-01-01T09:00:08.000    17.0
dtype: float64
select_dtypes(self, include=None, exclude=None)

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters
includestr or list

which columns to include based on dtypes

excludestr or list

which columns to exclude based on dtypes

set_index(self, index, drop=True)

Return a new DataFrame with a new index

Parameters
indexIndex, Series-convertible, or str

Index : the new index. Series-convertible : values for the new index. str : name of column to be used as series

dropboolean

whether to drop corresponding column for str index argument

property shape

Returns a tuple representing the dimensionality of the DataFrame.

sort_index(self, ascending=True)

Sort by the index

sort_values(self, by, ascending=True, na_position='last')

Sort by the values row-wise.

Parameters
bystr or list of str

Name or list of names to sort by.

ascendingbool or list of bool, default True

Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

na_position{‘first’, ‘last’}, default ‘last’

‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end

Returns
——-
sorted_objcuDF DataFrame

Notes

Difference from pandas:
  • Support axis=’index’ only.

  • Not supporting: inplace, kind

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> print(df.sort_values('b'))
   a  b
0  0 -3
2  2  0
1  1  2
tail(self, n=5)

Returns the last n rows as a new DataFrame

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df.tail(2))
   key   val
3    3  13.0
4    4  14.0
to_arrow(self, preserve_index=True)

Convert to a PyArrow Table.

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> df.to_arrow()
pyarrow.Table
None: int64
a: int64
b: int64
to_csv(self, path=None, sep=', ', na_rep='', columns=None, header=True, index=True, line_terminator='n', chunksize=None)

Write a dataframe to csv file format.

Parameters
dfDataFrame

DataFrame object to be written to csv

pathstr, default None

Path of file where DataFrame will be written

sepchar, default ‘,’

Delimiter to be used.

na_repstr, default ‘’

String to use for null entries

columnslist of str, optional

Columns to write

headerbool, default True

Write out the column names

indexbool, default True

Write out the index as a column

line_terminatorchar, default ‘n’
chunksizeint or None, default None

Rows to write at a time

Notes

  • Follows the standard of Pandas csv.QUOTE_NONNUMERIC for all output.

  • If to_csv leads to memory errors consider setting the chunksize argument.

Examples

Write a dataframe to csv.

>>> import cudf
>>> filename = 'foo.csv'
>>> df = cudf.DataFrame({'x': [0, 1, 2, 3],
                         'y': [1.0, 3.3, 2.2, 4.4],
                         'z': ['a', 'b', 'c', 'd']})
>>> df = df.set_index([3, 2, 1, 0])
>>> df.to_csv(filename)
to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack.

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters
cudf_objDataFrame, Series, Index, or Column
Returns
pycapsule_objPyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_feather(self, path, *args, **kwargs)

Write a DataFrame to the feather format.

Parameters
pathstr

File path

to_gpu_matrix(self)

Convert to a numba gpu ndarray

Returns
numba gpu ndarray
to_hdf(self, path_or_buf, key, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide.

Parameters
path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable,

nor searchable. - ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See also

cudf.io.hdf.read_hdf

Read from HDF file.

cudf.io.parquet.to_parquet

Write a DataFrame to the binary parquet format.

cudf.io.feather..to_feather

Write out feather-format for DataFrames.

to_json(self, path_or_buf=None, *args, **kwargs)

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path_or_bufstring or file handle, optional

File path or object. If not specified, the result is returned as a string.

orientstring

Indication of expected JSON string format. * Series

  • default is ‘index’

  • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are:

    {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index],

    ‘columns’ -> [columns], ‘data’ -> [values]} - ‘records’ : list like [{column -> value}, … , {column -> value}] - ‘index’ : dict like {index -> {column -> value}} - ‘columns’ : dict like {column -> {index -> value}} - ‘values’ : just the values array - ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

See Also
——–
.cudf.io.json.read_json
to_pandas(self)

Convert to a Pandas DataFrame.

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> type(df.to_pandas())
<class 'pandas.core.frame.DataFrame'>
to_parquet(self, path, *args, **kwargs)

Write a DataFrame to the parquet format.

Parameters
pathstr

File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

compression{‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

indexbool, default None

If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, the engine’s default behavior will be used.

partition_colslist, optional, default None

Column names by which to partition the dataset Columns are partitioned in the order they are given

to_records(self, index=True)

Convert to a numpy recarray

Parameters
indexbool

Whether to include the index in the output.

Returns
numpy recarray
to_string(self)

Convert to string

cuDF uses Pandas internals for efficient string formatting. Set formatting options using pandas string formatting options and cuDF objects will print identically to Pandas objects.

cuDF supports null/None as a value in any column type, which is transparently supported during this output process.

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2]
>>> df['val'] = [float(i + 10) for i in range(3)]
>>> df.to_string()
'   key   val\n0    0  10.0\n1    1  11.0\n2    2  12.0'
transpose(self)

Transpose index and columns.

Returns
a new (ncol x nrow) dataframe. self is (nrow x ncol)

Notes

Difference from pandas: Not supporting copy because default and only behaviour is copy=True

cudf.multi.concat(objs, axis=0, ignore_index=False, sort=None)

Concatenate DataFrames, Series, or Indices row-wise.

Parameters
objslist of DataFrame, Series, or Index
axisconcatenation axis, 0 - index, 1 - columns
ignore_indexbool

Set True to ignore the index of the objs and provide a default range index instead.

Returns
A new object of like type with rows from each object in objs.
cudf.reshape.general.get_dummies(df, prefix=None, prefix_sep='_', dummy_na=False, columns=None, cats={}, sparse=False, drop_first=False, dtype='int8')

Returns a dataframe whose columns are the one hot encodings of all columns in df

Parameters
dfcudf.DataFrame

dataframe to encode

prefixstr, dict, or sequence, optional

prefix to append. Either a str (to apply a constant prefix), dict mapping column names to prefixes, or sequence of prefixes to apply with the same length as the number of columns. If not supplied, defaults to the empty string

prefix_sepstr, dict, or sequence, optional, default ‘_’

separator to use when appending prefixes

dummy_naboolean, optional

Right now this is NON-FUNCTIONAL argument in rapids.

catsdict, optional

dictionary mapping column names to sequences of integers representing that column’s category. See cudf.DataFrame.one_hot_encoding for more information. if not supplied, it will be computed

sparseboolean, optional

Right now this is NON-FUNCTIONAL argument in rapids.

drop_firstboolean, optional

Right now this is NON-FUNCTIONAL argument in rapids.

columnssequence of str, optional

Names of columns to encode. If not provided, will attempt to encode all columns. Note this is different from pandas default behavior, which encodes all columns with dtype object or categorical

dtypestr, optional

output dtype, default ‘int8’

cudf.reshape.general.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters
frameDataFrame
id_varstuple, list, or ndarray, optional

Column(s) to use as identifier variables. default: None

value_varstuple, list, or ndarray, optional

Column(s) to unpivot. default: all columns that are not set as id_vars.

var_namescalar

Name to use for the variable column. default: frame.columns.name or ‘variable’

value_namestr

Name to use for the value column. default: ‘value’

Returns
outDataFrame

Melted result

Difference from pandas:
  • Does not support ‘col_level’ because cuDF does not have multi-index

Examples

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame({'A': {0: 1, 1: 1, 2: 5},
...                      'B': {0: 1, 1: 3, 2: 6},
...                      'C': {0: 1.0, 1: np.nan, 2: 4.0},
...                      'D': {0: 2.0, 1: 5.0, 2: 6.0}})
>>> cudf.melt(frame=df, id_vars=['A', 'B'], value_vars=['C', 'D'])
     A    B variable value
0    1    1        C   1.0
1    1    3        C
2    5    6        C   4.0
3    1    1        D   2.0
4    1    3        D   5.0
5    5    6        D   6.0

Series

class cudf.dataframe.series.Series(data=None, index=None, name=None, nan_as_null=True, dtype=None)

Data and null-masks.

Series objects are used as columns of DataFrame.

Attributes
cat
data

The gpu buffer for the data

dt
dtype

dtype of the Series

empty
has_null_mask

A boolean indicating whether a null-mask is needed

iloc

Select values by position.

index

The index object

is_monotonic
is_monotonic_decreasing
is_monotonic_increasing
is_unique
loc

Select values by label.

name

Returns name of the Series.

ndim

Dimension of the data.

null_count

Number of null values

nullmask

The gpu buffer for the null-mask

shape

Returns a tuple representing the dimensionality of the Series.

str
valid_count

Number of non-null values

values

Methods

abs(self)

Absolute value of each element of the series.

add(self, other[, fill_value])

Addition of series and other, element-wise (binary operator add).

append(self, other[, ignore_index])

Append values from another Series or array-like object.

applymap(self, udf[, out_dtype])

Apply a elemenwise function to transform the values in the Column.

argsort(self[, ascending, na_position])

Returns a Series of int64 index that will sort the series.

as_mask(self)

Convert booleans to bitmask

astype(self, dtype, \*\*kwargs)

Cast the Series to the given dtype

ceil(self)

Rounds each value upward to the smallest integral value not less than the original.

count(self[, axis, skipna])

The number of non-null values

cummax(self[, axis, skipna])

Compute the cumulative maximum of the series

cummin(self[, axis, skipna])

Compute the cumulative minimum of the series

cumprod(self[, axis, skipna])

Compute the cumulative product of the series

cumsum(self[, axis, skipna])

Compute the cumulative sum of the series

describe(self[, percentiles, include, exclude])

Compute summary statistics of a Series.

diff(self[, periods])

Calculate the difference between values at positions i and i - N in an array and store the output in a new array.

digitize(self, bins[, right])

Return the indices of the bins to which each value in series belongs.

drop_duplicates(self[, keep, inplace])

Return Series with duplicate values removed

dropna(self)

Return a Series with null values removed.

eq(self, other[, fill_value])

Equal to of series and other, element-wise (binary operator eq).

factorize(self[, na_sentinel])

Encode the input values as integer labels

fillna(self, value[, method, axis, inplace, …])

Fill null values with value.

find_first_value(self, value)

Returns offset of first value that matches

find_last_value(self, value)

Returns offset of last value that matches

floor(self)

Rounds each value downward to the largest integral value not greater than the original.

floordiv(self, other[, fill_value])

Integer division of series and other, element-wise (binary operator floordiv).

from_categorical(categorical[, codes])

Creates from a pandas.Categorical

from_masked_array(data, mask[, null_count])

Create a Series with null-mask.

ge(self, other[, fill_value])

Greater than or equal to of series and other, element-wise (binary operator ge).

gt(self, other[, fill_value])

Greater than of series and other, element-wise (binary operator gt).

hash_encode(self, stop[, use_name])

Encode column values as ints in [0, stop) using hash function.

hash_values(self)

Compute the hash of values in this column.

isna(self)

Identify missing values in a Series.

isnull(self)

Identify missing values in a Series.

label_encoding(self, cats[, dtype, na_sentinel])

Perform label encoding

le(self, other[, fill_value])

Less than or equal to of series and other, element-wise (binary operator le).

lt(self, other[, fill_value])

Less than of series and other, element-wise (binary operator lt).

max(self[, axis, skipna, dtype])

Compute the max of the series

mean(self[, axis, skipna])

Compute the mean of the series

min(self[, axis, skipna, dtype])

Compute the min of the series

mod(self, other[, fill_value])

Modulo of series and other, element-wise (binary operator mod).

mul(self, other[, fill_value])

Multiplication of series and other, element-wise (binary operator mul).

nans_to_nulls(self)

Convert nans (if any) to nulls

ne(self, other[, fill_value])

Not equal to of series and other, element-wise (binary operator ne).

nlargest(self[, n, keep])

Returns a new Series of the n largest element.

notna(self)

Identify non-missing values in a Series.

nsmallest(self[, n, keep])

Returns a new Series of the n smallest element.

nunique(self[, method, dropna])

Returns the number of unique values of the Series: approximate version, and exact version to be moved to libgdf

one_hot_encoding(self, cats[, dtype])

Perform one-hot-encoding

pow(self, other[, fill_value])

Exponential power of series and other, element-wise (binary operator pow).

product(self[, axis, skipna, dtype])

Compute the product of the series

quantile(self[, q, interpolation, exact, …])

Return values at the given quantile.

radd(self, other[, fill_value])

Addition of series and other, element-wise (binary operator radd).

reindex(self[, index, copy])

Return a Series that conforms to a new index

rename(self[, index, copy])

Alter Series name.

replace(self, to_replace, replacement)

Replace values given in to_replace with replacement.

reset_index(self[, drop])

Reset index to RangeIndex

reverse(self)

Reverse the Series

rfloordiv(self, other[, fill_value])

Integer division of series and other, element-wise (binary operator rfloordiv).

rmod(self, other[, fill_value])

Modulo of series and other, element-wise (binary operator rmod).

rmul(self, other[, fill_value])

Multiplication of series and other, element-wise (binary operator rmul).

rolling(self, window[, min_periods, center, …])

Rolling window calculations.

round(self[, decimals])

Round a Series to a configurable number of decimal places.

rpow(self, other[, fill_value])

Exponential power of series and other, element-wise (binary operator rpow).

rsub(self, other[, fill_value])

Subtraction of series and other, element-wise (binary operator rsub).

rtruediv(self, other[, fill_value])

Floating division of series and other, element-wise (binary operator rtruediv).

scale(self)

Scale values to [0, 1] in float64

searchsorted(self, value[, side])

Find indices where elements should be inserted to maintain order

set_index(self, index)

Returns a new Series with a different index.

set_mask(self, mask[, null_count])

Create new Series by setting a mask array.

shift(self[, periods, freq, axis, fill_value])

Shift values of an input array by periods positions and store the output in a new array.

sort_index(self[, ascending])

Sort by the index.

sort_values(self[, ascending, na_position])

Sort by the values.

std(self[, ddof, axis, skipna])

Compute the standard deviation of the series

sub(self, other[, fill_value])

Subtraction of series and other, element-wise (binary operator sub).

sum(self[, axis, skipna, dtype])

Compute the sum of the series

tail(self[, n])

Returns the last n rows as a new Series

take(self, indices[, ignore_index])

Return Series by taking values from the corresponding indices.

to_array(self[, fillna])

Get a dense numpy array for the data.

to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

to_frame(self[, name])

Convert Series into a DataFrame

to_gpu_array(self[, fillna])

Get a dense numba device array for the data.

to_hdf(self, path_or_buf, key, \*args, …)

Write the contained data to an HDF5 file using HDFStore.

to_json(self[, path_or_buf])

Convert the cuDF object to a JSON string.

to_string(self)

Convert to string

truediv(self, other[, fill_value])

Floating division of series and other, element-wise (binary operator truediv).

unique(self[, method, sort])

Returns unique values of this Series.

value_counts(self[, sort])

Returns unique values of this Series.

values_to_string(self[, nrows])

Returns a list of string for each element.

var(self[, ddof, axis, skipna])

Compute the variance of the series

where(self, cond[, other, axis])

Replace values with other where the condition is False.

acos

all

any

as_index

asin

atan

copy

cos

deserialize

equals

exp

from_arrow

from_pandas

groupby

head

isin

log

logical_and

logical_not

logical_or

serialize

sin

sqrt

sum_of_squares

tan

to_arrow

to_pandas

unique_k

abs(self)

Absolute value of each element of the series.

Returns a new Series.

add(self, other, fill_value=None)

Addition of series and other, element-wise (binary operator add).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

all(self, axis=0, skipna=True, level=None)
any(self, axis=0, skipna=True, level=None)
append(self, other, ignore_index=False)

Append values from another Series or array-like object. If ignore_index=True (default), the index is reset.

applymap(self, udf, out_dtype=None)

Apply a elemenwise function to transform the values in the Column.

The user function is expected to take one argument and return the result, which will be stored to the output Series. The function cannot reference globals except for other simple scalar objects.

Parameters
udfEither a callable python function or a python function already
decorated by ``numba.cuda.jit`` for call on the GPU as a device
out_dtypenumpy.dtype; optional

The dtype for use in the output. Only used for numba.cuda.jit decorated udf. By default, the result will have the same dtype as the source.

Returns
resultSeries

The mask and index are preserved.

Notes

The supported Python features are listed in

with these exceptions:

  • Math functions in cmath are not supported since libcudf does not have complex number support and output of cmath functions are most likely complex numbers.

  • These five functions in math are not supported since numba generatesmultiple PTX functions from them

    • math.sin()

    • math.cos()

    • math.tan()

    • math.gamma()

    • math.lgamma()

argsort(self, ascending=True, na_position='last')

Returns a Series of int64 index that will sort the series.

Uses Thrust sort.

Returns
result: Series
as_mask(self)

Convert booleans to bitmask

Returns
device array
astype(self, dtype, **kwargs)

Cast the Series to the given dtype

dtype : data type **kwargs : extra arguments to pass on to the constructor

Returns
outSeries

Copy of self cast to the given dtype. Returns self if dtype is the same as self.dtype.

ceil(self)

Rounds each value upward to the smallest integral value not less than the original.

Returns a new Series.

count(self, axis=None, skipna=True)

The number of non-null values

cummax(self, axis=0, skipna=True)

Compute the cumulative maximum of the series

cummin(self, axis=0, skipna=True)

Compute the cumulative minimum of the series

cumprod(self, axis=0, skipna=True)

Compute the cumulative product of the series

cumsum(self, axis=0, skipna=True)

Compute the cumulative sum of the series

property data

The gpu buffer for the data

describe(self, percentiles=None, include=None, exclude=None)

Compute summary statistics of a Series. For numeric data, the output includes the minimum, maximum, mean, median, standard deviation, and various quantiles. For object data, the output includes the count, number of unique values, the most common value, and the number of occurrences of the most common value.

Parameters
percentileslist-like, optional

The percentiles used to generate the output summary statistics. If None, the default percentiles used are the 25th, 50th and 75th. Values should be within the interval [0, 1].

Returns
A DataFrame containing summary statistics of relevant columns from
the input DataFrame.

Examples

Describing a Series containing numeric values. >>> import cudf >>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> print(s.describe())

stats values

0 count 10.0 1 mean 5.5 2 std 3.02765 3 min 1.0 4 25% 2.5 5 50% 5.5 6 75% 7.5 7 max 10.0

diff(self, periods=1)

Calculate the difference between values at positions i and i - N in an array and store the output in a new array. Notes —– Diff currently only supports float and integer dtype columns with no null values.

digitize(self, bins, right=False)

Return the indices of the bins to which each value in series belongs.

Parameters
binsnp.array

1-D monotonically, increasing array with same type as this series.

rightbool

Indicates whether interval contains the right or left bin edge.

Returns
A new Series containing the indices.

Notes

Monotonicity of bins is assumed and not checked.

drop_duplicates(self, keep='first', inplace=False)

Return Series with duplicate values removed

dropna(self)

Return a Series with null values removed.

property dtype

dtype of the Series

eq(self, other, fill_value=None)

Equal to of series and other, element-wise (binary operator eq).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

factorize(self, na_sentinel=-1)

Encode the input values as integer labels

Parameters
na_sentinelnumber

Value to indicate missing category.

Returns
(labels, cats)(Series, Series)
  • labels contains the encoded values

  • cats contains the categories in order that the N-th item corresponds to the (N-1) code.

fillna(self, value, method=None, axis=None, inplace=False, limit=None)

Fill null values with value.

Parameters
valuescalar or Series-like

Value to use to fill nulls. If Series-like, null values are filled with the values in corresponding indices of the given Series.

Returns
resultSeries

Copy with nulls filled.

find_first_value(self, value)

Returns offset of first value that matches

find_last_value(self, value)

Returns offset of last value that matches

floor(self)

Rounds each value downward to the largest integral value not greater than the original.

Returns a new Series.

floordiv(self, other, fill_value=None)

Integer division of series and other, element-wise (binary operator floordiv).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

classmethod from_categorical(categorical, codes=None)

Creates from a pandas.Categorical

If codes is defined, use it instead of categorical.codes

classmethod from_masked_array(data, mask, null_count=None)

Create a Series with null-mask. This is equivalent to:

Series(data).set_mask(mask, null_count=null_count)

Parameters
data1D array-like

The values. Null values must not be skipped. They can appear as garbage values.

mask1D array-like of numpy.uint8

The null-mask. Valid values are marked as 1; otherwise 0. The mask bit given the data index idx is computed as:

(mask[idx // 8] >> (idx % 8)) & 1
null_countint, optional

The number of null values. If None, it is calculated automatically.

ge(self, other, fill_value=None)

Greater than or equal to of series and other, element-wise (binary operator ge).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

gt(self, other, fill_value=None)

Greater than of series and other, element-wise (binary operator gt).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

property has_null_mask

A boolean indicating whether a null-mask is needed

hash_encode(self, stop, use_name=False)

Encode column values as ints in [0, stop) using hash function.

Parameters
stopint

The upper bound on the encoding range.

use_namebool

If True then combine hashed column values with hashed column name. This is useful for when the same values in different columns should be encoded with different hashed values.

Returns
——-
result: Series

The encoded Series.

hash_values(self)

Compute the hash of values in this column.

property iloc

Select values by position.

See DataFrame.iloc

property index

The index object

isna(self)

Identify missing values in a Series. Alias for isnull.

isnull(self)

Identify missing values in a Series.

label_encoding(self, cats, dtype=None, na_sentinel=-1)

Perform label encoding

Parameters
valuessequence of input values
dtype: numpy.dtype; optional

Specifies the output dtype. If None is given, the smallest possible integer dtype (starting with np.int32) is used.

na_sentinelnumber

Value to indicate missing category.

Returns
——-
A sequence of encoded labels with value between 0 and n-1 classes(cats)
le(self, other, fill_value=None)

Less than or equal to of series and other, element-wise (binary operator le).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

property loc

Select values by label.

See DataFrame.loc

lt(self, other, fill_value=None)

Less than of series and other, element-wise (binary operator lt).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

max(self, axis=None, skipna=True, dtype=None)

Compute the max of the series

mean(self, axis=None, skipna=True)

Compute the mean of the series

min(self, axis=None, skipna=True, dtype=None)

Compute the min of the series

mod(self, other, fill_value=None)

Modulo of series and other, element-wise (binary operator mod).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

mul(self, other, fill_value=None)

Multiplication of series and other, element-wise (binary operator mul).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

property name

Returns name of the Series.

nans_to_nulls(self)

Convert nans (if any) to nulls

property ndim

Dimension of the data. Series ndim is always 1.

ne(self, other, fill_value=None)

Not equal to of series and other, element-wise (binary operator ne).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

nlargest(self, n=5, keep='first')

Returns a new Series of the n largest element.

notna(self)

Identify non-missing values in a Series.

nsmallest(self, n=5, keep='first')

Returns a new Series of the n smallest element.

property null_count

Number of null values

property nullmask

The gpu buffer for the null-mask

nunique(self, method='sort', dropna=True)

Returns the number of unique values of the Series: approximate version, and exact version to be moved to libgdf

one_hot_encoding(self, cats, dtype='float64')

Perform one-hot-encoding

Parameters
catssequence of values

values representing each category.

dtypenumpy.dtype

specifies the output dtype.

Returns
A sequence of new series for each category. Its length is determined
by the length of cats.
pow(self, other, fill_value=None)

Exponential power of series and other, element-wise (binary operator pow).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

product(self, axis=None, skipna=True, dtype=None)

Compute the product of the series

quantile(self, q=0.5, interpolation='linear', exact=True, quant_index=True)

Return values at the given quantile.

Parameters
qfloat or array-like, default 0.5 (50% quantile)

0 <= q <= 1, the quantile(s) to compute

interpolation{’linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

columnslist of str

List of column names to include.

exactboolean

Whether to use approximate or exact quantile algorithm.

quant_indexboolean

Whether to use the list of quantiles as index.

Returns
DataFrame
radd(self, other, fill_value=None)

Addition of series and other, element-wise (binary operator radd).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

reindex(self, index=None, copy=True)

Return a Series that conforms to a new index

Parameters
indexIndex, Series-convertible, default None
copyboolean, default True
Returns
A new Series that conforms to the supplied index
rename(self, index=None, copy=True)

Alter Series name.

Change Series.name with a scalar value.

Parameters
indexScalar, optional

Scalar to alter the Series.name attribute

copyboolean, default True

Also copy underlying data

Returns
Series
Difference from pandas:
  • Supports scalar values only for changing name attribute

  • Not supporting: inplace, level

replace(self, to_replace, replacement)

Replace values given in to_replace with replacement.

Parameters
to_replacenumeric, str or list-like

Value(s) to replace.

  • numeric or str:

    • values equal to to_replace will be replaced with value

  • list of numeric or str:

    • If replacement is also list-like, to_replace and

    replacement must be of same length.

replacementnumeric, str, list-like, or dict

Value(s) to replace to_replace with.

Returns
resultSeries

Series after replacement. The mask and index are preserved.

See also

Series.fillna
reset_index(self, drop=False)

Reset index to RangeIndex

reverse(self)

Reverse the Series

rfloordiv(self, other, fill_value=None)

Integer division of series and other, element-wise (binary operator rfloordiv).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rmod(self, other, fill_value=None)

Modulo of series and other, element-wise (binary operator rmod).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rmul(self, other, fill_value=None)

Multiplication of series and other, element-wise (binary operator rmul).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rolling(self, window, min_periods=None, center=False, axis=0, win_type=None)

Rolling window calculations.

Parameters
windowint or offset

Size of the window, i.e., the number of observations used to calculate the statistic. For datetime indexes, an offset can be provided instead of an int. The offset must be convertible to a timedelta. As opposed to a fixed window size, each window will be sized to accommodate observations within the time period specified by the offset.

min_periodsint, optional

The minimum number of observations in the window that are required to be non-null, so that the result is non-null. If not provided or None, min_periods is equal to the window size.

centerbool, optional

If True, the result is set at the center of the window. If False (default), the result is set at the right edge of the window.

Returns
Rolling object.

Examples

>>> import cudf
>>> a = cudf.Series([1, 2, 3, None, 4])

Rolling sum with window size 2.

>>> print(a.rolling(2).sum())
0
1    3
2    5
3
4
dtype: int64

Rolling sum with window size 2 and min_periods 1.

>>> print(a.rolling(2, min_periods=1).sum())
0    1
1    3
2    5
3    3
4    4
dtype: int64

Rolling count with window size 3.

>>> print(a.rolling(3).count())
0    1
1    2
2    3
3    2
4    2
dtype: int64

Rolling count with window size 3, but with the result set at the center of the window.

>>> print(a.rolling(3, center=True).count())
0    2
1    3
2    2
3    2
4    1 dtype: int64

Rolling max with variable window size specified by an offset; only valid for datetime index.

>>> a = cudf.Series(
...     [1, 9, 5, 4, np.nan, 1],
...     index=[
...         pd.Timestamp('20190101 09:00:00'),
...         pd.Timestamp('20190101 09:00:01'),
...         pd.Timestamp('20190101 09:00:02'),
...         pd.Timestamp('20190101 09:00:04'),
...         pd.Timestamp('20190101 09:00:07'),
...         pd.Timestamp('20190101 09:00:08')
...     ]
... )
>>> print(a.rolling('2s').max())
2019-01-01T09:00:00.000    1
2019-01-01T09:00:01.000    9
2019-01-01T09:00:02.000    9
2019-01-01T09:00:04.000    4
2019-01-01T09:00:07.000
2019-01-01T09:00:08.000    1
dtype: int64

Apply custom function on the window with the apply method

>>> import numpy as np
>>> import math
>>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64)
>>> def some_func(A):
...     b = 0
...     for a in A:
...         b = b + math.sqrt(a)
...     return b
...
>>> print(b.rolling(3, min_periods=1).apply(some_func))
0     4.0
1     9.0
2    15.0
3    18.0
4    21.0
5    24.0
dtype: float64

And this also works for window rolling set by an offset

>>> import pandas as pd
>>> c = cudf.Series(
...     [16, 25, 36, 49, 64, 81],
...     index=[
...          pd.Timestamp('20190101 09:00:00'),
...          pd.Timestamp('20190101 09:00:01'),
...          pd.Timestamp('20190101 09:00:02'),
...          pd.Timestamp('20190101 09:00:04'),
...          pd.Timestamp('20190101 09:00:07'),
...          pd.Timestamp('20190101 09:00:08')
...      ],
...     dtype=np.float64
... )
>>> print(c.rolling('2s').apply(some_func))
2019-01-01T09:00:00.000     4.0
2019-01-01T09:00:01.000     9.0
2019-01-01T09:00:02.000    11.0
2019-01-01T09:00:04.000     7.0
2019-01-01T09:00:07.000     8.0
2019-01-01T09:00:08.000    17.0
dtype: float64
round(self, decimals=0)

Round a Series to a configurable number of decimal places.

rpow(self, other, fill_value=None)

Exponential power of series and other, element-wise (binary operator rpow).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rsub(self, other, fill_value=None)

Subtraction of series and other, element-wise (binary operator rsub).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rtruediv(self, other, fill_value=None)

Floating division of series and other, element-wise (binary operator rtruediv).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

scale(self)

Scale values to [0, 1] in float64

searchsorted(self, value, side='left')

Find indices where elements should be inserted to maintain order

Parameters
valuearray_like

Column of values to search for

sidestr {‘left’, ‘right’} optional

If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index

Returns
A Column of insertion points with the same shape as value
set_index(self, index)

Returns a new Series with a different index.

Parameters
indexIndex, Series-convertible

the new index or values for the new index

set_mask(self, mask, null_count=None)

Create new Series by setting a mask array.

This will override the existing mask. The returned Series will reference the same data buffer as this Series.

Parameters
mask1D array-like of numpy.uint8

The null-mask. Valid values are marked as 1; otherwise 0. The mask bit given the data index idx is computed as:

(mask[idx // 8] >> (idx % 8)) & 1
null_countint, optional

The number of null values. If None, it is calculated automatically.

property shape

Returns a tuple representing the dimensionality of the Series.

shift(self, periods=1, freq=None, axis=0, fill_value=None)

Shift values of an input array by periods positions and store the output in a new array.

Notes

Shift currently only supports float and integer dtype columns with no null values.

sort_index(self, ascending=True)

Sort by the index.

sort_values(self, ascending=True, na_position='last')

Sort by the values.

Sort a Series in ascending or descending order by some criterion.

Parameters
ascendingbool, default True

If True, sort values in ascending order, otherwise descending.

na_position{‘first’, ‘last’}, default ‘last’

‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end.

Returns
——-
sorted_objcuDF Series
Difference from pandas:
  • Not supporting: inplace, kind

Examples

>>> import cudf
>>> s = cudf.Series([1, 5, 2, 4, 3])
>>> s.sort_values()
0    1
2    2
4    3
3    4
1    5
std(self, ddof=1, axis=None, skipna=True)

Compute the standard deviation of the series

sub(self, other, fill_value=None)

Subtraction of series and other, element-wise (binary operator sub).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

sum(self, axis=None, skipna=True, dtype=None)

Compute the sum of the series

tail(self, n=5)

Returns the last n rows as a new Series

Examples

>>> import cudf
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> print(ser.tail(2))
3    1
4    0
take(self, indices, ignore_index=False)

Return Series by taking values from the corresponding indices.

to_array(self, fillna=None)

Get a dense numpy array for the data.

Parameters
fillnastr or None

Defaults to None, which will skip null values. If it equals “pandas”, null values are filled with NaNs. Non integral dtype is promoted to np.float64.

Notes

if fillna is None, null values are skipped. Therefore, the output size could be smaller.

to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack.

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters
cudf_objDataFrame, Series, Index, or Column
Returns
pycapsule_objPyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_frame(self, name=None)

Convert Series into a DataFrame

Parameters
namestr, default None

Name to be used for the column

Returns
DataFrame

cudf DataFrame

to_gpu_array(self, fillna=None)

Get a dense numba device array for the data.

Parameters
fillnastr or None

See fillna in .to_array.

Notes

if fillna is None, null values are skipped. Therefore, the output size could be smaller.

to_hdf(self, path_or_buf, key, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the :ref:`user guide <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#hdf5-pytables>`_.

Parameters
path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file: - ‘w’: write, a new file is created (an existing file with

the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and

    writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values: - ‘fixed’: Fixed format. Fast writing/reading. Not-appendable,

nor searchable.

  • ‘table’: Table format. Write as a PyTables Table structure

    which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

to_json(self, path_or_buf=None, *args, **kwargs)

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps. Parameters ———- path_or_buf : string or file handle, optional

File path or object. If not specified, the result is returned as a string.

orientstring

Indication of expected JSON string format. * Series

  • default is ‘index’

  • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are:

    {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index],

    ‘columns’ -> [columns], ‘data’ -> [values]} - ‘records’ : list like [{column -> value}, … , {column -> value}] - ‘index’ : dict like {index -> {column -> value}} - ‘columns’ : dict like {column -> {index -> value}} - ‘values’ : just the values array - ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

to_string(self)

Convert to string

Uses Pandas formatting internals to produce output identical to Pandas. Use the Pandas formatting settings directly in Pandas to control cuDF output.

truediv(self, other, fill_value=None)

Floating division of series and other, element-wise (binary operator truediv).

Parameters
other: Series or scalar value
fill_valueNone or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

unique(self, method='sort', sort=True)

Returns unique values of this Series. default=’sort’ will be changed to ‘hash’ when implemented.

property valid_count

Number of non-null values

value_counts(self, sort=True)

Returns unique values of this Series.

values_to_string(self, nrows=None)

Returns a list of string for each element.

var(self, ddof=1, axis=None, skipna=True)

Compute the variance of the series

where(self, cond, other=None, axis=None)

Replace values with other where the condition is False.

Parameters
  • cond – boolean Where cond is True, keep the original value. Where False, replace with corresponding value from other.

  • other – scalar, default None Entries where cond is False are replaced with corresponding value from other.

  • axis

Returns

Series

Groupby

Groupby.apply(self, function)

Apply a python transformation function over the grouped chunk.

Parameters
funcfunction

The python transformation function that will be applied on the grouped chunk.

Examples

from cudf import DataFrame
df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

# Define a function to apply to each row in a group
def mult(df):
  df['out'] = df['key'] * df['val']
  return df

result = groups.apply(mult)
print(result)

Output:

   key  val  out
0    0    0    0
1    0    1    0
2    1    2    2
3    1    3    3
4    2    4    8
5    2    5   10
6    2    6   12
Groupby.apply_grouped(self, function, **kwargs)

Apply a transformation function over the grouped chunk.

This uses numba’s CUDA JIT compiler to convert the Python transformation function into a CUDA kernel, thus will have a compilation overhead during the first run.

Parameters
funcfunction

The transformation function that will be executed on the CUDA GPU.

incols: list

A list of names of input columns.

outcols: list

A dictionary of output column names and their dtype.

kwargsdict

name-value of extra arguments. These values are passed directly into the function.

Examples

from cudf import DataFrame
from numba import cuda
import numpy as np

df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

# Define a function to apply to each group
def mult_add(key, val, out1, out2):
    for i in range(cuda.threadIdx.x, len(key), cuda.blockDim.x):
        out1[i] = key[i] * val[i]
        out2[i] = key[i] + val[i]

result = groups.apply_grouped(mult_add,
                              incols=['key', 'val'],
                              outcols={'out1': np.int32,
                                       'out2': np.int32},
                              # threads per block
                              tpb=8)

print(result)

Output:

   key  val out1 out2
0    0    0    0    0
1    0    1    0    1
2    1    2    2    3
3    1    3    3    4
4    2    4    8    6
5    2    5   10    7
6    2    6   12    8
import cudf
import numpy as np
from numba import cuda
import pandas as pd
from random import randint

# Create a random 15 row dataframe with one categorical
# feature and one random integer valued feature
df = cudf.DataFrame(
        {
            "cat": [1] * 5 + [2] * 5 + [3] * 5,
            "val": [randint(0, 100) for _ in range(15)],
        }
     )

# Group the dataframe by its categorical feature
groups = df.groupby("cat", method="cudf")

# Define a kernel which takes the moving average of a
# sliding window
def rolling_avg(val, avg):
    win_size = 3
    for row, i in enumerate(range(cuda.threadIdx.x,
                                  len(val), cuda.blockDim.x)):
        if row < win_size - 1:
            # If there is not enough data to fill the window,
            # take the average to be NaN
            avg[i] = np.nan
        else:
            total = 0
            for j in range(i - win_size + 1, i + 1):
                total += val[j]
            avg[i] = total / win_size

# Compute moving avgs on all groups
results = groups.apply_grouped(rolling_avg,
                               incols=['val'],
                               outcols=dict(avg=np.float64))
print("Results:", results)

# Note this gives the same result as its pandas equivalent
pdf = df.to_pandas()
pd_results = pdf.groupby('cat')['val'].rolling(3).mean()

Output:

Results:
     cat  val                 avg
0    1   16
1    1   45
2    1   62                41.0
3    1   45  50.666666666666664
4    1   26  44.333333333333336
5    2    5
6    2   51
7    2   77  44.333333333333336
8    2    1                43.0
9    2   46  41.333333333333336
[5 more rows]

This is functionally equivalent to pandas.DataFrame.Rolling

Groupby.as_df(self)

Get the intermediate dataframe after shuffling the rows into groups.

Returns
(df, segs)namedtuple
  • df : DataFrame

  • segsSeries

    Beginning offsets of each group.

Examples

from cudf import DataFrame

df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

df_groups = groups.as_df()

# DataFrame indexes of group starts
print(df_groups[1])

# DataFrame itself
print(df_groups[0])

Output:

# DataFrame indexes of group starts
0    0
1    2
2    4

# DataFrame itself
   key  val
0    0    0
1    0    1
2    1    2
3    1    3
4    2    4
5    2    5
6    2    6
Groupby.std(self)

Compute the std of each group

Returns
resultDataFrame
Groupby.var(self)

Compute the var of each group

Returns
resultDataFrame
Groupby.sum_of_squares(self)

Compute the sum_of_squares of each group

Returns
resultDataFrame

IO

cudf.io.csv.read_csv(filepath_or_buffer, lineterminator='n', quotechar='"', quoting=0, doublequote=True, header='infer', mangle_dupe_cols=True, usecols=None, sep=', ', delimiter=None, delim_whitespace=False, skipinitialspace=False, names=None, dtype=None, skipfooter=0, skiprows=0, dayfirst=False, compression='infer', thousands=None, decimal='.', true_values=None, false_values=None, nrows=None, byte_range=None, skip_blank_lines=True, parse_dates=None, comment=None, na_values=None, keep_default_na=True, na_filter=True, prefix=None, index_col=None)

Load a comma-seperated-values (CSV) dataset into a DataFrame

Parameters
filepath_or_bufferstr, path object, or file-like object

Either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), or any object with a read() method (such as builtin open() file handler function or StringIO).

sepchar, default ‘,’

Delimiter to be used.

delimiterchar, default None

Alternative argument name for sep.

delim_whitespacebool, default False

Determines whether to use whitespace as delimiter.

lineterminatorchar, default ‘n’

Character to indicate end of line.

skipinitialspacebool, default False

Skip spaces after delimiter.

nameslist of str, default None

List of column names to be used.

dtypetype, list of types, or dict of column -> type, default None

Data type(s) for data or columns. If list, types are applied in the same order as the column names. If dict, types are mapped to the column names. E.g. {‘a’: np.float64, ‘b’: int32, ‘c’: ‘float’} If None, dtypes are inferred from the dataset. Use str to preserve data and not infer or interpret to dtype.

quotecharchar, default ‘”’

Character to indicate start and end of quote item.

quotingstr or int, default 0

Controls quoting behavior. Set to one of 0 (csv.QUOTE_MINIMAL), 1 (csv.QUOTE_ALL), 2 (csv.QUOTE_NONNUMERIC) or 3 (csv.QUOTE_NONE). Quoting is enabled with all values except 3.

doublequotebool, default True

When quoting is enabled, indicates whether to interpret two consecutive quotechar inside fields as single quotechar

headerint, default ‘infer’

Row number to use as the column names. Default behavior is to infer the column names: if no names are passed, header=0; if column names are passed explicitly, header=None.

usecolslist of int or str, default None

Returns subset of the columns given in the list. All elements must be either integer indices (column number) or strings that correspond to column names

mangle_dupe_colsboolean, default True

Duplicate columns will be specified as ‘X’,’X.1’,…’X.N’.

skiprowsint, default 0

Number of rows to be skipped from the start of file.

skipfooterint, default 0

Number of rows to be skipped at the bottom of file.

compression{‘infer’, ‘gzip’, ‘zip’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’, then detect compression from the following extensions: ‘.gz’,‘.zip’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in, otherwise the first non-zero-sized file will be used. Set to None for no decompression.

decimalchar, default ‘.’

Character used as a decimal point.

thousandschar, default None

Character used as a thousands delimiter.

true_valueslist, default None

Values to consider as boolean True

false_valueslist, default None

Values to consider as boolean False

nrowsint, default None

If specified, maximum number of rows to read

byte_rangelist or tuple, default None

Byte range within the input file to be read. The first number is the offset in bytes, the second number is the range size in bytes. Set the size to zero to read all data after the offset location. Reads the row that starts before or at the end of the range, even if it ends after the end of the range.

skip_blank_linesbool, default True

If True, discard and do not parse empty lines If False, interpret empty lines as NaN values

parse_dateslist of int or names, default None

If list of columns, then attempt to parse each entry as a date. Columns may not always be recognized as dates, for instance due to unusual or non-standard formats. To guarantee a date and increase parsing speed, explicitly specify dtype=’date’ for the desired columns.

commentchar, default None

Character used as a comments indicator. If found at the beginning of a line, the line will be ignored altogether.

na_valueslist, default None

Values to consider as invalid

keep_default_nabool, default True

Whether or not to include the default NA values when parsing the data.

na_filterbool, default True

Detect missing values (empty strings and the values in na_values). Passing False can improve performance.

prefixstr, default None

Prefix to add to column numbers when parsing without a header row

index_colint, string or False, default None

Column to use as the row labels of the DataFrame. Passing index_col=False explicitly disables index column inference and discards the last column.

Returns
GPU DataFrame object.

Examples

Create a test csv file

>>> import cudf
>>> filename = 'foo.csv'
>>> lines = [
...   "num1,datetime,text",
...   "123,2018-11-13T12:00:00,abc",
...   "456,2018-11-14T12:35:01,def",
...   "789,2018-11-15T18:02:59,ghi"
... ]
>>> with open(filename, 'w') as fp:
...     fp.write('\n'.join(lines)+'\n')

Read the file with cudf.read_csv

>>> cudf.read_csv(filename)
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.csv.to_csv(df, path=None, sep=', ', na_rep='', columns=None, header=True, index=True, line_terminator='n', chunksize=None)

Write a dataframe to csv file format.

Parameters
dfDataFrame

DataFrame object to be written to csv

pathstr, default None

Path of file where DataFrame will be written

sepchar, default ‘,’

Delimiter to be used.

na_repstr, default ‘’

String to use for null entries

columnslist of str, optional

Columns to write

headerbool, default True

Write out the column names

indexbool, default True

Write out the index as a column

line_terminatorchar, default ‘n’
chunksizeint or None, default None

Rows to write at a time

Notes

  • Follows the standard of Pandas csv.QUOTE_NONNUMERIC for all output.

  • If to_csv leads to memory errors consider setting the chunksize argument.

Examples

Write a dataframe to csv.

>>> import cudf
>>> filename = 'foo.csv'
>>> df = cudf.DataFrame({'x': [0, 1, 2, 3],
                         'y': [1.0, 3.3, 2.2, 4.4],
                         'z': ['a', 'b', 'c', 'd']})
>>> df = df.set_index([3, 2, 1, 0])
>>> df.to_csv(filename)
cudf.io.parquet.read_parquet(filepath_or_buffer, engine='cudf', columns=None, row_group=None, skip_rows=None, num_rows=None, strings_to_categorical=False, *args, **kwargs)

Load a Parquet dataset into a DataFrame

Parameters
filepath_or_bufferstr, path object, bytes, or file-like object

Either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), Python bytes of raw binary data, or any object with a read() method (such as builtin open() file handler function or BytesIO).

engine{ ‘cudf’, ‘pyarrow’ }, default ‘cudf’

Parser engine to use.

columnslist, default None

If not None, only these columns will be read.

row_groupint, default None

If not None, only the row group with the specified index will be read.

skip_rowsint, default None

If not None, the nunber of rows to skip from the start of the file.

num_rowsint, default None

If not None, the total number of rows to read.

Returns
DataFrame

Examples

>>> import cudf
>>> df = cudf.read_parquet(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.parquet.read_parquet_metadata(path)

Read a Parquet file’s metadata and schema

Parameters
pathstring or path object

Path of file to be read

Returns
Total number of rows
Number of row groups
List of column names

Examples

>>> import cudf
>>> num_rows, num_row_groups, names = cudf.io.read_parquet_metadata(filename)
>>> df = [cudf.read_parquet(fname, row_group=i) for i in range(row_groups)]
>>> df = cudf.concat(df)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.parquet.to_parquet(df, path, *args, **kwargs)

Write a DataFrame to the parquet format.

Parameters
pathstr

File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

compression{‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

indexbool, default None

If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, the engine’s default behavior will be used.

partition_colslist, optional, default None

Column names by which to partition the dataset Columns are partitioned in the order they are given

cudf.io.orc.read_orc(filepath_or_buffer, engine='cudf', columns=None, stripe=None, skip_rows=None, num_rows=None, use_index=True)

Load an ORC dataset into a DataFrame

Parameters
filepath_or_bufferstr, path object, bytes, or file-like object

Either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), Python bytes of raw binary data, or any object with a read() method (such as builtin open() file handler function or BytesIO).

engine{ ‘cudf’, ‘pyarrow’ }, default ‘cudf’

Parser engine to use.

columnslist, default None

If not None, only these columns will be read from the file.

stripe: int, default None

If not None, only the stripe with the specified index will be read.

skip_rowsint, default None

If not None, the number of rows to skip from the start of the file.

num_rowsint, default None

If not None, the total number of rows to read.

use_indexbool, default True

If True, use row index if available for faster seeking.

kwargs are passed to the engine
Returns
DataFrame

Examples

>>> import cudf
>>> df = cudf.read_orc(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.orc.read_orc_metadata(path)

Read an ORC file’s metadata and schema

Parameters
pathstring or path object

Path of file to be read

Returns
Total number of rows
Number of stripes
List of column names

Examples

>>> import cudf
>>> num_rows, stripes, names = cudf.io.read_orc_metadata(filename)
>>> df = [cudf.read_orc(fname, stripe=i) for i in range(stripes)]
>>> df = cudf.concat(df)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.json.read_json(path_or_buf, engine='auto', dtype=True, lines=False, compression='infer', byte_range=None, *args, **kwargs)

Load a JSON dataset into a DataFrame

Parameters
path_or_bufstr, path object, or file-like object

Either JSON data in a str, path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), or any object with a read() method (such as builtin open() file handler function or StringIO).

engine{{ ‘auto’, ‘cudf’, ‘pandas’ }}, default ‘auto’

Parser engine to use. If ‘auto’ is passed, the engine will be automatically selected based on the other parameters.

orientstring,

Indication of expected JSON string format (pandas engine only). Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is: - 'split' : dict like

{index -> [index], columns -> [columns], data -> [values]}

  • 'records' : list like [{column -> value}, ... , {column -> value}]

  • 'index' : dict like {index -> {column -> value}}

  • 'columns' : dict like {column -> {index -> value}}

  • 'values' : just the values array

The allowed and default values depend on the value of the typ parameter. * when typ == 'series',

  • allowed orients are {'split','records','index'}

  • default is 'index'

  • The Series index must be unique for orient 'index'.

  • when typ == 'frame', - allowed orients are ``{‘split’,’records’,’index’,

    ‘columns’,’values’, ‘table’}``

    • default is 'columns'

    • The DataFrame index must be unique for orients 'index' and 'columns'.

    • The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.

    ‘table’ as an allowed value for the orient argument

typtype of object to recover (series or frame), default ‘frame’

With cudf engine, only frame output is supported.

dtypeboolean or dict, default True

If True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, applies only to the data.

convert_axesboolean, default True

Try to convert the axes to the proper dtypes (pandas engine only).

convert_datesboolean, default True

List of columns to parse for dates (pandas engine only); If True, then try to parse datelike columns default is True; a column label is datelike if * it ends with '_at', * it ends with '_time', * it begins with 'timestamp', * it is 'modified', or * it is 'date'

keep_default_datesboolean, default True

If parsing dates, parse the default datelike columns (pandas engine only)

numpyboolean, default False

Direct decoding to numpy arrays (pandas engine only). Supports numeric data only, but non-numeric column and index labels are supported. Note also that the JSON ordering MUST be the same for each term if numpy=True.

precise_floatboolean, default False

Set to enable usage of higher precision (strtod) function when decoding string to double values (pandas engine only). Default (False) is to use fast but less precise builtin functionality

date_unitstring, default None

The timestamp unit to detect if converting dates (pandas engine only). The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds.

encodingstr, default is ‘utf-8’

The encoding to use to decode py3 bytes. With cudf engine, only utf-8 is supported.

linesboolean, default False

Read the file as a json object per line.

chunksizeinteger, default None

Return JsonReader object for iteration (pandas engine only). See the line-delimted json docs for more information on chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip or xz if path_or_buf is a string ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

byte_rangelist or tuple, default None

Byte range within the input file to be read (cudf engine only). The first number is the offset in bytes, the second number is the range size in bytes. Set the size to zero to read all data after the offset location. Reads the row that starts before or at the end of the range, even if it ends after the end of the range.

Returns
resultSeries or DataFrame, depending on the value of typ.
cudf.io.json.to_json(cudf_val, path_or_buf=None, *args, **kwargs)

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path_or_bufstring or file handle, optional

File path or object. If not specified, the result is returned as a string.

orientstring

Indication of expected JSON string format. * Series

  • default is ‘index’

  • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are:

    {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index],

    ‘columns’ -> [columns], ‘data’ -> [values]} - ‘records’ : list like [{column -> value}, … , {column -> value}] - ‘index’ : dict like {index -> {column -> value}} - ‘columns’ : dict like {column -> {index -> value}} - ‘values’ : just the values array - ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

See Also
——–
.cudf.io.json.read_json
cudf.io.feather.read_feather(path, *args, **kwargs)

Load an feather object from the file path, returning a DataFrame.

Parameters
pathstring

File path

columnslist, default=None

If not None, only these columns will be read from the file.

Returns
DataFrame

Examples

>>> import cudf
>>> df = cudf.read_feather(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.feather.to_feather(df, path, *args, **kwargs)

Write a DataFrame to the feather format.

Parameters
pathstr

File path

cudf.io.hdf.read_hdf(path_or_buf, *args, **kwargs)

Read from the store, close it if we opened it.

Retrieve pandas object stored in file, optionally based on where criteria

Parameters
path_or_bufstring, buffer or path object

Path to the file to open, or an open HDFStore. object. Supports any object implementing the __fspath__ protocol. This includes pathlib.Path and py._path.local.LocalPath objects.

keyobject, optional

The group identifier in the store. Can be omitted if the HDF file contains a single pandas object.

mode{‘r’, ‘r+’, ‘a’}, optional

Mode to use when opening the file. Ignored if path_or_buf is a Pandas HDFS. Default is ‘r’.

wherelist, optional

A list of Term (or convertible) objects.

startint, optional

Row number to start selection.

stopint, optional

Row number to stop selection.

columnslist, optional

A list of columns names to return.

iteratorbool, optional

Return an iterator object.

chunksizeint, optional

Number of rows to include in an iteration when using an iterator.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

**kwargs

Additional keyword arguments passed to HDFStore.

Returns
itemobject

The selected object. Return type depends on the object stored.

See Also
cudf.io.hdf.to_hdfWrite a HDF file from a DataFrame.
cudf.io.hdf.to_hdf(path_or_buf, key, value, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide.

Parameters
path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable,

nor searchable. - ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See also

cudf.io.hdf.read_hdf

Read from HDF file.

cudf.io.parquet.to_parquet

Write a DataFrame to the binary parquet format.

cudf.io.feather..to_feather

Write out feather-format for DataFrames.

GpuArrowReader

class cudf.comm.gpuarrow.GpuArrowReader(schema, dev_ary)

Methods

to_dict(self)

Return a dictionary of Series object

schema

to_dict(self)

Return a dictionary of Series object