API Reference

DataFrame

class cudf.dataframe.DataFrame(name_series=None, index=None)

A GPU Dataframe object.

Examples

Build dataframe with __setitem__:

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0

Build dataframe with initializer:

>>> import cudf
>>> import numpy as np
>>> from datetime import datetime, timedelta
>>> ids = np.arange(5)

Create some datetime data

>>> t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
>>> datetimes = [(t0+ timedelta(seconds=x)) for x in range(5)]
>>> dts = np.array(datetimes, dtype='datetime64')

Create the GPU DataFrame

>>> df = cudf.DataFrame([('id', ids), ('datetimes', dts)])
>>> df
    id                datetimes
0    0  2018-10-07T12:00:00.000
1    1  2018-10-07T12:00:01.000
2    2  2018-10-07T12:00:02.000
3    3  2018-10-07T12:00:03.000
4    4  2018-10-07T12:00:04.000

Convert from a Pandas DataFrame:

>>> import pandas as pd
>>> import cudf
>>> pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
>>> df = cudf.from_pandas(pdf)
>>> df
  a b
0 0 0.1
1 1 0.2
2 2 nan
3 3 0.3
Attributes
T
columns

Returns a tuple of columns

dtypes

Return the dtypes in this object.

empty
iloc

Returns a integer-location based indexer for selection by position.

index

Returns the index of the DataFrame

loc

Returns a label-based indexer for row-slicing and column selection.

ndim

Dimension of the data.

shape

Returns a tuple representing the dimensionality of the DataFrame.

Methods

add_column(self, name, data[, forceindex])

Add a column

apply_chunks(self, func, incols, outcols[, …])

Transform user-specified chunks using the user-provided function.

apply_rows(self, func, incols, outcols, kwargs)

Apply a row-wise user defined function.

as_gpu_matrix(self[, columns, order])

Convert to a matrix in device memory.

as_matrix(self[, columns])

Convert to a matrix in host memory.

assign(self, \*\*kwargs)

Assign columns to DataFrame from keyword arguments.

copy(self[, deep])

Returns a copy of this dataframe

count(self)

describe(self[, percentiles, include, exclude])

Compute summary statistics of a DataFrame’s columns.

drop(self, labels[, axis])

Drop column(s)

drop_column(self, name)

Drop a column by name

fillna(self, value[, method, axis, inplace, …])

Fill null values with value.

from_arrow(table)

Convert from a PyArrow Table.

from_gpu_matrix(data[, index, columns, …])

Convert from a numba gpu ndarray.

from_pandas(dataframe[, nan_as_null])

Convert from a Pandas DataFrame.

from_records(data[, index, columns, nan_as_null])

Convert from a numpy recarray or structured array.

groupby(self[, by, sort, as_index, method, …])

Groupby

hash_columns(self[, columns])

Hash the given columns and return a new Series

head(self[, n])

Returns the first n rows as a new DataFrame

iteritems(self)

Iterate over column names and series pairs

join(self, other[, on, how, lsuffix, …])

Join columns with other DataFrame on index or on a key column.

label_encoding(self, column, prefix, cats[, …])

Encode labels in a column with label encoding.

mean(self[, axis, skipna, level, numeric_only])

Return the mean of the values for the requested axis.

melt(self, \*\*kwargs)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

merge(self, right[, on, how, left_on, …])

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

nlargest(self, n, columns[, keep])

Get the rows of the DataFrame sorted by the n largest value of columns

nsmallest(self, n, columns[, keep])

Get the rows of the DataFrame sorted by the n smallest value of columns

one_hot_encoding(self, column, prefix, cats)

Expand a column with one-hot-encoding.

partition_by_hash(self, columns, nparts)

Partition the dataframe by the hashed value of data in columns.

pop(self, item)

Return a column and drop it from the DataFrame.

quantile(self[, q, interpolation, columns, …])

Return values at the given quantile.

query(self, expr[, local_dict])

Query with a boolean expression using Numba to compile a GPU kernel.

rename(self[, mapper, columns, copy, inplace])

Alter column labels.

replace(self, to_replace, value)

Replace values given in to_replace with value.

select_dtypes(self[, include, exclude])

Return a subset of the DataFrame’s columns based on the column dtypes.

set_index(self, index)

Return a new DataFrame with a new index

sort_index(self[, ascending])

Sort by the index

sort_values(self, by[, ascending, na_position])

Sort by the values row-wise.

tail(self[, n])

Returns the last n rows as a new DataFrame

to_arrow(self[, preserve_index])

Convert to a PyArrow Table.

to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

to_feather(self, path, \*args, \*\*kwargs)

Write a DataFrame to the feather format.

to_gpu_matrix(self)

Convert to a numba gpu ndarray

to_hdf(self, path_or_buf, key, \*args, …)

Write the contained data to an HDF5 file using HDFStore.

to_json(self[, path_or_buf])

Convert the cuDF object to a JSON string.

to_pandas(self)

Convert to a Pandas DataFrame.

to_parquet(self, path, \*args, \*\*kwargs)

Write a DataFrame to the parquet format.

to_records(self[, index])

Convert to a numpy recarray

to_string(self[, nrows, ncols])

Convert to string

transpose(self)

Transpose index and columns.

acos

argsort

asin

atan

cos

cummax

cummin

cumprod

cumsum

deserialize

equals

exp

log

mask

max

min

product

reset_index

serialize

sin

sqrt

std

sum

take

tan

var

add_column(self, name, data, forceindex=False)

Add a column

Parameters
namestr

Name of column to be added.

dataSeries, array-like

Values to be added.

apply_chunks(self, func, incols, outcols, kwargs={}, chunks=None, tpb=1)

Transform user-specified chunks using the user-provided function.

Parameters
funcfunction

The transformation function that will be executed on the CUDA GPU.

incols: list

A list of names of input columns.

outcols: dict

A dictionary of output column names and their dtype.

kwargs: dict

name-value of extra arguments. These values are passed directly into the function.

chunksint or Series-like

If it is an int, it is the chunksize. If it is an array, it contains integer offset for the start of each chunk. The span of a chunk for chunk i-th is data[chunks[i] : chunks[i + 1]] for any i + 1 < chunks.size; or, data[chunks[i]:] for the i == len(chunks) - 1.

tpbint; optional

It is the thread-per-block for the underlying kernel. The default uses 1 thread to emulate serial execution for each chunk. It is a good starting point but inefficient. Its maximum possible value is limited by the available CUDA GPU resources.

Examples

For tpb > 1, func is executed by tpb number of threads concurrently. To access the thread id and count, use numba.cuda.threadIdx.x and numba.cuda.blockDim.x, respectively (See numba CUDA kernel documentation).

In the example below, the kernel is invoked concurrently on each specified chunk. The kernel computes the corresponding output for the chunk.

By looping over the range range(cuda.threadIdx.x, in1.size, cuda.blockDim.x), the kernel function can be used with any tpb in a efficient manner.

>>> from numba import cuda
>>> @cuda.jit
... def kernel(in1, in2, in3, out1):
...      for i in range(cuda.threadIdx.x, in1.size, cuda.blockDim.x):
...          x = in1[i]
...          y = in2[i]
...          z = in3[i]
...          out1[i] = x * y + z
apply_rows(self, func, incols, outcols, kwargs, cache_key=None)

Apply a row-wise user defined function.

Parameters
funcfunction

The transformation function that will be executed on the CUDA GPU.

incols: list

A list of names of input columns.

outcols: dict

A dictionary of output column names and their dtype.

kwargs: dict

name-value of extra arguments. These values are passed directly into the function.

Examples

The user function should loop over the columns and set the output for each row. Loop execution order is arbitrary, so each iteration of the loop MUST be independent of each other.

When func is invoked, the array args corresponding to the input/output are strided so as to improve GPU parallelism. The loop in the function resembles serial code, but executes concurrently in multiple threads.

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame()
>>> nelem = 3
>>> df['in1'] = np.arange(nelem)
>>> df['in2'] = np.arange(nelem)
>>> df['in3'] = np.arange(nelem)

Define input columns for the kernel

>>> in1 = df['in1']
>>> in2 = df['in2']
>>> in3 = df['in3']
>>> def kernel(in1, in2, in3, out1, out2, kwarg1, kwarg2):
...     for i, (x, y, z) in enumerate(zip(in1, in2, in3)):
...         out1[i] = kwarg2 * x - kwarg1 * y
...         out2[i] = y - kwarg1 * z

Call .apply_rows with the name of the input columns, the name and dtype of the output columns, and, optionally, a dict of extra arguments.

>>> df.apply_rows(kernel,
...               incols=['in1', 'in2', 'in3'],
...               outcols=dict(out1=np.float64, out2=np.float64),
...               kwargs=dict(kwarg1=3, kwarg2=4))
   in1  in2  in3 out1 out2
0    0    0    0  0.0  0.0
1    1    1    1  1.0 -2.0
2    2    2    2  2.0 -4.0
as_gpu_matrix(self, columns=None, order='F')

Convert to a matrix in device memory.

Parameters
columnssequence of str

List of a column names to be extracted. The order is preserved. If None is specified, all columns are used.

order‘F’ or ‘C’

Optional argument to determine whether to return a column major (Fortran) matrix or a row major (C) matrix.

Returns
A (nrow x ncol) numpy ndarray in “F” order.
as_matrix(self, columns=None)

Convert to a matrix in host memory.

Parameters
columnssequence of str

List of a column names to be extracted. The order is preserved. If None is specified, all columns are used.

Returns
A (nrow x ncol) numpy ndarray in “F” order.
assign(self, **kwargs)

Assign columns to DataFrame from keyword arguments.

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df = df.assign(a=[0, 1, 2], b=[3, 4, 5])
>>> print(df)
   a  b
0  0  3
1  1  4
2  2  5
columns

Returns a tuple of columns

copy(self, deep=True)

Returns a copy of this dataframe

Parameters
deep: bool

Make a full copy of Series columns and Index at the GPU level, or create a new allocation with references.

describe(self, percentiles=None, include=None, exclude=None)

Compute summary statistics of a DataFrame’s columns. For numeric data, the output includes the minimum, maximum, mean, median, standard deviation, and various quantiles. For object data, the output includes the count, number of unique values, the most common value, and the number of occurrences of the most common value.

Parameters
percentileslist-like, optional

The percentiles used to generate the output summary statistics. If None, the default percentiles used are the 25th, 50th and 75th. Values should be within the interval [0, 1].

include: str, list-like, optional

The dtypes to be included in the output summary statistics. Columns of dtypes not included in this list will not be part of the output. If include=’all’, all dtypes are included. Default of None includes all numeric columns.

exclude: str, list-like, optional

The dtypes to be excluded from the output summary statistics. Columns of dtypes included in this list will not be part of the output. Default of None excludes no columns.

Returns
output_frameDataFrame

Summary statistics of relevant columns in the original dataframe.

Examples

Describing a Series containing numeric values. >>> import cudf >>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> print(s.describe())

stats values

0 count 10.0 1 mean 5.5 2 std 3.02765 3 min 1.0 4 25% 2.5 5 50% 5.5 6 75% 7.5 7 max 10.0

Describing a DataFrame. By default all numeric fields are returned. >>> gdf = cudf.DataFrame() >>> gdf[‘a’] = [1,2,3] >>> gdf[‘b’] = [1.0, 2.0, 3.0] >>> gdf[‘c’] = [‘x’, ‘y’, ‘z’] >>> gdf[‘d’] = [1.0, 2.0, 3.0] >>> gdf[‘d’] = gdf[‘d’].astype(‘float32’) >>> print(gdf.describe())

stats a b d

0 count 3.0 3.0 3.0 1 mean 2.0 2.0 2.0 2 std 1.0 1.0 1.0 3 min 1.0 1.0 1.0 4 25% 1.5 1.5 1.5 5 50% 1.5 1.5 1.5 6 75% 2.5 2.5 2.5 7 max 3.0 3.0 3.0

Using the include keyword to describe only specific dtypes. >>> gdf = cudf.DataFrame() >>> gdf[‘a’] = [1,2,3] >>> gdf[‘b’] = [1.0, 2.0, 3.0] >>> gdf[‘c’] = [‘x’, ‘y’, ‘z’] >>> print(gdf.describe(include=’int’))

stats a

0 count 3.0 1 mean 2.0 2 std 1.0 3 min 1.0 4 25% 1.5 5 50% 1.5 6 75% 2.5 7 max 3.0

drop(self, labels, axis=None)

Drop column(s)

Parameters
labelsstr or sequence of strings

Name of column(s) to be dropped.

Returns
A dataframe without dropped column(s)

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]
>>> df_new = df.drop('val')
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0
>>> print(df_new)
   key
0    0
1    1
2    2
3    3
4    4
drop_column(self, name)

Drop a column by name

dtypes

Return the dtypes in this object.

fillna(self, value, method=None, axis=None, inplace=False, limit=None)

Fill null values with value.

Parameters
valuescalar, Series-like or dict

Value to use to fill nulls. If Series-like, null values are filled with values in corresponding indices. A dict can be used to provide different values to fill nulls in different columns.

Returns
resultDataFrame

Copy with nulls filled.

Examples

>>> import cudf
>>> gdf = cudf.DataFrame({'a': [1, 2, None], 'b': [3, None, 5]})
>>> gdf.fillna(4).to_pandas()
a  b
0  1  3
1  2  4
2  4  5
>>> gdf.fillna({'a': 3, 'b': 4}).to_pandas()
a  b
0  1  3
1  2  4
2  3  5
classmethod from_arrow(table)

Convert from a PyArrow Table.

Raises
TypeError for invalid input type.
Notes
Does not support automatically setting index column(s) similar to how
to_pandas works for PyArrow Tables.

Examples

>>> import pyarrow as pa
>>> import cudf
>>> data = [pa.array([1, 2, 3]), pa.array([4, 5, 6])]
>>> batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1'])
>>> table = pa.Table.from_batches([batch])
>>> cudf.DataFrame.from_arrow(table)
<cudf.DataFrame ncols=2 nrows=3 >
classmethod from_gpu_matrix(data, index=None, columns=None, nan_as_null=False)

Convert from a numba gpu ndarray.

Parameters
datanumba gpu ndarray
indexstr

The name of the index column in data. If None, the default index is used.

columnslist of str

List of column names to include.

Returns
DataFrame
classmethod from_pandas(dataframe, nan_as_null=True)

Convert from a Pandas DataFrame.

Raises
TypeError for invalid input type.

Examples

>>> import cudf
>>> import pandas as pd
>>> data = [[0,1], [1,2], [3,4]]
>>> pdf = pd.DataFrame(data, columns=['a', 'b'], dtype=int)
>>> cudf.from_pandas(pdf)
<cudf.DataFrame ncols=2 nrows=3 >
classmethod from_records(data, index=None, columns=None, nan_as_null=False)

Convert from a numpy recarray or structured array.

Parameters
datanumpy structured dtype or recarray of ndim=2
indexstr

The name of the index column in data. If None, the default index is used.

columnslist of str

List of column names to include.

Returns
DataFrame
groupby(self, by=None, sort=False, as_index=True, method='hash', level=None, group_keys=True)

Groupby

Parameters
bylist-of-str or str

Column name(s) to form that groups by.

sortbool

Force sorting group keys. Depends on the underlying algorithm.

as_indexbool; defaults to False

Must be False. Provided to be API compatible with pandas. The keys are always left as regular columns in the result.

methodstr, optional

A string indicating the method to use to perform the group by. Valid values are “hash” or “cudf”. “cudf” method may be deprecated in the future, but is currently the only method supporting group UDFs via the apply function.

Returns
The groupby object

Notes

Unlike pandas, this groupby operation behaves like a SQL groupby. No empty rows are returned. (For categorical keys, pandas returns rows for all categories even if they are no corresponding values.)

Only a minimal number of operations is implemented so far.

  • Only by argument is supported.

  • Since we don’t support multiindex, the by columns are stored as regular columns.

hash_columns(self, columns=None)

Hash the given columns and return a new Series

Parameters
columnsequence of str; optional

Sequence of column names. If columns is None (unspecified), all columns in the frame are used.

head(self, n=5)

Returns the first n rows as a new DataFrame

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df.head(2))
   key   val
0    0  10.0
1    1  11.0
iloc

Returns a integer-location based indexer for selection by position.

Examples

>>> df = DataFrame([('a', list(range(20))),
...                 ('b', list(range(20))),
...                 ('c', list(range(20)))])
>>> df.iloc[1]  # get the row from index 1st
a    1
b    1
c    1
>>> df.iloc[[0, 2, 9, 18]]  # get the rows from indices 0,2,9 and 18.
      a    b    c
 0    0    0    0
 2    2    2    2
 9    9    9    9
18   18   18   18
>>> df.iloc[3:10:2]  # get the rows using slice indices
     a    b    c
3    3    3    3
5    5    5    5
7    7    7    7
9    9    9    9
index

Returns the index of the DataFrame

iteritems(self)

Iterate over column names and series pairs

join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False, type='', method='hash')

Join columns with other DataFrame on index or on a key column.

Parameters
otherDataFrame
howstr

Only accepts “left”, “right”, “inner”, “outer”

lsuffix, rsuffixstr

The suffices to add to the left (lsuffix) and right (rsuffix) column names when avoiding conflicts.

sortbool

Set to True to ensure sorted ordering.

Returns
joinedDataFrame

Notes

Difference from pandas:

  • other must be a single DataFrame for now.

  • on is not supported yet due to lack of multi-index support.

label_encoding(self, column, prefix, cats, prefix_sep='_', dtype=None, na_sentinel=-1)

Encode labels in a column with label encoding.

Parameters
columnstr

the source column with binary encoding for the data.

prefixstr

the new column name prefix.

catssequence of ints

the sequence of categories as integers.

prefix_sepstr

the separator between the prefix and the category.

dtype :

the dtype for the outputs; see Series.label_encoding

na_sentinelnumber

Value to indicate missing category.

Returns
——-
a new dataframe with a new column append for the coded values.
loc

Returns a label-based indexer for row-slicing and column selection.

Examples

>>> df = DataFrame([('a', list(range(20))),
...                 ('b', list(range(20))),
...                 ('c', list(range(20)))])

Get the row by index label from ‘a’ and ‘b’ columns

>>> df.loc[0, ['a', 'b']]
a    0
b    0

Get rows from index 2 to index 5 from ‘a’ and ‘b’ columns.

>>> df.loc[2:5, ['a', 'b']]
   a  b
2  2  2
3  3  3
4  4  4
5  5  5

Get the every 3rd rows from index 2 to 10 from ‘a’ and ‘b’

>>> df.loc[2:10:3, ['a', 'b']]
    a    b
2   2    2
5   5    5
8   8    8
mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Return the mean of the values for the requested axis.

Parameters
axis{index (0), columns (1)}

Axis for the function to be applied on.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

**kwargs

Additional keyword arguments to be passed to the function.

Returns
meanSeries or DataFrame (if level specified)
melt(self, **kwargs)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters
frameDataFrame
id_varstuple, list, or ndarray, optional

Column(s) to use as identifier variables. default: None

value_varstuple, list, or ndarray, optional

Column(s) to unpivot. default: all columns that are not set as id_vars.

var_namescalar

Name to use for the variable column. default: frame.columns.name or ‘variable’

value_namestr

Name to use for the value column. default: ‘value’

Returns
outDataFrame

Melted result

merge(self, right, on=None, how='inner', left_on=None, right_on=None, left_index=False, right_index=False, lsuffix=None, rsuffix=None, type='', method='hash', indicator=False, suffixes=('_x', '_y'))

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

Parameters
rightDataFrame
onlabel or list; defaults to None

Column or index level names to join on. These must be found in both DataFrames.

If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_onlabel or list, or array-like

Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_onlabel or list, or array-like

Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_indexbool, default False

Use the index from the left DataFrame as the join key(s).

right_indexbool, default False

Use the index from the right DataFrame as the join key.

howstr, defaults to ‘left’

Only accepts ‘left’ left: use only keys from left frame, similar to a SQL left outer join; preserve key order

suffixes: Tuple[str, str], defaults to (‘_x’, ‘_y’)

Suffixes applied to overlapping column names on the left and right sides

typestr, defaults to ‘hash’
Returns
mergedDataFrame

Examples

>>> import cudf
>>> df_a = cudf.DataFrame()
>>> df_a['key'] = [0, 1, 2, 3, 4]
>>> df_a['vals_a'] = [float(i + 10) for i in range(5)]
>>> df_b = cudf.DataFrame()
>>> df_b['key'] = [1, 2, 4]
>>> df_b['vals_b'] = [float(i+10) for i in range(3)]
>>> df_merged = df_a.merge(df_b, on=['key'], how='left')
>>> df_merged.sort_values('key')  # doctest: +SKIP
   key  vals_a  vals_b
3    0    10.0
0    1    11.0    10.0
1    2    12.0    11.0
4    3    13.0
2    4    14.0    12.0
ndim

Dimension of the data. DataFrame ndim is always 2.

nlargest(self, n, columns, keep='first')

Get the rows of the DataFrame sorted by the n largest value of columns

Notes

Difference from pandas: * Only a single column is supported in columns

nsmallest(self, n, columns, keep='first')

Get the rows of the DataFrame sorted by the n smallest value of columns

Difference from pandas: * Only a single column is supported in columns

one_hot_encoding(self, column, prefix, cats, prefix_sep='_', dtype='float64')

Expand a column with one-hot-encoding.

Parameters
columnstr

the source column with binary encoding for the data.

prefixstr

the new column name prefix.

catssequence of ints

the sequence of categories as integers.

prefix_sepstr

the separator between the prefix and the category.

dtype :

the dtype for the outputs; defaults to float64.

Returns
a new dataframe with new columns append for each category.

Examples

>>> import pandas as pd
>>> import cudf
>>> pet_owner = [1, 2, 3, 4, 5]
>>> pet_type = ['fish', 'dog', 'fish', 'bird', 'fish']
>>> df = pd.DataFrame({'pet_owner': pet_owner, 'pet_type': pet_type})
>>> df.pet_type = df.pet_type.astype('category')

Create a column with numerically encoded category values

>>> df['pet_codes'] = df.pet_type.cat.codes
>>> gdf = cudf.from_pandas(df)

Create the list of category codes to use in the encoding

>>> codes = gdf.pet_codes.unique()
>>> gdf.one_hot_encoding('pet_codes', 'pet_dummy', codes).head()
  pet_owner  pet_type  pet_codes  pet_dummy_0  pet_dummy_1  pet_dummy_2
0         1      fish          2          0.0          0.0          1.0
1         2       dog          1          0.0          1.0          0.0
2         3      fish          2          0.0          0.0          1.0
3         4      bird          0          1.0          0.0          0.0
4         5      fish          2          0.0          0.0          1.0
partition_by_hash(self, columns, nparts)

Partition the dataframe by the hashed value of data in columns.

Parameters
columnssequence of str

The names of the columns to be hashed. Must have at least one name.

npartsint

Number of output partitions

Returns
partitioned: list of DataFrame
pop(self, item)

Return a column and drop it from the DataFrame.

quantile(self, q=0.5, interpolation='linear', columns=None, exact=True)

Return values at the given quantile.

Parameters
qfloat or array-like

0 <= q <= 1, the quantile(s) to compute

interpolation{linear, lower, higher, midpoint, nearest}

This parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j. Default ‘linear’.

columnslist of str

List of column names to include.

exactboolean

Whether to use approximate or exact quantile algorithm.

Returns
DataFrame
query(self, expr, local_dict={})

Query with a boolean expression using Numba to compile a GPU kernel.

See pandas.DataFrame.query.

Parameters
exprstr

A boolean expression. Names in expression refer to columns.

Names starting with @ refer to Python variables

local_dictdict

Containing the local variable to be used in query.

Returns
filteredDataFrame

Examples

>>> import cudf
>>> a = ('a', [1, 2, 2])
>>> b = ('b', [3, 4, 5])
>>> df = cudf.DataFrame([a, b])
>>> expr = "(a == 2 and b == 4) or (b == 3)"
>>> print(df.query(expr))
   a  b
0  1  3
1  2  4

DateTime conditionals:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> print(df.query('datetimes==@search_date'))
                datetimes
1 2018-10-08T00:00:00.000

Using local_dict:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date2 = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> print(df.query('datetimes==@search_date',
>>>         local_dict={'search_date':search_date2}))
                datetimes
1 2018-10-08T00:00:00.000
rename(self, mapper=None, columns=None, copy=True, inplace=False)

Alter column labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Parameters
mapper, columnsdict-like or function, optional

dict-like or functions transformations to apply to the column axis’ values.

copyboolean, default True

Also copy underlying data

inplace: boolean, default False

Retrun new DataFrame. If True, assign columns without copy

Returns
DataFrame

Notes

Difference from pandas:
  • Support axis=’columns’ only.

  • Not supporting: index, level

replace(self, to_replace, value)

Replace values given in to_replace with value.

Parameters
to_replacenumeric, str, list-like or dict

Value(s) to replace.

  • numeric or str:

    • values equal to to_replace will be replaced with value

  • list of numeric or str:

    • If value is also list-like, to_replace and value must be of same length.

  • dict:

    • Dicts can be used to replace different values in different columns. For example, {‘a’: 1, ‘z’: 2} specifies that the value 1 in column a and the value 2 in column z should be replaced with value*.

valuenumeric, str, list-like, or dict

Value(s) to replace to_replace with. If a dict is provided, then its keys must match the keys in to_replace, and correponding values must be compatible (e.g., if they are lists, then they must match in length).

Returns
resultDataFrame

DataFrame after replacement.

select_dtypes(self, include=None, exclude=None)

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters
includestr or list

which columns to include based on dtypes

excludestr or list

which columns to exclude based on dtypes

set_index(self, index)

Return a new DataFrame with a new index

Parameters
indexIndex, Series-convertible, or str

Index : the new index. Series-convertible : values for the new index. str : name of column to be used as series

shape

Returns a tuple representing the dimensionality of the DataFrame.

sort_index(self, ascending=True)

Sort by the index

sort_values(self, by, ascending=True, na_position='last')

Sort by the values row-wise.

Parameters
bystr or list of str

Name or list of names to sort by.

ascendingbool or list of bool, default True

Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

na_position{‘first’, ‘last’}, default ‘last’

‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end

Returns
——-
sorted_objcuDF DataFrame

Notes

Difference from pandas:
  • Support axis=’index’ only.

  • Not supporting: inplace, kind

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> print(df.sort_values('b'))
   a  b
0  0 -3
2  2  0
1  1  2
tail(self, n=5)

Returns the last n rows as a new DataFrame

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df.tail(2))
   key   val
3    3  13.0
4    4  14.0
to_arrow(self, preserve_index=True)

Convert to a PyArrow Table.

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> df.to_arrow()
pyarrow.Table
None: int64
a: int64
b: int64
to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack.

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters
cudf_objDataFrame, Series, Index, or Column
Returns
pycapsule_objPyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_feather(self, path, *args, **kwargs)

Write a DataFrame to the feather format.

Parameters
pathstr

File path

to_gpu_matrix(self)

Convert to a numba gpu ndarray

Returns
numba gpu ndarray
to_hdf(self, path_or_buf, key, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide.

Parameters
path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable,

nor searchable. - ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See also

cudf.io.hdf.read_hdf

Read from HDF file.

cudf.io.parquet.to_parquet

Write a DataFrame to the binary parquet format.

cudf.io.feather..to_feather

Write out feather-format for DataFrames.

to_json(self, path_or_buf=None, *args, **kwargs)

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path_or_bufstring or file handle, optional

File path or object. If not specified, the result is returned as a string.

orientstring

Indication of expected JSON string format. * Series

  • default is ‘index’

  • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are:

    {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index],

    ‘columns’ -> [columns], ‘data’ -> [values]} - ‘records’ : list like [{column -> value}, … , {column -> value}] - ‘index’ : dict like {index -> {column -> value}} - ‘columns’ : dict like {column -> {index -> value}} - ‘values’ : just the values array - ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

See Also
——–
.cudf.io.json.read_json
to_pandas(self)

Convert to a Pandas DataFrame.

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> type(df.to_pandas())
<class 'pandas.core.frame.DataFrame'>
to_parquet(self, path, *args, **kwargs)

Write a DataFrame to the parquet format.

Parameters
pathstr

File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

compression{‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

indexbool, default None

If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, the engine’s default behavior will be used.

partition_colslist, optional, default None

Column names by which to partition the dataset Columns are partitioned in the order they are given

to_records(self, index=True)

Convert to a numpy recarray

Parameters
indexbool

Whether to include the index in the output.

Returns
numpy recarray
to_string(self, nrows=NOTSET, ncols=NOTSET)

Convert to string

Parameters
nrowsint

Maximum number of rows to show. If it is None, all rows are shown.

ncolsint

Maximum number of columns to show. If it is None, all columns are shown.

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2]
>>> df['val'] = [float(i + 10) for i in range(3)]
>>> df.to_string()
'   key   val\n0    0  10.0\n1    1  11.0\n2    2  12.0'
transpose(self)

Transpose index and columns.

Returns
a new (ncol x nrow) dataframe. self is (nrow x ncol)

Notes

Difference from pandas: Not supporting copy because default and only behaviour is copy=True

cudf.multi.concat(objs, axis=0, ignore_index=False)

Concatenate DataFrames, Series, or Indices row-wise.

Parameters
objslist of DataFrame, Series, or Index
axisconcatenation axis, 0 - index, 1 - columns
ignore_indexbool

Set True to ignore the index of the objs and provide a default range index instead.

Returns
A new object of like type with rows from each object in objs.
cudf.reshape.general.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value')

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters
frameDataFrame
id_varstuple, list, or ndarray, optional

Column(s) to use as identifier variables. default: None

value_varstuple, list, or ndarray, optional

Column(s) to unpivot. default: all columns that are not set as id_vars.

var_namescalar

Name to use for the variable column. default: frame.columns.name or ‘variable’

value_namestr

Name to use for the value column. default: ‘value’

Returns
outDataFrame

Melted result

Difference from pandas:
  • Does not support ‘col_level’ because cuDF does not have multi-index

Examples

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame({'A': {0: 1, 1: 1, 2: 5},
...                      'B': {0: 1, 1: 3, 2: 6},
...                      'C': {0: 1.0, 1: np.nan, 2: 4.0},
...                      'D': {0: 2.0, 1: 5.0, 2: 6.0}})
>>> cudf.melt(frame=df, id_vars=['A', 'B'], value_vars=['C', 'D'])
     A    B variable value
0    1    1        C   1.0
1    1    3        C
2    5    6        C   4.0
3    1    1        D   2.0
4    1    3        D   5.0
5    5    6        D   6.0

Series

class cudf.dataframe.series.Series(data=None, index=None, name=None, nan_as_null=True, dtype=None)

Data and null-masks.

Series objects are used as columns of DataFrame.

Attributes
cat
data

The gpu buffer for the data

dt
dtype

dtype of the Series

empty
has_null_mask

A boolean indicating whether a null-mask is needed

iloc

For integer-location based selection.

index

The index object

ndim

Dimension of the data.

null_count

Number of null values

nullmask

The gpu buffer for the null-mask

shape

Returns a tuple representing the dimensionality of the Series.

str
valid_count

Number of non-null values

Methods

abs(self)

Absolute value of each element of the series.

append(self, arbitrary)

Append values from another Series or array-like object.

applymap(self, udf[, out_dtype])

Apply a elemenwise function to transform the values in the Column.

argsort(self[, ascending, na_position])

Returns a Series of int64 index that will sort the series.

as_mask(self)

Convert booleans to bitmask

astype(self, dtype)

Convert to the given dtype.

ceil(self)

Rounds each value upward to the smallest integral value not less than the original.

count(self[, axis, skipna])

The number of non-null values

cummax(self[, axis, skipna])

Compute the cumulative maximum of the series

cummin(self[, axis, skipna])

Compute the cumulative minimum of the series

cumprod(self[, axis, skipna])

Compute the cumulative product of the series

cumsum(self[, axis, skipna])

Compute the cumulative sum of the series

describe(self[, percentiles, include, exclude])

Compute summary statistics of a Series.

diff(self[, periods])

Calculate the difference between values at positions i and i - N in an array and store the output in a new array.

digitize(self, bins[, right])

Return the indices of the bins to which each value in series belongs.

factorize(self[, na_sentinel])

Encode the input values as integer labels

fillna(self, value[, method, axis, inplace, …])

Fill null values with value.

find_first_value(self, value)

Returns offset of first value that matches

find_last_value(self, value)

Returns offset of last value that matches

floor(self)

Rounds each value downward to the largest integral value not greater than the original.

from_categorical(categorical[, codes])

Creates from a pandas.Categorical

from_masked_array(data, mask[, null_count])

Create a Series with null-mask.

hash_encode(self, stop[, use_name])

Encode column values as ints in [0, stop) using hash function.

hash_values(self)

Compute the hash of values in this column.

isna(self)

Identify missing values in a Series.

isnull(self)

Identify missing values in a Series.

label_encoding(self, cats[, dtype, na_sentinel])

Perform label encoding

masked_assign(self, value, mask)

Assign a scalar value to a series using a boolean mask df[df < 0] = 0

max(self[, axis, skipna, dtype])

Compute the max of the series

mean(self[, axis, skipna, dtype])

Compute the mean of the series

mean_var(self[, ddof])

Compute mean and variance at the same time.

min(self[, axis, skipna, dtype])

Compute the min of the series

nlargest(self[, n, keep])

Returns a new Series of the n largest element.

notna(self)

Identify non-missing values in a Series.

nsmallest(self[, n, keep])

Returns a new Series of the n smallest element.

nunique(self[, method, dropna])

Returns the number of unique values of the Series: approximate version, and exact version to be moved to libgdf

one_hot_encoding(self, cats[, dtype])

Perform one-hot-encoding

product(self[, axis, skipna, dtype])

Compute the product of the series

quantile(self, q[, interpolation, exact, …])

Return values at the given quantile.

rename(self[, index, copy])

Alter Series name.

replace(self, to_replace, value)

Replace values given in to_replace with value.

reset_index(self[, drop])

Reset index to RangeIndex

reverse(self)

Reverse the Series

scale(self)

Scale values to [0, 1] in float64

set_index(self, index)

Returns a new Series with a different index.

set_mask(self, mask[, null_count])

Create new Series by setting a mask array.

shift(self[, periods, freq, axis, fill_value])

Shift values of an input array by periods positions and store the output in a new array.

sort_index(self[, ascending])

Sort by the index.

sort_values(self[, ascending, na_position])

Sort by the values.

std(self[, ddof, axis, skipna])

Compute the standard deviation of the series

sum(self[, axis, skipna, dtype])

Compute the sum of the series

tail(self[, n])

Returns the last n rows as a new Series

take(self, indices[, ignore_index])

Return Series by taking values from the corresponding indices.

to_array(self[, fillna])

Get a dense numpy array for the data.

to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

to_frame(self[, name])

Convert Series into a DataFrame

to_gpu_array(self[, fillna])

Get a dense numba device array for the data.

to_hdf(self, path_or_buf, key, \*args, …)

Write the contained data to an HDF5 file using HDFStore.

to_json(self[, path_or_buf])

Convert the cuDF object to a JSON string.

to_string(self[, nrows])

Convert to string

unique(self[, method, sort])

Returns unique values of this Series.

value_counts(self[, method, sort])

Returns unique values of this Series.

values_to_string(self[, nrows])

Returns a list of string for each element.

var(self[, ddof, axis, skipna])

Compute the variance of the series

acos

as_index

asin

atan

copy

cos

deserialize

equals

exp

from_arrow

from_pandas

groupby

head

log

serialize

sin

sqrt

sum_of_squares

tan

to_arrow

to_pandas

unique_k

abs(self)

Absolute value of each element of the series.

Returns a new Series.

append(self, arbitrary)

Append values from another Series or array-like object. Returns a new copy with the index resetted.

applymap(self, udf, out_dtype=None)

Apply a elemenwise function to transform the values in the Column.

The user function is expected to take one argument and return the result, which will be stored to the output Series. The function cannot reference globals except for other simple scalar objects.

Parameters
udffunction

Wrapped by numba.cuda.jit for call on the GPU as a device function.

out_dtypenumpy.dtype; optional

The dtype for use in the output. By default, the result will have the same dtype as the source.

Returns
resultSeries

The mask and index are preserved.

argsort(self, ascending=True, na_position='last')

Returns a Series of int64 index that will sort the series.

Uses Thrust sort.

Returns
result: Series
as_mask(self)

Convert booleans to bitmask

Returns
device array
astype(self, dtype)

Convert to the given dtype.

Returns
If the dtype changed, a new Series is returned by casting each
values to the given dtype.
If the dtype is not changed, self is returned.
ceil(self)

Rounds each value upward to the smallest integral value not less than the original.

Returns a new Series.

count(self, axis=None, skipna=True)

The number of non-null values

cummax(self, axis=0, skipna=True)

Compute the cumulative maximum of the series

cummin(self, axis=0, skipna=True)

Compute the cumulative minimum of the series

cumprod(self, axis=0, skipna=True)

Compute the cumulative product of the series

cumsum(self, axis=0, skipna=True)

Compute the cumulative sum of the series

data

The gpu buffer for the data

describe(self, percentiles=None, include=None, exclude=None)

Compute summary statistics of a Series. For numeric data, the output includes the minimum, maximum, mean, median, standard deviation, and various quantiles. For object data, the output includes the count, number of unique values, the most common value, and the number of occurrences of the most common value.

Parameters
percentileslist-like, optional

The percentiles used to generate the output summary statistics. If None, the default percentiles used are the 25th, 50th and 75th. Values should be within the interval [0, 1].

Returns
A DataFrame containing summary statistics of relevant columns from
the input DataFrame.

Examples

Describing a Series containing numeric values. >>> import cudf >>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> print(s.describe())

stats values

0 count 10.0 1 mean 5.5 2 std 3.02765 3 min 1.0 4 25% 2.5 5 50% 5.5 6 75% 7.5 7 max 10.0

diff(self, periods=1)

Calculate the difference between values at positions i and i - N in an array and store the output in a new array. Notes —– Diff currently only supports float and integer dtype columns with no null values.

digitize(self, bins, right=False)

Return the indices of the bins to which each value in series belongs.

Parameters
binsnp.array

1-D monotonically, increasing array with same type as this series.

rightbool

Indicates whether interval contains the right or left bin edge.

Returns
A new Series containing the indices.

Notes

Monotonicity of bins is assumed and not checked.

dtype

dtype of the Series

factorize(self, na_sentinel=-1)

Encode the input values as integer labels

Parameters
na_sentinelnumber

Value to indicate missing category.

Returns
(labels, cats)(Series, Series)
  • labels contains the encoded values

  • cats contains the categories in order that the N-th item corresponds to the (N-1) code.

fillna(self, value, method=None, axis=None, inplace=False, limit=None)

Fill null values with value.

Parameters
valuescalar or Series-like

Value to use to fill nulls. If Series-like, null values are filled with the values in corresponding indices of the given Series.

Returns
resultSeries

Copy with nulls filled.

find_first_value(self, value)

Returns offset of first value that matches

find_last_value(self, value)

Returns offset of last value that matches

floor(self)

Rounds each value downward to the largest integral value not greater than the original.

Returns a new Series.

classmethod from_categorical(categorical, codes=None)

Creates from a pandas.Categorical

If codes is defined, use it instead of categorical.codes

classmethod from_masked_array(data, mask, null_count=None)

Create a Series with null-mask. This is equivalent to:

Series(data).set_mask(mask, null_count=null_count)

Parameters
data1D array-like

The values. Null values must not be skipped. They can appear as garbage values.

mask1D array-like of numpy.uint8

The null-mask. Valid values are marked as 1; otherwise 0. The mask bit given the data index idx is computed as:

(mask[idx // 8] >> (idx % 8)) & 1
null_countint, optional

The number of null values. If None, it is calculated automatically.

has_null_mask

A boolean indicating whether a null-mask is needed

hash_encode(self, stop, use_name=False)

Encode column values as ints in [0, stop) using hash function.

Parameters
stopint

The upper bound on the encoding range.

use_namebool

If True then combine hashed column values with hashed column name. This is useful for when the same values in different columns should be encoded with different hashed values.

Returns
——-
result: Series

The encoded Series.

hash_values(self)

Compute the hash of values in this column.

iloc

For integer-location based selection.

Returns
Series containing the elements corresponding to the indices

Examples

>>> import cudf
>>> sr = cudf.Series(list(range(20)))

Get the value from 1st index

>>> sr.iloc[1]
1

Get the values from 0,2,9 and 18th index

>>> sr.iloc[0,2,9,18]
 0    0
 2    2
 9    9
18   18

Get the values using slice indices

>>> sr.iloc[3:10:2]
3    3
5    5
7    7
9    9
index

The index object

isna(self)

Identify missing values in a Series. Alias for isnull.

isnull(self)

Identify missing values in a Series.

label_encoding(self, cats, dtype=None, na_sentinel=-1)

Perform label encoding

Parameters
valuessequence of input values
dtype: numpy.dtype; optional

Specifies the output dtype. If None is given, the smallest possible integer dtype (starting with np.int32) is used.

na_sentinelnumber

Value to indicate missing category.

Returns
——-
A sequence of encoded labels with value between 0 and n-1 classes(cats)
masked_assign(self, value, mask)

Assign a scalar value to a series using a boolean mask df[df < 0] = 0

Parameters
valuescalar

scalar value for assignment

maskcudf Series

Boolean Series

Returns
cudf Series

cudf series with new value set to where mask is True

max(self, axis=None, skipna=True, dtype=None)

Compute the max of the series

mean(self, axis=None, skipna=True, dtype=None)

Compute the mean of the series

mean_var(self, ddof=1)

Compute mean and variance at the same time.

min(self, axis=None, skipna=True, dtype=None)

Compute the min of the series

ndim

Dimension of the data. Series ndim is always 1.

nlargest(self, n=5, keep='first')

Returns a new Series of the n largest element.

notna(self)

Identify non-missing values in a Series.

nsmallest(self, n=5, keep='first')

Returns a new Series of the n smallest element.

null_count

Number of null values

nullmask

The gpu buffer for the null-mask

nunique(self, method='sort', dropna=True)

Returns the number of unique values of the Series: approximate version, and exact version to be moved to libgdf

one_hot_encoding(self, cats, dtype='float64')

Perform one-hot-encoding

Parameters
catssequence of values

values representing each category.

dtypenumpy.dtype

specifies the output dtype.

Returns
A sequence of new series for each category. Its length is determined
by the length of cats.
product(self, axis=None, skipna=True, dtype=None)

Compute the product of the series

quantile(self, q, interpolation='linear', exact=True, quant_index=True)

Return values at the given quantile.

Parameters
qfloat or array-like, default 0.5 (50% quantile)

0 <= q <= 1, the quantile(s) to compute

interpolation{’linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

columnslist of str

List of column names to include.

exactboolean

Whether to use approximate or exact quantile algorithm.

quant_indexboolean

Whether to use the list of quantiles as index.

Returns
DataFrame
rename(self, index=None, copy=True)

Alter Series name.

Change Series.name with a scalar value.

Parameters
indexScalar, optional

Scalar to alter the Series.name attribute

copyboolean, default True

Also copy underlying data

Returns
Series
Difference from pandas:
  • Supports scalar values only for changing name attribute

  • Not supporting: inplace, level

replace(self, to_replace, value)

Replace values given in to_replace with value.

Parameters
to_replacenumeric, str or list-like

Value(s) to replace.

  • numeric or str:

    • values equal to to_replace will be replaced with value

  • list of numeric or str:

    • If value is also list-like, to_replace and value must

    be of same length.

valuenumeric, str, list-like, or dict

Value(s) to replace to_replace with.

Returns
resultSeries

Series after replacement. The mask and index are preserved.

See also

Series.fillna
reset_index(self, drop=False)

Reset index to RangeIndex

reverse(self)

Reverse the Series

scale(self)

Scale values to [0, 1] in float64

set_index(self, index)

Returns a new Series with a different index.

Parameters
indexIndex, Series-convertible

the new index or values for the new index

set_mask(self, mask, null_count=None)

Create new Series by setting a mask array.

This will override the existing mask. The returned Series will reference the same data buffer as this Series.

Parameters
mask1D array-like of numpy.uint8

The null-mask. Valid values are marked as 1; otherwise 0. The mask bit given the data index idx is computed as:

(mask[idx // 8] >> (idx % 8)) & 1
null_countint, optional

The number of null values. If None, it is calculated automatically.

shape

Returns a tuple representing the dimensionality of the Series.

shift(self, periods=1, freq=None, axis=0, fill_value=None)

Shift values of an input array by periods positions and store the output in a new array.

Notes

Shift currently only supports float and integer dtype columns with no null values.

sort_index(self, ascending=True)

Sort by the index.

sort_values(self, ascending=True, na_position='last')

Sort by the values.

Sort a Series in ascending or descending order by some criterion.

Parameters
ascendingbool, default True

If True, sort values in ascending order, otherwise descending.

na_position{‘first’, ‘last’}, default ‘last’

‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end.

Returns
——-
sorted_objcuDF Series
Difference from pandas:
  • Not supporting: inplace, kind

Examples

>>> import cudf
>>> s = cudf.Series([1, 5, 2, 4, 3])
>>> s.sort_values()
0    1
2    2
4    3
3    4
1    5
std(self, ddof=1, axis=None, skipna=True)

Compute the standard deviation of the series

sum(self, axis=None, skipna=True, dtype=None)

Compute the sum of the series

tail(self, n=5)

Returns the last n rows as a new Series

Examples

>>> import cudf
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> print(ser.tail(2))
3    1
4    0
take(self, indices, ignore_index=False)

Return Series by taking values from the corresponding indices.

to_array(self, fillna=None)

Get a dense numpy array for the data.

Parameters
fillnastr or None

Defaults to None, which will skip null values. If it equals “pandas”, null values are filled with NaNs. Non integral dtype is promoted to np.float64.

Notes

if fillna is None, null values are skipped. Therefore, the output size could be smaller.

to_dlpack(self)

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack.

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters
cudf_objDataFrame, Series, Index, or Column
Returns
pycapsule_objPyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_frame(self, name=None)

Convert Series into a DataFrame

Parameters
namestr, default None

Name to be used for the column

Returns
DataFrame

cudf DataFrame

to_gpu_array(self, fillna=None)

Get a dense numba device array for the data.

Parameters
fillnastr or None

See fillna in .to_array.

Notes

if fillna is None, null values are skipped. Therefore, the output size could be smaller.

to_hdf(self, path_or_buf, key, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the :ref:`user guide <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#hdf5-pytables>`_.

Parameters
path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file: - ‘w’: write, a new file is created (an existing file with

the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and

    writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values: - ‘fixed’: Fixed format. Fast writing/reading. Not-appendable,

nor searchable.

  • ‘table’: Table format. Write as a PyTables Table structure

    which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

to_json(self, path_or_buf=None, *args, **kwargs)

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps. Parameters ———- path_or_buf : string or file handle, optional

File path or object. If not specified, the result is returned as a string.

orientstring

Indication of expected JSON string format. * Series

  • default is ‘index’

  • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are:

    {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index],

    ‘columns’ -> [columns], ‘data’ -> [values]} - ‘records’ : list like [{column -> value}, … , {column -> value}] - ‘index’ : dict like {index -> {column -> value}} - ‘columns’ : dict like {column -> {index -> value}} - ‘values’ : just the values array - ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

to_string(self, nrows=NOTSET)

Convert to string

Parameters
nrowsint

Maximum number of rows to show. If it is None, all rows are shown.

unique(self, method='sort', sort=True)

Returns unique values of this Series. default=’sort’ will be changed to ‘hash’ when implemented.

valid_count

Number of non-null values

value_counts(self, method='sort', sort=True)

Returns unique values of this Series.

values_to_string(self, nrows=None)

Returns a list of string for each element.

var(self, ddof=1, axis=None, skipna=True)

Compute the variance of the series

Groupby

class cudf.groupby.groupby.Groupby(df, by, method='hash', as_index=True, level=None)

Groupby object returned by cudf.DataFrame.groupby().

Methods

agg(self, args)

Invoke aggregation functions on the groups.

apply_multicolumn

apply_multicolumn_mapped

apply_multiindex_or_single_index

copy

count

deepcopy

max

mean

min

sum

agg(self, args)

Invoke aggregation functions on the groups.

Parameters
argsdict, list, str, callable
  • str

    The aggregate function name.

  • list

    List of str of the aggregate function.

  • dict

    key-value pairs of source column name and list of aggregate functions as str.

Returns
resultDataFrame

Notes

Since multi-indexes aren’t supported aggregation results are returned in columns using the naming scheme of aggregation_columnname.

Groupby.apply(self, function)

Apply a python transformation function over the grouped chunk.

Parameters
funcfunction

The python transformation function that will be applied on the grouped chunk.

Examples

from cudf import DataFrame
df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

# Define a function to apply to each row in a group
def mult(df):
  df['out'] = df['key'] * df['val']
  return df

result = groups.apply(mult)
print(result)

Output:

   key  val  out
0    0    0    0
1    0    1    0
2    1    2    2
3    1    3    3
4    2    4    8
5    2    5   10
6    2    6   12
Groupby.apply_grouped(self, function, **kwargs)

Apply a transformation function over the grouped chunk.

This uses numba’s CUDA JIT compiler to convert the Python transformation function into a CUDA kernel, thus will have a compilation overhead during the first run.

Parameters
funcfunction

The transformation function that will be executed on the CUDA GPU.

incols: list

A list of names of input columns.

outcols: list

A dictionary of output column names and their dtype.

kwargsdict

name-value of extra arguments. These values are passed directly into the function.

Examples

from cudf import DataFrame
from numba import cuda
import numpy as np

df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

# Define a function to apply to each group
def mult_add(key, val, out1, out2):
    for i in range(cuda.threadIdx.x, len(key), cuda.blockDim.x):
        out1[i] = key[i] * val[i]
        out2[i] = key[i] + val[i]

result = groups.apply_grouped(mult_add,
                              incols=['key', 'val'],
                              outcols={'out1': np.int32,
                                       'out2': np.int32},
                              # threads per block
                              tpb=8)

print(result)

Output:

   key  val out1 out2
0    0    0    0    0
1    0    1    0    1
2    1    2    2    3
3    1    3    3    4
4    2    4    8    6
5    2    5   10    7
6    2    6   12    8
Groupby.as_df(self)

Get the intermediate dataframe after shuffling the rows into groups.

Returns
(df, segs)namedtuple
  • df : DataFrame

  • segsSeries

    Beginning offsets of each group.

Examples

from cudf import DataFrame

df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

df_groups = groups.as_df()

# DataFrame indexes of group starts
print(df_groups[1])

# DataFrame itself
print(df_groups[0])

Output:

# DataFrame indexes of group starts
0    0
1    2
2    4

# DataFrame itself
   key  val
0    0    0
1    0    1
2    1    2
3    1    3
4    2    4
5    2    5
6    2    6
Groupby.std(self)

Compute the std of each group

Returns
resultDataFrame
Groupby.var(self)

Compute the var of each group

Returns
resultDataFrame
Groupby.sum_of_squares(self)

Compute the sum_of_squares of each group

Returns
resultDataFrame

IO

cudf.io.csv.read_csv(filepath_or_buffer, lineterminator='n', quotechar='"', quoting=0, doublequote=True, header='infer', mangle_dupe_cols=True, usecols=None, sep=', ', delimiter=None, delim_whitespace=False, skipinitialspace=False, names=None, dtype=None, skipfooter=0, skiprows=0, dayfirst=False, compression='infer', thousands=None, decimal='.', true_values=None, false_values=None, nrows=None, byte_range=None, skip_blank_lines=True, comment=None, na_values=None, keep_default_na=True, na_filter=True, prefix=None, index_col=None)

Load and parse a CSV file into a DataFrame

Parameters
filepath_or_bufferstr

Path of file to be read or a file-like object containing the file.

sepchar, default ‘,’

Delimiter to be used.

delimiterchar, default None

Alternative argument name for sep.

delim_whitespacebool, default False

Determines whether to use whitespace as delimiter.

lineterminatorchar, default ‘\n’

Character to indicate end of line.

skipinitialspacebool, default False

Skip spaces after delimiter.

nameslist of str, default None

List of column names to be used.

dtypelist of str or dict of {col: dtype}, default None

List of data types in the same order of the column names or a dictionary with column_name:dtype (pandas style).

quotecharchar, default ‘”’

Character to indicate start and end of quote item.

quotingstr or int, default 0

Controls quoting behavior. Set to one of 0 (csv.QUOTE_MINIMAL), 1 (csv.QUOTE_ALL), 2 (csv.QUOTE_NONNUMERIC) or 3 (csv.QUOTE_NONE). Quoting is enabled with all values except 3.

doublequotebool, default True

When quoting is enabled, indicates whether to interpret two consecutive quotechar inside fields as single quotechar

headerint, default ‘infer’

Row number to use as the column names. Default behavior is to infer the column names: if no names are passed, header=0; if column names are passed explicitly, header=None.

usecolslist of int or str, default None

Returns subset of the columns given in the list. All elements must be either integer indices (column number) or strings that correspond to column names

mangle_dupe_colsboolean, default True

Duplicate columns will be specified as ‘X’,’X.1’,…’X.N’.

skiprowsint, default 0

Number of rows to be skipped from the start of file.

skipfooterint, default 0

Number of rows to be skipped at the bottom of file.

compression{‘infer’, ‘gzip’, ‘zip’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’, then detect compression from the following extensions: ‘.gz’,‘.zip’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in, otherwise the first non-zero-sized file will be used. Set to None for no decompression.

decimalchar, default ‘.’

Character used as a decimal point.

thousandschar, default None

Character used as a thousands delimiter.

true_valueslist, default None

Values to consider as boolean True

false_valueslist, default None

Values to consider as boolean False

nrowsint, default None

If specified, maximum number of rows to read

byte_rangelist or tuple, default None

Byte range within the input file to be read. The first number is the offset in bytes, the second number is the range size in bytes. Set the size to zero to read all data after the offset location. Reads the row that starts before or at the end of the range, even if it ends after the end of the range.

skip_blank_linesbool, default True

If True, discard and do not parse empty lines If False, interpret empty lines as NaN values

commentchar, default None

Character used as a comments indicator. If found at the beginning of a line, the line will be ignored altogether.

na_valueslist, default None

Values to consider as invalid

keep_default_nabool, default True

Whether or not to include the default NA values when parsing the data.

na_filterbool, default True

Detect missing values (empty strings and the values in na_values). Passing False can improve performance.

prefixstr, default None

Prefix to add to column numbers when parsing without a header row

index_colint or string, default None

Column to use as the row labels

Returns
GPU DataFrame object.

Examples

Create a test csv file

>>> import cudf
>>> filename = 'foo.csv'
>>> lines = [
...   "num1,datetime,text",
...   "123,2018-11-13T12:00:00,abc",
...   "456,2018-11-14T12:35:01,def",
...   "789,2018-11-15T18:02:59,ghi"
... ]
>>> with open(filename, 'w') as fp:
...     fp.write('\n'.join(lines)+'\n')

Read the file with cudf.read_csv

>>> cudf.read_csv(filename)
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.parquet.read_parquet(path, engine='cudf', columns=None, row_group=None, skip_rows=None, num_rows=None, strings_to_categorical=False, *args, **kwargs)

Read a Parquet file into DataFrame

Parameters
pathstring or path object

Path of file to be read

engine{ ‘cudf’, ‘pyarrow’ }, default ‘cudf’

Parser engine to use.

columnslist, default None

If not None, only these columns will be read.

row_groupint, default None

If not None, only the row group with the specified index will be read.

skip_rowsint, default None

If not None, the nunber of rows to skip from the start of the file.

num_rowsint, default None

If not None, the total number of rows to read.

Returns
DataFrame

Examples

>>> import cudf
>>> df = cudf.read_parquet(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.parquet.read_parquet_metadata(path)

Read a Parquet file’s metadata and schema

Parameters
pathstring or path object

Path of file to be read

Returns
Total number of rows
Number of row groups
List of column names

Examples

>>> import cudf
>>> num_rows, num_row_groups, names = cudf.read_parquet_metadata(filename)
>>> df = [cudf.read_parquet(fname, row_group=i) for i in range(row_groups)]
>>> df = cudf.concat(df)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.parquet.to_parquet(df, path, *args, **kwargs)

Write a DataFrame to the parquet format.

Parameters
pathstr

File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

compression{‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

indexbool, default None

If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, the engine’s default behavior will be used.

partition_colslist, optional, default None

Column names by which to partition the dataset Columns are partitioned in the order they are given

cudf.io.orc.read_orc(path, engine='cudf', columns=None, skip_rows=None, num_rows=None)

Load an ORC object from the file path, returning a DataFrame.

Parameters
pathstring

File path

engine{ ‘cudf’, ‘pyarrow’ }, default ‘cudf’

Parser engine to use.

columnslist, default None

If not None, only these columns will be read from the file.

skip_rowsint, default None

If not None, the number of rows to skip from the start of the file.

num_rowsint, default None

If not None, the total number of rows to read.

kwargs are passed to the engine
Returns
DataFrame

Examples

>>> import cudf
>>> df = cudf.read_orc(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.json.read_json(path_or_buf, *args, **kwargs)

Convert a JSON string to a cuDF object.

Parameters
path_or_bufa valid JSON string or file-like, default: None

The string could be a URL. Valid URL schemes include http, ftp, s3, gcs, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/table.json

orientstring,

Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is: - 'split' : dict like

{index -> [index], columns -> [columns], data -> [values]}

  • 'records' : list like [{column -> value}, ... , {column -> value}]

  • 'index' : dict like {index -> {column -> value}}

  • 'columns' : dict like {column -> {index -> value}}

  • 'values' : just the values array

The allowed and default values depend on the value of the typ parameter. * when typ == 'series',

  • allowed orients are {'split','records','index'}

  • default is 'index'

  • The Series index must be unique for orient 'index'.

  • when typ == 'frame', - allowed orients are ``{‘split’,’records’,’index’,

    ‘columns’,’values’, ‘table’}``

    • default is 'columns'

    • The DataFrame index must be unique for orients 'index' and 'columns'.

    • The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.

    ‘table’ as an allowed value for the orient argument

typtype of object to recover (series or frame), default ‘frame’
dtypeboolean or dict, default True

If True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, applies only to the data.

convert_axesboolean, default True

Try to convert the axes to the proper dtypes.

convert_datesboolean, default True

List of columns to parse for dates; If True, then try to parse datelike columns default is True; a column label is datelike if * it ends with '_at', * it ends with '_time', * it begins with 'timestamp', * it is 'modified', or * it is 'date'

keep_default_datesboolean, default True

If parsing dates, then parse the default datelike columns

numpyboolean, default False

Direct decoding to numpy arrays. Supports numeric data only, but non-numeric column and index labels are supported. Note also that the JSON ordering MUST be the same for each term if numpy=True.

precise_floatboolean, default False

Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality

date_unitstring, default None

The timestamp unit to detect if converting dates. The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds respectively.

encodingstr, default is ‘utf-8’

The encoding to use to decode py3 bytes.

linesboolean, default False

Read the file as a json object per line.

chunksizeinteger, default None

Return JsonReader object for iteration. See the line-delimted json docs for more information on chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip or xz if path_or_buf is a string ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

Returns
resultSeries or DataFrame, depending on the value of typ.
cudf.io.json.to_json(cudf_val, path_or_buf=None, *args, **kwargs)

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path_or_bufstring or file handle, optional

File path or object. If not specified, the result is returned as a string.

orientstring

Indication of expected JSON string format. * Series

  • default is ‘index’

  • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are:

    {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index],

    ‘columns’ -> [columns], ‘data’ -> [values]} - ‘records’ : list like [{column -> value}, … , {column -> value}] - ‘index’ : dict like {index -> {column -> value}} - ‘columns’ : dict like {column -> {index -> value}} - ‘values’ : just the values array - ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

See Also
——–
.cudf.io.json.read_json
cudf.io.feather.read_feather(path, *args, **kwargs)

Load an feather object from the file path, returning a DataFrame.

Parameters
pathstring

File path

columnslist, default=None

If not None, only these columns will be read from the file.

Returns
DataFrame

Examples

>>> import cudf
>>> df = cudf.read_feather(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.feather.to_feather(df, path, *args, **kwargs)

Write a DataFrame to the feather format.

Parameters
pathstr

File path

cudf.io.hdf.read_hdf(path_or_buf, *args, **kwargs)

Read from the store, close it if we opened it.

Retrieve pandas object stored in file, optionally based on where criteria

Parameters
path_or_bufstring, buffer or path object

Path to the file to open, or an open HDFStore. object. Supports any object implementing the __fspath__ protocol. This includes pathlib.Path and py._path.local.LocalPath objects.

keyobject, optional

The group identifier in the store. Can be omitted if the HDF file contains a single pandas object.

mode{‘r’, ‘r+’, ‘a’}, optional

Mode to use when opening the file. Ignored if path_or_buf is a Pandas HDFS. Default is ‘r’.

wherelist, optional

A list of Term (or convertible) objects.

startint, optional

Row number to start selection.

stopint, optional

Row number to stop selection.

columnslist, optional

A list of columns names to return.

iteratorbool, optional

Return an iterator object.

chunksizeint, optional

Number of rows to include in an iteration when using an iterator.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

**kwargs

Additional keyword arguments passed to HDFStore.

Returns
itemobject

The selected object. Return type depends on the object stored.

See Also
cudf.io.hdf.to_hdfWrite a HDF file from a DataFrame.
cudf.io.hdf.to_hdf(path_or_buf, key, value, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide.

Parameters
path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable,

nor searchable. - ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See also

cudf.io.hdf.read_hdf

Read from HDF file.

cudf.io.parquet.to_parquet

Write a DataFrame to the binary parquet format.

cudf.io.feather..to_feather

Write out feather-format for DataFrames.

GpuArrowReader

class cudf.comm.gpuarrow.GpuArrowReader

Methods

to_dict(self)

Return a dictionary of Series object

to_dict(self)

Return a dictionary of Series object