nvstrings API Reference

nvstrings

class nvstrings.nvstrings(cptr)

Instance manages a list of strings in device memory.

Operations are across all of the strings and their results reside in device memory. Strings in the list are immutable. Methods that modify any string will create a new nvstrings instance.

Methods

add_strings(self, strs)

Add the specified strings to the end of these strings and return a new instance.

byte_count(self[, vals, bdevmem])

Fills the argument with the number of bytes of each string.

capitalize(self)

Capitalize first character of each string.

cat(self[, others, sep, na_rep])

Appends the given strings to this list of strings and returns as new nvstrings.

center(self, width[, fillchar])

Pad the beginning and end of each string to the minimum width.

compare(self, str[, devptr])

Compare each string to the supplied string.

contains(self, pat[, regex, devptr])

Find the specified string within each string.

copy(self)

Return a new instance as a copy of this instance.

count(self, pat[, devptr])

Count occurrences of pattern in each string.

device_memory(self)

Returns the total device memory used by this instance

endswith(self, pat[, devptr])

Return array of boolean values with True for the strings where the specified string is at the end.

extract(self, pat)

Extract string from the first match of regular expression pat.

extract_record(self, pat)

Extract string from the first match of regular expression pat.

fillna(self, repl)

Create new instance, replacing all nulls with the given string(s).

find(self, sub[, start, end, devptr])

Find the specified string sub within each string.

find_from(self, sub[, starts, ends, devptr])

Find the specified string within each string starting at the specified character positions.

find_multiple(self, strs[, devptr])

Return a ‘matrix’ of find results for each of the string in the strs parameter.

findall(self, pat)

A new set of nvstrings is created by organizing substring results vertically.

findall_record(self, pat)

Find all occurrences of regular expression pattern in each string.

gather(self, indexes[, count])

Return a new list of strings from this instance.

get(self, i)

Returns the character specified in each string as a new string.

get_cpointer(self)

Returns memory pointer to underlying C++ class instance.

get_info(self)

Return a dictionary of information about the strings in this instance.

get_ipc_data(self)

Returns IPC data handler to underlying C++ object from NVStrings.

hash(self[, devptr])

Returns hash values represented by each string.

htoi(self[, devptr])

Returns integer value represented by each string.

index(self, sub[, start, end, devptr])

Same as find but throws an error if arg is not found in all strings.

insert(self[, start, repl])

Insert the specified string into each string in the specified position.

ip2int(self[, devptr])

Returns integer value represented by each string.

is_empty(self[, devptr])

Return array of boolean values with True for strings that contain at least one character.

isalnum(self[, devptr])

Return array of boolean values with True for strings that contain only alpha-numeric characters.

isalpha(self[, devptr])

Return array of boolean values with True for strings that contain only alphabetic characters.

isdecimal(self[, devptr])

Return array of boolean values with True for strings that contain only decimal characters – those that can be used to extract base10 numbers.

isdigit(self[, devptr])

Return array of boolean values with True for strings that contain only decimal and digit characters.

islower(self[, devptr])

Return array of boolean values with True for strings that contain only lowercase characters.

isnumeric(self[, devptr])

Return array of boolean values with True for strings that contain only numeric characters.

isspace(self[, devptr])

Return array of boolean values with True for strings that contain only whitespace characters.

isupper(self[, devptr])

Return array of boolean values with True for strings that contain only uppercase characters.

join(self[, sep])

Concatentate this list of strings into a single string.

len(self[, devptr])

Returns the number of characters of each string.

ljust(self, width[, fillchar])

Pad the end of each string to the minimum width.

lower(self)

Convert each string to lowercase.

lstrip(self[, to_strip])

Strip leading characters from each string.

match(self, pat[, devptr])

Return array of boolean values where True is set if the specified pattern matches the beginning of the corresponding string.

match_strings(self, strs[, devptr])

Return array of boolean values where True is set for those strings in strs that match exactly to the corresponding strings in this instance.

null_count(self[, emptyisnull])

Returns the number of null strings in this instance.

order(self[, stype, asc, nullfirst, devptr])

Sort this list by name (2) or length (1) or both (3).

pad(self, width[, side, fillchar])

Add specified padding to each string.

partition(self[, delimiter])

Each string is split into two strings on the first delimiter found.

remove_strings(self, indexes[, count])

Remove the specified strings and return a new instance.

repeat(self, repeats)

Appends each string with itself the specified number of times.

replace(self, pat, repl[, n, regex])

Replace a string (pat) in each string with another string (repl).

replace_multi(self, pats, repls[, regex])

Replace multiple strings (pats) in each string with corresponding strings (repls).

replace_with_backrefs(self, pat, repl)

Use the repl back-ref template to create a new string with the extracted elements found using the pat expression.

rfind(self, sub[, start, end, devptr])

Find the specified string within each string.

rindex(self, sub[, start, end, devptr])

Same as rfind but throws an error if arg is not found in all strings.

rjust(self, width[, fillchar])

Pad the beginning of each string to the minimum width.

rpartition(self[, delimiter])

Each string is split into two strings on the first delimiter found.

rsplit(self[, delimiter, n])

A new set of columns (nvstrings) is created by splitting the strings vertically.

rsplit_record(self[, delimiter, n])

Returns an array of nvstrings each representing the split of each individual string.

rstrip(self[, to_strip])

Strip trailing characters from each string.

scalar_scatter(self, str, indexes, count)

Return a new list of strings placing the specified string at the provided indexes.

scatter(self, strs, indexes)

Return a new list of strings combining this instance with the provided strings using the specified indexes.

set_null_bitmask(self, nbuf[, bdevmem])

Store null-bitmask into provided memory.

size(self)

The number of strings managed by this instance.

slice(self, start[, stop, step])

Returns a substring of each string.

slice_from(self[, starts, stops])

Return substring of each string using positions for each string.

slice_replace(self[, start, stop, repl])

Replace the specified section of each string with a new string.

sort(self[, stype, asc, nullfirst])

Sort this list by name (2) or length (1) or both (3).

split(self[, delimiter, n])

A new set of columns (nvstrings) is created by splitting the strings vertically.

split_record(self[, delimiter, n])

Returns an array of nvstrings each representing the split of each individual string.

startswith(self, pat[, devptr])

Return array of boolean values with True for the strings where the specified string is at the beginning.

stod(self[, devptr])

Returns float64 values represented by each string.

stof(self[, devptr])

Returns float values represented by each string.

stoi(self[, devptr])

Returns integer values represented by each string.

stol(self[, devptr])

Returns int64 values represented by each string.

strip(self[, to_strip])

Strip leading and trailing characters from each string.

sublist(self, indexes[, count])

Calls gather()

swapcase(self)

Change each lowercase character to uppercase and vice versa.

timestamp2int(self[, format, units, devptr])

Returns integer value represented by each string.

title(self)

Uppercase the first letter of each letter after a space and lowercase the rest.

to_booleans(self[, true, devptr])

Returns boolean value represented by each string.

to_host(self)

Copies strings back to CPU memory into a Python array.

to_offsets(self, sbuf, obuf[, nbuf, bdevmem])

Store byte-array of characters encoded in UTF-8 and offsets and optional null-bitmask into provided memory.

translate(self, table)

Translate individual characters to new characters using the provided table.

upper(self)

Convert each string to uppercase.

url_decode(self)

URL-decode each string and return as a new instance.

url_encode(self)

URL-encode each string and return as a new instance.

wrap(self, width)

This will place new-line characters in whitespace so each line is no more than width characters.

zfill(self, width)

Pads the strings with leading zeros.

add_strings(self, strs)

Add the specified strings to the end of these strings and return a new instance.

Parameters
strsnvstrings or list

1 or more nvstrings objects

Examples

>>> import nvstrings
>>> s1 = nvstrings.to_device(['apple','pear','banana'])
>>> s2 = nvstrings.to_device(['orange','pear'])
>>> s3 = s1.add_strings(s2)
>>> print(s3)
['apple', 'pear', banana', 'orange', 'pear']
byte_count(self, vals=0, bdevmem=False)

Fills the argument with the number of bytes of each string. Returns the total number of bytes.

Parameters
valsmemory pointer

Where byte length values will be written. Must be able to hold at least size() of int32 values. None can be specified if only the total count is required.

Examples

>>> import nvstrings
>>> import numpy as np
>>> from librmm_cffi import librmm

Example passing device memory pointer

>>> s = nvstrings.to_device(["abc","d","ef"])
>>> arr = np.arange(s.size(),dtype=np.int32)
>>> d_arr = librmm.to_device(arr)
>>> s.byte_count(d_arr.device_ctypes_pointer.value,True)
>>> print(d_arr.copy_to_host())
[3,1,2]
capitalize(self)

Capitalize first character of each string. This only applies to ASCII characters at this time.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello, friend","goodbye, friend"])
>>> print(s.lower())
['Hello, friend", "Goodbye, friend"]
cat(self, others=None, sep=None, na_rep=None)

Appends the given strings to this list of strings and returns as new nvstrings.

Parameters
othersnvstrings or list of nvstrings

Strings to be appended. The number of strings in the arg(s) must match size() of this instance.

sepstr

If specified, this separator will be appended to each string before appending the others.

na_repchar

This character will take the place of any null strings (not empty strings) in either list.

Examples

>>> import nvstrings
>>> s1 = nvstrings.to_device(['hello', None,'goodbye'])
>>> s2 = nvstrings.to_device(['world','globe', None])
>>> print(s1.cat(s2,sep=':', na_rep='_'))
["hello:world","_:globe","goodbye:_"]
center(self, width, fillchar=' ')

Pad the beginning and end of each string to the minimum width.

Parameters
widthint

The minimum width of characters of the new string. If the width is smaller than the existing string, no padding is performed.

fillcharchar

The character used to do the padding. Default is space character. Only the first character is used.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye","well"])
>>> print(s.center(7))
[' hello ', 'goodbye', ' well  ']
compare(self, str, devptr=0)

Compare each string to the supplied string. Returns value of 0 for strings that match str. Returns < 0 when first different character is lower than argument string or argument string is shorter. Returns > 0 when first different character is greater than the argument string or the argument string is longer.

Parameters
strstr

String to compare all strings in this instance.

devptrGPU memory pointer

Where string result values will be written. Must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","world"])
>>> print(s.compare('hello'))
[0,15]
contains(self, pat, regex=True, devptr=0)

Find the specified string within each string. Default expects regex pattern. Returns an array of boolean values where True if pat is found, False if not.

Parameters
patstr

Pattern or string to search for in each string of this instance.

regexbool

If True, pat is interpreted as a regex string. If False, pat is a string to be searched for in each instance.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Must be able to hold at least size() of np.byte values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.contains('o'))
[True, False, True]
copy(self)

Return a new instance as a copy of this instance.

Examples

>>> import nvstrings
>>> s1 = nvstrings.to_device(['apple','pear','banana'])
>>> s2 = s1.copy()
>>> print(s2)
['apple', 'pear', banana']
count(self, pat, devptr=0)

Count occurrences of pattern in each string.

Parameters
patstr

Pattern to find

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory must be able to hold at least size() of int32 values.

device_memory(self)

Returns the total device memory used by this instance

Examples

>>> import nvstrings
>> s = nvstrings.to_device(['a'*7])
>>> print(s.device_memory())
24
endswith(self, pat, devptr=0)

Return array of boolean values with True for the strings where the specified string is at the end.

Parameters
patstr

Pattern to find. Regular expressions are not accepted.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory must be able to hold at least size() of np.byte values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.endswith('d'))
[False, False, True]
extract(self, pat)

Extract string from the first match of regular expression pat. A new array of nvstrings is created by organizing group results vertically.

Parameters
patstr

The regex pattern with group capture syntax

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["a1","b2","c3"])
>>> for result in s.extract('([ab])(\d)'):
...     print(result)
["a","b"]
["1","2"]
[None,None]
extract_record(self, pat)

Extract string from the first match of regular expression pat. A new array of nvstrings is created for each string in this instance.

Parameters
patstr

The regex pattern with group capture syntax

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["a1","b2","c3"])
>>> for result in s.extract_record('([ab])(\d)'):
...     print(result)
["a","1"]
["b","2"]
[None,None]
fillna(self, repl)

Create new instance, replacing all nulls with the given string(s).

Parameters
replstr or nvstrings

String to be used in place of nulls. This may be an empty string but may not be None. This may also be another nvstrings instance with the size. Corresponding strings are replaced only if null.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello", None, "goodbye"])
>>> print(s.fillna(''))
['hello', '', 'goodbye']
find(self, sub, start=0, end=None, devptr=0)

Find the specified string sub within each string. Return -1 for those strings where sub is not found.

Parameters
substr

String to find

startint

Beginning of section to search from. Default is 0 (beginning of each string).

endint

End of section to find. Default is end of each string.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory size must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.find('o'))
[4,-1,1]
find_from(self, sub, starts=0, ends=0, devptr=0)

Find the specified string within each string starting at the specified character positions. The starts and ends parameters are device memory pointers. If specified, each must contain size() of int32 values. Returns -1 for those strings where sub is not found.

Parameters
substr

String to find

startsGPU memory pointer

Pointer to GPU array of int32 values of beginning of sections to search, one per string.

endsGPU memory pointer

Pointer to GPU array of int32 values of end of sections to search. Use -1 to specify to the end of that string.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory size must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> import numpy as np
>>> from numba import cuda
>>> s = nvstrings.to_device(["hello","there"])
>>> darr = cuda.to_device(np.asarray([2,3],dtype=np.int32))
>>> print(s.find_from('e',starts=darr.device_ctypes_pointer.value))
[-1,4]
find_multiple(self, strs, devptr=0)

Return a ‘matrix’ of find results for each of the string in the strs parameter.

Each row is an array of integers identifying the first location of the corresponding provided string.

Parameters
strsnvstrings

Strings to find in each of the strings in this instance.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory size must be able to hold at least size()*strs.size() of int32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hare","bunny","rabbit"])
>>> t = nvstrings.to_device(["a","e","i","o","u"])
>>> print(s.find_multiple(t))
[[1, 3, -1, -1, -1], [-1, -1, -1, -1, 1], [1, -1, 4, -1, -1]]
findall(self, pat)

A new set of nvstrings is created by organizing substring results vertically.

Parameters
patstr

The regex pattern to search for substrings

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hare","bunny","rabbit"])
>>> for result in s.findall('[ab]'):
...     print(result)
["a","b","a"]
[None,None,"b"]
[None,None,"b"]
findall_record(self, pat)

Find all occurrences of regular expression pattern in each string. A new array of nvstrings is created for each string in this instance.

Parameters
patstr

The regex pattern used to search for substrings

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hare","bunny","rabbit"])
>>> for result in s.findall_record('[ab]'):
...     print(result)
["a"]
["b"]
["a","b","b"]
gather(self, indexes, count=0)

Return a new list of strings from this instance.

Parameters
indexesList of ints or GPU memory pointer

0-based indexes of strings to return from an nvstrings object. Values must be of type int32.

countint

Number of ints if indexes parm is a device pointer. Otherwise it is ignored.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.gather([0, 2]))
['hello', 'world']
get(self, i)

Returns the character specified in each string as a new string. The nvstrings returned contains a list of single character strings.

Parameters
iint

The character position identifying the character in each string to return.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello world","goodbye","well said"])
>>> print(s.get(0))
['h', 'g', 'w']
get_cpointer(self)

Returns memory pointer to underlying C++ class instance.

get_info(self)

Return a dictionary of information about the strings in this instance. This could be helpful in understanding the makeup of the data and how the operations may perform.

get_ipc_data(self)

Returns IPC data handler to underlying C++ object from NVStrings.

Returns
list

A list containing the IPC handler data from NVstrings.

Examples
>>> import nvstrings
    ..
>>> ipc_data = nvstrings.get_ipc_data()
    ..
>>> print(ipc_data)
    ..
hash(self, devptr=0)

Returns hash values represented by each string.

Parameters
devptrGPU memory pointer

Where string hash values will be written. Must be able to hold at least size() of uint32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","world"])
>>> s.hash()
[99162322, 113318802]
htoi(self, devptr=0)

Returns integer value represented by each string. String is interpretted to have hex (base-16) characters.

Parameters
devptrGPU memory pointer

Where resulting integer values will be written. Memory must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["1234","ABCDEF","1A2","cafe"])
>>> print(s.htoi())
[4660, 11259375, 418, 51966]
index(self, sub, start=0, end=None, devptr=0)

Same as find but throws an error if arg is not found in all strings.

Parameters
substr

String to find

startint

Beginning of section to search from. Default is 0 (beginning of each string).

endint

End of section to search. Default is end of each string.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory size must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","world"])
>>> print(s.index('l'))
[2,3]
insert(self, start=0, repl=None)

Insert the specified string into each string in the specified position.

Parameters
startint

Beginning position of the string to replace. Default is beginning of the each string. Specify -1 to insert at the end of each string.

replstr

String to insert into the specified position valus.

Examples

>>> import nvstrings
>>> strs = nvstrings.to_device(["abcdefghij","0123456789"])
>>> print(strs.insert(2,'_'))
['ab_cdefghij', '01_23456789']
ip2int(self, devptr=0)

Returns integer value represented by each string. String is interpretted to be IPv4 format.

Parameters
devptrGPU memory pointer

Where resulting integer values will be written. Memory must be able to hold at least size() of uint32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["192.168.0.1","10.0.0.1"])
>>> print(s.ip2int())
[3232235521, 167772161]
is_empty(self, devptr=0)

Return array of boolean values with True for strings that contain at least one character.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['hello', 'goodbye', '', None])
>>> print(s.isempty())
[True, True,False,False]
isalnum(self, devptr=0)

Return array of boolean values with True for strings that contain only alpha-numeric characters. Equivalent to: isalpha() or isdigit() or isnumeric() or isdecimal()

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['1234', 'de', '1.75', '-34', '+9.8', ' '])
>>> print(s.isalnum())
[True, True, False, False, False, False]
isalpha(self, devptr=0)

Return array of boolean values with True for strings that contain only alphabetic characters.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['1234', 'de', '1.75', '-34', '+9.8', ' '])
>>> print(s.isalpha())
[False, True, False, False, False, False]
isdecimal(self, devptr=0)

Return array of boolean values with True for strings that contain only decimal characters – those that can be used to extract base10 numbers.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['1234', 'de', '1.75', '-34', '+9.8', ' '])
>>> print(s.isdecimal())
[True, False, False, False, False, False]
isdigit(self, devptr=0)

Return array of boolean values with True for strings that contain only decimal and digit characters.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['1234', 'de', '1.75', '-34', '+9.8', ' '])
>>> print(s.isdigit())
[True, False, False, False, False, False]
islower(self, devptr=0)

Return array of boolean values with True for strings that contain only lowercase characters.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['hello', 'Goodbye'])
>>> print(s.islower())
[True, False]
isnumeric(self, devptr=0)

Return array of boolean values with True for strings that contain only numeric characters. These include digit and numeric characters.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['1234', 'de', '1.75', '-34', '+9.8', ' '])
>>> print(s.isnumeric())
[True, False, False, False, False, False]
isspace(self, devptr=0)

Return array of boolean values with True for strings that contain only whitespace characters.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['1234', 'de', '1.75', '-34', '+9.8', ' '])
>>> print(s.isspace())
[False, False, False, False, False, True]
isupper(self, devptr=0)

Return array of boolean values with True for strings that contain only uppercase characters.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['hello', 'Goodbye'])
>>> print(s.isupper())
[False, True]
join(self, sep='')

Concatentate this list of strings into a single string.

Parameters
sepstr

This separator will be appended to each string before appending the next.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye"])
>>> print(s.join(sep=':'))
['hello:goodbye']
len(self, devptr=0)

Returns the number of characters of each string.

Parameters
devptrGPU memory pointer

Where string length values will be written. Must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> from librmm_cffi import librmm
>>> import numpy as np

Example passing device memory pointer

>>> s = nvstrings.to_device(["abc","d","ef"])
>>> arr = np.arange(s.size(),dtype=np.int32)
>>> d_arr = librmm.to_device(arr)
>>> s.len(d_arr.device_ctypes_pointer.value)
>>> print(d_arr.copy_to_host())
[3,1,2]
ljust(self, width, fillchar=' ')

Pad the end of each string to the minimum width.

Parameters
widthint

The minimum width of characters of the new string. If the width is smaller than the existing string, no padding is performed.

fillcharchar

The character used to do the padding. Default is space character. Only the first character is used.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye","well"])
>>> print(s.ljust(width=6))
['hello ', 'goodbye', 'well  ']
lower(self)

Convert each string to lowercase. This only applies to ASCII characters at this time.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["Hello, Friend","Goodbye, Friend"])
>>> print(s.lower())
['hello, friend', 'goodbye, friend']
lstrip(self, to_strip=None)

Strip leading characters from each string.

Parameters
to_stripstr

Characters to be removed from leading edge of each string

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["oh","hello","goodbye"])
>>> print(s.lstrip('o'))
['h', 'hello', 'goodbye']
match(self, pat, devptr=0)

Return array of boolean values where True is set if the specified pattern matches the beginning of the corresponding string.

Parameters
patstr

Pattern to find

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory size must be able to hold at least size() of np.byte values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.match('h'))
[True, False, False]
match_strings(self, strs, devptr=0)

Return array of boolean values where True is set for those strings in strs that match exactly to the corresponding strings in this instance.

Parameters
strsnvstrings

Strings to match against. Number of strings must match those in this instance.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory size must be able to hold at least size() of np.byte or np.int8 values.

Examples

>>> import nvstrings
>>> s1 = nvstrings.to_device(["hello","there","world"])
>>> s2 = nvstrings.to_device(["hello","here","world"])
>>> print(s1.match_strings(s2))
[True, False, True]
null_count(self, emptyisnull=False)

Returns the number of null strings in this instance.

Parameters
emptyisnullboolean

If True, empty strings are counted as null. Default is False.

Examples

>>> import nvstrings

Example passing device memory pointer

>>> s = nvstrings.to_device(["abc","",None])
>>> print("nulls",s.null_count())
nulls 1
>>> print("nulls+empty", s.null_count(True))
nulls+empty 2
order(self, stype=2, asc=True, nullfirst=True, devptr=0)

Sort this list by name (2) or length (1) or both (3). This sort only provides the new indexes and does not reorder the managed strings.

Parameters
stypeint

Type of sort to use. If stype is 1, strings will be sorted by length If stype is 2, strings will be sorted alphabetically by name If stype is 3, strings will be sorted by length and then alphabetically

ascbool

Whether to sort ascending (True) or descending (False)

nullfirstbool

Null strings are sorted to the beginning by default

devptrGPU memory pointer

Where index values will be written. Must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["aaa", "bb", "aaaabb"])
>>> print(s.order(2))
[1, 0, 2]
pad(self, width, side='left', fillchar=' ')

Add specified padding to each string. Side:{‘left’,’right’,’both’}, default is ‘left’.

Parameters
fillcharchar

The character used to do the padding. Default is space character. Only the first character is used.

sidestr

Either one of “left”, “right”, “both”. The default is “left” “left” performs a padding on the left – same as rjust() “right” performs a padding on the right – same as ljust() “both” performs equal padding on left and right – same as center()

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye","well"])
>>> print(s.pad(5))
['  hello', 'goodbye', '   well']
partition(self, delimiter=' ')

Each string is split into two strings on the first delimiter found.

Three strings are returned for each string: beginning, delimiter, end.

Parameters
delimiterstr

The character used to locate the split points of each string. Default is space.

Examples

>>> import nvstrings
>>> strs = nvstrings.to_device(["hello world","goodbye","up in arms"])
>>> for s in strs.partition(' '):
...     print(s)
['hello', ' ', 'world']
['goodbye', '', '']
['up', ' ', 'in arms']
remove_strings(self, indexes, count=0)

Remove the specified strings and return a new instance.

Parameters
indexesList of ints

0-based indexes of strings to remove from an nvstrings object If this parameter is pointer to device memory, count parm is required.

countint

Number of ints if indexes parm is a device pointer. Otherwise it is ignored.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.remove_strings([0, 2]))
['there']
repeat(self, repeats)

Appends each string with itself the specified number of times. This returns a nvstrings instance with the new strings.

Parameters
repeatsint

The number of times each string should be repeated. Repeat count of 0 or 1 will just return copy of each string.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye","well"])
>>> print(s.repeat(2))
['hellohello', 'goodbyegoodbye', 'wellwell']
replace(self, pat, repl, n=-1, regex=True)

Replace a string (pat) in each string with another string (repl).

Parameters
patstr

String to be replaced. This can also be a regex expression – not a compiled regex.

replstr

String to replace found string in the output instance.

regexboolean

Set to True if pat is a regular expression string.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye"])
>>> print(s.replace('e', ''))
['hllo', 'goodby']
replace_multi(self, pats, repls, regex=True)

Replace multiple strings (pats) in each string with corresponding strings (repls).

Parameters
patslist or nvstrings

Strings to be replaced. These can also be a regex expressions patterns. If so, this must not be an nvstrings instance.

replslist or nvstrings

Strings to replace found pattern/string. Must be the same number of strings as pats. Alternately, this can be a single str instance and would be used as replacement for each string found.

regexboolean

Set to True if pats are regular expression strings.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye"])
>>> print(s.replace_multi(['e', 'o'],['E','O']))
['hEllO', 'gOOdbyE']
replace_with_backrefs(self, pat, repl)

Use the repl back-ref template to create a new string with the extracted elements found using the pat expression.

Parameters
patstr

Regex with groupings to identify extract sections. This should not be a compiled regex.

replstr

String template containing back-reference indicators.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["A543","Z756"])
>>> print(s.replace_with_backrefs('(\d)(\d)', 'V\2\1'))
['V45', 'V57']
rfind(self, sub, start=0, end=None, devptr=0)

Find the specified string within each string. Search from the end of the string.

Return -1 for those strings where sub is not found.

Parameters
substr

String to find

startint

Beginning of section to find. Default is 0(beginning of each string).

endint

End of section to find. Default is end of each string.

devptrGPU memory pointer

Optional device memory pointer to hold the results.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.rfind('o'))
[4, -1, 1]
rindex(self, sub, start=0, end=None, devptr=0)

Same as rfind but throws an error if arg is not found in all strings.

Parameters
substr

String to find

startint

Beginning of section to search from. Default is 0 (beginning of each string).

endint

End of section to search. Default is end of each string.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory size must be able to hold at least size() of int32 values.

Examples

>>>import nvstrings >>> s = nvstrings.to_device([“hello”,”world”]) >>> print(s.rindex(‘l’)) [3,3]

rjust(self, width, fillchar=' ')

Pad the beginning of each string to the minimum width.

Parameters
widthint

The minimum width of characters of the new string. If the width is smaller than the existing string, no padding is performed.

fillcharchar

The character used to do the padding. Default is space character. Only the first character is used.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye","well"])
>>> print(s.ljust(width=6))
[' hello', 'goodbye', '  well']
rpartition(self, delimiter=' ')

Each string is split into two strings on the first delimiter found. Delimiter is searched for from the end.

Three strings are returned for each string: beginning, delimiter, end.

Parameters
delimiterstr

The character used to locate the split points of each string. Default is space.

Examples

>>> import nvstrings
>>> strs = nvstrings.to_device(["hello world","goodbye","up in arms"])
>>> for s in strs.rpartition(' '):
...     print(s)
['hello', ' ', 'world']
['', '', 'goodbye']
['up in', ' ', 'arms']
rsplit(self, delimiter=None, n=-1)

A new set of columns (nvstrings) is created by splitting the strings vertically. Delimiter is searched from the end.

Parameters
delimiterstr

The character used to locate the split points of each string. Default is space.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello world","goodbye","well said"])
>>> for result in s.rsplit(' '):
...     print(result)
["hello","goodbye","well"]
["world",None,"said"]
rsplit_record(self, delimiter=None, n=-1)

Returns an array of nvstrings each representing the split of each individual string. The delimiter is searched for from the end of each string.

Parameters
delimiterstr

The character used to locate the split points of each string. Default is space.

nint

Maximum number of strings to return for each split.

Examples

>>> import nvstrings
>>> strs = nvstrings.to_device(["hello world","goodbye","up in arms"])
>>> for s in strs.rsplit_record(' ',2):
...     print(s)
['hello', 'world']
['goodbye']
['up in', 'arms']
rstrip(self, to_strip=None)

Strip trailing characters from each string.

Parameters
to_stripstr

Characters to be removed from trailing edge of each string

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["oh","hello","goodbye"])
>>> print(s.rstrip('o'))
['oh', 'hell', 'goodbye']
scalar_scatter(self, str, indexes, count)

Return a new list of strings placing the specified string at the provided indexes.

Parameters
strstr

String to be placed

indexesList of ints or GPU memory pointer

0-based indexes of strings indicating the positions where the str parameter should be placed. Existing strings at those positions will be replaced. Values must be of type int32.

countint

Number of elements in the indexes parameter.

Examples

>>> import nvstrings
>>> s1 = nvstrings.to_device(["a","b","c","d"])
>>> print(s1.scalar_scatter("_", [1, 3]))
['a', '_', 'c', '_']
scatter(self, strs, indexes)

Return a new list of strings combining this instance with the provided strings using the specified indexes.

Parameters
strsnvstrings

Strings to be combined with this instance.

indexesList of ints or GPU memory pointer

0-based indexes of strings indicating which strings should be replaced by the corresponding element in strs. Values must be of type int32. The number of values should be the same as strs.size(). Repeated indexes will cause undefined results.

Examples

>>> import nvstrings
>>> s1 = nvstrings.to_device(["a","b","c","d"])
>>> s2 = nvstrings.to_device(["e","f"])
>>> print(s1.scatter(s2, [1, 3]))
['a', 'e', 'c', 'f']
set_null_bitmask(self, nbuf, bdevmem=False)

Store null-bitmask into provided memory.

Parameters
nbufmemory address or buffer

Stores null bitmask in arrow format.

bdevmemboolean

Default (False) interprets nbuf as CPU memory.

Examples

>>> import nvstrings
>>> import numpy as np
>>> s = nvstrings.to_device(['a',None,'p','l','e'])
>>> nulls = np.empty(int(s.size()/8)+1, dtype=np.int8)
>>> s.set_null_bitmask(nulls)
>>> print("nulls",nulls.tobytes())
nulls b'\x1d'
size(self)

The number of strings managed by this instance.

Returns
int

number of strings

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","world"])
>>> print(s.size())
2
slice(self, start, stop=None, step=None)

Returns a substring of each string.

Parameters
startint

Beginning position of the string to extract. Default is beginning of the each string.

stopint

Ending position of the string to extract. Default is end of each string.

stepstr

Characters that are to be captured within the specified section. Default is every character.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","goodbye"])
>>> print(s.slice(2,5))
['llo', 'odb']
slice_from(self, starts=0, stops=0)

Return substring of each string using positions for each string.

The starts and stops parameters are device memory pointers. If specified, each must contain size() of int32 values.

Parameters
startsGPU memory pointer

Beginning position of each the string to extract. Default is beginning of the each string.

stopsGPU memory pointer

Ending position of the each string to extract. Default is end of each string. Use -1 to specify to the end of that string.

Examples

>>> import nvstrings
>>> import numpy as np
>>> from numba import cuda
>>> s = nvstrings.to_device(["hello","there"])
>>> darr = cuda.to_device(np.asarray([2,3],dtype=np.int32))
>>> print(s.slice_from(starts=darr.device_ctypes_pointer.value))
['llo','re']
slice_replace(self, start=None, stop=None, repl=None)

Replace the specified section of each string with a new string.

Parameters
startint

Beginning position of the string to replace. Default is beginning of the each string.

stopint

Ending position of the string to replace. Default is end of each string.

replstr

String to insert into the specified position values.

Examples

>>> import nvstrings
>>> strs = nvstrings.to_device(["abcdefghij","0123456789"])
>>> print(strs.slice_replace(2,5,'z'))
['abzfghij', '01z56789']
sort(self, stype=2, asc=True, nullfirst=True)

Sort this list by name (2) or length (1) or both (3). Sorting can help improve performance for other operations.

Parameters
stypeint

Type of sort to use. If stype is 1, strings will be sorted by length If stype is 2, strings will be sorted alphabetically by name If stype is 3, strings will be sorted by length and then alphabetically

ascbool

Whether to sort ascending (True) or descending (False)

nullfirstbool

Null strings are sorted to the beginning by default

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["aaa", "bb", "aaaabb"])
>>> print(s.sort(3))
['bb', 'aaa', 'aaaabb']
split(self, delimiter=None, n=-1)

A new set of columns (nvstrings) is created by splitting the strings vertically.

Parameters
delimiterstr

The character used to locate the split points of each string. Default is space.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello world","goodbye","well said"])
>>> for result in s.split(' '):
...     print(result)
["hello","goodbye","well"]
["world",None,"said"]
split_record(self, delimiter=None, n=-1)

Returns an array of nvstrings each representing the split of each individual string.

Parameters
delimiterstr

The character used to locate the split points of each string. Default is space.

nint

Maximum number of strings to return for each split.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello world","goodbye","well said"])
>>> for result in s.split_record(' '):
...     print(result)
["hello","world"]
["goodbye"]
["well","said"]
startswith(self, pat, devptr=0)

Return array of boolean values with True for the strings where the specified string is at the beginning.

Parameters
patstr

Pattern to find. Regular expressions are not accepted.

devptrGPU memory pointer

Optional device memory pointer to hold the results. Memory must be able to hold at least size() of np.byte values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.startswith('h'))
[True, False, False]
stod(self, devptr=0)

Returns float64 values represented by each string.

Parameters
devptrGPU memory pointer

Where resulting float values will be written. Memory must be able to hold at least size() of float64 values

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["1234","-876","543.2","-0.12",".55"])
>>> print(s.stod())
[1234.0, -876.0, 543.2000122070312,
 -0.11999999731779099, 0.550000011920929]
stof(self, devptr=0)

Returns float values represented by each string.

Parameters
devptrGPU memory pointer

Where resulting float values will be written. Memory must be able to hold at least size() of float32 values

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["1234","-876","543.2","-0.12",".55"])
>>> print(s.stof())
[1234.0, -876.0, 543.2000122070312,
 -0.11999999731779099, 0.550000011920929]
stoi(self, devptr=0)

Returns integer values represented by each string.

Parameters
devptrGPU memory pointer

Where resulting integer values will be written. Memory must be able to hold at least size() of int32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["1234","-876","543.2","-0.12",".55""])
>>> print(s.stoi())
[1234, -876, 543, 0, 0]
stol(self, devptr=0)

Returns int64 values represented by each string.

Parameters
devptrGPU memory pointer

Where resulting integer values will be written. Memory must be able to hold at least size() of int64 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["1234","-876","543.2","-0.12",".55""])
>>> print(s.stol())
[1234, -876, 543, 0, 0]
strip(self, to_strip=None)

Strip leading and trailing characters from each string.

Parameters
to_stripstr

Characters to be removed from both ends of each string

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["oh, hello","goodbye"])
>>> print(s.strip('o'))
['h, hell', 'goodbye']
sublist(self, indexes, count=0)

Calls gather()

swapcase(self)

Change each lowercase character to uppercase and vice versa. This only applies to ASCII characters at this time.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["Hello, Friend","Goodbye, Friend"])
>>> print(s.lower())
['hELLO, fRIEND', 'gOODBYE, fRIEND']
timestamp2int(self, format=None, units='s', devptr=0)

Returns integer value represented by each string. String is interpretted using the format provided. The values are returned in the units specified based on epoch time and in UTC.

Parameters
formatstr

May include the following strptime specifiers only: %Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z Default format is “%Y-%m-%dT%H:%M:%SZ”

unitsstr

The units of the values and must be one of the following: Y,M,D,h,m,s,ms,us,ns Default is ‘s’ for seconds

devptrGPU memory pointer

Where resulting integer values will be written. Memory must be able to hold at least size() of int64 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["2019-03-20T12:34:56Z"])
>>> print(s.timestamp2int())
[1553085296]
title(self)

Uppercase the first letter of each letter after a space and lowercase the rest. This only applies to ASCII characters at this time.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["Hello friend","goodnight moon"])
>>> print(s.title())
['Hello Friend', 'Goodnight Moon']
to_booleans(self, true='True', devptr=0)

Returns boolean value represented by each string.

Parameters
truestr

String to use for True values. All others are set to False.

devptrGPU memory pointer

Where resulting integer values will be written. Memory must be able to hold at least size() of uint32 values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["True","False","",None])
>>> print(s.to_booleans())
[True, False, False, None]
to_host(self)

Copies strings back to CPU memory into a Python array.

Returns
list

A list of strings

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","world"])
>>> h = s.upper().to_host()
>>> print(h)
["HELLO","WORLD"]
to_offsets(self, sbuf, obuf, nbuf=0, bdevmem=False)

Store byte-array of characters encoded in UTF-8 and offsets and optional null-bitmask into provided memory.

Parameters
sbufmemory address or buffer

Strings characters are stored contiguously encoded as UTF-8.

obufmemory address or buffer

Stores array of int32 byte offsets to beginning of each string in sbuf. This should be able to hold size()+1 values.

nbufmemory address or buffer

Optional: stores null bitmask in arrow format.

bdevmemboolean

Default (False) interprets memory pointers as CPU memory.

Examples

>>> import nvstrings
>>> import numpy as np
>>> s = nvstrings.to_device(['a','p','p','l','e'])
>>> values = np.empty(s.size(), dtype=np.int8)
>>> offsets = np.empty(s.size()+1, dtype=np.int32)
>>> s.to_offsets(values,offsets)
>>> print("values",values.tobytes())
values b'apple'
>>> print("offsets",offsets)
offsets [0 1 2 3 4 5]
translate(self, table)

Translate individual characters to new characters using the provided table.

Parameters
patdict

Use str.maketrans() to build the mapping table. Unspecified characters are unchanged.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","there","world"])
>>> print(s.translate(str.maketrans('elh','ELH')))
['HELLo', 'tHErE', 'worLd]
upper(self)

Convert each string to uppercase. This only applies to ASCII characters at this time.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["Hello, friend","Goodbye, friend"])
>>> print(s.lower())
['HELLO, FRIEND', 'GOODBYE, FRIEND']
url_decode(self)

URL-decode each string and return as a new instance. No format checking is performed. All characters are expected to be encoded as UTF-8 hex values.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(['A%2FB-C%2FD', 'e%20f.g', '4-5%2C6'])
>>> print(s.url_decode())
['A/B-C/D', 'e f.g', '4-5,6']
url_encode(self)

URL-encode each string and return as a new instance. No format checking is performed. All characters are encoded except for ASCII letters, digits, and these characters: ‘.’,’_’,’-‘,’~’. Encoding converts to hex using UTF-8 encoded bytes.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["a/b-c/d","E F.G","1-2,3"])
>>> print(s.url_encode())
['a%2Fb-c%2Fd', 'E%20F.G', '1-2%2C3']
wrap(self, width)

This will place new-line characters in whitespace so each line is no more than width characters. Lines will not be truncated.

Parameters
widthint

The maximum width of characters per newline in the new string. If the width is smaller than the existing string, no newlines will be inserted.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello there","goodbye all","well ok"])
>>> print(s.wrap(3))
['hello\nthere', 'goodbye\nall', 'well\nok']
zfill(self, width)

Pads the strings with leading zeros. It will handle prefix sign characters correctly for strings containing leading number characters.

Parameters
widthint

The minimum width of characters of the new string. If the width is smaller than the existing string, no padding is performed.

Examples

>>> import nvstrings
>>> s = nvstrings.to_device(["hello","1234","-9876","+5.34"])
>>> print(s.zfill(width=6))
['0hello', '001234', '-09876', '+05.34']

nvcategory

class nvcategory.nvcategory(cptr)

Instance manages a dictionary of strings (keys) in device memory and a mapping of indexes (values).

Methods

add_keys(self, keys[, nulls])

Create new category adding the specified keys and remapping values to the new key indexes.

add_strings(self, nvs)

Create new category incorporating specified strings.

gather(self, indexes[, count])

Return nvcategory instance using the keys for this option and copying the values from the indexes argument.

gather_and_remap(self, indexes[, count])

Return nvcategory instance using the specified indexes to gather strings from this instance.

gather_numbers(self, indexes, narr[, nulls])

Fill buffer with keys values specified by the given indexes.

gather_strings(self, indexes[, count])

Return nvstrings instance represented using the specified indexes.

get_cpointer(self)

Returns memory pointer to underlying C++ class instance.

indexes_for_key(self, key[, devptr])

Return all index values for given key.

keys(self[, narr])

Return the unique keys for this category.

keys_size(self)

The number of keys.

keys_type(self)

Return string with name of the keys type.

merge_and_remap(self, nvcat)

Create new category incorporating the specified category keys and values.

merge_category(self, nvcat)

Create new category incorporating the specified category keys and values.

remove_keys(self, keys[, nulls])

Create new category removing the specified keys and remapping values to the new key indexes.

remove_strings(self, nvs)

Create new category without the specified strings.

remove_unused_keys(self)

Create new category removing any keys that have no corresponding values.

set_keys(self, keys[, nulls])

Create new category using the specified keys and remapping values to the new key indexes.

size(self)

The number of values.

to_numbers(self, narr[, nulls])

Fill array with key numbers as represented by the values in this instance.

to_strings(self)

Return nvstrings instance represented by the values in this instance.

value(self, str)

Return the category value for the given string.

value_for_index(self, idx)

Return the category value for the given index.

values(self[, devptr])

Return all values for this instance.

values_cpointer(self)

Returns memory pointer to underlying device memory array of int32 values for this instance.

add_keys(self, keys, nulls=None)

Create new category adding the specified keys and remapping values to the new key indexes.

Parameters
keysnvstrings or ndarray

keys to be added to existing keys

nulls: bitmask

ndarray of type uint8

Examples

>>> import nvcategory
>>> import numpy as np
>>> c = nvcategory.from_numbers(np.array([4, 1, 2, 3, 2, 1, 4]))
>>> print(c.keys(),c.values())
[1, 2, 3, 4] [3, 0, 1, 2, 1, 0, 3]
>>> nc = c.add_keys(np.array([2,1,0,3]))
>>> print(nc.keys(),nc.values())
[0, 1, 2, 3, 4] [4, 1, 2, 3, 2, 1, 4]
add_strings(self, nvs)

Create new category incorporating specified strings. This will return a new nvcategory with new key values. The index values will appear as if appended.

Parameters
nvsnvstrings

New strings to be added.

Examples

>>> import nvcategory, nvstrings
>>> s1 = nvstrings.to_device(["eee","aaa","eee","dddd"])
>>> s2 = nvstrings.to_device(["ggg","eee","aaa"])
>>> c1 = nvcategory.from_strings(s1)
>>> c2 = c1.add_strings(s2)
>>> print(c1.keys())
['aaa','dddd','eee']
>>> print(c1.values())
[2, 0, 2, 1]
>>> print(c2.keys())
['aaa','dddd','eee','ggg']
>>> print(c2.values())
[2, 0, 2, 1, 3, 2, 0]
gather(self, indexes, count=0)

Return nvcategory instance using the keys for this option and copying the values from the indexes argument.

Parameters
indexeslist or GPU memory pointer

List of ints or GPU memory pointer to array of int32 values.

countint

Number of ints if indexes parm is a device pointer. Otherwise it is ignored.

Returns
nvcategory

keys and values based on indexes provided

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["aa","bb","bb","ff","cc","ff"])
>>> print(c.keys(),c.values())
['aa', 'bb', 'cc', 'ff'] [0, 1, 1, 3, 2, 3]
>>> c = c.gather([1,3,2,3,1,2])
>>> print(c.keys(),c.values())
['aa', 'bb', 'cc', 'ff'] [1, 3, 2, 3, 1, 2]
>>> import numpy as np
>>> nc = nvcategory.from_numbers(np.array([2, 1, 5, 4, 1, 5, 1, 1]))
>>> indexes = np.array([1, 3, 2, 3, 1, 2], dtype=np.int32)
>>> nc1 = nc.gather(indexes)
>>> print(nc1.keys(),nc1.values())
[1, 2, 4, 5] [1, 3, 2, 3, 1, 2]
gather_and_remap(self, indexes, count=0)

Return nvcategory instance using the specified indexes to gather strings from this instance. Index values will be remapped if any keys are not represented. This is equivalent to calling nvcategory.from_strings() using the nvstrings object returned from a call to gather_strings().

Parameters
indexeslist or GPU memory pointer

List of ints or GPU memory pointer to array of int32 values.

countint

Number of ints if indexes parm is a device pointer. Otherwise it is ignored.

Returns
nvcategory

keys and values based on indexes provided

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["aa","bb","bb","ff","cc","ff"])
>>> print(c.keys(),c.values())
['aa', 'bb', 'cc', 'ff'] [0, 1, 1, 3, 2, 3]
>>> c = c.gather([1,3,2,3,1,2])
>>> print(c.keys(),c.values())
['bb', 'cc', 'ff'] [0, 2, 1, 2, 0, 1]
>>> import numpy as np
>>> nc = nvcategory.from_numbers(np.array([2, 1, 5, 4, 1, 5, 1, 1]))
>>> indexes = np.array([1, 3, 2, 3, 1, 2], dtype=np.int32)
>>> nc1 = nc.gather_and_remap(indexes)
>>> print(nc1.keys(),nc1.values())
[2, 4, 5] [0, 2, 1, 2, 0, 1]
gather_numbers(self, indexes, narr, nulls=None)

Fill buffer with keys values specified by the given indexes.

Parameters
indexesndarray of type int32

List of integers identifying keys to copy to the output.

narrndarray

Type must match the keys for this instance and hold the number of values indicated by the indexes parameter.

nullsndarray

Bits are set to indicate null and non-null entries in the narr output array.

Examples

>>> import nvcategory
>>> import numpy as np
>>> narr = np.array([2, 1, 1.25, 1.5, 1, 1.25, 1, 1, 2])
>>> nc = nvcategory.from_numbers(narr)
>>> idxs = np.array([0, 2, 0], dtype=np.int32)
>>> nbrs = np.empty([idxs.size], dtype=narr.dtype)
>>> nc.gather_numbers(idxs, nbrs)
>>> nbrs.tolist()
[1.0, 1.5, 1.0]
gather_strings(self, indexes, count=0)

Return nvstrings instance represented using the specified indexes.

Parameters
indexesList of ints or GPU memory pointer

0-based indexes of keys to return as an nvstrings object

countint

Number of ints if indexes parm is a device pointer. Otherwise it is ignored.

Returns
nvstrings

strings list based on indexes

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.keys())
['aaa','dddd','eee']
>>> print(c.values())
[2, 0, 2, 1]
>>> print(c.gather_strings([0,2,0]))
['aaa','eee','aaa']
get_cpointer(self)

Returns memory pointer to underlying C++ class instance.

indexes_for_key(self, key, devptr=0)

Return all index values for given key.

Parameters
keystr or number

key whose values should be returned

devptrGPU memory pointer or ndarray

Where index values will be written. Must be able to hold int32 values for this key.

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.indexes_for_key('aaa'))
[1]
>>> print(c.indexes_for_key('eee'))
[0, 2]
>>> import numpy as np
>>> narr = np.array([2, 1, 1.25, 1.5, 1, 1.25, 1, 1, 2])
>>> nc = nvcategory.from_numbers(narr)
>>> count = nc.indexes_for_key(1)
>>> idxs = np.empty([count], dtype=np.int32)
>>> count = nc.indexes_for_key(1, idxs)
>>> idxs.tolist()
[1, 4, 6, 7]
keys(self, narr=None)

Return the unique keys for this category. String keys are returned as nvstrings instance. Numeric keys require a buffer to fill. Buffer must be able to hold at least keys_size() elements of type keys_type().

Returns
nvstrings or None

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.keys())
['aaa','dddd','eee']
>>> import numpy as np
>>> narr = np.array([2, 1, 1.25, 1.5, 1, 1.25, 1, 1, 2])
>>> nc = nvcategory.from_numbers(narr)
>>> keys = np.empty([cat.keys_size()], dtype=narr.dtype)
>>> nc.keys(keys)
>>> keys.tolist()
[1.0, 1.25, 1.5, 2.0]
keys_size(self)

The number of keys.

Returns
int

Number of keys

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.keys())
['aaa','dddd','eee']
>>> print(c.keys_size())
3
keys_type(self)

Return string with name of the keys type.

Examples

>>> import nvcategory
>>> import numpy as np
>>> narr = np.array([2, 1, 1.25, 1.5, 1, 1.25, 1, 1, 2])
>>> nc = nvcategory.from_numbers(narr)
>>> nc.keys_type()
'float64'
merge_and_remap(self, nvcat)

Create new category incorporating the specified category keys and values. This will return a new nvcategory with new key values. The index values will appear as if appended. Values are appended and will be remapped to the new keys.

Parameters
nvcatnvcategory

New category to be merged.

Examples

>>> import nvcategory
>>> import numpy as np
>>> cat1 = nvcategory.from_numbers(np.array([4, 1, 2, 3, 2, 1, 4]))
>>> print(cat1.keys(),cat1.values())
[1, 2, 3, 4] [3, 0, 1, 2, 1, 0, 3]
>>> cat2 = nvcategory.from_numbers(np.array([2, 4, 3, 0]))
>>> print(cat2.keys(),cat2.values())
[0, 2, 3, 4] [1, 3, 2, 0]
>>> nc = cat1.merge_and_remap(cat2)
>>> print(nc.keys(),nc.values())
[0, 1, 2, 3, 4] [4, 1, 2, 3, 2, 1, 4, 2, 4, 3, 0]
merge_category(self, nvcat)

Create new category incorporating the specified category keys and values. This will return a new nvcategory with new key values. The index values will appear as if appended. Any matching keys will preserve their values and any new keys will get new values.

Parameters
nvcatnvcategory

New category to be merged.

remove_keys(self, keys, nulls=None)

Create new category removing the specified keys and remapping values to the new key indexes. Values with removed keys are mapped to -1.

Parameters
keysnvstrings or ndarray

keys to be removed from existing keys

nulls: bitmask

ndarray of type uint8

Examples

>>> import nvcategory
>>> import numpy as np
>>> c = nvcategory.from_numbers(np.array([4, 1, 2, 3, 2, 1, 4]))
>>> print(c.keys(),c.values())
[1, 2, 3, 4] [3, 0, 1, 2, 1, 0, 3]
>>> nc = c.remove_keys(np.array([4,0]))
>>> print(nc.keys(),nc.values())
[1, 2, 3] [-1, 0, 1, 2, 1, 0, -1]
remove_strings(self, nvs)

Create new category without the specified strings. The returned category will have new set of key values and indexes.

Parameters
nvsnvstrings

strings to be removed.

Examples

>>> import nvcategory, nvstrings
>>> s1 = nvstrings.to_device(["eee","aaa","eee","dddd"])
>>> s2 = nvstrings.to_device(["aaa"])
>>> c1 = nvcategory.from_strings(s1)
>>> c2 = c1.remove_strings(s2)
>>> print(c1.keys())
['aaa','dddd','eee']
>>> print(c1.values())
[2, 0, 2, 1]
>>> print(c2.keys())
['dddd', 'eee']
>>> print(c2.values())
[1, 1, 0]
remove_unused_keys(self)

Create new category removing any keys that have no corresponding values. Values are remapped to match the new keyset.

Examples

>>> import nvcategory
>>> import numpy as np
>>> c = nvcategory.from_numbers(np.array([4, 1, 2, 3, 2, 1, 4]))
>>> print(c.keys(),c.values())
[1, 2, 3, 4] [3, 0, 1, 2, 1, 0, 3]
>>> nc = c.add_keys(np.array([4,0]))
>>> print(nc.keys(),nc.values())
[1, 2, 3] [-1, 0, 1, 2, 1, 0, -1]
>>> nc1 = nc.remove_unused_keys()
>>> print(nc1.keys(),nc1.values())
[1, 2, 3, 4] [3, 0, 1, 2, 1, 0, 3]
set_keys(self, keys, nulls=None)

Create new category using the specified keys and remapping values to the new key indexes. Matching names will have remapped values. Values with removed keys are mapped to -1.

Parameters
strsnvstrings or ndarray

keys to be used for new category

nulls: bitmask

ndarray of type uint8

Examples

>>> import nvcategory
>>> import numpy as np
>>> c = nvcategory.from_numbers(np.array([4, 1, 2, 3, 2, 1, 4]))
>>> print(c.keys(),c.values())
[1, 2, 3, 4] [3, 0, 1, 2, 1, 0, 3]
>>> nc = c.set_keys(np.array([2, 4, 3, 0]))
>>> print(nc.keys(),nc.values())
[0, 2, 3, 4] [3, -1, 1, 2, 1, -1, 3]
size(self)

The number of values.

Returns
int

Number of values

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.values())
[2, 0, 2, 1]
>>> print(c.size())
4
to_numbers(self, narr, nulls=None)

Fill array with key numbers as represented by the values in this instance.

Parameters
narrndarray

Array to fill with numbers. Must be of the same type as the keys for this instance and must be able to hold size() values.

nullsndarray

Array to fill with bits identifying null and non-null entries.

Examples

>>> import nvcategory
>>> import numpy as np
>>> narr = np.array([2, 1, 1.25, 1.5, 1, 1.25, 1, 1, 2])
>>> nc = nvcategory.from_numbers(narr)
>>> nbrs = np.empty([cat.size()], dtype=narr.dtype)
>>> nc.to_numbers(nbrs)
>>> nbrs.tolist()
[2.0, 1.0, 1.25, 1.5, 1.0, 1.25, 1.0, 1.0, 2.0]
to_strings(self)

Return nvstrings instance represented by the values in this instance.

Returns
nvstrings

Full strings list based on values indexes

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.keys())
['aaa','dddd','eee']
>>> print(c.values())
[2, 0, 2, 1]
>>> print(c.to_strings())
['eee','aaa','eee','dddd']
value(self, str)

Return the category value for the given string.

Parameters
strstr

key to retrieve

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.value('aaa'))
0
>>> print(c.value('eee'))
2
value_for_index(self, idx)

Return the category value for the given index.

Parameters
idxint

index value to retrieve

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.value_for_index(3))
1
values(self, devptr=0)

Return all values for this instance.

Parameters
devptrGPU memory pointer

Where index values will be written. Must be able to hold size() of int32 values.

Examples

>>> import nvcategory
>>> c = nvcategory.to_device(["eee","aaa","eee","dddd"])
>>> print(c.values())
[2, 0, 2, 1]
>>> import numpy as np
>>> narr = np.array([2, 1, 1.25, 1.5, 1, 1.25, 1, 1, 2])
>>> nc = nvcategory.from_numbers(narr)
>>> values = np.empty([cat.size()], dtype=np.int32)
>>> nc.values(values)
>>> values.tolist()
[3, 0, 1, 2, 0, 1, 0, 0, 3]
values_cpointer(self)

Returns memory pointer to underlying device memory array of int32 values for this instance.

nvtext

nvtext.contains_strings(strs, tgts, devptr=0)

The tgts strings are searched for within each strs. The returned byte array is 1 for each tgts in strs and 0 otherwise.

Parameters
strsnvstrings

The strings for this operation.

tgtsnvstrings

The strings to check for inside each strs.

devptrGPU memory pointer

Must be able to hold at least strs.size()*tgts.size() of int8 values.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["hello","goodbye",""])
>>> t = nvstrings.to_device(['o','y'])
>>> n = nvtext.contains_strings(s,t)
>>> print(n)
[[True,False],[True,True],[False,False]]
nvtext.edit_distance(strs, tgt, algo=0, devptr=0)

Compute the edit-distance between strs and tgt. Edit distance is how many character changes between strings.

Parameters
strsnvstrings

The strings for this operation.

tgtstr, nvstrings

The string or strings to compute edit-distance with.

algo: int

0 = Levenshtein

devptrGPU memory pointer

Must be able to hold at least strs.size() of int32 values.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["honda","hyundai"])
>>> n = nvtext.edit_distance(s,"honda")
>>> print(n)
[0,3]
nvtext.ngrams(tokens, N=2, sep='_')

Generate the n-grams from a set of tokens. You can generate tokens from an nvstrings instance using the tokenize() function.

Parameters
tokensnvstrings

The tokens for this operation.

Nint

The degree of the n-gram (number of consecutive tokens). Default of 2 for bigrams.

sepstr

The separator to use between within an n-gram. Default is ‘_’.

Examples

>>> import nvstrings, nvtext
>>> dstrings = nvstrings.to_device(['this is my', 'favorite book'])
>>> print(nvtext.ngrams(dstrings, N=2, sep='_'))
['this_is', 'is_my', 'my_favorite', 'favorite_book']
nvtext.normalize_spaces(strs)

Remove extra whitespace between tokens and trim whitespace from the beginning and the end of each string.

Parameters
strsnvstrings

The strings for this operation.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["hello          world"," test string  "])
>>> n = nvtext.normalize_spaces(s)
>>> print(n)
["hello world", "test string"]
nvtext.replace_tokens(strs, tgts, repls, delimiter=None)

The tgts tokens are searched for within each strs and replaced with the corresponding repls if found. Tokens are identified by the delimiter character(s) provided.

Parameters
strsnvstrings

The strings for this operation.

tgtsnvstrings

The tokens to search for inside each strs.

replsnvstrings or str

The strings to replace for each found tgts token found. Alternately, this can be a single str instance and would be used as replacement for each string found.

delimiterstr

The characters used to locate the tokens of each string. Default is whitespace.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["this is me","theme music",""])
>>> t = nvstrings.to_device(['is','me'])
>>> r = nvtext.replace_tokens(s,t,'_')
>>> print(r)
["this _ _", "theme music"]
nvtext.strings_counts(strs, tgts, devptr=0)

The tgts strings are searched for within each strs. The returned int32 array is number of occurrences of each tgts in strs.

Parameters
strsnvstrings

The strings for this operation.

tgtsnvstrings

The strings to count for inside each strs.

devptrGPU memory pointer

Must be able to hold at least strs.size()*tgts.size() of int32 values.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["hello","goodbye",""])
>>> t = nvstrings.to_device(['o','y'])
>>> n = nvtext.strings_counts(s,t)
>>> print(n)
[[1,0],[2,1],[0,0]]
nvtext.token_count(strs, delimiter=' ', devptr=0)

Each string is split into tokens using the provided delimiter. The returned integer array is the number of tokens in each string.

Parameters
strsnvstrings

The strings for this operation

delimiterstr

The character used to locate the split points of each string. Default is space.

devptrGPU memory pointer

Must be able to hold at least strs.size() of int32 values.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["hello world","goodbye",""])
>>> n = nvtext.token_count(s)
>>> print(n)
[2,1,0]
nvtext.tokenize(strs, delimiter=None)

Each string is split into tokens using the provided delimiter(s). The nvstrings instance returned contains the tokens in the order they were found.

Parameters
strsnvstrings

The strings for this operation

delimiterstr or nvstrings or list of strs

The string used to locate the split points of each string. Default is whitespace.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["hello world",
...                          "goodbye world",
...                          "hello goodbye"])
>>> t = nvtext.tokenize(s)
>>> print(t)
["hello","world","goodbye","world","hello","goodbye"]
nvtext.tokens_counts(strs, tgts, delimiter=' ', devptr=0)

The tgts strings are searched for within each strs. The returned int32 array is number of occurrences of each tgts in strs.

Parameters
strsnvstrings

The strings for this operation.

tgtsnvstrings

The strings to count for inside each strs.

delimiterstr

The character used to locate the split points of each string. Default is space.

devptrGPU memory pointer

Must be able to hold at least strs.size()*tgts.size() of int32 values.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["this is me","goodbye me",""])
>>> t = nvstrings.to_device(['is','me'])
>>> n = nvtext.tokens_counts(s,t)
>>> print(n)
[[1,1],[0,1],[0,0]]
nvtext.unique_tokens(strs, delimiter=' ')

Each string is split into tokens using the provided delimiter. The nvstrings instance returned contains unique list of tokens.

Parameters
strsnvstrings

The strings for this operation

delimiterstr

The character used to locate the split points of each string. Default is space.

Examples

>>> import nvstrings, nvtext
>>> s = nvstrings.to_device(["hello world",
...                          "goodbye world",
...                          "hello goodbye"])
>>> ut = nvtext.unique_tokens(s)
>>> print(ut)
["goodbye","hello","world"]