API Reference¶

IP¶

clx.ip.hostmask(ips, prefixlen=16)¶

Compute a column of hostmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses
prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses

Returns

hostmasks

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.hostmask(cudf.Series(["192.168.0.1","10.0.0.1"], prefixlen=16)
0    0.0.255.255
1    0.0.255.255
Name: hostmask, dtype: object

clx.ip.int_to_ip(values)¶

Convert integer column to IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters: values (cudf.Series) – Integers to be converted
Returns: IP addresses
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.int_to_ip(cudf.Series([3232235521, 167772161]))
0     5.79.97.178
1    94.130.74.45
dtype: object

clx.ip.ip_to_int(values)¶

Convert string column of IP addresses to integer values. Addresses must be IPv4. IPv6 not yet supported.

Parameters: values (cudf.Series) – IP addresses to be converted
Returns: Integer representations of IP addresses
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.ip_to_int(cudf.Series(["192.168.0.1","10.0.0.1"]))
0      89088434
1    1585596973
dtype: int64

clx.ip.is_global(ips)¶

Indicates whether each address is global. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_global(cudf.Series(["127.0.0.1","207.46.13.151"]))
0    False
1    True
dtype: bool

clx.ip.is_ip(ips)¶

Indicates whether each address is an ip string. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_ip(cudf.Series(["192.168.0.1","10.123.0"]))
0     True
1    False
dtype: bool

clx.ip.is_link_local(ips)¶

Indicates whether each address is link local. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_link_local(cudf.Series(["127.0.0.1","169.254.123.123"]))
0    False
1    True
dtype: bool

clx.ip.is_loopback(ips)¶

Indicates whether each address is loopback. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_loopback(cudf.Series(["127.0.0.1","10.0.0.1"]))
0     True
1    False
dtype: bool

clx.ip.is_multicast(ips)¶

Indicates whether each address is multicast. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_multicast(cudf.Series(["127.0.0.1","224.0.0.0"]))
0    False
1    True
dtype: bool

clx.ip.is_private(ips)¶

Indicates whether each address is private. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_private(cudf.Series(["127.0.0.1","207.46.13.151"]))
0    True
1    False
dtype: bool

clx.ip.is_reserved(ips)¶

Indicates whether each address is reserved. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_reserved(cudf.Series(["127.0.0.1","10.0.0.1"]))
0    False
1    False
dtype: bool

clx.ip.is_unspecified(ips)¶

Indicates whether each address is unspecified. Addresses must be IPv4. IPv6 not yet supported.

Parameters: ips – IP addresses
Returns: booleans
Return type: cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_unspecified(cudf.Series(["127.0.0.1","10.0.0.1"]))
0    False
1    False
dtype: bool

clx.ip.mask(ips, masks)¶

Apply a mask to a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses
masks (cudf.Series) – The host or subnet masks to be applied

Returns

masked IP addresses

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> input_ips = cudf.Series(["192.168.0.1","10.0.0.1"])
>>> input_masks = cudf.Series(["255.255.0.0", "255.255.0.0"])
>>> clx.ip.mask(input_ips, input_masks)
0    192.168.0.0
1       10.0.0.0
Name: mask, dtype: object

clx.ip.netmask(ips, prefixlen=16)¶

Compute a column of netmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses
prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses

Returns

netmasks

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.netmask(cudf.Series(["192.168.0.1","10.0.0.1"]), prefixlen=16)
0    255.255.0.0
1    255.255.0.0
Name: net_mask, dtype: object

Analytics¶

class clx.analytics.dga_detector.DGADetector(lr=0.001)¶

This class provides multiple functionalities such as build, train and evaluate the RNNClassifier model to distinguish legitimate and DGA domain names.

Methods

`evaluate_model`(self, detector_dataset)	This function evaluates the trained model to verify it’s accuracy.
`init_model`(self[, char_vocab, hidden_size, …])	This function instantiates RNNClassifier model to train.
`predict`(self, domains)	This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series.
`train_model`(self, detector_dataset)	This function is used for training RNNClassifier model with a given training dataset.

evaluate_model(self, detector_dataset)¶

This function evaluates the trained model to verify it’s accuracy.

Parameters: detector_dataset (DetectorDataset) – Instance holds preprocessed data.
Returns: Model accuracy
Return type: decimal

Examples

>>> dd = DGADetector()
>>> dd.init_model()
>>> dd.evaluate_model(detector_dataset)
Evaluating trained model ...
Test set: Accuracy: 3/4 (0.75)

init_model(self, char_vocab=128, hidden_size=100, n_domain_type=2, n_layers=3)¶

This function instantiates RNNClassifier model to train. And also optimizes to scale it and keep running on parallelism.

Parameters

char_vocab (int) – Vocabulary size is set to 128 ASCII characters.
hidden_size (int) – Hidden size of the network.
n_domain_type (int) – Number of domain types.
n_layers (int) – Number of network layers.

predict(self, domains)¶

This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series.

Parameters: domains (cudf.Series) – List of domains.
Returns: Predicted results with respect to given domains.
Return type: cudf.Series

Examples

>>> dd.predict(['nvidia.com', 'dgadomain'])
0    0
1    1
Name: is_dga, dtype: int64

train_model(self, detector_dataset)¶

This function is used for training RNNClassifier model with a given training dataset. It returns total loss to determine model prediction accuracy. :param detector_dataset: Instance holds preprocessed data :type detector_dataset: DetectorDataset :return: Total loss :rtype: int

Examples

>>> from clx.analytics.dga_detector import DGADetector
>>> partitioned_dfs = ... # partitioned_dfs = [df1, df2, ...] represents training dataset
>>> dd = DGADetector()
>>> dd.init_model()
>>> dd.train_model(detector_dataset)
1.5728906989097595

class clx.analytics.model.rnn_classifier.RNNClassifier(input_size, hidden_size, output_size, n_layers=1, bidirectional=True)¶

Methods

forward(self, input, seq_lengths)

Defines the computation performed at every call.

forward(self, input, seq_lengths)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

clx.analytics.stats.rzscore(series, window)¶

Calculates rolling z-score

Parameters

seriescudf.Series: Series for which to calculate rolling z-score
windowint: Window size

Returns

cudf.Series: Series with rolling z-score values

Examples

>>> import clx.analytics.stats
>>> import cudf
>>> sequence = [3,4,5,6,1,10,34,2,1,11,45,34,2,9,19,43,24,13,23,10,98,84,10]
>>> series = cudf.Series(sequence)
>>> zscores_df = cudf.DataFrame()
>>> zscores_df['zscore'] = clx.analytics.stats.rzscore(series, 7)
>>> zscores_df
            zscore
         null
         null
         null
         null
         null
         null
  2.374423424
 -0.645941275
 -0.683973734
  0.158832461
 1.847751909
 0.880026019
-0.950835449
-0.360593742
 0.111407599
 1.228914145
-0.074966331
-0.570321249
 0.327849973
-0.934372308
 2.296828498
 1.282966989
-0.795223674

DNS Extractor¶

clx.dns.dns_extractor.extract_hostnames(url_series)¶

This function extracts hostnames from the given urls.

Parameters: url_series (cudf.Series) – Urls that are to be handled.
Returns: Hostnames extracted from the urls.
Return type: cudf.Series

Examples

>>> from cudf import DataFrame
>>> from clx.dns import dns_extractor as dns
>>> input_df = DataFrame(
...     {
...         "url": [
...             "http://www.google.com",
...             "gmail.com",
...             "github.com",
...             "https://pandas.pydata.org",
...         ]
...     }
... )
>>> dns.extract_hostnames(input_df["url"])
0       www.google.com
1            gmail.com
2           github.com
3    pandas.pydata.org
Name: 0, dtype: object

clx.dns.dns_extractor.generate_tld_cols(hostname_split_df, hostnames, col_len)¶

This function generates tld columns.

Parameters

hostname_split_df (cudf.DataFrame) – Hostname splits.
hostnames (cudf.DataFrame) – Hostnames.
col_len – Hostname splits dataframe columns length.

Returns

Tld columns with all combination.

Return type

cudf.DataFrame

Examples

>>> import cudf
>>> from clx.dns import dns_extractor as dns
>>> hostnames = cudf.Series(["www.google.com", "pandas.pydata.org"])
>>> hostname_splits = dns.get_hostname_split_df(hostnames)
>>> print(hostname_splits)
     2       1       0
0  com  google     www
1  org  pydata  pandas
>>> col_len = len(hostname_split_df.columns) - 1
>>> col_len = len(hostname_splits.columns) - 1
>>> dns.generate_tld_cols(hostname_splits, hostnames, col_len)
     2       1       0 tld2        tld1               tld0
0  com  google     www  com  google.com     www.google.com
1  org  pydata  pandas  org  pydata.org  pandas.pydata.org

clx.dns.dns_extractor.get_hostname_split_df(hostnames)¶

Find all words and digits between periods.

Parameters: hostnames (cudf.Series) – Hostnames that are being split.
Returns: Hostname splits.
Return type: cudf.DataFrame

Examples

>>> import cudf
>>> from clx.dns import dns_extractor as dns
>>> hostnames = cudf.Series(["www.google.com", "pandas.pydata.org"])
>>> dns.get_hostname_split_df(hostnames)
     2       1       0
0  com  google     www
1  org  pydata  pandas

clx.dns.dns_extractor.parse_url(url_series, req_cols=None)¶

This function extracts subdomain, domain and suffix for a given url.

Parameters

url_df_col (cudf.Series) – Urls that are to be handled.
req_cols (set(strings)) – Columns requested to extract such as (domain, subdomain, suffix and hostname).

Returns

Extracted information of requested columns.

Return type

cudf.DataFrame

Examples

>>> from cudf import DataFrame
>>> from clx.dns import dns_extractor as dns
>>> 
>>> input_df = DataFrame(
...     {
...         "url": [
...             "http://www.google.com",
...             "gmail.com",
...             "github.com",
...             "https://pandas.pydata.org",
...         ]
...     }
... )
>>> dns.parse_url(input_df["url"])
            hostname  domain suffix subdomain
0     www.google.com  google    com       www
1          gmail.com   gmail    com          
2         github.com  github    com          
3  pandas.pydata.org  pydata    org    pandas
>>> dns.parse_url(input_df["url"], req_cols={'domain', 'suffix'})
   domain suffix
0  google    com
1   gmail    com
2  github    com
3  pydata    org

Heuristics¶

clx.heuristics.ports.major_ports(addr_col, port_col, min_conns=1, eph_min=10000)¶

Find major ports for each address. This is done by computing the mean number of connections across all ports for each address and then filters out all ports that don’t cross this threshold. Also adds column for IANA service name correspondingto each port.

Parameters

addr_col (cudf.Series) – Column of addresses as strings
port_col (cudf.Series) – Column of corresponding port numbers as ints
min_conns (int) – Filter out ip:port rows that don’t have at least this number of connections (default: 1)
eph_min (int) – Ports greater than or equal to this will be labeled as an ephemeral service (default: 10000)

Returns

DataFrame with columns for address, port, IANA service corresponding to port, and number of connections

Return type

cudf.DataFrame

Examples

>>> import clx.heuristics.ports as ports
>>> import cudf
>>> input_addr_col = cudf.Series(["10.0.75.1","10.0.75.1","10.0.75.1","10.0.75.255","10.110.104.107", "10.110.104.107"])
>>> input_port_col = cudf.Series([137,137,7680,137,7680, 7680])
>>> ports.major_ports(input_addr_col, input_port_col, min_conns=2, eph_min=7000)
            addr  port     service  conns
0      10.0.75.1   137  netbios-ns      2
1 10.110.104.107  7680   ephemeral      2

OSI (Open Source Integration)¶

class clx.osi.farsight.FarsightLookupClient(server, apikey, limit=None, http_proxy=None, https_proxy=None)¶

Methods

`query_rdata_ip`(self, rdata_ip[, before, after])	query to find DNSDB records matching a specific IP address with given time range.
`query_rdata_name`(self, rdata_name[, rrtype, …])	query matches only a single DNSDB record of given oname and time ranges.
`query_rrset`(self, oname[, rrtype, …])	batch version of querying DNSDB by given domain name and time ranges.

query_rdata_ip(self, rdata_ip, before=None, after=None)¶: query to find DNSDB records matching a specific IP address with given time range.

query_rdata_name(self, rdata_name, rrtype=None, before=None, after=None)¶: query matches only a single DNSDB record of given oname and time ranges.

query_rrset(self, oname, rrtype=None, bailiwick=None, before=None, after=None)¶: batch version of querying DNSDB by given domain name and time ranges.

class clx.osi.virus_total.VirusTotalClient(api_key=None, proxies=None)¶

Attributes

api_key
proxies
vt_endpoint_dict

Methods

`domain_report`(self, domain)	Retrieve report using domain.
`file_report`(self, \*resource)	The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report.
`file_rescan`(self, \*resource)	This function rescan given files.
`file_scan`(self, file)	This function allows you to send a file for scanning with VirusTotal.
`ipaddress_report`(self, ip)	Retrieve report using ip address.
`put_comment`(self, resource, comment)	Post comment for a file or URL
`scan_big_file`(self, files)	Scanning files larger than 32MB
`url_report`(self, \*resource)	The resource argument must be the URL to retrieve the most recent report.
`url_scan`(self, \*url)	This function scan on provided url with VirusTotal.

domain_report(self, domain)¶: Retrieve report using domain.

file_report(self, *resource)¶: The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report. You may also specify a scan_id returned by the /file/scan endpoint.

file_rescan(self, *resource)¶: This function rescan given files. The resource argument can be the MD5, SHA-1 or SHA-256 of the file you want to re-scan.

file_scan(self, file)¶: This function allows you to send a file for scanning with VirusTotal. Before performing submissions it would be nice to retrieve the latest report on the file. File size limit is 32MB, in order to submit files up to 200MB in size it is mandatory to request a special upload URL using the /file/scan/upload_url endpoint.

ipaddress_report(self, ip)¶: Retrieve report using ip address.

put_comment(self, resource, comment)¶: Post comment for a file or URL

scan_big_file(self, files)¶: Scanning files larger than 32MB

url_report(self, *resource)¶: The resource argument must be the URL to retrieve the most recent report.

url_scan(self, *url)¶: This function scan on provided url with VirusTotal.

class clx.osi.whois.WhoIsLookupClient(sep=', ', datetime_format='%m-%d-%Y %H:%M:%S')¶

Methods

whois(self, domains[, arr2str])

Function to access parsed WHOIS data for a given domain.

whois(self, domains, arr2str=True)¶: Function to access parsed WHOIS data for a given domain.

Parsers¶

class clx.parsers.event_parser.EventParser(columns, event_name)¶

This is an abstract class for all event log parsers.

Parameters

columns (set(string)) – Event column names.
event_name (string) – Event name

Attributes

columns: List of columns that are being processed.
event_name: Event name define type of logs that are being processed.

Methods

`filter_by_pattern`(self, df, column, pattern)	Retrieve events only which satisfies given regex pattern.
`parse`(self, dataframe, raw_column)	Abstract method ‘parse’ triggers the parsing functionality.
`parse_raw_event`(self, dataframe, raw_column, …)	Processes parsing of a specific type of raw event records received as a dataframe.

property columns¶

List of columns that are being processed.

Returns: Event column names.
Return type: set(string)

property event_name¶

Event name define type of logs that are being processed.

Returns: Event name
Return type: string

filter_by_pattern(self, df, column, pattern)¶

Retrieve events only which satisfies given regex pattern.

Parameters

df (cudf.DataFrame) – Raw events to be filtered.
column (string) – Raw data contained column name.
pattern (string) – Regex pattern to retrieve events that are required.

Returns

filtered dataframe.

Return type

cudf.DataFrame

abstract parse(self, dataframe, raw_column)¶: Abstract method ‘parse’ triggers the parsing functionality. Subclasses are required to implement and execute any parsing pre-processing steps.

parse_raw_event(self, dataframe, raw_column, event_regex)¶

Processes parsing of a specific type of raw event records received as a dataframe.

Parameters

dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.
event_regex (dict) – Required regular expressions for a given event type.

Returns

parsed information.

Return type

cudf.DataFrame

class clx.parsers.splunk_notable_parser.SplunkNotableParser¶

This is class parses splunk notable logs.

Methods

parse(self, dataframe, raw_column)

Parses the Splunk notable raw events.

parse(self, dataframe, raw_column)¶

Parses the Splunk notable raw events.

Parameters

dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.

Returns

parsed information.

Return type

cudf.DataFrame

class clx.parsers.windows_event_parser.WindowsEventParser(interested_eventcodes=None)¶

This is class parses windows event logs.

Parameters: interested_eventcodes (set(int)) – This parameter provides flexibility to parse only interested eventcodes.

Methods

`clean_raw_data`(self, dataframe, raw_column)	Lower casing and replacing escape characters.
`get_columns`(self)	Get columns of windows event codes.
`parse`(self, dataframe, raw_column)	Parses the Windows raw event.

clean_raw_data(self, dataframe, raw_column)¶

Lower casing and replacing escape characters.

Parameters

dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.

Returns

Clean raw information.

Return type

cudf.DataFrame

get_columns(self)¶

Get columns of windows event codes.

Returns: Columns of all configured eventcodes, if no interested eventcodes specified.
Return type: set(string)

parse(self, dataframe, raw_column)¶

Parses the Windows raw event.

Parameters

dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.

Returns

Parsed information.

Return type

cudf.DataFrame

clx.parsers.zeek.parse_log_file(filepath)¶

Parse Zeek log file and return cuDF dataframe. Uses header comments to get column names/types and configure parser.

Parameters: filepath (string) – filepath for Zeek log file
Returns: Zeek log dataframe
Return type: cudf.DataFrame

Workflow¶

class clx.workflow.workflow.Workflow(name, source=None, destination=None)¶

Attributes

destination: Dictionary of configuration parameters for the data destination (writer)
name: Name of the workflow for logging purposes.
source: Dictionary of configuration parameters for the data source (reader)

Methods

`benchmark`(function)	Decorator used to capture a benchmark for a given function
`run_workflow`(self)	Run workflow.
`set_destination`(self, destination)	Set destination.
`set_source`(self, source)	Set source.
`stop_workflow`(self)	Close workflow.
`workflow`(self, dataframe)	The pipeline function performs the data enrichment on the data.

benchmark(function)¶: Decorator used to capture a benchmark for a given function

property destination¶: Dictionary of configuration parameters for the data destination (writer)

property name¶: Name of the workflow for logging purposes.

run_workflow(self)¶: Run workflow. Reader (source) fetches data. Workflow implementation is executed. Workflow output is written to destination.

set_destination(self, destination)¶

Set destination.

Parameters: destination – dict of configuration parameters for the destination (writer)

set_source(self, source)¶

Set source.

Parameters: source – dict of configuration parameters for data source (reader)

property source¶: Dictionary of configuration parameters for the data source (reader)

stop_workflow(self)¶: Close workflow. This includes calling close() method on reader (source) and writer (destination)

abstract workflow(self, dataframe)¶: The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.

class clx.workflow.splunk_alert_workflow.SplunkAlertWorkflow(name, source=None, destination=None, interval='day', threshold=2.5, window=7, raw_data_col_name='_raw')¶

Attributes

interval: Interval can be set to day or hour by which z score will be calculated
raw_data_col_name: Dataframe column name containing raw splunk alert data
threshold: Threshold by which to flag z score.
window: Window by which to calculate rolling z score

Methods

workflow(self, dataframe)

The pipeline function performs the data enrichment on the data.

property interval¶: Interval can be set to day or hour by which z score will be calculated

property raw_data_col_name¶: Dataframe column name containing raw splunk alert data

property threshold¶: Threshold by which to flag z score. Threshold will be flagged for scores >threshold or <-threshold

property window¶: Window by which to calculate rolling z score

workflow(self, dataframe)¶: The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.

I/O¶

class clx.io.reader.kafka_reader.KafkaReader(batch_size, consumer, time_window=30)¶

Reads from Kafka based on config object.

Parameters

batch_size – batch size
consumer – Kafka consumer
time_window – Max window of time that queued events will wait to be pushed to workflow

Attributes

consumer
has_data
time_window

Methods

`close`(self)	Close Kafka reader
`fetch_data`(self)	Fetch data from Kafka based on provided config object

close(self)¶: Close Kafka reader

fetch_data(self)¶: Fetch data from Kafka based on provided config object

class clx.io.reader.dask_fs_reader.DaskFileSystemReader(config)¶

Uses Dask to read from file system based on config object.

Parameters: config – dictionary object of config values for type, input_format, input_path, and dask reader optional keyword args

Methods

`close`(self)	Close dask reader
`fetch_data`(self)	Fetch data using dask based on provided config object

close(self)¶: Close dask reader

fetch_data(self)¶: Fetch data using dask based on provided config object

class clx.io.reader.fs_reader.FileSystemReader(config)¶

Uses cudf to read from file system based on config object.

Parameters: config – dictionary object of config values for type, input_format, input_path (or output_path), and cudf reader optional keyword args

Methods

`close`(self)	Close cudf reader
`fetch_data`(self)	Fetch data using cudf based on provided config object

close(self)¶: Close cudf reader

fetch_data(self)¶: Fetch data using cudf based on provided config object

class clx.io.writer.kafka_writer.KafkaWriter(kafka_topic, batch_size, delimiter, producer)¶

Publish to Kafka topic based on config object.

Parameters

kafka_topic – Kafka topic
batch_size – batch size
delimiter – delimiter
producer – producer

Attributes

delimiter
producer

Methods

`close`(self)	Close Kafka writer
`write_data`(self, df)	publish messages to kafka topic

close(self)¶: Close Kafka writer

write_data(self, df)¶

publish messages to kafka topic

Parameters: df – dataframe to publish

class clx.io.writer.fs_writer.FileSystemWriter(config)¶

Uses cudf to write to file system based on config object.

Parameters: config – dictionary object of config values for type, output_format, output_path (or output_path), and cudf writer optional keyword args

Methods

`close`(self)	Close cudf writer
`write_data`(self, df)	Write data to file system using cudf based on provided config object

close(self)¶: Close cudf writer

write_data(self, df)¶: Write data to file system using cudf based on provided config object