API Reference¶
IP¶
-
clx.ip.
hostmask
(ips, prefixlen=16)¶ Compute a column of hostmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses
- Returns
hostmasks
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.hostmask(cudf.Series(["192.168.0.1","10.0.0.1"], prefixlen=16) 0 0.0.255.255 1 0.0.255.255 Name: hostmask, dtype: object
-
clx.ip.
int_to_ip
(values)¶ Convert integer column to IP addresses. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
values (cudf.Series) – Integers to be converted
- Returns
IP addresses
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.int_to_ip(cudf.Series([3232235521, 167772161])) 0 5.79.97.178 1 94.130.74.45 dtype: object
-
clx.ip.
ip_to_int
(values)¶ Convert string column of IP addresses to integer values. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
values (cudf.Series) – IP addresses to be converted
- Returns
Integer representations of IP addresses
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.ip_to_int(cudf.Series(["192.168.0.1","10.0.0.1"])) 0 89088434 1 1585596973 dtype: int64
-
clx.ip.
is_global
(ips)¶ Indicates whether each address is global. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_global(cudf.Series(["127.0.0.1","207.46.13.151"])) 0 False 1 True dtype: bool
-
clx.ip.
is_ip
(ips)¶ Indicates whether each address is an ip string. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_ip(cudf.Series(["192.168.0.1","10.123.0"])) 0 True 1 False dtype: bool
-
clx.ip.
is_link_local
(ips)¶ Indicates whether each address is link local. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_link_local(cudf.Series(["127.0.0.1","169.254.123.123"])) 0 False 1 True dtype: bool
-
clx.ip.
is_loopback
(ips)¶ Indicates whether each address is loopback. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_loopback(cudf.Series(["127.0.0.1","10.0.0.1"])) 0 True 1 False dtype: bool
-
clx.ip.
is_multicast
(ips)¶ Indicates whether each address is multicast. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_multicast(cudf.Series(["127.0.0.1","224.0.0.0"])) 0 False 1 True dtype: bool
-
clx.ip.
is_private
(ips)¶ Indicates whether each address is private. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_private(cudf.Series(["127.0.0.1","207.46.13.151"])) 0 True 1 False dtype: bool
-
clx.ip.
is_reserved
(ips)¶ Indicates whether each address is reserved. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_reserved(cudf.Series(["127.0.0.1","10.0.0.1"])) 0 False 1 False dtype: bool
-
clx.ip.
is_unspecified
(ips)¶ Indicates whether each address is unspecified. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
- Returns
booleans
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.is_unspecified(cudf.Series(["127.0.0.1","10.0.0.1"])) 0 False 1 False dtype: bool
-
clx.ip.
mask
(ips, masks)¶ Apply a mask to a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
masks (cudf.Series) – The host or subnet masks to be applied
- Returns
masked IP addresses
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> input_ips = cudf.Series(["192.168.0.1","10.0.0.1"]) >>> input_masks = cudf.Series(["255.255.0.0", "255.255.0.0"]) >>> clx.ip.mask(input_ips, input_masks) 0 192.168.0.0 1 10.0.0.0 Name: mask, dtype: object
-
clx.ip.
netmask
(ips, prefixlen=16)¶ Compute a column of netmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.
- Parameters
ips – IP addresses
prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses
- Returns
netmasks
- Return type
cudf.Series
Examples
>>> import clx.ip >>> import cudf >>> clx.ip.netmask(cudf.Series(["192.168.0.1","10.0.0.1"]), prefixlen=16) 0 255.255.0.0 1 255.255.0.0 Name: net_mask, dtype: object
Analytics¶
-
class
clx.analytics.dga_detector.
DGADetector
(lr=0.001)¶ This class provides multiple functionalities such as build, train and evaluate the RNNClassifier model to distinguish legitimate and DGA domain names.
Methods
evaluate_model
(self, detector_dataset)This function evaluates the trained model to verify it’s accuracy.
init_model
(self[, char_vocab, hidden_size, …])This function instantiates RNNClassifier model to train.
predict
(self, domains)This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series.
train_model
(self, detector_dataset)This function is used for training RNNClassifier model with a given training dataset.
-
evaluate_model
(self, detector_dataset)¶ This function evaluates the trained model to verify it’s accuracy.
- Parameters
detector_dataset (DetectorDataset) – Instance holds preprocessed data.
- Returns
Model accuracy
- Return type
decimal
Examples
>>> dd = DGADetector() >>> dd.init_model() >>> dd.evaluate_model(detector_dataset) Evaluating trained model ... Test set: Accuracy: 3/4 (0.75)
-
init_model
(self, char_vocab=128, hidden_size=100, n_domain_type=2, n_layers=3)¶ This function instantiates RNNClassifier model to train. And also optimizes to scale it and keep running on parallelism.
-
predict
(self, domains)¶ This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series.
- Parameters
domains (cudf.Series) – List of domains.
- Returns
Predicted results with respect to given domains.
- Return type
cudf.Series
Examples
>>> dd.predict(['nvidia.com', 'dgadomain']) 0 0 1 1 Name: is_dga, dtype: int64
-
train_model
(self, detector_dataset)¶ This function is used for training RNNClassifier model with a given training dataset. It returns total loss to determine model prediction accuracy. :param detector_dataset: Instance holds preprocessed data :type detector_dataset: DetectorDataset :return: Total loss :rtype: int
Examples
>>> from clx.analytics.dga_detector import DGADetector >>> partitioned_dfs = ... # partitioned_dfs = [df1, df2, ...] represents training dataset >>> dd = DGADetector() >>> dd.init_model() >>> dd.train_model(detector_dataset) 1.5728906989097595
-
-
class
clx.analytics.model.rnn_classifier.
RNNClassifier
(input_size, hidden_size, output_size, n_layers=1, bidirectional=True)¶ Methods
forward
(self, input, seq_lengths)Defines the computation performed at every call.
-
forward
(self, input, seq_lengths)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
clx.analytics.stats.
rzscore
(series, window)¶ Calculates rolling z-score
- Parameters
- seriescudf.Series
Series for which to calculate rolling z-score
- windowint
Window size
- Returns
- cudf.Series
Series with rolling z-score values
Examples
>>> import clx.analytics.stats >>> import cudf >>> sequence = [3,4,5,6,1,10,34,2,1,11,45,34,2,9,19,43,24,13,23,10,98,84,10] >>> series = cudf.Series(sequence) >>> zscores_df = cudf.DataFrame() >>> zscores_df['zscore'] = clx.analytics.stats.rzscore(series, 7) >>> zscores_df zscore 0 null 1 null 2 null 3 null 4 null 5 null 6 2.374423424 7 -0.645941275 8 -0.683973734 9 0.158832461 10 1.847751909 11 0.880026019 12 -0.950835449 13 -0.360593742 14 0.111407599 15 1.228914145 16 -0.074966331 17 -0.570321249 18 0.327849973 19 -0.934372308 20 2.296828498 21 1.282966989 22 -0.795223674
DNS Extractor¶
-
clx.dns.dns_extractor.
extract_hostnames
(url_series)¶ This function extracts hostnames from the given urls.
- Parameters
url_series (cudf.Series) – Urls that are to be handled.
- Returns
Hostnames extracted from the urls.
- Return type
cudf.Series
Examples
>>> from cudf import DataFrame >>> from clx.dns import dns_extractor as dns >>> input_df = DataFrame( ... { ... "url": [ ... "http://www.google.com", ... "gmail.com", ... "github.com", ... "https://pandas.pydata.org", ... ] ... } ... ) >>> dns.extract_hostnames(input_df["url"]) 0 www.google.com 1 gmail.com 2 github.com 3 pandas.pydata.org Name: 0, dtype: object
-
clx.dns.dns_extractor.
generate_tld_cols
(hostname_split_df, hostnames, col_len)¶ This function generates tld columns.
- Parameters
hostname_split_df (cudf.DataFrame) – Hostname splits.
hostnames (cudf.DataFrame) – Hostnames.
col_len – Hostname splits dataframe columns length.
- Returns
Tld columns with all combination.
- Return type
cudf.DataFrame
Examples
>>> import cudf >>> from clx.dns import dns_extractor as dns >>> hostnames = cudf.Series(["www.google.com", "pandas.pydata.org"]) >>> hostname_splits = dns.get_hostname_split_df(hostnames) >>> print(hostname_splits) 2 1 0 0 com google www 1 org pydata pandas >>> col_len = len(hostname_split_df.columns) - 1 >>> col_len = len(hostname_splits.columns) - 1 >>> dns.generate_tld_cols(hostname_splits, hostnames, col_len) 2 1 0 tld2 tld1 tld0 0 com google www com google.com www.google.com 1 org pydata pandas org pydata.org pandas.pydata.org
-
clx.dns.dns_extractor.
get_hostname_split_df
(hostnames)¶ Find all words and digits between periods.
- Parameters
hostnames (cudf.Series) – Hostnames that are being split.
- Returns
Hostname splits.
- Return type
cudf.DataFrame
Examples
>>> import cudf >>> from clx.dns import dns_extractor as dns >>> hostnames = cudf.Series(["www.google.com", "pandas.pydata.org"]) >>> dns.get_hostname_split_df(hostnames) 2 1 0 0 com google www 1 org pydata pandas
-
clx.dns.dns_extractor.
parse_url
(url_series, req_cols=None)¶ This function extracts subdomain, domain and suffix for a given url.
- Parameters
url_df_col (cudf.Series) – Urls that are to be handled.
req_cols (set(strings)) – Columns requested to extract such as (domain, subdomain, suffix and hostname).
- Returns
Extracted information of requested columns.
- Return type
cudf.DataFrame
Examples
>>> from cudf import DataFrame >>> from clx.dns import dns_extractor as dns >>> >>> input_df = DataFrame( ... { ... "url": [ ... "http://www.google.com", ... "gmail.com", ... "github.com", ... "https://pandas.pydata.org", ... ] ... } ... ) >>> dns.parse_url(input_df["url"]) hostname domain suffix subdomain 0 www.google.com google com www 1 gmail.com gmail com 2 github.com github com 3 pandas.pydata.org pydata org pandas >>> dns.parse_url(input_df["url"], req_cols={'domain', 'suffix'}) domain suffix 0 google com 1 gmail com 2 github com 3 pydata org
Heuristics¶
-
clx.heuristics.ports.
major_ports
(addr_col, port_col, min_conns=1, eph_min=10000)¶ Find major ports for each address. This is done by computing the mean number of connections across all ports for each address and then filters out all ports that don’t cross this threshold. Also adds column for IANA service name correspondingto each port.
- Parameters
addr_col (cudf.Series) – Column of addresses as strings
port_col (cudf.Series) – Column of corresponding port numbers as ints
min_conns (int) – Filter out ip:port rows that don’t have at least this number of connections (default: 1)
eph_min (int) – Ports greater than or equal to this will be labeled as an ephemeral service (default: 10000)
- Returns
DataFrame with columns for address, port, IANA service corresponding to port, and number of connections
- Return type
cudf.DataFrame
Examples
>>> import clx.heuristics.ports as ports >>> import cudf >>> input_addr_col = cudf.Series(["10.0.75.1","10.0.75.1","10.0.75.1","10.0.75.255","10.110.104.107", "10.110.104.107"]) >>> input_port_col = cudf.Series([137,137,7680,137,7680, 7680]) >>> ports.major_ports(input_addr_col, input_port_col, min_conns=2, eph_min=7000) addr port service conns 0 10.0.75.1 137 netbios-ns 2 1 10.110.104.107 7680 ephemeral 2
OSI (Open Source Integration)¶
-
class
clx.osi.farsight.
FarsightLookupClient
(server, apikey, limit=None, http_proxy=None, https_proxy=None)¶ Methods
query_rdata_ip
(self, rdata_ip[, before, after])query to find DNSDB records matching a specific IP address with given time range.
query_rdata_name
(self, rdata_name[, rrtype, …])query matches only a single DNSDB record of given oname and time ranges.
query_rrset
(self, oname[, rrtype, …])batch version of querying DNSDB by given domain name and time ranges.
-
query_rdata_ip
(self, rdata_ip, before=None, after=None)¶ query to find DNSDB records matching a specific IP address with given time range.
-
query_rdata_name
(self, rdata_name, rrtype=None, before=None, after=None)¶ query matches only a single DNSDB record of given oname and time ranges.
-
query_rrset
(self, oname, rrtype=None, bailiwick=None, before=None, after=None)¶ batch version of querying DNSDB by given domain name and time ranges.
-
-
class
clx.osi.virus_total.
VirusTotalClient
(api_key=None, proxies=None)¶ - Attributes
- api_key
- proxies
- vt_endpoint_dict
Methods
domain_report
(self, domain)Retrieve report using domain.
file_report
(self, \*resource)The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report.
file_rescan
(self, \*resource)This function rescan given files.
file_scan
(self, file)This function allows you to send a file for scanning with VirusTotal.
ipaddress_report
(self, ip)Retrieve report using ip address.
put_comment
(self, resource, comment)Post comment for a file or URL
scan_big_file
(self, files)Scanning files larger than 32MB
url_report
(self, \*resource)The resource argument must be the URL to retrieve the most recent report.
url_scan
(self, \*url)This function scan on provided url with VirusTotal.
-
domain_report
(self, domain)¶ Retrieve report using domain.
-
file_report
(self, *resource)¶ The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report. You may also specify a scan_id returned by the /file/scan endpoint.
-
file_rescan
(self, *resource)¶ This function rescan given files. The resource argument can be the MD5, SHA-1 or SHA-256 of the file you want to re-scan.
-
file_scan
(self, file)¶ This function allows you to send a file for scanning with VirusTotal. Before performing submissions it would be nice to retrieve the latest report on the file. File size limit is 32MB, in order to submit files up to 200MB in size it is mandatory to request a special upload URL using the /file/scan/upload_url endpoint.
-
ipaddress_report
(self, ip)¶ Retrieve report using ip address.
-
put_comment
(self, resource, comment)¶ Post comment for a file or URL
-
scan_big_file
(self, files)¶ Scanning files larger than 32MB
-
url_report
(self, *resource)¶ The resource argument must be the URL to retrieve the most recent report.
-
url_scan
(self, *url)¶ This function scan on provided url with VirusTotal.
Parsers¶
-
class
clx.parsers.event_parser.
EventParser
(columns, event_name)¶ This is an abstract class for all event log parsers.
- Parameters
columns (set(string)) – Event column names.
event_name (string) – Event name
- Attributes
columns
List of columns that are being processed.
event_name
Event name define type of logs that are being processed.
Methods
filter_by_pattern
(self, df, column, pattern)Retrieve events only which satisfies given regex pattern.
parse
(self, dataframe, raw_column)Abstract method ‘parse’ triggers the parsing functionality.
parse_raw_event
(self, dataframe, raw_column, …)Processes parsing of a specific type of raw event records received as a dataframe.
-
property
columns
¶ List of columns that are being processed.
- Returns
Event column names.
- Return type
set(string)
-
property
event_name
¶ Event name define type of logs that are being processed.
- Returns
Event name
- Return type
string
-
filter_by_pattern
(self, df, column, pattern)¶ Retrieve events only which satisfies given regex pattern.
- Parameters
df (cudf.DataFrame) – Raw events to be filtered.
column (string) – Raw data contained column name.
pattern (string) – Regex pattern to retrieve events that are required.
- Returns
filtered dataframe.
- Return type
cudf.DataFrame
-
abstract
parse
(self, dataframe, raw_column)¶ Abstract method ‘parse’ triggers the parsing functionality. Subclasses are required to implement and execute any parsing pre-processing steps.
-
parse_raw_event
(self, dataframe, raw_column, event_regex)¶ Processes parsing of a specific type of raw event records received as a dataframe.
- Parameters
dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.
event_regex (dict) – Required regular expressions for a given event type.
- Returns
parsed information.
- Return type
cudf.DataFrame
-
class
clx.parsers.splunk_notable_parser.
SplunkNotableParser
¶ This is class parses splunk notable logs.
Methods
parse
(self, dataframe, raw_column)Parses the Splunk notable raw events.
-
parse
(self, dataframe, raw_column)¶ Parses the Splunk notable raw events.
- Parameters
dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.
- Returns
parsed information.
- Return type
cudf.DataFrame
-
-
class
clx.parsers.windows_event_parser.
WindowsEventParser
(interested_eventcodes=None)¶ This is class parses windows event logs.
- Parameters
interested_eventcodes (set(int)) – This parameter provides flexibility to parse only interested eventcodes.
Methods
clean_raw_data
(self, dataframe, raw_column)Lower casing and replacing escape characters.
get_columns
(self)Get columns of windows event codes.
parse
(self, dataframe, raw_column)Parses the Windows raw event.
-
clean_raw_data
(self, dataframe, raw_column)¶ Lower casing and replacing escape characters.
- Parameters
dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.
- Returns
Clean raw information.
- Return type
cudf.DataFrame
-
get_columns
(self)¶ Get columns of windows event codes.
- Returns
Columns of all configured eventcodes, if no interested eventcodes specified.
- Return type
set(string)
-
parse
(self, dataframe, raw_column)¶ Parses the Windows raw event.
- Parameters
dataframe (cudf.DataFrame) – Raw events to be parsed.
raw_column (string) – Raw data contained column name.
- Returns
Parsed information.
- Return type
cudf.DataFrame
-
clx.parsers.zeek.
parse_log_file
(filepath)¶ Parse Zeek log file and return cuDF dataframe. Uses header comments to get column names/types and configure parser.
- Parameters
filepath (string) – filepath for Zeek log file
- Returns
Zeek log dataframe
- Return type
cudf.DataFrame
Workflow¶
-
class
clx.workflow.workflow.
Workflow
(name, source=None, destination=None)¶ - Attributes
destination
Dictionary of configuration parameters for the data destination (writer)
name
Name of the workflow for logging purposes.
source
Dictionary of configuration parameters for the data source (reader)
Methods
benchmark
(function)Decorator used to capture a benchmark for a given function
run_workflow
(self)Run workflow.
set_destination
(self, destination)Set destination.
set_source
(self, source)Set source.
stop_workflow
(self)Close workflow.
workflow
(self, dataframe)The pipeline function performs the data enrichment on the data.
-
benchmark
(function)¶ Decorator used to capture a benchmark for a given function
-
property
destination
¶ Dictionary of configuration parameters for the data destination (writer)
-
property
name
¶ Name of the workflow for logging purposes.
-
run_workflow
(self)¶ Run workflow. Reader (source) fetches data. Workflow implementation is executed. Workflow output is written to destination.
-
set_destination
(self, destination)¶ Set destination.
- Parameters
destination – dict of configuration parameters for the destination (writer)
-
set_source
(self, source)¶ Set source.
- Parameters
source – dict of configuration parameters for data source (reader)
-
property
source
¶ Dictionary of configuration parameters for the data source (reader)
-
stop_workflow
(self)¶ Close workflow. This includes calling close() method on reader (source) and writer (destination)
-
abstract
workflow
(self, dataframe)¶ The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.
-
class
clx.workflow.splunk_alert_workflow.
SplunkAlertWorkflow
(name, source=None, destination=None, interval='day', threshold=2.5, window=7, raw_data_col_name='_raw')¶ - Attributes
interval
Interval can be set to day or hour by which z score will be calculated
raw_data_col_name
Dataframe column name containing raw splunk alert data
threshold
Threshold by which to flag z score.
window
Window by which to calculate rolling z score
Methods
workflow
(self, dataframe)The pipeline function performs the data enrichment on the data.
-
property
interval
¶ Interval can be set to day or hour by which z score will be calculated
-
property
raw_data_col_name
¶ Dataframe column name containing raw splunk alert data
-
property
threshold
¶ Threshold by which to flag z score. Threshold will be flagged for scores >threshold or <-threshold
-
property
window
¶ Window by which to calculate rolling z score
-
workflow
(self, dataframe)¶ The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.
I/O¶
-
class
clx.io.reader.kafka_reader.
KafkaReader
(batch_size, consumer, time_window=30)¶ Reads from Kafka based on config object.
- Parameters
batch_size – batch size
consumer – Kafka consumer
time_window – Max window of time that queued events will wait to be pushed to workflow
- Attributes
- consumer
- has_data
- time_window
Methods
close
(self)Close Kafka reader
fetch_data
(self)Fetch data from Kafka based on provided config object
-
close
(self)¶ Close Kafka reader
-
fetch_data
(self)¶ Fetch data from Kafka based on provided config object
-
class
clx.io.reader.dask_fs_reader.
DaskFileSystemReader
(config)¶ Uses Dask to read from file system based on config object.
- Parameters
config – dictionary object of config values for type, input_format, input_path, and dask reader optional keyword args
Methods
close
(self)Close dask reader
fetch_data
(self)Fetch data using dask based on provided config object
-
close
(self)¶ Close dask reader
-
fetch_data
(self)¶ Fetch data using dask based on provided config object
-
class
clx.io.reader.fs_reader.
FileSystemReader
(config)¶ Uses cudf to read from file system based on config object.
- Parameters
config – dictionary object of config values for type, input_format, input_path (or output_path), and cudf reader optional keyword args
Methods
close
(self)Close cudf reader
fetch_data
(self)Fetch data using cudf based on provided config object
-
close
(self)¶ Close cudf reader
-
fetch_data
(self)¶ Fetch data using cudf based on provided config object
-
class
clx.io.writer.kafka_writer.
KafkaWriter
(kafka_topic, batch_size, delimiter, producer)¶ Publish to Kafka topic based on config object.
- Parameters
kafka_topic – Kafka topic
batch_size – batch size
delimiter – delimiter
producer – producer
- Attributes
- delimiter
- producer
Methods
close
(self)Close Kafka writer
write_data
(self, df)publish messages to kafka topic
-
close
(self)¶ Close Kafka writer
-
write_data
(self, df)¶ publish messages to kafka topic
- Parameters
df – dataframe to publish
-
class
clx.io.writer.fs_writer.
FileSystemWriter
(config)¶ Uses cudf to write to file system based on config object.
- Parameters
config – dictionary object of config values for type, output_format, output_path (or output_path), and cudf writer optional keyword args
Methods
close
(self)Close cudf writer
write_data
(self, df)Write data to file system using cudf based on provided config object
-
close
(self)¶ Close cudf writer
-
write_data
(self, df)¶ Write data to file system using cudf based on provided config object