10 minutes to CLX

This is a short introduction to CLX geared mainly towards new users of the code.

What are these libraries?

CLX (Cyber Log Accelerators) provides a simple API for security analysts, data scientists, and engineers to quickly get started applying RAPIDS to real-world cyber use cases. CLX uses the GPU dataframe (cuDF) and other RAPIDS packages to execute cybersecurity and information security workflows. The following packages are available:

  • analytics - Machine learning and statistics functionality

  • ip - IPv4 data translation and parsing

  • parsers - Cyber log Event parsing

  • io - Input and output features for a workflow

  • workflow - Workflow which receives input data and produces analytical output data

  • osi - Open source integration (VirusTotal, FarsightDB and Whois)

  • dns - TLD extraction

When to use CLX

Use CLX to build your cyber data analytics workflows for a GPU-accelerated environmetn using RAPIDS. CLX contains common cyber and cyber ML functionality, such as log parsing for specific data sources, cyber data type parsing (e.g., IPv4), and DGA detection. CLX also provides the ability to integrate this functionality into a CLX workflow, which simplifies execution of the series of parsing and ML functions needed for end-to-end use cases.

Log Parsing

CLX provides traditional parsers for some common log types. Here’s an example parsing a common Windows Event Log of event code type 4770.

[1]:
import cudf
from clx.parsers.windows_event_parser import WindowsEventParser
event = "04/03/2019 11:58:59 AM\\nLogName=Security\\nSourceName=Microsoft Windows security auditing.\\nEventCode=5156\\nEventType=0\\nType=Information\\nComputerName=user234.test.com\\nTaskCategory=Filtering Platform Connection\\nOpCode=Info\\nRecordNumber=241754521\\nKeywords=Audit Success\\nMessage=The Windows Filtering Platform has permitted a connection.\\r\\n\\r\\nApplication Information:\\r\\n\\tProcess ID:\\t\\t4\\r\\n\\tApplication Name:\\tSystem\\r\\n\\r\\nNetwork Information:\\r\\n\\tDirection:\\t\\tInbound\\r\\n\\tSource Address:\\t\\t100.20.100.20\\r\\n\\tSource Port:\\t\\t138\\r\\n\\tDestination Address:\\t100.20.100.30\\r\\n\\tDestination Port:\\t\\t138\\r\\n\\tProtocol:\\t\\t17\\r\\n\\r\\nFilter Information:\\r\\n\\tFilter Run-Time ID:\\t0\\r\\n\\tLayer Name:\\t\\tReceive/Accept\\r\\n\\tLayer Run-Time ID:\\t44"
wep = WindowsEventParser()
df = cudf.DataFrame()
df['raw'] = [event]
result_df = wep.parse(df, 'raw')
result_df.head()
[1]:
service_information_service_id target_account_old_account_name service_service_name group_group_name changed_attributes_account_expires detailed_authentication_information_key_length additional_information_result_code account_information_security_id changed_attributes_user_account_control process_information_caller_process_id ... changed_attributes_old_uac_value attributes_profile_path attributes_user_account_control account_for_which_logon_failed_account_domain account_whose_credentials_were_used_account_domain new_logon_logon_guid service_server attributes_home_directory failure_information_status failure_information_sub_status
0 ...

1 rows × 131 columns

Cyber Data Types

CLX provides the ability to work with different data types that are specific to cybersecurity, such as IPv4 and DNS. Here’s an example of how to get started.

IPv4

The IPv4 data type is still commonly used and present in log files. Below we demonstrate functionality. Additional operations are available in the clx.ip module.

Convert IPv4 values to integers

[50]:
import clx.ip
import cudf
df = cudf.Series(["5.79.97.178", "94.130.74.45"])
result_df = clx.ip.ip_to_int(df)
print(result_df)
0      89088434
1    1585596973
dtype: int64

Check if IPv4 values are multicast

[51]:
import clx.ip
import cudf
df = cudf.Series(["224.0.0.0", "239.255.255.255", "5.79.97.178"])
result_df = clx.ip.is_multicast(df)
print(result_df)
0     True
1     True
2    False
dtype: bool

TLD Extraction

CLX provides the ability to extract the TLD from the registered domain and subdomains of a URL, using the public suffix list.

[1]:
import cudf
from clx.dns import dns_extractor as dns

input_df = cudf.DataFrame(
    {
        "url": [
            "http://www.google.com",
            "gmail.com",
            "github.com",
            "https://pandas.pydata.org",
            "http://www.worldbank.org.kg/",
            "waiterrant.blogspot.com",
            "http://forums.news.cnn.com.ac/",
            "http://forums.news.cnn.ac/",
            "ftp://b.cnn.com/",
            "a.news.uk",
            "a.news.co.uk",
            "https://a.news.co.uk",
            "107-193-100-2.lightspeed.cicril.sbcglobal.net",
            "a23-44-13-2.deploy.static.akamaitechnologies.com",
        ]
    }
)
output_df = dns.parse_url(input_df["url"])
output_df.head(14)
[1]:
hostname domain suffix subdomain
0 www.google.com google com www
1 gmail.com gmail com
2 github.com github com
3 pandas.pydata.org pydata org pandas
4 www.worldbank.org.kg worldbank org.kg www
5 waiterrant.blogspot.com waiterrant blogspot.com
6 forums.news.cnn.com.ac cnn com.ac forums.news
7 forums.news.cnn.ac cnn ac forums.news
8 b.cnn.com cnn com b
9 a.news.uk news uk a
10 a.news.co.uk news co.uk a
11 a.news.co.uk news co.uk a
12 107-193-100-2.lightspeed.cicril.sbcglobal.net sbcglobal net 107-193-100-2.lightspeed.cicril
13 a23-44-13-2.deploy.static.akamaitechnologies.com akamaitechnologies com a23-44-13-2.deploy.static

Machine Learning

CLX offers machine learning and statistcs functions that are ready to integrate into your CLX workflow.

Calculate a rolling z-score on a given cuDF series.

[2]:
import clx.analytics.stats
import cudf
sequence = [3,4,5,6,1,10,34,2,1,11,45,34,2,9,19,43,24,13,23,10,98,84,10]
series = cudf.Series(sequence)
zscores_df = cudf.DataFrame()
zscores_df['zscore'] = clx.analytics.stats.rzscore(series, 7)
print(zscores_df)
          zscore
0           null
1           null
2           null
3           null
4           null
5           null
6    2.374423424
7   -0.645941275
8   -0.683973734
9    0.158832461
10   1.847751909
11   0.880026019
12  -0.950835449
13  -0.360593742
14   0.111407599
15   1.228914145
16  -0.074966331
17  -0.570321249
18   0.327849973
19  -0.934372308
20   2.296828498
21   1.282966989
22  -0.795223674

Workflows

Now that we’ve demonstrated the basics of CLX , let’s try to tie some of this functionality into a CLX workflow. A workflow is defined as a function that receives a cuDF dataframe, performs some operations on it, and then returns an output cuDF dataframe. In our use case, we decide to show how to parse raw WinEVT data within a workflow.

[61]:
import cudf
from clx.workflow.workflow import Workflow
from clx.parsers.windows_event_parser import WindowsEventParser

wep = WindowsEventParser()

class LogParseWorkflow(Workflow):
    def workflow(self, dataframe):
        output = wep.parse(dataframe, "raw")
        return output

input_df = cudf.DataFrame()
input_df["raw"] = ["04/03/2019 11:58:59 AM\\nLogName=Security\\nSourceName=Microsoft Windows security auditing.\\nEventCode=5156\\nEventType=0\\nType=Information\\nComputerName=user234.test.com\\nTaskCategory=Filtering Platform Connection\\nOpCode=Info\\nRecordNumber=241754521\\nKeywords=Audit Success\\nMessage=The Windows Filtering Platform has permitted a connection.\\r\\n\\r\\nApplication Information:\\r\\n\\tProcess ID:\\t\\t4\\r\\n\\tApplication Name:\\tSystem\\r\\n\\r\\nNetwork Information:\\r\\n\\tDirection:\\t\\tInbound\\r\\n\\tSource Address:\\t\\t100.20.100.20\\r\\n\\tSource Port:\\t\\t138\\r\\n\\tDestination Address:\\t100.20.100.30\\r\\n\\tDestination Port:\\t\\t138\\r\\n\\tProtocol:\\t\\t17\\r\\n\\r\\nFilter Information:\\r\\n\\tFilter Run-Time ID:\\t0\\r\\n\\tLayer Name:\\t\\tReceive/Accept\\r\\n\\tLayer Run-Time ID:\\t44"]
lpw = LogParseWorkflow(name="my-log-parsing-workflow")
lpw.workflow(input_df)
[61]:
member_account_name attributes_password_last_set service_service_name attributes_profile_path account_information_security_id additional_information_transited_services additional_information_caller_computer_name network_information_direction new_logon_account_name changed_attributes_home_drive ... certificate_information_certificate_issuer_name network_information_source_network_address service_information_service_name privileges account_for_which_logon_failed_account_domain network_information_network_address service_server new_account_account_name user_account_name attributes_user_account_control
0 inbound ...

1 rows × 131 columns

A workflow can receive and output data from different locations, including CSV files and Kafka. To integrate I/O into your workflow, simply indicate your workflow configurations within a workflow.yaml file or define your configurations at instantiation within a python dictionary.
The workflow class will first look for any configuration file here:
  • /etc/clx/[workflow-name]/workflow.yaml then

  • ~/.config/clx/[workflow-name]/workflow.yaml

To learn more about workflow configurations visit the CLX Workflow page

To demonstrate the input functionality, we’ll create a small CSV input file.

[62]:
import cudf
input_df = cudf.DataFrame()
input_df["raw"] = ["04/03/2019 11:58:59 AM\\nLogName=Security\\nSourceName=Microsoft Windows security auditing.\\nEventCode=5156\\nEventType=0\\nType=Information\\nComputerName=user234.test.com\\nTaskCategory=Filtering Platform Connection\\nOpCode=Info\\nRecordNumber=241754521\\nKeywords=Audit Success\\nMessage=The Windows Filtering Platform has permitted a connection.\\r\\n\\r\\nApplication Information:\\r\\n\\tProcess ID:\\t\\t4\\r\\n\\tApplication Name:\\tSystem\\r\\n\\r\\nNetwork Information:\\r\\n\\tDirection:\\t\\tInbound\\r\\n\\tSource Address:\\t\\t100.20.100.20\\r\\n\\tSource Port:\\t\\t138\\r\\n\\tDestination Address:\\t100.20.100.30\\r\\n\\tDestination Port:\\t\\t138\\r\\n\\tProtocol:\\t\\t17\\r\\n\\r\\nFilter Information:\\r\\n\\tFilter Run-Time ID:\\t0\\r\\n\\tLayer Name:\\t\\tReceive/Accept\\r\\n\\tLayer Run-Time ID:\\t44"]
input_df.to_csv("alert_data.csv")

Next, create and run the workflow.

[60]:
from clx.workflow.workflow import Workflow
from clx.parsers.windows_event_parser import WindowsEventParser
import os
dirpath = os.getcwd()

source = {
   "type": "fs",
   "input_format": "csv",
   "input_path": dirpath + "alert_data.csv",
   "schema": ["raw"],
   "delimiter": ",",
   "required_cols": ["raw"],
   "dtype": ["str"],
   "header": 0
}
destination = {
   "type": "fs",
   "output_format": "csv",
   "output_path": dirpath + "alert_data_output.csv"
}
wep = WindowsEventParser()

class LogParseWorkflow(Workflow):
    def workflow(self, dataframe):
        output = wep.parse(dataframe, "raw")
        return output

lpw = LogParseWorkflow(source=source, destination=destination, name="my-log-parsing-workflow")
lpw.run_workflow()

Output data can be read directly from the resulting CSV file.

[66]:
f = open('alert_data_output.csv', "r")
f.readlines()
[66]:
['member_account_name,attributes_password_last_set,service_service_name,attributes_profile_path,account_information_security_id,additional_information_transited_services,additional_information_caller_computer_name,network_information_direction,new_logon_account_name,changed_attributes_home_drive,filter_information_layer_run_time_id,new_logon_security_id,additional_information_result_code,eventcode,changed_attributes_logon_hours,account_information_supplied_realm_name,additional_information_ticket_options,subject_security_id,detailed_authentication_information_key_length,changed_attributes_script_path,changed_attributes_display_name,detailed_authentication_information_transited_services,subject_logon_id,changed_attributes_sam_account_name,network_information_workstation_name,service_information_service_id,subject_account_name,account_information_user_id,new_logon_account_domain,attributes_user_workstations,account_locked_out_account_name,target_account_old_account_name,network_information_protocol,attributes_home_directory,attributes_logon_hours,group_group_domain,changed_attributes_allowedtodelegateto,changed_attributes_user_account_control,network_information_source_port,attributes_user_parameters,network_information_port,application_information_process_id,attributes_sid_history,attributes_new_uac_value,process_process_name,network_information_destination_port,changed_attributes_home_directory,group_security_id,member_security_id,user_account_domain,certificate_information_certificate_serial_number,account_whose_credentials_were_used_account_domain,attributes_account_expires,subject_account_domain,process_information_caller_process_id,process_process_id,target_server_additional_information,process_information_caller_process_name,logon_type,network_information_destination_address,account_whose_credentials_were_used_logon_guid,filter_information_layer_name,additional_information_ticket_encryption_type,network_information_source_address,target_account_account_domain,failure_information_status,failure_information_failure_reason,process_information_process_name,target_account_security_id,filter_information_filter_run_time_id,attributes_allowed_to_delegate_to,changed_attributes_sid_history,account_for_which_logon_failed_security_id,new_account_domain_name,detailed_authentication_information_logon_process,additional_information_privileges,account_information_account_name,user_security_id,process_information_process_id,network_information_client_port,certificate_information_certificate_thumbprint,target_server_target_server_name,attributes_primary_group_id,additional_information_pre_authentication_type,changed_attributes_old_uac_value,account_information_account_domain,account_whose_credentials_were_used_account_name,id,subject_logon_guid,attributes_sam_account_name,detailed_authentication_information_authentication_package,attributes_user_principal_name,target_account_new_account_name,computername,attributes_home_drive,changed_attributes_account_expires,target_account_account_name,application_information_application_name,changed_attributes_primary_group_id,additional_information_failure_code,time,failure_information_sub_status,attributes_display_name,new_account_security_id,changed_attributes_user_principal_name,new_logon_logon_guid,changed_attributes_user_workstations,account_information_logon_guid,new_logon_logon_id,attributes_old_uac_value,changed_attributes_new_uac_value,additional_information_expiration_time,changed_attributes_password_last_set,network_information_client_address,account_for_which_logon_failed_account_name,changed_attributes_profile_path,attributes_script_path,detailed_authentication_information_package_name_ntlm_only,group_group_name,changed_attributes_user_parameters,account_locked_out_security_id,certificate_information_certificate_issuer_name,network_information_source_network_address,service_information_service_name,privileges,account_for_which_logon_failed_account_domain,network_information_network_address,service_server,new_account_account_name,user_account_name,attributes_user_account_control\n',
 ',,,,,,,inbound,,,44,,,5156,,,,,,,,,,,,,,,,,,,17,,,,,,138,,,4,,,,138,,,,,,,,,,,,,,100.20.100.30,,receive/accept,,100.20.100.20,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,system,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\n']

Open Source Threat Intelligence Integration

Often it’s beneficial to integrate open source threat intelligence with collected data. CLX includes the ability to query VirusTotal and FarsightDB directly. An API key is necessary for both of these integrations.

[ ]:
from clx.osi.virus_total import VirusTotalClient
vt_api_key='<virus total apikey goes here>'
vt_client = VirusTotalClient(api_key=vt_api_key)
result = vt_client.url_scan(["virustotal.com"])
[ ]:
from clx.osi.farsight import FarsightLookupClient
server='https://api.dnsdb.info'
fs_api_key='<farsight apikey goes here>'
fs_client = FarsightLookupClient(server, fs_api_key, limit=1)
result = fs_client.query_rrset("www.dnsdb.info")
[6]:
from clx.osi.whois import WhoIsLookupClient
whois_client = WhoIsLookupClient()
whois_result = whois_client.whois(["nvidia.com"])
print(whois_result)
[{'domain_name': 'NVIDIA.COM', 'registrar': 'Safenames Ltd', 'whois_server': 'whois.safenames.net', 'referral_url': None, 'updated_date': '04-23-2019 17:17:03,10-04-2013 20:01:01', 'creation_date': '04-20-1993 04:00:00', 'expiration_date': '04-21-2020 04:00:00', 'name_servers': 'DNS1.P09.NSONE.NET,DNS2.P09.NSONE.NET,NS5.DNSMADEEASY.COM,NS6.DNSMADEEASY.COM,NS7.DNSMADEEASY.COM', 'status': 'clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited,clientTransferProhibited https://icann.org/epp#clientTransferProhibited,serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited,serverTransferProhibited https://icann.org/epp#serverTransferProhibited,serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited', 'emails': 'abuse@safenames.net,wadmpfvzi5ei@idp.email,hostmaster@safenames.net', 'dnssec': 'unsigned', 'name': 'Data protected, not disclosed', 'org': None, 'address': '2701 San Tomas Expressway', 'city': 'Santa Clara', 'state': 'CA', 'zipcode': '95050', 'country': 'US'}]