10 minutes to CLX¶
This is a short introduction to CLX geared mainly towards new users of the code.
What are these libraries?¶
CLX (Cyber Log Accelerators) provides a simple API for security analysts, data scientists, and engineers to quickly get started applying RAPIDS to real-world cyber use cases. CLX uses the GPU dataframe (cuDF) and other RAPIDS packages to execute cybersecurity and information security workflows. The following packages are available:
analytics - Machine learning and statistics functionality
ip - IPv4 data translation and parsing
parsers - Cyber log Event parsing
io - Input and output features for a workflow
workflow - Workflow which receives input data and produces analytical output data
osi - Open source integration (VirusTotal, FarsightDB and Whois)
dns - TLD extraction
When to use CLX¶
Use CLX to build your cyber data analytics workflows for a GPU-accelerated environmetn using RAPIDS. CLX contains common cyber and cyber ML functionality, such as log parsing for specific data sources, cyber data type parsing (e.g., IPv4), and DGA detection. CLX also provides the ability to integrate this functionality into a CLX workflow, which simplifies execution of the series of parsing and ML functions needed for end-to-end use cases.
Log Parsing¶
CLX provides traditional parsers for some common log types. Here’s an example parsing a common Windows Event Log of event code type 4770.
[1]:
import cudf
from clx.parsers.windows_event_parser import WindowsEventParser
event = "04/03/2019 11:58:59 AM\\nLogName=Security\\nSourceName=Microsoft Windows security auditing.\\nEventCode=5156\\nEventType=0\\nType=Information\\nComputerName=user234.test.com\\nTaskCategory=Filtering Platform Connection\\nOpCode=Info\\nRecordNumber=241754521\\nKeywords=Audit Success\\nMessage=The Windows Filtering Platform has permitted a connection.\\r\\n\\r\\nApplication Information:\\r\\n\\tProcess ID:\\t\\t4\\r\\n\\tApplication Name:\\tSystem\\r\\n\\r\\nNetwork Information:\\r\\n\\tDirection:\\t\\tInbound\\r\\n\\tSource Address:\\t\\t100.20.100.20\\r\\n\\tSource Port:\\t\\t138\\r\\n\\tDestination Address:\\t100.20.100.30\\r\\n\\tDestination Port:\\t\\t138\\r\\n\\tProtocol:\\t\\t17\\r\\n\\r\\nFilter Information:\\r\\n\\tFilter Run-Time ID:\\t0\\r\\n\\tLayer Name:\\t\\tReceive/Accept\\r\\n\\tLayer Run-Time ID:\\t44"
wep = WindowsEventParser()
df = cudf.DataFrame()
df['raw'] = [event]
result_df = wep.parse(df, 'raw')
result_df.head()
[1]:
service_information_service_id | target_account_old_account_name | service_service_name | group_group_name | changed_attributes_account_expires | detailed_authentication_information_key_length | additional_information_result_code | account_information_security_id | changed_attributes_user_account_control | process_information_caller_process_id | ... | changed_attributes_old_uac_value | attributes_profile_path | attributes_user_account_control | account_for_which_logon_failed_account_domain | account_whose_credentials_were_used_account_domain | new_logon_logon_guid | service_server | attributes_home_directory | failure_information_status | failure_information_sub_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ... |
1 rows × 131 columns
Cyber Data Types¶
CLX provides the ability to work with different data types that are specific to cybersecurity, such as IPv4 and DNS. Here’s an example of how to get started.
IPv4¶
The IPv4 data type is still commonly used and present in log files. Below we demonstrate functionality. Additional operations are available in the clx.ip
module.
Convert IPv4 values to integers¶
[50]:
import clx.ip
import cudf
df = cudf.Series(["5.79.97.178", "94.130.74.45"])
result_df = clx.ip.ip_to_int(df)
print(result_df)
0 89088434
1 1585596973
dtype: int64
Check if IPv4 values are multicast¶
[51]:
import clx.ip
import cudf
df = cudf.Series(["224.0.0.0", "239.255.255.255", "5.79.97.178"])
result_df = clx.ip.is_multicast(df)
print(result_df)
0 True
1 True
2 False
dtype: bool
TLD Extraction¶
CLX provides the ability to extract the TLD from the registered domain and subdomains of a URL, using the public suffix list.
[1]:
import cudf
from clx.dns import dns_extractor as dns
input_df = cudf.DataFrame(
{
"url": [
"http://www.google.com",
"gmail.com",
"github.com",
"https://pandas.pydata.org",
"http://www.worldbank.org.kg/",
"waiterrant.blogspot.com",
"http://forums.news.cnn.com.ac/",
"http://forums.news.cnn.ac/",
"ftp://b.cnn.com/",
"a.news.uk",
"a.news.co.uk",
"https://a.news.co.uk",
"107-193-100-2.lightspeed.cicril.sbcglobal.net",
"a23-44-13-2.deploy.static.akamaitechnologies.com",
]
}
)
output_df = dns.parse_url(input_df["url"])
output_df.head(14)
[1]:
hostname | domain | suffix | subdomain | |
---|---|---|---|---|
0 | www.google.com | com | www | |
1 | gmail.com | gmail | com | |
2 | github.com | github | com | |
3 | pandas.pydata.org | pydata | org | pandas |
4 | www.worldbank.org.kg | worldbank | org.kg | www |
5 | waiterrant.blogspot.com | waiterrant | blogspot.com | |
6 | forums.news.cnn.com.ac | cnn | com.ac | forums.news |
7 | forums.news.cnn.ac | cnn | ac | forums.news |
8 | b.cnn.com | cnn | com | b |
9 | a.news.uk | news | uk | a |
10 | a.news.co.uk | news | co.uk | a |
11 | a.news.co.uk | news | co.uk | a |
12 | 107-193-100-2.lightspeed.cicril.sbcglobal.net | sbcglobal | net | 107-193-100-2.lightspeed.cicril |
13 | a23-44-13-2.deploy.static.akamaitechnologies.com | akamaitechnologies | com | a23-44-13-2.deploy.static |
Machine Learning¶
CLX offers machine learning and statistcs functions that are ready to integrate into your CLX workflow.
Calculate a rolling z-score on a given cuDF series.
[2]:
import clx.analytics.stats
import cudf
sequence = [3,4,5,6,1,10,34,2,1,11,45,34,2,9,19,43,24,13,23,10,98,84,10]
series = cudf.Series(sequence)
zscores_df = cudf.DataFrame()
zscores_df['zscore'] = clx.analytics.stats.rzscore(series, 7)
print(zscores_df)
zscore
0 null
1 null
2 null
3 null
4 null
5 null
6 2.374423424
7 -0.645941275
8 -0.683973734
9 0.158832461
10 1.847751909
11 0.880026019
12 -0.950835449
13 -0.360593742
14 0.111407599
15 1.228914145
16 -0.074966331
17 -0.570321249
18 0.327849973
19 -0.934372308
20 2.296828498
21 1.282966989
22 -0.795223674
Workflows¶
Now that we’ve demonstrated the basics of CLX , let’s try to tie some of this functionality into a CLX workflow. A workflow is defined as a function that receives a cuDF dataframe, performs some operations on it, and then returns an output cuDF dataframe. In our use case, we decide to show how to parse raw WinEVT data within a workflow.
[61]:
import cudf
from clx.workflow.workflow import Workflow
from clx.parsers.windows_event_parser import WindowsEventParser
wep = WindowsEventParser()
class LogParseWorkflow(Workflow):
def workflow(self, dataframe):
output = wep.parse(dataframe, "raw")
return output
input_df = cudf.DataFrame()
input_df["raw"] = ["04/03/2019 11:58:59 AM\\nLogName=Security\\nSourceName=Microsoft Windows security auditing.\\nEventCode=5156\\nEventType=0\\nType=Information\\nComputerName=user234.test.com\\nTaskCategory=Filtering Platform Connection\\nOpCode=Info\\nRecordNumber=241754521\\nKeywords=Audit Success\\nMessage=The Windows Filtering Platform has permitted a connection.\\r\\n\\r\\nApplication Information:\\r\\n\\tProcess ID:\\t\\t4\\r\\n\\tApplication Name:\\tSystem\\r\\n\\r\\nNetwork Information:\\r\\n\\tDirection:\\t\\tInbound\\r\\n\\tSource Address:\\t\\t100.20.100.20\\r\\n\\tSource Port:\\t\\t138\\r\\n\\tDestination Address:\\t100.20.100.30\\r\\n\\tDestination Port:\\t\\t138\\r\\n\\tProtocol:\\t\\t17\\r\\n\\r\\nFilter Information:\\r\\n\\tFilter Run-Time ID:\\t0\\r\\n\\tLayer Name:\\t\\tReceive/Accept\\r\\n\\tLayer Run-Time ID:\\t44"]
lpw = LogParseWorkflow(name="my-log-parsing-workflow")
lpw.workflow(input_df)
[61]:
member_account_name | attributes_password_last_set | service_service_name | attributes_profile_path | account_information_security_id | additional_information_transited_services | additional_information_caller_computer_name | network_information_direction | new_logon_account_name | changed_attributes_home_drive | ... | certificate_information_certificate_issuer_name | network_information_source_network_address | service_information_service_name | privileges | account_for_which_logon_failed_account_domain | network_information_network_address | service_server | new_account_account_name | user_account_name | attributes_user_account_control | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | inbound | ... |
1 rows × 131 columns
workflow.yaml
file or define your configurations at instantiation within a python dictionary./etc/clx/[workflow-name]/workflow.yaml then
~/.config/clx/[workflow-name]/workflow.yaml
To learn more about workflow configurations visit the CLX Workflow page
To demonstrate the input functionality, we’ll create a small CSV input file.
[62]:
import cudf
input_df = cudf.DataFrame()
input_df["raw"] = ["04/03/2019 11:58:59 AM\\nLogName=Security\\nSourceName=Microsoft Windows security auditing.\\nEventCode=5156\\nEventType=0\\nType=Information\\nComputerName=user234.test.com\\nTaskCategory=Filtering Platform Connection\\nOpCode=Info\\nRecordNumber=241754521\\nKeywords=Audit Success\\nMessage=The Windows Filtering Platform has permitted a connection.\\r\\n\\r\\nApplication Information:\\r\\n\\tProcess ID:\\t\\t4\\r\\n\\tApplication Name:\\tSystem\\r\\n\\r\\nNetwork Information:\\r\\n\\tDirection:\\t\\tInbound\\r\\n\\tSource Address:\\t\\t100.20.100.20\\r\\n\\tSource Port:\\t\\t138\\r\\n\\tDestination Address:\\t100.20.100.30\\r\\n\\tDestination Port:\\t\\t138\\r\\n\\tProtocol:\\t\\t17\\r\\n\\r\\nFilter Information:\\r\\n\\tFilter Run-Time ID:\\t0\\r\\n\\tLayer Name:\\t\\tReceive/Accept\\r\\n\\tLayer Run-Time ID:\\t44"]
input_df.to_csv("alert_data.csv")
Next, create and run the workflow.
[60]:
from clx.workflow.workflow import Workflow
from clx.parsers.windows_event_parser import WindowsEventParser
import os
dirpath = os.getcwd()
source = {
"type": "fs",
"input_format": "csv",
"input_path": dirpath + "alert_data.csv",
"schema": ["raw"],
"delimiter": ",",
"required_cols": ["raw"],
"dtype": ["str"],
"header": 0
}
destination = {
"type": "fs",
"output_format": "csv",
"output_path": dirpath + "alert_data_output.csv"
}
wep = WindowsEventParser()
class LogParseWorkflow(Workflow):
def workflow(self, dataframe):
output = wep.parse(dataframe, "raw")
return output
lpw = LogParseWorkflow(source=source, destination=destination, name="my-log-parsing-workflow")
lpw.run_workflow()
Output data can be read directly from the resulting CSV file.
[66]:
f = open('alert_data_output.csv', "r")
f.readlines()
[66]:
['member_account_name,attributes_password_last_set,service_service_name,attributes_profile_path,account_information_security_id,additional_information_transited_services,additional_information_caller_computer_name,network_information_direction,new_logon_account_name,changed_attributes_home_drive,filter_information_layer_run_time_id,new_logon_security_id,additional_information_result_code,eventcode,changed_attributes_logon_hours,account_information_supplied_realm_name,additional_information_ticket_options,subject_security_id,detailed_authentication_information_key_length,changed_attributes_script_path,changed_attributes_display_name,detailed_authentication_information_transited_services,subject_logon_id,changed_attributes_sam_account_name,network_information_workstation_name,service_information_service_id,subject_account_name,account_information_user_id,new_logon_account_domain,attributes_user_workstations,account_locked_out_account_name,target_account_old_account_name,network_information_protocol,attributes_home_directory,attributes_logon_hours,group_group_domain,changed_attributes_allowedtodelegateto,changed_attributes_user_account_control,network_information_source_port,attributes_user_parameters,network_information_port,application_information_process_id,attributes_sid_history,attributes_new_uac_value,process_process_name,network_information_destination_port,changed_attributes_home_directory,group_security_id,member_security_id,user_account_domain,certificate_information_certificate_serial_number,account_whose_credentials_were_used_account_domain,attributes_account_expires,subject_account_domain,process_information_caller_process_id,process_process_id,target_server_additional_information,process_information_caller_process_name,logon_type,network_information_destination_address,account_whose_credentials_were_used_logon_guid,filter_information_layer_name,additional_information_ticket_encryption_type,network_information_source_address,target_account_account_domain,failure_information_status,failure_information_failure_reason,process_information_process_name,target_account_security_id,filter_information_filter_run_time_id,attributes_allowed_to_delegate_to,changed_attributes_sid_history,account_for_which_logon_failed_security_id,new_account_domain_name,detailed_authentication_information_logon_process,additional_information_privileges,account_information_account_name,user_security_id,process_information_process_id,network_information_client_port,certificate_information_certificate_thumbprint,target_server_target_server_name,attributes_primary_group_id,additional_information_pre_authentication_type,changed_attributes_old_uac_value,account_information_account_domain,account_whose_credentials_were_used_account_name,id,subject_logon_guid,attributes_sam_account_name,detailed_authentication_information_authentication_package,attributes_user_principal_name,target_account_new_account_name,computername,attributes_home_drive,changed_attributes_account_expires,target_account_account_name,application_information_application_name,changed_attributes_primary_group_id,additional_information_failure_code,time,failure_information_sub_status,attributes_display_name,new_account_security_id,changed_attributes_user_principal_name,new_logon_logon_guid,changed_attributes_user_workstations,account_information_logon_guid,new_logon_logon_id,attributes_old_uac_value,changed_attributes_new_uac_value,additional_information_expiration_time,changed_attributes_password_last_set,network_information_client_address,account_for_which_logon_failed_account_name,changed_attributes_profile_path,attributes_script_path,detailed_authentication_information_package_name_ntlm_only,group_group_name,changed_attributes_user_parameters,account_locked_out_security_id,certificate_information_certificate_issuer_name,network_information_source_network_address,service_information_service_name,privileges,account_for_which_logon_failed_account_domain,network_information_network_address,service_server,new_account_account_name,user_account_name,attributes_user_account_control\n',
',,,,,,,inbound,,,44,,,5156,,,,,,,,,,,,,,,,,,,17,,,,,,138,,,4,,,,138,,,,,,,,,,,,,,100.20.100.30,,receive/accept,,100.20.100.20,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,system,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\n']
Open Source Threat Intelligence Integration¶
Often it’s beneficial to integrate open source threat intelligence with collected data. CLX includes the ability to query VirusTotal and FarsightDB directly. An API key is necessary for both of these integrations.
Create an account with https://www.virustotal.com
Create an account with https://www.farsightsecurity.com
[ ]:
from clx.osi.virus_total import VirusTotalClient
vt_api_key='<virus total apikey goes here>'
vt_client = VirusTotalClient(api_key=vt_api_key)
result = vt_client.url_scan(["virustotal.com"])
[ ]:
from clx.osi.farsight import FarsightLookupClient
server='https://api.dnsdb.info'
fs_api_key='<farsight apikey goes here>'
fs_client = FarsightLookupClient(server, fs_api_key, limit=1)
result = fs_client.query_rrset("www.dnsdb.info")
[6]:
from clx.osi.whois import WhoIsLookupClient
whois_client = WhoIsLookupClient()
whois_result = whois_client.whois(["nvidia.com"])
print(whois_result)
[{'domain_name': 'NVIDIA.COM', 'registrar': 'Safenames Ltd', 'whois_server': 'whois.safenames.net', 'referral_url': None, 'updated_date': '04-23-2019 17:17:03,10-04-2013 20:01:01', 'creation_date': '04-20-1993 04:00:00', 'expiration_date': '04-21-2020 04:00:00', 'name_servers': 'DNS1.P09.NSONE.NET,DNS2.P09.NSONE.NET,NS5.DNSMADEEASY.COM,NS6.DNSMADEEASY.COM,NS7.DNSMADEEASY.COM', 'status': 'clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited,clientTransferProhibited https://icann.org/epp#clientTransferProhibited,serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited,serverTransferProhibited https://icann.org/epp#serverTransferProhibited,serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited', 'emails': 'abuse@safenames.net,wadmpfvzi5ei@idp.email,hostmaster@safenames.net', 'dnssec': 'unsigned', 'name': 'Data protected, not disclosed', 'org': None, 'address': '2701 San Tomas Expressway', 'city': 'Santa Clara', 'state': 'CA', 'zipcode': '95050', 'country': 'US'}]