CLX DGA Detection

This is an introduction to CLX DGA Detection.

What is DGA Detection?

Domain Generation Algorithms (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists.

When to use CLX DGA Detection?

Use CLX DGA Detection to build your own DGA Detection model that can then be used to predict whether a given domain is malicious or not. We will use a type of recurrent neural network called the Gated Recurrent Unit (GRU) for this example. The CLX and RAPIDS libraries enable users train their models with up-to-date domain names representative of both benign and DGA generated strings. Using a CLX workflow, this capability could also be used in production environments.

For a more advanced, in-depth example of CLX DGA Detection view this Jupyter notebook.

How to train a CLX DGA Detection model

To train a CLX DGA Detection model you simply need a training data set which contains a column of domains and their associated type which can be either 1 (benign) or 0 (malicious).

First initialize your new model

[1]:
LR = 0.001
N_LAYERS = 3
CHAR_VOCAB = 128
HIDDEN_SIZE = 100
N_DOMAIN_TYPE = 2  # Will be 2 since there are a total of 2 different types

from clx.analytics.dga_detector import DGADetector
from clx.analytics.detector_dataset import DetectorDataset

dd = DGADetector(lr=LR)
dd.init_model(
    n_layers=N_LAYERS,
    char_vocab=CHAR_VOCAB,
    hidden_size=HIDDEN_SIZE,
    n_domain_type=N_DOMAIN_TYPE,
)

Next, train your DGA detector. The below example uses a small dataset for demonstration only. Ideally you will want a larger training set.

To develop a more expansive training set, these resources are available:

[2]:
import cudf

train_df = cudf.DataFrame()
train_df["domain"] = [
    "google.com",
    "youtube.com",
    "tmall.com",
    "duiwlqeejymdb.com",
    "kofsmyaiufarb.net",
    "xskphhmrlcihr.biz",
    "yahoo.com",
    "linkedin.com",
    "twitter.com",
    "wejaecjhycwss.co.uk",
    "xtorhktvpblmr.info",
    "xvljisbfalkts.com",
]
train_df["type"] = [1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]

# DetectorDataset converts domains from string to ascii and creates partitioned dataframes based on given batch size
train_df = DetectorDataset(train_df, 6)

When we train a model, the total loss is returned

[3]:
dd.train_model(train_df)
/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/dlpack.py:82: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.
  return cpp_dlpack.to_dlpack(gdf_cols)
[3]:
2.973564386367798

Ideally, you will want to train your model over a number of epochs as detailed in our example DGA Detection notebook.

Save a trained model

[4]:
dd.save_model("clx_dga_classifier.pth")

Load a model

Let’s create a new dga detector and load the saved model from above.

[5]:
dga_detector = DGADetector(lr=0.001)
dga_detector.load_model("clx_dga_classifier.pth")

DGA Inferencing

Use your new model to predict malicious domains

[6]:
test_df = cudf.DataFrame()
test_df['domain'] = ['facebook.com','ylqblbltqkynb.net']

dga_detector.predict(test_df['domain'])
[6]:
0    1
1    1
Name: is_dga, dtype: int64

Conclusion

DGA detector in CLX enables users to train their models for detection and also use existing models. This capability could also be used in conjunction with log parsing efforts if the logs contain domain names. DGA detection done with CLX and RAPIDS keeps data in GPU memory, removing unnecessary copy/converts and providing a 4X speed advantage over CPU only implementations. This is esepcially true with large batch sizes.

[ ]: