Documenting Malware for Research

Posted on September 08, 2022 in malware • 3 min read

Introduction

When I first began my PhD, my advisor suggested I go through various malware source codes (released openly on the Internet) to compile, execute, understand, and document them for future use.

This inevitably paid off, in that it helped me more easily create ground-truth for my datasets and experiments.

Over the years, I've built up a repository of malware source code, along with compilation instructions and documentation on how to run and use the malware -- overall, how the malware works on the inside.

I've been asked by various colleagues to open-source this repo to aid in the education and research of malware analysis and detection.

Why is this difficult?

There are many places to get malware source code from. At the time of this writing, these are two of the most popular sources today:

What's great is that these repos contain malware sources as they originally existed when they were leaked or released.

What's not so great is that these malware source repositories are not clean. They contain lots of garbage files (e.g., temporary files, database files, incomplete source files, etc.) and the documentation on how to compile and use them is either incomplete or non-existent.

In addition, not all malware run on the same operating system. There are malware for Windows, MacOS, Linux, Android, and iOS. Some Windows malware assume Windows XP, while others assume Windows 10. Some malware even assume specific software versions, in order to exploit that particular version. Thus any released code must be able to compile and execute the malware as it was intended -- which is not an easy feat. The documentation must include instructions on how to set up complete environments so that the malware will behave as the author expected it to.

My goal is to create a git repo where I can share these same malware source files, but accompany them with good documentation on how to operate them. The sole purpose is to facilitate an easy way for a researcher to test their solutions (e.g., dynamic analysis, static analysis, malware detection, etc.) on real-world malware samples, straight from the source. This allows the researcher complete control over the sample, so they can experiment with it in a safe environment.

What's the plan?

This blog series will dive deep into the history of malware.

Over time I will release more samples as I clean and document them.

Some posts may be out of order chronologically, and some may be incomplete as I add more malware and content.

Consider this series as a constant work-in-progress.

Where can I follow this?

I will be releasing all source code and documentation here: https://github.com/evandowning/usable-malware

I will be blogging about each sample here on my website: https://www.evandowning.com/tag/usable-malware.html

Final thoughts

This is going to be a long process, and will be accomplished during my free time.

What will take time is documenting how a researcher without access to older versions of VisualStudio (which some of these malware require) will compile and use these samples. I will likely need to update the sample sources to be compatible with the newest versions of Visual Studio. I also plan to write CI/CD pipelines to ensure the malware doesn't breaking in the future.

If you have any complaints of incomplete or incorrect documentation, please open an issue in the git repo above.

If you have any contributions, please create a pull request.

I hope this is useful for you.