Malicious Javascript Detection

This research project focuses on detecting malicious JavaScript code through the combined analysis of obfuscation patterns and string reconstruction techniques. In this study, I developed a dual-model framework that identifies obfuscated scripts and classifies malicious behavior using machine learning, supported by a custom Python library called Atomic Search. This work was conducted as part of my undergraduate thesis and later published as a formal research paper.

Category

Machine Learning

Client

Thesis

Start Date

August 2024

End Date

December 2024

Description

The goal of this research was to address the challenge of detecting malicious JavaScript, especially when attackers intentionally hide harmful intentions through obfuscation. I built two Random Forest–based machine learning models: the first detects whether a script is obfuscated using 20 engineered features, while the second classifies malicious code using 92 features, including outputs from Atomic Search. The reconstruction technique was crucial because many real-world attacks use concatenation and fragmented strings to conceal high-risk functions. By combining obfuscation analysis with string reconstruction, this project achieved high accuracy and produced a practical detection framework for cybersecurity applications.

THE STORY

This project began when I encountered the limitations of traditional detection systems, especially when dealing with heavily obfuscated JavaScript samples collected from public repositories. Scripts using techniques like concatenation, splitting, and randomization often bypass simple rule-based detection. This motivated me to design a more comprehensive solution. Throughout the research, I manually analyzed numerous malicious samples, explored entropy patterns, studied syntactic irregularities, and identified behaviors that differentiate benign and malicious code. Developing Atomic Search became a turning point—it allowed me to reconstruct scattered string fragments and extract meaningful features, which significantly boosted the performance of the detection model. The journey was deeply technical, involving dataset building, feature engineering, model tuning, and experimental validation, but it ultimately provided a reliable and reproducible approach to JavaScript threat detection.

OUR APPROACH

My approach combined algorithmic design, statistical analysis, and engineering. I first built an obfuscation detection model using features such as entropy, syntactic ratios, and character distributions. I then created the Atomic Search library to reconstruct obfuscated strings and expose hidden malicious patterns. These reconstructed structures were integrated as additional features for the malicious detection model, resulting in a more complete representation of each script. Both models were trained using Random Forest, tuned with cross-validation, and evaluated on large datasets. The final framework demonstrated exceptional performance—with 99.10% accuracy for obfuscation detection and 99.52% for malicious classification—showing that combining obfuscation analysis and reconstruction produces a far more effective solution than previous single-stage methods. This research not only strengthened my technical expertise but also contributed a practical cybersecurity tool to the field.