Deobfuscating JavaScript Code Using Character-Based Tokenization

  • A.-G. Sîrbu Department of Computer Science, Babes-Bolyai University, 1, M. Kogalniceanu Street, 400084, Cluj-Napoca, Romania

Abstract

The JavaScript code deployed goes through the process of minification, in which variables are renamed using single character names and spaces are removed in order for the files to have a smaller size, thus loading faster. Because of this, the code becomes unintelligible, making it harder to be analyzed manually. Since JavaScript experts can understand it, machine learning approaches to deobfuscate the minified file are possible. Thus, we propose a technique that finds a fitting name for each obfuscated variable, which is both intuitive and meaningful based on the usage of that variable, based on a Sequence-to-Sequence model, which generates the name character by character to cover all the possible variable names. The proposed approach achieves an average exact name generation accuracy of 70.53%, outperforming the state-of-the-art by 12%.

References

[1] Rohan Bavishi, Michael Pradel, and Koushik Sen. Context2name: A deep learning-based approach to infer natural variable names from usage contexts. arXiv preprint arXiv:1809.05193, 2018.
[2] George W. Burruss and Timothy M. Bray. Confidence intervals. In Kimberly Kempf-Leonard, editor, Encyclopedia of Social Measurement, pages 455–462. Elsevier, New York, 2005.
[3] Raymond PL Buse and Westley R Weimer. Learning a metric for code readability. IEEE Transactions on software engineering, 36(4):546–558, 2009.
[4] Tadayoshi Fushiki. Estimation of prediction error by using k-fold cross-validation. Statistics and Computing, 21:137–146, 2011.
[5] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
[6] Alan Jaffe, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, and Bogdan Vasilescu. Meaningful variable names for decompiled code: A machine translation approach. In Proceedings of the 26th Conference on Program Comprehension, pages 20–30, 2018.
[7] Xufang Li, Peter KK Loh, and Freddy Tan. Mechanisms of polymorphic and metamorphic viruses. In 2011 European intelligence and security informatics conference, pages 149–154. IEEE, 2011.
[8] Peter Likarish, Eunjin Jung, and Insoon Jo. Obfuscated malicious javascript detection using classification techniques. In 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), pages 47–54. IEEE, 2009.
[9] Xiaoyu Liu, Jinu Jang, Neel Sundaresan, Miltiadis Allamanis, and Alexey Svyatkovskiy. Adaptivepaste: Code adaptation through learning semantics-aware variable usage representations. arXiv preprint arXiv:2205.11023, 2022.
[10] Tom´aˇs Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, and Jan Cernocky. Subword language modeling with neural networks. preprint (http://www. fit.vutbr. cz/imikolov/rnnlm/char. pdf), 8(67), 2012.
[11] Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program properties from "big code”. ACM SIGPLAN Notices, 50(1):111–124, 2015.
[12] Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. Dobf: A deobfuscation pre-training objective for programming languages. arXiv preprint arXiv:2102.07492, 2021.
[13] Steve Souders. High-performance web sites. Communications of the ACM, 51(12):36–41, 2008.
[14] Sharath K Udupa, Saumya K Debray, and Matias Madou. Deobfuscation: Reverse engineering obfuscated code. In 12th Working Conference on Reverse Engineering (WCRE’05), pages 10–pp. IEEE, 2005.
15] Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. Recovering clear, natural identifiers from obfuscated js names. In Proceedings of the 2017 11th joint meeting on foundations of software engineering, pages 683–693, 2017.
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[17] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 783–794. IEEE, 2019.
Published
2023-10-18
How to Cite
SÎRBU, A.-G.. Deobfuscating JavaScript Code Using Character-Based Tokenization. Studia Universitatis Babeș-Bolyai Informatica, [S.l.], v. 68, n. 2, p. 5-21, oct. 2023. ISSN 2065-9601. Available at: <https://www.cs.ubbcluj.ro/~studia-i/journal/journal/article/view/90>. Date accessed: 30 june 2024. doi: https://doi.org/10.24193/subbi.2023.2.01.
Section
Articles