Clustering method in protocol reverse engineering for industrial protocols

Posted by ItsWesYo on Wed, 19 Jan 2022 17:45:51 +0100

1, Abstract

In this paper, we propose a method to analyze the private protocol structure, which can be used in industrial protocols. The method consists of six modules: traffic collection, message extraction, message size clustering, message similarity clustering, field extraction and session analysis. We use Schneider Modicon M580 to collect traffic, and prove the effectiveness of the proposed method by comparing the collected traffic with the existing protocol reverse engineering methods (Netzob, AutoreeEngine and FieldHunter).

2, Introduction

In this paper, we propose a system for deriving Modbus/TCP protocol structure using Schneider Modicon equipment (including representative PLC equipment). The system classifies the protocol messages according to their size, and uses the mean shift algorithm to cluster the protocol messages. These grouped messages are defined as a type. For each type, the system uses the continuous sequence pattern (CSP) algorithm to extract the common substring of the defined field. Analyze the structure of the message after the field definition. Finally, the order and structure of message types can be used to identify the message types used in industrial sites, the meaning of fields and the commands transmitted by network communication.

The proposed automatic industrial protocol reverse engineering method includes six stages. The structure of the system includes the following steps: traffic collection, message extraction, size based message clustering, similarity based message clustering, field extraction and session analysis.

(1) Flow collection

This step collects the flow between EWS and PLC devices. When collecting traffic, EWS specifies a function and executes it to collect traffic. You must perform this process multiple times to collect at least two traffic sets.

(2) Information extraction

The collected traffic sets are stored in the pcap type. This step extracts the pcap type file as the message format used in this method. When extracting the message form, it is divided into request and response messages according to the direction. This is because the message type in the request is different from that in the response. The information contained in the information is as follows:

(3) Size based message clustering

This step specifies the type by separating the messages extracted from the message extraction step according to size. The extracted message is divided into request and response, so this step is performed twice. First, this step receives the request message; Then, the input mail is sorted according to the size; Finally, the sorted messages are formatted sequentially starting with the smallest message size. The detailed steps are shown in the figure below:

(4) Message clustering based on similarity

This step measures the similarity between messages separated by size, performs more detailed clustering, and the similarity between messages of the same size is measured and classified. We use several algorithms to obtain the clustering algorithm for the best classification of messages, including K-means, UPGMA and mean shift algorithms. After comparison, we finally decided to use the mean shift algorithm. The following figure shows the process of classifying messages by similarity through the mean shift algorithm:

Use this step to determine the number of message types for the entered message.

(5) Field extraction

The field extraction step derives static and dynamic fields for messages. Static fields refer to a series of public strings in the same type of message. Dynamic fields refer to other messages except the public string in the same type of message.

In this step, the CSP algorithm is used to extract static fields. CSP algorithm extracts the public string based on Apriori algorithm, and uses the static field extraction process of CSP algorithm to extract the same type of messages into a group of sequences. CSP algorithm generates content with length of 1 from a set of sequences. The content with length of 1 is divided into the content that fails to meet the minimum support through the minimum support check and the content that meets the minimum support requirements. Content that does not meet the minimum support requirements will be deleted and content that meets the requirements with a content length of 2 will be created. Repeat this process until the length cannot be increased. The following figure shows the process of static field extraction:

CSP algorithm requires minimal support. Minimum support refers to the condition that the candidate content can be extended to the next length. In this study, the minimum support is always set to 100%, which represents the common string extracted from all messages of the same type.

(6) Conversation analysis

The session analysis step displays the sequence of message types defined when the function is executed by aligning the derived message sequence. The message information includes flow location and packet location information. Therefore, this information can be used to implement a sequence of message types in a session. The implementation results can analyze the structure of protocol messages in the flow.

3, Experimental evaluation

The proposed method is based on the analysis results of Modbus/TCP protocol to evaluate the performance. Finally, the effectiveness of this method is verified by comparative analysis and the results of Netzob and AutoreeEngine.

(1) Flow collection

The function of project transmission from PLC to EWS includes sending project information executed by existing PLC to EWS. We collect traffic and analyze the protocol structure of the following three typical functions. The information collected is as follows:

(2) Message clustering

We manually classify the message types by analyzing the message types of the experimental data. The classification results are as follows:

We apply the message clustering algorithm, and the classification results are as follows. We determine that the mean shift algorithm is the most similar to the manual classification of message types.

(3) Performance evaluation

This paper evaluates the performance evaluation indicators from two aspects: simplicity and coverage. Conciseness evaluates the message type of input data as a manually generated global truth type and a message type extracted from each method. Coverage evaluates the ability to overwrite all messages when extracting message types.


The following figure shows the simplicity comparison between the method used in this article and Netzob and AutoreeEngine:

The following figure shows the comparison of coverage values of Netzob, AutoreeEngine and the proposed method:

4, Conclusion

This paper presents a method to analyze the structure of industrial private protocol, which can be used to effectively monitor the network traffic of industrial protocol. Experiments show that the existing protocol reverse engineering methods have some limitations in analyzing industrial protocols.

As future work, we intend to apply more different industrial protocols to develop systems that can be used in industrial sites. In addition, our goal is to develop a system suitable for commercial, private and industrial agreements.

1. Zetter K. Attack code for SCADA vulnerabilities released online. http://www.wired.com/threatlevel/2011/03/scada-vulnerabilities/,
2011.
2. Spenneberg R, Brüggemann M, Schwartke H. PLC-blaster: a worm living solely in the PLC.Black Hat Asia. 2016;16:1–16.
3. Langner R. Stuxnet: dissecting a cyberwarfare weapon.IEEE Security & Privacy. 2011;9(3):49-51.
4. Tridgell A. (2003). How Samba was written. [Online]. Available: http://samba.org/ftp/tridge/misc/french_cafe.txt
5. Pidgin. (2018). About Pidgin. [Online]. Available: http://www.pidgin.im/about
6. Caballero J, Song D. Automatic protocol reverse-engineering: message format extraction and field semantics inference.Int J Comput
Telecommun Netw. 2013;57(2):451-474.
7. Liu M, Jia C, Liu L, Wang Z. Extracting sent message formats from executables using backward slicing. Proc. 4th Int. Conf. Emerg.
Intell. Data Web Technol., X'ian, China, Sep. 2013, pp. 377–384.
8. Wang Y, Yun X, Shafiq MZ et al A semantics aware approach to automated reverse engineering unknown protocols. In: Proc. 20th IEEE
Int. Conf. Netw. Protocols (ICNP), Oct. 2012, pp. 1–10.
9. Krueger T, Gascon H, Kramer N, Rieck K Learning stateful models for network honeypots. In: Proc. 5th ACM Workshop Secur. Artif.
Intell., Raleigh, NC, USA, Oct. 2012, pp. 37–48.
10. Li H, Shuai B, Wang J, Tang C Protocol reverse engineering using LDA and association analysis. In: Proc. 11th Int. Conf. Comput.
Intell.Secur. (CIS), Dec. 2015, pp. 312–316.
11. Beddoe MA. Network protocol analysis using bioinforomatics algorithms; 2004. [Online]. Available: http://www.4tphi.net/?awalters/
PI/pi.pdf
12. Leita C, Mermoud K, Dacier M. ScriptGen: an automated script generation tool for Honeyd. In: Proc. 21st Annu. Comput. Secur. Appl.
Conf., Tucson, AZ, USA, Dec. 2005, p. 2.
13. Cui W, Kannan J, Wang HJ. Discoverer: automatic protocol reverse engineering from network traces. In: Proc. 16th USENIX Secur.
Symp., Boston, MA, USA, Aug. 2007, pp. 199–212.
14. Bossert G Exploiting semantic for the automatic reverse engineering of communication protocols. Ph.D. dissertation, Univ. Gif-sur-
Yvette, Rennes, France, Dec. 2014.
15. Wang L, Jiang T. On the complexity of multiple sequence alignment.J Comput Biol. 1994;1(4):337-348.
16. Luo J-Z, Yu S-Z. Position-based automatic reverse engineering of network protocols.J Netw Comput Appl. May 2013;36(3):1070-1077.
17. Wang Y, Zhang N, Wu Y-M., Su B-B, Liao Y-J. Protocol formats reverse engineering based on association rules in wireless environment.
In: Proc. 12th IEEE Int. Conf. Trust, Secur. Privacy Comput. Commun. Melbourne, VIC, Australia, Jul. 2013, pp. 134–141.
18. Ji R, Li H, Tang C Extracting keywords of UAVs wireless communication protocols based on association rules learning. In: Proc. 12th
IEEE Int. Conf. Comput. Intell. Secur., Wuxi, China, Dec. 2016, pp. 310–313.
19. Bermudez I, Tongaonkar A, Iliofotou M, Mellia M, Munafo MM. Automatic protocol field inference for deeper protocol understanding.
In: Proc. 14th IFIP Netw. Conf., Toulouse, France, May 2015, pp. 1–9.
20. Ladi G, Buttyan L, Holczer T Message format and field semantics inference for binary protocols using recorded network traffic. In: Proc.
26th Int. Conf. Softw., Telecommun. Comput. Netw., Split, Croatia, Sep. 2018
21. Stouffer K, Falco J, Scarfone K. Guide to industrial control systems (ICS) security.NIST Special Publication. 2011;800(82):16-16.
22. Shim K-S, Goo Y-H, Lee M-S, Hasanova H, Kim M-S Inference of network unknown protocol structure using CSP (contiguous sequence
pattern) algorithm based on tree structure. Proc. of the NOMS 2018—IEEE/IFIP DISSECT workshop, Taipei, Taiwan, April. 23, 2018,
pp. 1–4.
23. Davidson CC, Andel T, Yampolskiy M, McDonald JT, Glisson B, Thomas T (2018). On SCADA PLC and Fieldbus Cyber-Security. In:
13th International Conference on Cyber Warfare and Security pp. 140–149.
24. Van Herrewege A, Singelee D, Verbauwhede I. CANAuth—a simple, backward compatible broadcast authentication protocol for CAN
bus. ECRYPT Workshop on Lightweight Cryptography. Vol. 2011. 2011.
25. Thompson S. Application of controller area network bus and CANopen protocol in Industrial Automation. Diss. Murdoch University,
2018.
26. Murvay P-S, Groza B. A brief look at the security of DeviceNet communication in industrial control systems. Proceedings of the Central
European Cybersecurity Conference 2018. 2018.
27. Fovino IN, Carcano A, Masera M, Trombetta A. Design and implementation of a secure modbus protocol. In:International Conference
on Critical Infrastructure Protection. Berlin, Heidelberg: Springer; 2009.
28. Suzuki K, Chino S, Sakurada H, Tarui I, Ban N, Charles P FDT technology for CC-link network. SICE Annual Conference 2011. IEEE,
2011.
29. Höfken H, Paffen B, Schuba M. ICS/SCADA security analysis of a Beckhoff CX5020 PLC. 2015 International Conference on Information
Systems Security and Privacy (ICISSP). IEEE, 2015.
30. Langlois K, van der Hoeven T, Rodriguez Cianca D, et al. Ethercat tutorial: an introduction for real-time hardware communication on
windows [tutorial].IEEE Robot Autom Mag. 2018;25(1):22-122.
31. Faisal MA, Cardenas AA, Wool A. Profiling communications in industrial IP networks: Model complexity and anomaly detection. In:
Security and Privacy Trends in the Industrial Internet of Things. Cham, Switzerland: Springer; 2019:139-160.
32. Feld J. PROFINET—scalable factory communication for all applications. IEEE International Workshop on Factory Communication Sys-
tems, 2004. Proceedings IEEE, 2004.
33. Goldenberg N, Wool A. Accurate modeling of Modbus/TCP for intrusion detection in SCADA systems.Int J Crit Infr Prot. 2013;6(2):
63-75.
34. Jain AK. Data clustering: 50 years beyond K-means.Pattern Recognit Lett. 2010;31(8):651-666.
35. Cumani S, Laface P. Exact memory–constrained UPGMA for large scale speaker clustering.Pattern Recognit. 2019;95:235-246.
36. Reddym CK, Bhanukiran B. A survey of partitional and hierarchical clustering algorithm. In:Data Clustering. Boca Raton, Florida,
United States: Chapman and Hall/CRC; 2018:87-110.
37. Shim KS, Yoon SH, Lee SK, Kim SM, Jung WS, Kim MS. Automatic generation of snort content rule for network traffic analysis.KICS.
2015;40(04):666-677.

Topics: network Data Mining