In the face of liars who "don't talk about martial virtue", we deal with Ta like this

Posted by reyjrar on Thu, 03 Mar 2022 22:19:19 +0100

Turn:

In the face of liars who "don't talk about martial virtue", we deal with Ta like this

"I, xxx, pay..." have you ever received such a message?

In fact, the history of fraud using information technology may be far beyond your imagination. Even before the birth of the Internet, the "Nigerian Prince" scam of fraud through paper letters and faxes was widely spread all over the world. Today, through various channels, online fraud carried out by means of social engineering is more diverse, making it impossible to prevent.

Today, we will see from the perspective of enterprises how to use technical means to prevent online swindlers from committing fraud through their own company's platform.

===

Fraudulent users and malicious accounts may cause billions of dollars in revenue loss to enterprises every year. Although many enterprises have been using rule-based filters to prevent all kinds of malicious activities in the system, such filters are often quite fragile and can not capture all malicious behaviors.

On the other hand, some solutions (such as graph technology) are outstanding in detecting fraudsters and malicious users. Fraudsters can adjust their activities to deceive rule-based systems or simple feature-based models, but it is difficult to forge the graph structure, especially the relationship between users and other entities in the transaction / interaction log. Graph neural network (GNN) can combine the information in the graph structure with the attributes of users or transactions, extract meaningful representations, and finally distinguish malicious users and events from legitimate users and events.

This article describes how to use Amazon SageMaker and Deep Graph Library (DGL) to train GNN model to detect malicious users or fraudulent transactions. Users who want to use fully hosted AWS AI services for fraud detection can also consider using Amazon Fraud Detector to significantly reduce the difficulty of identifying potential fraudulent online activities (such as creating forged accounts or online payment fraud).

The following will focus on how to use Amazon SageMaker for data preprocessing and model training. To train a set of GNN models, we first need to build a set of heterogeneous graphs using the information in the transaction table or access log. The so-called heterogeneous graph refers to a graph containing different types of nodes and edges. If nodes represent users or transactions, each node will reflect a variety of different relationships between the current user and other users or entities (such as device identifiers, institutions, applications, IP addresses, etc.).

Here are some use cases applicable to this solution:

Financial networks that conduct transactions between users and between users and specific financial institutions or applications.
A game network in which users continuously interact with other users and even different games or devices.
Social networks with many different types of links between users and other users.

The following figure shows the basic architecture of heterogeneous financial transaction network.

GNN can combine user characteristics (such as demographic information) or transaction characteristics (such as activity frequency). In other words, we can use the features of nodes and edges as metadata to enrich the representation of heterogeneous graphs. After completing the establishment of nodes, relationships and their associated features in heterogeneous graphs, GNN models can be trained to guide them to learn how to use node or edge features and add graph structure to classify different nodes as malicious or legal nodes. Model training is completed in a semi supervised manner - some nodes in the graph need to be marked as fraudulent or legal nodes in advance. Taking the subset containing these markers as the training signal, we can gradually find out the optimal parameter matching of GNN model. Then, the trained GNN model can predict the remaining unlabeled nodes in the graph.

framework

First, we can use Amazon SageMaker's complete solution architecture to run processing jobs and training jobs. You can use the AWS Lambda function that can respond to the Amazon Simple Storage Service (Amazon S3) PUT event to automatically trigger the Amazon SageMaker job, or manually trigger the corresponding job through the unit running in the sample Amazon SageMaker notebook. The following figure is a visual representation of this architecture:

The complete implementation can be obtained through GitHub repo, which is also equipped with a set of AWS CloudFormation template, which is used to start the whole architecture in the AWS account.

GNN fraud detection preparation: Data Preprocessing

In this section, we will introduce how to preprocess the sample data set to determine the relationship between nodes in a heterogeneous graph!

data set

In this use case, we benchmark the modeling method using the IEEE-CIS fraud data set. This is an anonymous data set containing up to 500000 transactions between users. The dataset contains two main tables:

Transactions table: a transaction table that contains information about transactions or interactions between users.
Identity table: identity table, which contains the log access, equipment and network information of the specific user executing the transaction.

We can use the subsets of these transactions and their labels as supervision signals in model training. For transactions in the test data set, their labels will be blocked during training. The task of the model is very clear: predict which blocked transactions are fraudulent and which are legal.

The following example code takes the data and uploads it to the Amazon S3 bucket used by Amazon SageMaker to access the dataset during preprocessing and training (running in the Jupiter notebook unit):

# Replace with an S3 location or local path to point to your own dataset
raw_data_location = 's3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data'
bucket = 'SAGEMAKER_S3_BUCKET'
prefix = 'dgl'
input_data = 's3://{}/{}/raw-data'.format(bucket, prefix)
!aws s3 cp --recursive $raw_data_location $input_data
# Set S3 locations to store processed data for training and post-training results and artifacts respectively
train_data = 's3://{}/{}/processed-data'.format(bucket, prefix)
train_output = 's3://{}/{}/output'.format(bucket, prefix)

Although fraudsters will try to cover up their malicious activities, such behaviors still have very obvious characteristics in the graph structure, such as high degree or activity aggregation tendency. The following sections will explain how to perform feature extraction and graph construction, and then use these patterns to realize fraud prediction by GNN model.

feature extraction

Feature extraction includes performing digital coding on classification features, and then performing a series of transformations on digital columns. For example, we need to perform logarithmic conversion on the transaction amount to indicate the relative size of the amount, and its category attribute can be converted to digital form through the independent heat coding method. For each transaction, the eigenvector will contain the inherent attributes in the transaction table, which contain the time increment, name and address matching, matching count and other information compared with the previous transaction.

Construction diagram

To build a complete interaction diagram, we need to divide the relationship information in the data into edge lists corresponding to various relationship types. Each Edge list belongs to a bipartite graph between transaction nodes and other entity types. These entity types constitute transaction related identification attributes respectively. For example, for the card type used in the transaction (debit card or credit card), we can create it as the entity type, the IP address of the device used to complete the transaction, and the device ID or operating system of the device used. The entity type used in the figure construction includes all attributes in the identity table and subsets of attributes in the transaction table, such as credit card information or e-mail domain. Heterogeneous graph is composed of Edge list representing each relationship category and characteristic matrix of nodes.

Using Amazon SageMaker Processing

You can use Amazon SageMaker Processing to perform data preprocessing and feature extraction steps. Amazon SageMaker Processing is a feature in Amazon SageMaker that allows you to run pre-processing and post-processing workloads on top of a fully managed infrastructure. For more details, see processing data and evaluating models.

First, we need to define the containers used in Amazon SageMaker Processing jobs. This container should contain all the dependencies required by the data preprocessing script. Since the data preprocessing in this use case only needs to use the Pandas library, the minimum Dockerfile can be used to implement the container definition. Please refer to the following code for details:

FROM python:3.7-slim-buster
RUN pip3 install pandas==0.24.2
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

You can enter the following code to build the container and push the built container to the Amazon Elastic Container Registry (Amazon ECR) image warehouse:

import boto3
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-preprocessing-container'
ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, ecr_repository)
!bash data-preprocessing/container/build_and_push.sh $ecr_repository docker

When the data preprocessing container is ready, we can create an Amazon SageMaker ScriptProcessor, which is responsible for setting up the processing job environment using the preprocessing container. Next, you can use ScriptProcessor to run Python scripts responsible for the specific implementation of data preprocessing in the environment defined by the container. After the python script completes execution and saves the preprocessed data back to Amazon S3, the processing job ends. The whole process is completely managed by Amazon SageMaker. When running ScriptProcessor, we can choose to pass parameters to the data preprocessing script, so as to specify which columns in the transaction table should be regarded as identity columns and which columns belong to classification characteristics. All other columns are assumed to be numeric characteristic columns. Please refer to the following code for details:

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
script_processor = ScriptProcessor(command=['python3'],
 image_uri=ecr_repository_uri,
 role=role,
 instance_count=1,
 instance_type='ml.r5.24xlarge')
script_processor.run(code='data-preprocessing/graph_data_preprocessor.py',
 inputs=[ProcessingInput(source=input_data,
 destination='/opt/ml/processing/input')],
 outputs=[ProcessingOutput(destination=train_data,
 source='/opt/ml/processing/output')],
 arguments=['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',
 '--cat-cols',' M1,M2,M3,M4,M5,M6,M7,M8,M9'])

The following example code shows the output results of the Amazon SageMaker Processing job stored in Amazon S3:

from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('n'.join(processed_files))Output:
===== Processed Files =====
s3://graph-fraud-detection/dgl/processed-data/features.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceInfo_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceType_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_P_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_ProductCD_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_R_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_TransactionID_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card3_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card4_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card5_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card6_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_01_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_02_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_03_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_04_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_05_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_06_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_07_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_08_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_09_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_10_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_11_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_12_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_13_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_14_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_15_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_16_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_17_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_18_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_19_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_20_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_21_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_22_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_23_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_24_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_25_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_26_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_27_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_28_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_29_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_30_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_31_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_32_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_33_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_34_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_35_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_36_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_37_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_38_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/tags.csv
s3://graph-fraud-detection/dgl/processed-data/test.csv

All relational Edgelist files represent different types of edges used to construct heterogeneous graphs during training. Features.csv contains the features after the final conversion of the transaction node, while tags CSV contains node labels as training supervision signals. Test.csv contains TransactionID data as a test data set to evaluate the performance of the model. These node labels are shielded during training to avoid interference with model prediction.

GNN model training

Now we can use the Deep Graph Library (DGL) to create graphs and define GNN models, and then use Amazon SageMaker to launch the infrastructure to train GNN. Specifically, we can use the relational graph convolution neural network model to learn the embedding of nodes in heterogeneous graphs and the full connection layer for final node classification.

Super parameter

To train GNN model, you also need to define a series of fixed super parameters before training, such as the types of graphs you want to construct, the types of GNN model used, network architecture, optimizer and optimization parameters. Please refer to the following code for details:

edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'features.csv',
 'edges': 'relation*.csv',
 'labels': 'tags.csv',
 'model': 'rgcn',
 'num-gpus': 1,
 'batch-size': 10000,
 'embedding-size': 64,
 'n-neighbors': 1000,
 'n-layers': 2,
 'n-epochs': 10,
 'optimizer': 'adam',
 'lr': 1e-2
 }

The above code contains some super parameters. For more details on the super parameters and their default values, see estimator in GitHub repo_ fns. py.

Using Amazon SageMaker training model

After the super parameter definition is completed, you can now officially start the training process. The training operation uses DGL (MXNet as the back-end deep learning framework) to realize the definition and training of GNN model. Amazon SageMaker provides a framework fitter, in which the deep learning framework environment can greatly reduce the training difficulty of GNN model. For more details on training GNN models with DGL on Amazon SageMaker, see training depth map network.

Now, we can create an Amazon SageMaker MXNet fitter and pass in the model training script, super parameters and the required number / type of training instances. Next, you can call Fit on the fitter and transfer it to the training data storage location on Amazon S3. See the following code for details:

from sagemaker.mxnet import MXNet
estimator = MXNet(entry_point='train_dgl_mxnet_entry_point.py',
 source_dir='dgl-fraud-detection',
 role=role,
 train_instance_count=1,
 train_instance_type='ml.p2.xlarge',
 framework_version="1.6.0",
 py_version='py3',
 hyperparameters=params,
 output_path=train_output,
 code_location=train_output,
 sagemaker_session=sess)
estimator.fit({'train': train_data})

result

After the training of GNN model, the model has learned how to distinguish legal transactions from fraudulent transactions. The training assignment will generate a pred CSV file, which is the model for test Forecast results of transactions in CSV. ROC curve reflects the relationship between correct prediction rate and false alarm rate under various thresholds, and the area under the curve (AUC) can be used as an evaluation index. It can be seen from the figure below that the GNN model we trained is better than the fully connected feedforward network and the gradient lifting tree that uses the same characteristics but does not make full use of the graph structure.

summary

In this paper, we explain how to build a heterogeneous graph according to user transactions and activities, use the graph and other collected features to train the GNN model, and finally predict the fraud of transactions. This paper also introduces how to use DGL and Amazon SageMaker to define and train GNN models with high prediction performance. For the complete implementation of this project and details of other GNN models, see GitHub repo.

In addition, we also introduced how to implement data processing to extract meaningful features and relationships from the original transaction data log using Amazon SageMaker Processing. You can directly deploy the CloudFormation template provided in the example and pass in your own dataset to detect malicious users and fraudulent transactions in the data.

Turn: