Node Representation Learning Based on Graph Neural Network

Posted by jim.davidson on Tue, 25 Jan 2022 12:30:56 +0100

In this section, we will learn how to implement a multilayer network and the general process of generating high-quality node representations and classifying high-accuracy nodes using a training graph network.
Our task is to predict nodes with unknown labels based on their attributes (which can be categorical or numerical), edge information, edge attributes (if any), and known node prediction labels.

Introduction to Cora Dataset

  • The Cora dataset consists of paper s from many machine learning domains that are grouped into seven categories:Case_Based, Gene_ Algorithms, Neural_Networks, Probabilistic_Methods, Reinforcement_Learning, Rule_Learning, Theory.
  • In this dataset, each paper references at least one other paper in the dataset or another, totaling 2708 papers. There were 1433 words in the final vocabulary after removing word breaks and words with less than 10 document frequencies.

The dataset contains two files:

  1. The.Content file contains a description of the content of the paper in the format
    • <paper_ Id>: The identifier of a paper, one for each paper.
    • <word_ Attributes>: is a lexical feature, 0 or 1, indicating the existence of the corresponding vocabulary.
    • <class_ Label>: is the category described in this document.
  2. The.cites file contains a citation graph of the dataset, with each row formatted as follows
    • : The referenced paper identifier.
    • : The paper identifier of the reference.

References are right-to-left, for example, if there is a behavior of paper1 paper2, then the corresponding join relationship is paper2->paper1.

MLP, GCN, GAT Node Representation Learning Ability Comparison

Dead work

##Introducing datasets
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures

dataset = Planetoid(root='/home/**/python_file/gnn/dataset', name='Cora',transform=NormalizeFeatures())
data = dataset[0]
print(data)

##Node Representation Distribution Visualizer Loading
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

def visualize(h, color):
    z = TSNE(n_components=2).fit_transform(out.detach().cpu().numpy())
    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])

    plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap="Set2")
    plt.show()

Application of MLP in Graph Node Classification Task

##MLP Graph Node Classifier
import torch
from torch.nn import Linear
import torch.nn.functional as F

class MLP(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(MLP, self).__init__()
        torch.manual_seed(12345)
        self.lin1 = Linear(dataset.num_features, hidden_channels)
        self.lin2 = Linear(hidden_channels, dataset.num_classes)

    def forward(self, x):
        x = self.lin1(x)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin2(x)
        return x

# Simple training MLP
model = MLP(hidden_channels=16)
print(model)
model = MLP(hidden_channels=16)
criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)  # Define optimizer.

def train():
    model.train()
    optimizer.zero_grad()  # Clear gradients.
    out = model(data.x)  # Perform a single forward pass.
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss

for epoch in range(1, 201):
    loss = train()
    # print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

##Test results
def test():
    model.eval()
    out = model(data.x)
    pred = out.argmax(dim=1)  # Use the class with highest probability.
    test_correct = pred[data.test_mask] == data.y[data.test_mask]  # Check against ground-truth labels.
    test_acc = int(test_correct.sum()) / int(data.test_mask.sum())  # Derive ratio of correct predictions.
    return test_acc

test_acc = test()
print(f'Test Accuracy: {test_acc:.4f}')

Application of GCN in Graph Node Classification Task

##Introducing GCN
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = GCNConv(dataset.num_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

model = GCN(hidden_channels=16)
print(model)

##Visualization of untrained data
model = GCN(hidden_channels=16)
model.eval()

out = model(data.x, data.edge_index)
visualize(out, color=data.y)

##Training GCN Classifier
model = GCN(hidden_channels=16)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
      model.train()
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

for epoch in range(1, 201):
    loss = train()
    # print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

##Test results
def test():
      model.eval()
      out = model(data.x, data.edge_index)
      pred = out.argmax(dim=1)  # Use the class with highest probability.
      test_correct = pred[data.test_mask] == data.y[data.test_mask]  # Check against ground-truth labels.
      test_acc = int(test_correct.sum()) / int(data.test_mask.sum())  # Derive ratio of correct predictions.
      return test_acc

test_acc = test()
print(f'Test Accuracy: {test_acc:.4f}')

##Visualization of results after training
model.eval()

out = model(data.x, data.edge_index)
visualize(out, color=data.y)

Application of GAT in Graph Node Classification Task

##Introducing GAT
import torch
from torch.nn import Linear
import torch.nn.functional as F

from torch_geometric.nn import GATConv

class GAT(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GAT, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = GATConv(dataset.num_features, hidden_channels)
        self.conv2 = GATConv(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x
model = GAT(hidden_channels=16)
print(model)

##Visualization of untrained data
model = GAT(hidden_channels=16)
model.eval()

out = model(data.x, data.edge_index)
visualize(out, color=data.y)

##Train GAT Classifier
model = GAT(hidden_channels=16)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
      model.train()
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

for epoch in range(1, 201):
    loss = train()
    # print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

##def test():
      model.eval()
      out = model(data.x, data.edge_index)
      pred = out.argmax(dim=1)  # Use the class with highest probability.
      test_correct = pred[data.test_mask] == data.y[data.test_mask]  # Check against ground-truth labels.
      test_acc = int(test_correct.sum()) / int(data.test_mask.sum())  # Derive ratio of correct predictions.
      return test_acc

test_acc = test()
print(f'Test Accuracy: {test_acc:.4f}')

##Data visualization after training
model.eval()

out = model(data.x, data.edge_index)
visualize(out, color=data.y)


Comparative analysis of results

  • Measurement accuracy (unadjusted): ACC(GCN)>ACC(GAT)>ACC(MLP)
  • Reason: In the learning of node representation, the MLP node classifier only considers the node's own attributes, ignoring the connection between nodes, and its result is the worst. GCN and GAT node classifiers take into account both the attributes of nodes themselves and those of neighboring nodes, and their results are better than MLP node classifiers. It can be seen that the information of neighbor nodes is important to the task of node classification.
  • The difference between GCN and GAT is that the normalization methods in the aggregation process of neighbor node information are different:
    • The former calculates the normalization factor based on the degree of the center node and the neighbor node, while the latter calculates the normalization factor based on the similarity between the center node and the neighbor node.
    • The normalization method of the former depends on the topological structure of the graph. Different nodes have different degrees of themselves and their neighbors, which may affect the generalization ability in some applications.
    • The latter depends on the similarity between the center node and the neighbor node. The similarity is trained, so it is not affected by the topological structure of the graph and will have better generalization performance in different tasks.

Reference material:

https://github.com/datawhalechina/team-learning-nlp/tree
https://zhuanlan.zhihu.com/p/78452993

Topics: neural networks