Layers

In this section are listed all of the Layers and Modules.

nn.Linear

new nn.Linear(in_size,
              out_size,
              device, 
              bias, 
              xavier)

Applies a linear transformation to the input tensor. Input is matrix-multiplied by a w tensor and added to a b tensor.

Parameters

in_size (number) - Size of the last dimension of the input data.
out_size (number) - Size of the last dimension of the output data.
device (string) - Device on which the model's calculations will run. Either 'cpu' or 'gpu'.
bias (boolean) - Whether to use a bias term b.
xavier (boolean) - Whether to use Xavier initialization on the weights.

Learnable Variables

w - [input_size, output_size] Tensor.
b - [output_size] Tensor.

Example

>>> const linear = new nn.Linear(10,15,'gpu');
>>> let x = torch.randn([100,50,10], true, 'gpu');
>>> let y = linear.forward(x);
>>> y.shape
// [100, 50, 15]

nn.MultiHeadSelfAttention

new nn.MultiHeadSelfAttention(in_size,
                              out_size,
                              n_heads,
                              n_timesteps,
                              dropout_prob,
                              device)

Applies a self-attention layer on the input tensor.

Matrix-multiplies input by Wk, Wq, Wv, resulting in Key, Query and Value tensors.
Computes attention multiplying Query and transpose Key.
Applies Mask, Dropout and Softmax to attention activations.
Multiplies result by Values.
Multiplies result by residual_proj.
Applies final Dropout.

Parameters

in_size (number) - Size of the last dimension of the input data.
out_size (number) - Size of the last dimension of the output data.
n_heads (boolean) - Number of parallel attention heads the data is divided into. In_size must be divided evenly by n_heads.
n_timesteps (boolean) - Number of timesteps computed in parallel by the transformer.
dropout_prob (boolean) - probability of randomly dropping an activation during training (to improve regularization).
device (string) - Device on which the model's calculations will run. Either 'cpu' or 'gpu'.

Learnable Variables

Wk - [input_size, input_size] Tensor.
Wq - [input_size, input_size] Tensor.
Wv - [input_size, input_size] Tensor.
residual_proj - [input_size, output_size] Tensor.

Example

>>> const att = new nn.MultiHeadSelfAttention(10, 15, 2, 32, 0.2, 'gpu');
>>> let x = torch.randn([100,50,10], true, 'gpu');
>>> let y = att.forward(x);
>>> y.shape
// [100, 50, 15]

nn.FullyConnected

new nn.FullyConnected(in_size,
                      out_size,
                      dropout_prob,
                      device,
                      bias)

Applies a fully-connected layer on the input tensor.

Matrix-multiplies input by Linear layer l1, upscaling the input.
Passes tensor through ReLU.
Matrix-multiplies tensor by Linear layer l2, downscaling the input.
Passes tensor through Dropout.

forward(x: Tensor): Tensor {
    let z = this.l1.forward(x);
    z = this.relu.forward(z);
    z = this.l2.forward(z);
    z = this.dropout.forward(z);
    return z;
}

Parameters

in_size (number) - Size of the last dimension of the input data.
out_size (number) - Size of the last dimension of the output data.
dropout_prob (boolean) - probability of randomly dropping an activation during training (to improve regularization).
device (string) - Device on which the model's calculations will run. Either 'cpu' or 'gpu'.
bias (boolean) - Whether to use a bias term b.

Learnable Variables

l1 - [input_size, 4input_size] Tensor.
l2 - [4input_size, input_size] Tensor.

Example

>>> const fc = new nn.FullyConnected(10, 15, 0.2, 'gpu');
>>> let x = torch.randn([100,50,10], true, 'gpu');
>>> let y = fc.forward(x);
>>> y.shape
// [100, 50, 15]

nn.Block

new nn.Block(in_size,
             out_size,
             n_heads,
             n_timesteps,
             dropout_prob,
             device)

Applies a transformer Block layer on the input tensor.

forward(x: Tensor): Tensor {
    // Pass through Layer Norm and Self Attention:
    let z = x.add(this.att.forward(this.ln1.forward(x)));
    // Pass through Layer Norm and Fully Connected:
    z = z.add(this.fcc.forward(this.ln2.forward(z)));
    return z;
}

Parameters

in_size (number) - Size of the last dimension of the input data.
out_size (number) - Size of the last dimension of the output data.
n_heads (boolean) - Number of parallel attention heads the data is divided into. In_size must be divided evenly by n_heads.
n_timesteps (boolean) - Number of timesteps computed in parallel by the transformer.
dropout_prob (boolean) - probability of randomly dropping an activation during training (to improve regularization).
device (string) - Device on which the model's calculations will run. Either 'cpu' or 'gpu'.

Learnable Modules

nn.MultiHeadSelfAttention - Wk, Wq, Wv, residual_proj.
nn.LayerNorm - gamma, beta.
nn.FullyConnecyed - l1, l2.
nn.LayerNorm - gamma, beta.

nn.Embedding

new nn.Embedding(in_size,
                 embed_size)

Embedding table, with a number of embeddings equal to the vocabulary size of the model in_size, and size of each embedding equal to embed_size. For each element in the input tensor (integer), returns the embedding indexed by the integer.

forward(idx: Tensor): Tensor {
   // Get embeddings indexed by input (idx):
   let x = this.E.at(idx);
   return x;
}

Parameters

in_size (number) - Number of different classes the model can predict (vocabulary size).
embed_size (number) - Dimension of each embedding generated.

Learnable Parameters

E - [vocab_size, embed_size] Tensor.

Example

>>> const batch_size = 32;
>>> const number_of_timesteps = 256;
>>> const embed = new nn.Embedding(10, 64);
>>> let x = torch.randint(0, 10, [batch_size, number_of_timesteps]);
>>> let y = embed.forward(x);
>>> y.shape
// [32, 256, 64]

nn.PositionalEmbedding

new nn.PositionalEmbedding(input_size,
                           embed_size)

Embedding table, with a number of embeddings equal to the input size of the model input_size, and size of each embedding equal to embed_size. For each element in the input tensor, returns the embedding indexed by it's position.

forward(idx: Tensor): Tensor {
   // Get dimension of the input:
   const [B, T] = idx.shape;
   // Gets positional embeddings for each element along "T": (Batch, Timesteps) => (Batch, Timesteps, Embed)
   const x = this.E.at([...Array(T).keys()]);
   return x
}

Parameters

input_size (number) - Number of different embeddings in the lookup table (size of the input).
embed_size (number) - Dimension of each embedding generated.

Learnable Parameters

E - [input_size, embed_size] Tensor.

Example

>>> const batch_size = 32;
>>> const number_of_timesteps = 256;
>>> const embed = new nn.PositionalEmbedding(number_of_timesteps, 64);
>>> let x = torch.randint(0, 10, [batch_size, number_of_timesteps]);
>>> let y = embed.forward(x);
>>> y.shape
// [32, 256, 64]

nn.ReLU

new nn.ReLU()

Rectified Linear Unit activation function. This implementation is leaky for stability. For each element in the incoming tensor:

If element is positive, no change.
If element is negative, multiply by 0.001.

Parameters

None

Learnable Parameters

None

nn.Softmax

new nn.Softmax()

Softmax activation function. Rescales the data in the input tensor, along the dim dimension. The sum of every element along this dimension is one, and every element is between zero and one.

forward(x: Tensor, dim = -1): Tensor {
   z = exp(z);
   const out = z.div(z.sum(dim, true));
   return out;
   return x
}

Parameters

None

Learnable Parameters

None

Example

>>> const softmax = new nn.Softmax();
>>> let x = torch.randn([2,4]);
>>> let y = softmax.forward(x, -1);
>>> y.data
// [[0.1, 0.2, 0.8, 0.0],
//  [0.6, 0.1, 0.2, 0.1]]

nn.Dropout

new nn.Dropout(drop_prob: number)

Dropout class. For each element in input tensor, has a drop_prob chance of setting it to zero.

Parameters

drop_prob (number) - Probability to drop each value in input, from 0 to 1.

Learnable Parameters

None

Example

>>> const dropout = new nn.Dropout(0.5);
>>> let x = torch.ones([2,4]);
>>> let y = dropout.forward(x);
>>> y.data
// [[1, 0, 0, 1],
//  [0, 1, 0, 1]]

nn.LayerNorm

new nn.LayerNorm(n_embed: number)

LayerNorm class. Normalizes the data, with a mean of 0 and standard deviation of 1, across the last dimension. This is done as described in the LayerNorm paper.

Parameters

n_embed (number) - Size of the last dimension of the input.

Learnable Parameters

gamma (number) - Constant to multiply output by (initialized as 1).
beta (number) - Constant to add to output (initialized as 0).

nn.CrossEntropyLoss

new nn.CrossEntropyLoss()

Cross Entropy Loss function. Computes the cross entropy loss between the target and the input tensor.

First, calculates softmax of input tensor.
Then, selects the elements of the input corresponding to the correct class in the target.
Gets the negative log of these elements.
Adds all of them, and divides by the number of elemets.

Parameters

None

Learnable Parameters

None

nn.MSELoss

new nn.MSELoss()

Mean Squared Error Loss function. This function calculates the mean squared error between the target and the input tensor, which is a measure of the average squared difference between predictions and actual values.

First, computes the element-wise difference between the target and the input tensor.
Next, squares each of these differences to eliminate negative values and emphasize larger errors.
Finally, sums all the squared differences and divides by the total number of elements to calculate the mean.

Parameters

None

Learnable Parameters

None

Example

>>> const number_of_classes = 10;
>>> const input_size = 64;
>>> const loss_func = new nn.CrossEntropyLoss();
>>> let x = torch.randn([input_size, number_of_classes]);
>>> let y = torch.randint(0, number_of_classes, [input_size]);
>>> let loss = loss_func.forward(x, y);
>>> loss.data
// 2.3091357