VisionTransformers

VisionTransformers is a pure Julia package implementing various vision transformer models in Flux.

Available Models

Model	Source	Implemented
ViT	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	:whitecheckmark:
PVT	Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions	:whitecheckmark:
CvT	CvT: Introducing Convolutions to Vision Transformers	:whitecheckmark:
SWIN	Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	:whitecheckmark:
Twins	Twins: Revisiting the Design of Spatial Attention in Vision Transformers	:whitecheckmark:
CaiT	Going deeper with Image Transformers	:x:

Models

VisionTransformers.ViT — Function

ViT(config::Symbol; kw...)
ViT(dim::Integer, nheads::Integer, depth::Integer;
    imsize=(224,224), patchsize=(16,16), inchannels=3,
    nclasses=1000, mlp_ratio=4, qkv_bias=true,
    dropout=0.1, attn_dropout=0.1, drop_path=0.0, 
    class_token=false, register_tokens=0)

Construct a Vision Transformer (ViT) model for image classification. The input image is split into non-overlapping patches, each patch is linearly embedded into a dim-dimensional vector, and a sequence of Transformer blocks is applied. Optionally, a learnable class token may be prepended to the patch sequence for classification.

Arguments

config: One of :tiny, :small, :base, :large, or :huge.
dim: Embedding dimension of the patch tokens.
nheads: Number of attention heads in each Transformer block.
depth: Number of Transformer blocks in the encoder.
imsize: Input image size (height, width). Default (224,224).
patchsize: Patch size (height, width). Default (16,16).
inchannels: Number of input image channels. Default is 3.
nclasses: Number of output classes for classification. Default is 1000.
mlp_ratio: Expansion ratio for the hidden dimension of the MLP relative to dim. Default is 4.
qkv_bias: Whether to add a bias to query, key, and value projections. Default is true.
dropout: Dropout probability applied to embeddings, MLP, and attention outputs. Default is 0.1.
attn_dropout: Dropout probability applied to attention weights. Default is 0.1.
drop_path: Probability for stochastic depth (drop-path) regularization. Default is 0.0.
class_token: Whether to prepend a learnable class token to the patch sequence for classification. If false, global average pooling is applied instead.
register_tokens: Number of additional learnable tokens to prepend to the patch sequence. Default is 0.

VisionTransformers.CvT — Function

CvT(config::Symbol; kw...)
CvT(dim::Int, depths, nheads;
    inchannels=3, mlp_ratio=4,
    dropout=0.1, drop_path=0.0,
    nclasses=1000)

Construct a Convolutional Vision Transformer (CvT) model for image classification. CvT extends the Vision Transformer (ViT) by replacing linear projections with convolutional projections and by introducing convolutional token embeddings. This design enhances locality and translation invariance while retaining the benefits of Transformer-based global modeling.

Arguments

config: One of :B13, :B21, or :W24.
dim: Embedding dimension of patch tokens in the first stage.
depths: Number of Transformer blocks in each stage.
nheads: Number of attention heads in each stage.
inchannels: Number of input image channels. Default is 3.
mlp_ratio:: Expansion ratio for the hidden dimension of the MLP relative to dim. Default is 4.
dropout: Dropout probability applied to embeddings, MLP, and attention outputs.
drop_path: Probability for stochastic depth (drop-path) regularization.
nclasses: Number of output classes for classification. Default is 1000.

VisionTransformers.PVT — Function

PVT(config::Symbol; kw...)
PVT(embed_dims, depths, nheads;
    mlp_ratios=[8,8,4,4], sr_ratios=[8,4,2,1], qkv_bias=true, dropout=0.1, 
    drop_path=0.0, imsize=224, inchannels=3, nclasses=1000)

Construct a Pyramid Vision Transformer (PVT) model for image classification or feature extraction. PVT builds a hierarchical representation by progressively reducing spatial resolution while increasing the embedding dimension. Spatial Reduction Attention (SRA) is used to make attention efficient on high-resolution feature maps.

Arguments

config: One of :tiny, :small, :medium, or :large.
embed_dims: Embedding dimension for each stage of the network.
depths: Number of transformer blocks in each stage.
nheads: Number of attention heads in each stage.
mlp_ratios: Expansion ratio for the hidden dimension of the MLP in each stage. Default is [8,8,4,4].
sr_ratios: Spatial reduction ratio for SRA in each stage. Default is [8,4,2,1].
qkv_bias: Whether to add a bias to query, key, and value projections. Default is true.
dropout: Dropout probability applied to MLP and attention outputs.
drop_path: Probability for stochastic depth (drop-path) regularization.
imsize: Input image size (assumed square). Default is (224, 224).
inchannels: Number of input image channels. Default is 3.
nclasses: Number of output classes for classification. Default is 1000.

VisionTransformers.SWIN — Function

SWIN(config::Symbol; kw...)
SWIN(embed_dims, depths, nheads;
     window_size=7, position_embedding=true, mlp_ratio=4,
     qkv_bias=true, dropout=0.1, drop_path=0.0,
     inchannels=3, nclasses=1000)

Construct a Swin Transformer model for image classification or feature extraction. The architecture is composed of multiple stages of Shifted Windowed Multi-Head Self-Attention (SW-MSA) and MLP blocks. The model progressively merges patches while increasing channel dimension, enabling hierarchical feature learning.

Arguments

config: One of :tiny, :small, :base, or :large.
embed_dims: Embedding dimension for each stage of the network.
depths: Number of transformer blocks in each stage.
nheads: Number of attention heads in each stage.
window_size: Spatial size of the attention window. Default is 7.
position_embedding: Whether to include learnable relative position embeddings in attention.
mlp_ratio: Expansion ratio for the hidden dimension of the MLP relative to the embedding dimension. Default is 4.
qkv_bias: Whether to add bias to query, key, and value projections.
dropout: Dropout probability applied to MLP and attention outputs.
drop_path: Probability for stochastic depth (drop-path) regularization.
inchannels: Number of input image channels. Default is 3.
nclasses: Number of output classes for classification. Default is 1000.

VisionTransformers.Twins — Function

Twins(config::Symbol; kw...)
Twins(embed_dims, depths, nheads; 
      window_sizes=[7,7,7,7], sr_ratios=[8,4,2,1], mlp_ratio=4, 
      qkv_bias=true, dropout=0.1, attn_dropout_prob=0.1, 
      drop_path=0.0, inchannels=3, nclasses=1000)

Construct a Twins Vision Transformer model. The architecture consists of alternating local and global attention mechanisms to effectively capture both local and global features in images.

Parameters

config: Predefined configuration symbol. One of :small, :base, or :large.
embed_dims: Embedding dimensions for each stage.
depths: Number of transformer blocks in each stage.
nheads: Number of attention heads in each stage.
window_sizes: Spatial size of the attention window for each stage. Default is [7,7,7,7].
sr_ratios: Spatial reduction ratios for SRAttention in each stage. Default is [8,4,2,1].
mlp_ratio: Expansion ratio for the hidden dimension of the MLP relative to the embedding dimension. Default is 4.
qkv_bias: Whether to add bias to query, key, and value projections. Default is true.
dropout: Dropout probability applied to embeddings, MLP, and attention outputs. Default is 0.1.
attn_dropout: Dropout probability applied to attention weights. Default is 0.1.
drop_path: Probability for stochastic depth (drop-path) regularization. Default is 0.0.
inchannels: Number of input image channels. Default is 3.
nclasses: Number of output classes for classification. Default is 1000.

Layers

VisionTransformers.MultiHeadAttention — Type

MultiHeadAttention(dim::Integer; nheads=8, attn_dropout_prob=0.0, 
                   proj_dropout_prob=0.0, qkv_bias=false)

Construct a multi-head self-attention layer. The input features are linearly projected into query, key, and value tensors. Expects tensors of size C x S x N, where C is the embedding dimension, S is the sequence length, and N is the batch size. Note that S can have more than one dimension.

Arguments

dim: Dimensionality of the input feature embeddings.
nheads: Number of parallel attention heads. Default is 8.
attn_dropout_prob: Dropout probability applied to attention weights.
proj_dropout_prob: Dropout probability applied after the output projection.
qkv_bias: Whether to add learnable bias terms to the query, key, and value projections.

VisionTransformers.ConvAttention — Type

ConvAttention(dim::Int; kernel=(3,3), q_stride=(1,1), kv_stride=(1,1), nheads=8, 
              attn_dropout_prob=0.0, proj_dropout_prob=0.0, qkv_bias=false)

A convolutional attention layer as proposed in CvT. The queries, keys, and values are computed using convolutional layers. Computational complexity is reduced by downsampling the keys and values with a kv_stride greater than 1.

Parameters

dim: The dimension of the feature embedding.
nheads: The number of heads to use for self attention.
kernel: The kernel size used in the convolutional projection layers.
attn_dropout_prob: Dropout probability in the attention block.
proj_dropout_prob: Dropout probability in the projection block.
q_stride: Convolutional stride used to compute the query.
kv_stride: Convolutional stride used to compute the key and value.
qkv_bias: Whether to include a bias term in the convolutional projection layers.

VisionTransformers.SRAttention — Type

SRAttention(dim::Int; nheads=8, qkv_bias=false, attn_dropout_prob=0.0,
            proj_dropout_prob=0.0, sr_ratio=(1,1), sr_method=:conv)

Construct a Spatial Reduction Attention (SRA) layer as used in PVT and Twins. This is a variant of multi-head self-attention designed to reduce computational cost on high-resolution feature maps by spatially downsampling the keys and values before attention. The reduction can be performed using convolutional or pooling methods.

Arguments

dim: Dimensionality of the input feature embeddings.
nheads: Number of parallel attention heads. Default is 8.
qkv_bias: Whether to add learnable bias terms to query, key, and value projections.
attn_dropout_prob: Dropout probability applied to the attention weights.
proj_dropout_prob: Dropout probability applied after the output projection.
sr_ratio: Spatial reduction ratio (height, width) applied to keys and values. (1,1) means no reduction.
sr_method: Method used for spatial reduction. Must be one of :conv or :pool.

VisionTransformers.WindowedAttention — Type

WindowedAttention(dim::Integer; window_size=(7,7), shift_size=(0,0),
                  position_embedding=false, nheads=8, qkv_bias=false,
                  attn_dropout_prob=0.0, proj_dropout_prob=0.0)

Construct a windowed multi-head self-attention module, as used in the SWIN Transformer. The input is partitioned into non-overlapping windows of size window_size, and attention is computed within each window. Optionally, the windows can be shifted by shift_size to enable cross-window connections.

Arguments

dim: Dimensionality of the input feature embeddings.
window_size: Spatial size of each attention window (height, width).
shift_size: Offset applied to window partitions to enable connections across windows. Default (0,0) means no shift.
position_embedding: Whether to include learnable relative position embeddings in the attention computation.
nheads: Number of attention heads.
qkv_bias: Whether to add learnable bias parameters to the query, key, and value projections.
attn_dropout_prob: Dropout probability applied to the attention weights.
proj_dropout_prob: Dropout probability applied to the output projection.

VisionTransformers.AbsolutePositionEmbedding — Type

AbsolutePositionEmbedding(dim::Int, imsize::Tuple, patchsize::Tuple; init=zeros32, extra_tokens=0)

Absolute position embedding layer that adds learnable positional embeddings to the input tensor.

Parameters

dim: The dimensionality of the embeddings.
imsize: A tuple representing the size of the input image (height, width[, depth]).
patchsize: A tuple representing the size of the patches (height, width[, depth]).
init: A function to initialize the positional embeddings (default is zeros32).
extra_tokens: The number of extra tokens to add to the positional embeddings (default is 0).

Input

An input tensor of shape WHCN or WHDCN where C is the embedding dimension and N is the batch size. The embedding will be resampled via bilinear interpolation if the input size does not match the initialized size.

VisionTransformers.VariablePositionEmbedding — Type

VariablePositionEmbedding(dim::Int, imsize::Tuple, patchsize::Tuple; init=zeros32)

Variable position embedding layer that adds learnable positional embeddings to the input tensor.

Parameters

dim: The dimensionality of the embeddings.
imsize: A tuple representing the size of the input image (height, width[, depth]).
patchsize: A tuple representing the size of the patches (height, width[, depth]).
init: A function to initialize the positional embeddings (default is zeros32).

Input

An input tensor of shape WHCN or WHDCN where C is the embedding dimension and N is the batch size. The embedding will be resampled via bilinear interpolation if the input size does not match the initialized size.

VisionTransformers.RelativePositionEmbedding — Type

RelativePositionEmbedding(dim::Int, nheads::Int, window_size::Tuple; init=zeros32)

Relative position embedding layer that adds learnable relative positional biases to the attention scores.

Parameters

dim: The dimensionality of the embeddings.
nheads: The number of attention heads.
window_size: A tuple representing the size of the attention window (height, width[, depth]).
init: A function to initialize the relative positional biases (default is zeros32).

Input

An attention scores tensor of shape [Ww*Wh x Ww*Wh x nheads x 1] where Ww and Wh are the window width and height, respectively. For 3D data, the shape is [Ww*Wh*Wd x Ww*Wh*Wd x nheads x 1] where Wd is the window depth.

Note

We define an overloaded version of Flux.NNlib.apply_attn_bias to apply the relative position embedding to the attention scores. Thus, it is sufficient to pass this layer to Flux.NNlib.dot_product_attention as the bias argument.

Example

q, k, v = Flux.chunk(qkv(x), 3; dims=1)
y, α = Flux.dot_product_attention(q, k, v, position_embedding; nheads)

VisionTransformers.PEG — Function

PEG(dim::Int; kernel=(3,3), bias=true)

Position Embedding Generator (PEG) layer that applies a depthwise convolution to the input tensor to encode positional information.

Parameters

dim: The number of input and output channels.
kernel: The size of the convolution kernel (default is (3,3)).
bias: Whether to include a bias term in the convolution (default is true).

VisionTransformers.MLP — Function

MLP(indims, hiddendims, outdims; dropout=0.0, act=Flux.gelu)

Build a 2-layer multi-layer-perceptron.

Parameters

indims: The dimension of the input features.
hiddendims: The dimension of the hidden features.
outdims: The dimension of the output features.
dropout: The dropout probability following each Dense layer.
act: The activation function following the first Dense layer.

VisionTransformers.SeparableConv — Function

SeparableConv(dim, kernel::NTuple{N,Int}, stride::NTuple{N,Int}; bias=false) where N

Build a separable convolution layer consisting of a depthwise convolution followed by a pointwise convolution.

Parameters

dim: The number of input and output channels.
kernel: The size of the convolution kernel.
stride: The stride of the convolution.
bias: Whether to include a bias term in the pointwise convolution.

VisionTransformers.Tokens — Type

Tokens(dim, ntokens; init=rand32)

Learnable token embedding layer that prepends a specified number of learnable tokens to the input sequence.

Parameters

dim: The dimensionality of the tokens.
ntokens: The number of learnable tokens to prepend.
init: A function to initialize the token embeddings (default is rand32).

Input

A tensor of shape CLN where C is the embedding dimension, L is the sequence length, and N is the batch size. The output will have shape C(L+ntokens)N with the learnable tokens prepended to the input sequence.

VisionTransformers.StripTokens — Type

StripTokens(ntokens)

Layer that removes a specified number of tokens from the beginning of the input sequence.

Parameters

ntokens: The number of tokens to remove from the beginning of the sequence.

Utilities

VisionTransformers.img2seq — Function

img2seq(x::AbstractArray{<:Any,4})

Take a collection of image tokens of size [W x H x C x N] and flatten them into a sequence of size [C x L x N] where L = W * H.

VisionTransformers.seq2img — Function

seq2img(x::AbstractArray{<:Any,3})

Take a sequence of image tokens of size [C x L x N] and reshape it into an image of size [W x H x C x N], where W = H = sqrt(L).