VisionTransformers

VisionTransformers is a pure Julia package implementing various vision transformer models in Flux.

Available Models

ModelSourceImplemented
ViTAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale:whitecheckmark:
PVTPyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions:whitecheckmark:
CvTCvT: Introducing Convolutions to Vision Transformers:whitecheckmark:
SWINSwin Transformer: Hierarchical Vision Transformer using Shifted Windows:whitecheckmark:
TwinsTwins: Revisiting the Design of Spatial Attention in Vision Transformers:whitecheckmark:
CaiTGoing deeper with Image Transformers:x:

Models

VisionTransformers.ViTFunction
ViT(config::Symbol; kw...)
ViT(dim::Integer, nheads::Integer, depth::Integer;
    imsize=(224,224), patchsize=(16,16), inchannels=3,
    nclasses=1000, mlp_ratio=4, qkv_bias=true,
    dropout=0.1, attn_dropout=0.1, drop_path=0.0, 
    class_token=false, register_tokens=0)

Construct a Vision Transformer (ViT) model for image classification. The input image is split into non-overlapping patches, each patch is linearly embedded into a dim-dimensional vector, and a sequence of Transformer blocks is applied. Optionally, a learnable class token may be prepended to the patch sequence for classification.

Arguments

  • config: One of :tiny, :small, :base, :large, or :huge.
  • dim: Embedding dimension of the patch tokens.
  • nheads: Number of attention heads in each Transformer block.
  • depth: Number of Transformer blocks in the encoder.
  • imsize: Input image size (height, width). Default (224,224).
  • patchsize: Patch size (height, width). Default (16,16).
  • inchannels: Number of input image channels. Default is 3.
  • nclasses: Number of output classes for classification. Default is 1000.
  • mlp_ratio: Expansion ratio for the hidden dimension of the MLP relative to dim. Default is 4.
  • qkv_bias: Whether to add a bias to query, key, and value projections. Default is true.
  • dropout: Dropout probability applied to embeddings, MLP, and attention outputs. Default is 0.1.
  • attn_dropout: Dropout probability applied to attention weights. Default is 0.1.
  • drop_path: Probability for stochastic depth (drop-path) regularization. Default is 0.0.
  • class_token: Whether to prepend a learnable class token to the patch sequence for classification. If false, global average pooling is applied instead.
  • register_tokens: Number of additional learnable tokens to prepend to the patch sequence. Default is 0.
source
VisionTransformers.CvTFunction
CvT(config::Symbol; kw...)
CvT(dim::Int, depths, nheads;
    inchannels=3, mlp_ratio=4,
    dropout=0.1, drop_path=0.0,
    nclasses=1000)

Construct a Convolutional Vision Transformer (CvT) model for image classification. CvT extends the Vision Transformer (ViT) by replacing linear projections with convolutional projections and by introducing convolutional token embeddings. This design enhances locality and translation invariance while retaining the benefits of Transformer-based global modeling.

Arguments

  • config: One of :B13, :B21, or :W24.
  • dim: Embedding dimension of patch tokens in the first stage.
  • depths: Number of Transformer blocks in each stage.
  • nheads: Number of attention heads in each stage.
  • inchannels: Number of input image channels. Default is 3.
  • mlp_ratio:: Expansion ratio for the hidden dimension of the MLP relative to dim. Default is 4.
  • dropout: Dropout probability applied to embeddings, MLP, and attention outputs.
  • drop_path: Probability for stochastic depth (drop-path) regularization.
  • nclasses: Number of output classes for classification. Default is 1000.
source
VisionTransformers.PVTFunction
PVT(config::Symbol; kw...)
PVT(embed_dims, depths, nheads;
    mlp_ratios=[8,8,4,4], sr_ratios=[8,4,2,1], qkv_bias=true, dropout=0.1, 
    drop_path=0.0, imsize=224, inchannels=3, nclasses=1000)

Construct a Pyramid Vision Transformer (PVT) model for image classification or feature extraction. PVT builds a hierarchical representation by progressively reducing spatial resolution while increasing the embedding dimension. Spatial Reduction Attention (SRA) is used to make attention efficient on high-resolution feature maps.

Arguments

  • config: One of :tiny, :small, :medium, or :large.
  • embed_dims: Embedding dimension for each stage of the network.
  • depths: Number of transformer blocks in each stage.
  • nheads: Number of attention heads in each stage.
  • mlp_ratios: Expansion ratio for the hidden dimension of the MLP in each stage. Default is [8,8,4,4].
  • sr_ratios: Spatial reduction ratio for SRA in each stage. Default is [8,4,2,1].
  • qkv_bias: Whether to add a bias to query, key, and value projections. Default is true.
  • dropout: Dropout probability applied to MLP and attention outputs.
  • drop_path: Probability for stochastic depth (drop-path) regularization.
  • imsize: Input image size (assumed square). Default is (224, 224).
  • inchannels: Number of input image channels. Default is 3.
  • nclasses: Number of output classes for classification. Default is 1000.
source
VisionTransformers.SWINFunction
SWIN(config::Symbol; kw...)
SWIN(embed_dims, depths, nheads;
     window_size=7, position_embedding=true, mlp_ratio=4,
     qkv_bias=true, dropout=0.1, drop_path=0.0,
     inchannels=3, nclasses=1000)

Construct a Swin Transformer model for image classification or feature extraction. The architecture is composed of multiple stages of Shifted Windowed Multi-Head Self-Attention (SW-MSA) and MLP blocks. The model progressively merges patches while increasing channel dimension, enabling hierarchical feature learning.

Arguments

  • config: One of :tiny, :small, :base, or :large.
  • embed_dims: Embedding dimension for each stage of the network.
  • depths: Number of transformer blocks in each stage.
  • nheads: Number of attention heads in each stage.
  • window_size: Spatial size of the attention window. Default is 7.
  • position_embedding: Whether to include learnable relative position embeddings in attention.
  • mlp_ratio: Expansion ratio for the hidden dimension of the MLP relative to the embedding dimension. Default is 4.
  • qkv_bias: Whether to add bias to query, key, and value projections.
  • dropout: Dropout probability applied to MLP and attention outputs.
  • drop_path: Probability for stochastic depth (drop-path) regularization.
  • inchannels: Number of input image channels. Default is 3.
  • nclasses: Number of output classes for classification. Default is 1000.
source
VisionTransformers.TwinsFunction
Twins(config::Symbol; kw...)
Twins(embed_dims, depths, nheads; 
      window_sizes=[7,7,7,7], sr_ratios=[8,4,2,1], mlp_ratio=4, 
      qkv_bias=true, dropout=0.1, attn_dropout_prob=0.1, 
      drop_path=0.0, inchannels=3, nclasses=1000)

Construct a Twins Vision Transformer model. The architecture consists of alternating local and global attention mechanisms to effectively capture both local and global features in images.

Parameters

  • config: Predefined configuration symbol. One of :small, :base, or :large.
  • embed_dims: Embedding dimensions for each stage.
  • depths: Number of transformer blocks in each stage.
  • nheads: Number of attention heads in each stage.
  • window_sizes: Spatial size of the attention window for each stage. Default is [7,7,7,7].
  • sr_ratios: Spatial reduction ratios for SRAttention in each stage. Default is [8,4,2,1].
  • mlp_ratio: Expansion ratio for the hidden dimension of the MLP relative to the embedding dimension. Default is 4.
  • qkv_bias: Whether to add bias to query, key, and value projections. Default is true.
  • dropout: Dropout probability applied to embeddings, MLP, and attention outputs. Default is 0.1.
  • attn_dropout: Dropout probability applied to attention weights. Default is 0.1.
  • drop_path: Probability for stochastic depth (drop-path) regularization. Default is 0.0.
  • inchannels: Number of input image channels. Default is 3.
  • nclasses: Number of output classes for classification. Default is 1000.
source

Layers

VisionTransformers.MultiHeadAttentionType
MultiHeadAttention(dim::Integer; nheads=8, attn_dropout_prob=0.0, 
                   proj_dropout_prob=0.0, qkv_bias=false)

Construct a multi-head self-attention layer. The input features are linearly projected into query, key, and value tensors. Expects tensors of size C x S x N, where C is the embedding dimension, S is the sequence length, and N is the batch size. Note that S can have more than one dimension.

Arguments

  • dim: Dimensionality of the input feature embeddings.
  • nheads: Number of parallel attention heads. Default is 8.
  • attn_dropout_prob: Dropout probability applied to attention weights.
  • proj_dropout_prob: Dropout probability applied after the output projection.
  • qkv_bias: Whether to add learnable bias terms to the query, key, and value projections.
source
VisionTransformers.ConvAttentionType
ConvAttention(dim::Int; kernel=(3,3), q_stride=(1,1), kv_stride=(1,1), nheads=8, 
              attn_dropout_prob=0.0, proj_dropout_prob=0.0, qkv_bias=false)

A convolutional attention layer as proposed in CvT. The queries, keys, and values are computed using convolutional layers. Computational complexity is reduced by downsampling the keys and values with a kv_stride greater than 1.

Parameters

  • dim: The dimension of the feature embedding.
  • nheads: The number of heads to use for self attention.
  • kernel: The kernel size used in the convolutional projection layers.
  • attn_dropout_prob: Dropout probability in the attention block.
  • proj_dropout_prob: Dropout probability in the projection block.
  • q_stride: Convolutional stride used to compute the query.
  • kv_stride: Convolutional stride used to compute the key and value.
  • qkv_bias: Whether to include a bias term in the convolutional projection layers.
source
VisionTransformers.SRAttentionType
SRAttention(dim::Int; nheads=8, qkv_bias=false, attn_dropout_prob=0.0,
            proj_dropout_prob=0.0, sr_ratio=(1,1), sr_method=:conv)

Construct a Spatial Reduction Attention (SRA) layer as used in PVT and Twins. This is a variant of multi-head self-attention designed to reduce computational cost on high-resolution feature maps by spatially downsampling the keys and values before attention. The reduction can be performed using convolutional or pooling methods.

Arguments

  • dim: Dimensionality of the input feature embeddings.
  • nheads: Number of parallel attention heads. Default is 8.
  • qkv_bias: Whether to add learnable bias terms to query, key, and value projections.
  • attn_dropout_prob: Dropout probability applied to the attention weights.
  • proj_dropout_prob: Dropout probability applied after the output projection.
  • sr_ratio: Spatial reduction ratio (height, width) applied to keys and values. (1,1) means no reduction.
  • sr_method: Method used for spatial reduction. Must be one of :conv or :pool.
source
VisionTransformers.WindowedAttentionType
WindowedAttention(dim::Integer; window_size=(7,7), shift_size=(0,0),
                  position_embedding=false, nheads=8, qkv_bias=false,
                  attn_dropout_prob=0.0, proj_dropout_prob=0.0)

Construct a windowed multi-head self-attention module, as used in the SWIN Transformer. The input is partitioned into non-overlapping windows of size window_size, and attention is computed within each window. Optionally, the windows can be shifted by shift_size to enable cross-window connections.

Arguments

  • dim: Dimensionality of the input feature embeddings.
  • window_size: Spatial size of each attention window (height, width).
  • shift_size: Offset applied to window partitions to enable connections across windows. Default (0,0) means no shift.
  • position_embedding: Whether to include learnable relative position embeddings in the attention computation.
  • nheads: Number of attention heads.
  • qkv_bias: Whether to add learnable bias parameters to the query, key, and value projections.
  • attn_dropout_prob: Dropout probability applied to the attention weights.
  • proj_dropout_prob: Dropout probability applied to the output projection.
source
VisionTransformers.AbsolutePositionEmbeddingType
AbsolutePositionEmbedding(dim::Int, imsize::Tuple, patchsize::Tuple; init=zeros32, extra_tokens=0)

Absolute position embedding layer that adds learnable positional embeddings to the input tensor.

Parameters

  • dim: The dimensionality of the embeddings.
  • imsize: A tuple representing the size of the input image (height, width[, depth]).
  • patchsize: A tuple representing the size of the patches (height, width[, depth]).
  • init: A function to initialize the positional embeddings (default is zeros32).
  • extra_tokens: The number of extra tokens to add to the positional embeddings (default is 0).

Input

An input tensor of shape WHCN or WHDCN where C is the embedding dimension and N is the batch size. The embedding will be resampled via bilinear interpolation if the input size does not match the initialized size.

source
VisionTransformers.VariablePositionEmbeddingType
VariablePositionEmbedding(dim::Int, imsize::Tuple, patchsize::Tuple; init=zeros32)

Variable position embedding layer that adds learnable positional embeddings to the input tensor.

Parameters

  • dim: The dimensionality of the embeddings.
  • imsize: A tuple representing the size of the input image (height, width[, depth]).
  • patchsize: A tuple representing the size of the patches (height, width[, depth]).
  • init: A function to initialize the positional embeddings (default is zeros32).

Input

An input tensor of shape WHCN or WHDCN where C is the embedding dimension and N is the batch size. The embedding will be resampled via bilinear interpolation if the input size does not match the initialized size.

source
VisionTransformers.RelativePositionEmbeddingType
RelativePositionEmbedding(dim::Int, nheads::Int, window_size::Tuple; init=zeros32)

Relative position embedding layer that adds learnable relative positional biases to the attention scores.

Parameters

  • dim: The dimensionality of the embeddings.
  • nheads: The number of attention heads.
  • window_size: A tuple representing the size of the attention window (height, width[, depth]).
  • init: A function to initialize the relative positional biases (default is zeros32).

Input

An attention scores tensor of shape [Ww*Wh x Ww*Wh x nheads x 1] where Ww and Wh are the window width and height, respectively. For 3D data, the shape is [Ww*Wh*Wd x Ww*Wh*Wd x nheads x 1] where Wd is the window depth.

Note

We define an overloaded version of Flux.NNlib.apply_attn_bias to apply the relative position embedding to the attention scores. Thus, it is sufficient to pass this layer to Flux.NNlib.dot_product_attention as the bias argument.

Example

q, k, v = Flux.chunk(qkv(x), 3; dims=1)
y, α = Flux.dot_product_attention(q, k, v, position_embedding; nheads)
source
VisionTransformers.PEGFunction
PEG(dim::Int; kernel=(3,3), bias=true)

Position Embedding Generator (PEG) layer that applies a depthwise convolution to the input tensor to encode positional information.

Parameters

  • dim: The number of input and output channels.
  • kernel: The size of the convolution kernel (default is (3,3)).
  • bias: Whether to include a bias term in the convolution (default is true).
source
VisionTransformers.MLPFunction
MLP(indims, hiddendims, outdims; dropout=0.0, act=Flux.gelu)

Build a 2-layer multi-layer-perceptron.

Parameters

  • indims: The dimension of the input features.
  • hiddendims: The dimension of the hidden features.
  • outdims: The dimension of the output features.
  • dropout: The dropout probability following each Dense layer.
  • act: The activation function following the first Dense layer.
source
VisionTransformers.SeparableConvFunction
SeparableConv(dim, kernel::NTuple{N,Int}, stride::NTuple{N,Int}; bias=false) where N

Build a separable convolution layer consisting of a depthwise convolution followed by a pointwise convolution.

Parameters

  • dim: The number of input and output channels.
  • kernel: The size of the convolution kernel.
  • stride: The stride of the convolution.
  • bias: Whether to include a bias term in the pointwise convolution.
source
VisionTransformers.TokensType
Tokens(dim, ntokens; init=rand32)

Learnable token embedding layer that prepends a specified number of learnable tokens to the input sequence.

Parameters

  • dim: The dimensionality of the tokens.
  • ntokens: The number of learnable tokens to prepend.
  • init: A function to initialize the token embeddings (default is rand32).

Input

A tensor of shape CLN where C is the embedding dimension, L is the sequence length, and N is the batch size. The output will have shape C(L+ntokens)N with the learnable tokens prepended to the input sequence.

source
VisionTransformers.StripTokensType
StripTokens(ntokens)

Layer that removes a specified number of tokens from the beginning of the input sequence.

Parameters

  • ntokens: The number of tokens to remove from the beginning of the sequence.
source

Utilities

VisionTransformers.img2seqFunction
img2seq(x::AbstractArray{<:Any,4})

Take a collection of image tokens of size [W x H x C x N] and flatten them into a sequence of size [C x L x N] where L = W * H.

source
VisionTransformers.seq2imgFunction
seq2img(x::AbstractArray{<:Any,3})

Take a sequence of image tokens of size [C x L x N] and reshape it into an image of size [W x H x C x N], where W = H = sqrt(L).

source