Click here to flash read.
Convolutional neural networks have made significant strides in medical image
analysis in recent years. However, the local nature of the convolution operator
inhibits the CNNs from capturing global and long-range interactions. Recently,
Transformers have gained popularity in the computer vision community and also
medical image segmentation. But scalability issues of self-attention mechanism
and lack of the CNN like inductive bias have limited their adoption. In this
work, we present MaxViT-UNet, an Encoder-Decoder based hybrid vision
transformer for medical image segmentation. The proposed hybrid decoder, also
based on MaxViT-block, is designed to harness the power of convolution and
self-attention mechanism at each decoding stage with minimal computational
burden. The multi-axis self-attention in each decoder stage helps in
differentiating between the object and background regions much more
efficiently. The hybrid decoder block initially fuses the lower level features
upsampled via transpose convolution, with skip-connection features coming from
hybrid encoder, then fused features are refined using multi-axis attention
mechanism. The proposed decoder block is repeated multiple times to accurately
segment the nuclei regions. Experimental results on MoNuSeg dataset proves the
effectiveness of the proposed technique. Our MaxViT-UNet outperformed the
previous CNN only (UNet) and Transformer only (Swin-UNet) techniques by a large
margin of 2.36% and 5.31% on Dice metric respectively.