Transformer-base Object Detection

Transformer-based Object Detection. Detection Transformer (DETR) [2] redefines object detection as a set prediction problem, using a transformer-based architecture to model object relationships via global attention.

It eliminates the need for hand-crafted components like non-maximum suppression by employing learnable object queries and Hungarian matching [18].

To improve efficiency, Deformable DETR [48] introduces sparse attention on multilevel feature maps, accelerating convergence and enhancing small-object detection. Various DETR variants [6, 20, 25, 30, 49] further refine performance. Our method is built upon the widely adopted Deformable DETR.

Dual Domain Control via Active Learning for Remote Sensing Domain, page 2

Below is generated by Copilot.

What is DETR?

DETR (DEtection TRansformer) reframes object detection as a set prediction problem using a Transformer architecture. Instead of generating proposals or anchors, it predicts a fixed-size set of objects directly.

Role of Query Vectors

Query vectors are learnable embeddings that act as “slots” for potential objects. Here’s how they work:

1. Input to the Transformer Decoder

DETR has two main parts:
- Encoder: Processes image features extracted by a CNN backbone (e.g., ResNet).
- Decoder: Takes query vectors and attends to encoder outputs to predict objects.
Each query vector interacts with the encoded image features through cross-attention. This means:
- Queries ask: “Is there an object here? What are its properties?”
- The decoder refines these queries into object predictions.

2. Fixed Number of Queries

DETR uses a fixed number of queries (e.g., 100).
Each query corresponds to one predicted object (or a “no object” class).
Why fixed? Because the Transformer predicts a set of objects without relying on anchors or region proposals.

3. Predictions from Queries

Each query outputs:
- Class label (including “no object”)
- Bounding box coordinates
Hungarian matching is used to match predicted boxes to ground truth during training.

4. Why Queries Matter in Incremental Learning

Queries are dynamic and unordered.
They carry semantic information about objects.
In domain-incremental learning, queries can shift when the domain changes (new sensor, new conditions).
ADSC (Active Domain Shift Control) in the paper aligns these queries between old and new domains using optimal transport and knowledge distillation.

Melih Akay 🦾

Explorer

Recent writing

Wheelie Roadmap

Invertibility of a matrix

About me