Transformer-based Object Detection. Detection Transformer (DETR) [2] redefines object detection as a set prediction problem, using a transformer-based architecture to model object relationships via global attention.
It eliminates the need for hand-crafted components like non-maximum suppression by employing learnable object queries and Hungarian matching [18].
To improve efficiency, Deformable DETR [48] introduces sparse attention on multilevel feature maps, accelerating convergence and enhancing small-object detection. Various DETR variants [6, 20, 25, 30, 49] further refine performance. Our method is built upon the widely adopted Deformable DETR.
Dual Domain Control via Active Learning for Remote Sensing Domain, page 2
Below is generated by Copilot.
What is DETR?
DETR (DEtection TRansformer) reframes object detection as a set prediction problem using a Transformer architecture. Instead of generating proposals or anchors, it predicts a fixed-size set of objects directly.
Role of Query Vectors
Query vectors are learnable embeddings that act as “slots” for potential objects. Here’s how they work:
1. Input to the Transformer Decoder
-
DETR has two main parts:
- Encoder: Processes image features extracted by a CNN backbone (e.g., ResNet).
- Decoder: Takes query vectors and attends to encoder outputs to predict objects.
-
Each query vector interacts with the encoded image features through cross-attention. This means:
- Queries ask: “Is there an object here? What are its properties?”
- The decoder refines these queries into object predictions.
2. Fixed Number of Queries
- DETR uses a fixed number of queries (e.g., 100).
- Each query corresponds to one predicted object (or a “no object” class).
- Why fixed? Because the Transformer predicts a set of objects without relying on anchors or region proposals.
3. Predictions from Queries
- Each query outputs:
- Class label (including “no object”)
- Bounding box coordinates
- Hungarian matching is used to match predicted boxes to ground truth during training.
4. Why Queries Matter in Incremental Learning
- Queries are dynamic and unordered.
- They carry semantic information about objects.
- In domain-incremental learning, queries can shift when the domain changes (new sensor, new conditions).
- ADSC (Active Domain Shift Control) in the paper aligns these queries between old and new domains using optimal transport and knowledge distillation.