Transformer-based Object Detection. Detection Transformer (DETR) [2] redefines object detection as a set prediction problem, using a transformer-based architecture to model object relationships via global attention.

It eliminates the need for hand-crafted components like non-maximum suppression by employing learnable object queries and Hungarian matching [18].

To improve efficiency, Deformable DETR [48] introduces sparse attention on multilevel feature maps, accelerating convergence and enhancing small-object detection. Various DETR variants [6, 20, 25, 30, 49] further refine performance. Our method is built upon the widely adopted Deformable DETR.

Dual Domain Control via Active Learning for Remote Sensing Domain, page 2

Below is generated by Copilot.

What is DETR?

DETR (DEtection TRansformer) reframes object detection as a set prediction problem using a Transformer architecture. Instead of generating proposals or anchors, it predicts a fixed-size set of objects directly.


Role of Query Vectors

Query vectors are learnable embeddings that act as “slots” for potential objects. Here’s how they work:

1. Input to the Transformer Decoder

  • DETR has two main parts:

    • Encoder: Processes image features extracted by a CNN backbone (e.g., ResNet).
    • Decoder: Takes query vectors and attends to encoder outputs to predict objects.
  • Each query vector interacts with the encoded image features through cross-attention. This means:

    • Queries ask: “Is there an object here? What are its properties?”
    • The decoder refines these queries into object predictions.

2. Fixed Number of Queries

  • DETR uses a fixed number of queries (e.g., 100).
  • Each query corresponds to one predicted object (or a “no object” class).
  • Why fixed? Because the Transformer predicts a set of objects without relying on anchors or region proposals.

3. Predictions from Queries

  • Each query outputs:
    • Class label (including “no object”)
    • Bounding box coordinates
  • Hungarian matching is used to match predicted boxes to ground truth during training.

4. Why Queries Matter in Incremental Learning

  • Queries are dynamic and unordered.
  • They carry semantic information about objects.
  • In domain-incremental learning, queries can shift when the domain changes (new sensor, new conditions).
  • ADSC (Active Domain Shift Control) in the paper aligns these queries between old and new domains using optimal transport and knowledge distillation.