Publications

2025

  1. gift.png
    GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers
    under review, 2025
    A framework for generating global, interpretable textual explanations of vision classifiers, combining counterfactual visual explanations with VLMs and LLMs.
  2. ppt.PNG
    PPT: Pre-Training with Pseudo-Labeled Trajectories for Motion Forecasting
    under review, 2025
    Pre-training with pseudo-labeled trajectories, obtained with offline 3D-trackers, boosts trajectory prediction models: improved performance, efficiency, and generalization.
  3. awta.png
    Annealed Winner-Takes-All for Motion Forecasting
    under review, 2025
    Using an annealing loss enhances training stability and performance of state-of-the-art trajectory prediction models.

2024

  1. llm_wrapper.png
    LLM-wrapper: black-box semantic-aware adaptation of Vision-Language foundation models
    ECCV Workshop Eval-FoMo, 2024
    LLMs can learn to adapt black-box VLMs for new tasks and domains, by wrapping and reasoning on the vision models’ outputs.
  2. regents_page.png
    ReGentS: Real-World Safety-Critical Driving Scenario Generation Made Stable
    In ECCV Workshop W-CODA, 2024
    ReGentS generates safety-critical driving scenarios with adversarial optimization of real-world trajectories.
  3. valeo4cast.png
    Valeo4Cast: A Modular Approach to End-to-End Forecasting
    In ECCV Workshop ROAD++, 2024
    Using separate training and fine-tuning of detection, tracking, and forecasting modules, achieves first place in the Argoverse 2 Challenge, outperforming last year’s winner by +17.1 points.
  4. unitraj.gif
    UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
    In ECCV, 2024
    Unifying major datasets for vehicle trajectory prediction enables the study of scale and diversity impacts on performance and model generalization.
  5. pointbev.png
    PointBeV: A Sparse Approach to BeV Predictions
    In CVPR, 2024
    A sparse approach to bird’s-eye view perception enhances performance and computational efficiency by avoiding the uniform allocation of resources across all cells, making it flexible to the task, situation and compute budget at inference time.
  6. unsup_obj_loc_survey.png
    Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey
    In IJCV, 2024
    A survey on unsupervised object localization methods leveraging self-supervised pre-trained features, e.g., DINO.
  7. e2e_forecasting.png
    Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?
    In ICRA, 2024
    This work presents a unified evaluation pipeline for motion forecasting with real-world perception inputs, revealing a performance gap between curated and perception-based data.

2023

  1. octet.png
    OCTET: Object-aware Counterfactual Explanations
    In CVPR, 2023
    Using a spatial- and object-aware generative model enables the generation of counterfactual explanations for deep vision models dealing with complex scenes, including many objects.
  2. found.png
    Unsupervised Object Localization: Observing the Background to Discover Objects
    In CVPR, 2023
    FOUND trains a single conv1x1 on DINO features, for unsupervised object segmentation. It runs at 80 FPS on a V100 after a 2h self-training on a single GPU.
  3. lidartouch.png
    LiDARTouch: Monocular metric depth estimation with a few-beam LiDAR
    In CVIU, 2023
    Adding a low-cost LiDAR to a monocular camera setup yields improved metric depth maps in a self-supervised manner.

2022

  1. lara.png
    LaRa: Latents and Rays for Multi-Camera Bird’s-Eye-View Semantic Segmentation
    In CoRL, 2022
    The Perceiver architecture, combined with careful ray encoding, excels in multi-camera fusion and transforming perceptive views into bird’s-eye-view semantic segmentation.
  2. steex.png
    STEEX: Steering Counterfactual Explanations with Semantics
    In ECCV, 2022
    Using a well-structured image generative model unlocks the generation of counterfactual explanations for deep vision models dealing with high-quality image and complex scenes.
  3. cab.png
    Raising context awareness in motion forecasting
    In CVPR Workshop on Autonomous Driving (WAD), 2022
    As trajectory prediction models merely extrapolate past motion, CAB enhances the use of HD-map information, addressing long-tail corner cases.
  4. xai_survey.png
    Explainability of deep vision-based autonomous driving systems: Review and challenges
    Éloi Zablocki*Hédi Ben-Younes*Patrick Pérez, and Matthieu Cord
    In IJCV, 2022
    A survey on explainability methods for vision-based autonomous-driving models.
  5. beef.png
    Driving Behavior Explanation with Multi-level Fusion
    Hedi Ben-Younes*Éloi Zablocki*Matthieu Cord, and Patrick Pérez
    In Pattern Recognition journal (PR), 2022
    BEEF is a self-driving model that both drives and explains its decisions with natural language.

2020

  1. transductive_zsl.PNG
    Transductive Zero-Shot Learning using Cross-modal CycleGAN
    arxiv, 2020
    Using a cycle-consistency loss reduces the domain shift between visual and textual representations, enhancing performance in zero-shot object recognition.

2019

  1. czsl.PNG
    Context-Aware Zero-Shot Learning for Object Recognition
    In ICML, 2019
    Using visual context boosts zero-shot object recognition.
  2. grounded_sentence_embedding.PNG
    Incorporating Visual Semantics into Sentence Representations within a Grounded Space
    In EMNLP, 2019
    A careful transfer of visual features to sentence representations enriches the semantics of general-purpose textual representations.

2018

  1. grounded_word_embedding.PNG
    Learning Multi-Modal Word Representation Grounded in Visual Context
    In AAAI, 2018
    Visual context can be used, along with textual context, to learn improved word representations with the skip-gram algorithm.

2017

  1. clef.PNG
    LIP6@CLEF2017: Multi-Modal Spatial Role Labeling using Word Embeddings
    In CLEF, 2017
    A linear SVM on pooled word representations to classify spatial relations from text and images.