iccv23

iccv 2021 论文列表

2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021.

PointBA: Towards Backdoor Attacks in 3D Point Cloud.
Black-box Detection of Backdoor Attacks with Limited Information and Data.
Rethinking the Backdoor Attacks' Triggers: A Frequency Perspective.
Invisible Backdoor Attack with Sample-Specific Triggers.
CLEAR: Clean-up Sample-Targeted Backdoor in Neural Networks.
Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better.
Defending against Universal Adversarial Patches by Clipping Feature Norms.
Low Curvature Activations Reduce Overfitting in Adversarial Training.
Practical Relative Order Attack in Deep Ranking.
Cross-Modality Person Re-Identification via Modality Confusion and Center Aggregation.
A Simple Baseline for Weakly-Supervised Scene Graph Generation.
From General to Specific: Informative Scene Graph Generation via Balance Adjustment.
Spatial-Temporal Transformer for Dynamic Scene Graph Generation.
Unconditional Scene Graph Generation.
Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs.
Few-Shot Visual Relationship Co-Localization.
Salient Object Ranking with Position-Preserved Attention.
Vision-Language Transformer and Query Generation for Referring Segmentation.
Condensing a Sequence to One Informative Frame for Video Recognition.
Refining Action Segmentation with Hierarchical Video Representations.
Region-aware Contrastive Learning for Semantic Segmentation.
Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation.
Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction.
Point Transformer.
Adaptive Focus for Efficient Video Recognition.
SurfGen: Adversarial 3D Shape Synthesis with Explicit Surface Discriminators.
CTRL-C: Camera calibration TRansformer with Line-Classification.
A Closer Look at Rotation-invariant Deep Point Cloud Analysis.
PU-EVA: An Edge-Vector based Approximation Solution for Flexible-scale Point Cloud Upsampling.
Full-Velocity Radar Returns by Radar-Camera Fusion.
Attack as the Best Defense: Nullifying Image-to-image Translation GANs via Limit-aware Adversarial Attack.
Knowledge-Enriched Distributional Model Inversion Attacks.
Aha! Adaptive History-driven Attack for Decision-based Black-box Models.
Admix: Enhancing the Transferability of Adversarial Attacks.
Bayesian Deep Basis Fitting for Depth Completion with Uncertainty.
Glimpse-Attend-and-Explore: Self-Attention for Active Visual Exploration.
The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation.
Auxiliary Tasks and Exploration Enable ObjectGoal Navigation.
LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving.
RAIN: Reinforced Hybrid Attention Inference Network for Motion Forecasting.
VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction.
GP-S3Net: Graph-based Panoptic Sparse Semantic Segmentation Network.
Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting.
Regularizing Nighttime Weirdness: Efficient Self-supervised Monocular Depth Estimation in the Dark.
Self-Supervised Real-to-Sim Scene Generation.
MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans.
RPVNet: A Deep and Efficient Range-Point-Voxel Fusion Network for LiDAR Point Cloud Segmentation.
HRegNet: A Hierarchical Network for Large-scale Outdoor LiDAR Point Cloud Registration.
Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.
P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching.
Deep Hough Voting for Robust Global Registration.
Exploiting Scene Graphs for Human-Object Interaction Detection.
Pose Correction for Highly Accurate Visual Localization in Large-scale Indoor Spaces.
Graspness Discovery in Clutters for Fast and Accurate Grasp Detection.
Episodic Transformer for Vision-and-Language Navigation.
Context-aware Scene Graph Generation with Seq2Seq Transformers.
Exploring Long Tail Visual Relationship Recognition with Large Vocabulary.
Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection.
Topic Scene Graph Generation by Attention Distillation from Caption.
Visual Graph Memory with Unsupervised Representation for Visual Navigation.
Segmentation-grounded Scene Graph Generation.
Exploring Relational Context for Multi-Task Dense Prediction.
Enhanced Boundary Learning for Glass-like Object Segmentation.
Interaction via Bi-directional Graph of Semantic Region Affinity for Scene Parsing.
In-Place Scene Labelling and Understanding with Implicit Scene Representation.
Generative Compositional Augmentations for Scene Graph Prediction.
Visual Distant Supervision for Scene Graph Generation.
MGNet: Monocular Geometric Scene Understanding for Autonomous Driving.
NEAT: Neural Attention Fields for End-to-End Autonomous Driving.
Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations.
The Functional Correspondence Problem.
H2O: A Benchmark for Visual Human-human Object Handover Analysis.
Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery.
Toward Human-Like Grasp: Dexterous Grasping via Semantic Representation of Object-Hand.
Safety-aware Motion Prediction with Unseen Vehicles for Autonomous Driving.
Learnable Boundary Guided Adversarial Training.
Robustness and Generalization via Generative Adversarial Training.
Triggering Failures: Out-Of-Distribution detection by learning from local adversarial attacks in Semantic Segmentation.
RobustNav: Towards Benchmarking Robustness in Embodied Navigation.
VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection.
Multi-View Radar Semantic Segmentation.
Structured Bird's-Eye-View Traffic Scene Understanding from Onboard Images.
Road Anomaly Detection by Partial Image Reconstruction with Segmentation Coupling.
AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection.
Robust 2D/3D Vehicle Parsing in Arbitrary Camera Views for CVIS.
Prediction by Anticipation: An Action-Conditional Prediction Method based on Interaction Learning.
Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning.
Bifold and Semantic Reasoning for Pedestrian Behavior Prediction.
Learning to drive from a world on rails.
Personalized Trajectory Prediction via Distribution Discrimination.
Crowd Counting With Partial Annotations in an Image.
Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation.
Spatial Uncertainty-Aware Semi-Supervised Crowd Counting.
FOVEA: Foveated Image Magnification for Autonomous Navigation.
Revealing the Reciprocal Relations between Self-Supervised Stereo and Monocular Depth Estimation.
Perturbed Self-Distillation: Weakly Supervised Large-Scale Point Cloud Semantic Segmentation.
ReDAL: Region-based and Diversity-aware Active Learning for Point Cloud Semantic Segmentation.
Warp-Refine Propagation: Semi-Supervised Auto-labeling via Cycle-consistency.
VMNet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation.
Learning Inner-Group Relations on Point Clouds.
Hierarchical Aggregation for 3D Instance Segmentation.
HiFT: Hierarchical Feature Transformer for Aerial Tracking.
SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation.
4D-Net for Learned Multi-Modal Alignment.
Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation.
Rethinking 360° Image Visual Attention Modelling with Unsupervised Learning.
Learning of Visual Relations: The Devil is in the Tails.
FLAR: A Unified Prototype Framework for Few-sample Lifelong Active Recognition.
Pose Invariant Topological Memory for Visual Navigation.
THDA: Treasure Hunt Data Augmentation for Semantic Navigation.
Scaling up instance annotation via label propagation.
Scribble-Supervised Semantic Segmentation Inference.
PrimitiveNet: Primitive Instance Segmentation with Local Primitive Embedding under Adversarial Metric.
Deep Metric Learning for Open World Semantic Segmentation.
Continuous Copy-Paste for One-stage Multi-object Tracking and Segmentation.
Hierarchical Disentangled Representation Learning for Outdoor Illumination Estimation and Editing.
DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets.
LSG-CPD: Coherent Point Drift with Local Surface Geometry for Point Cloud Registration.
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather.
FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular Cameras.
Robust Small Object Detection on the Water Surface through Fusion of Camera and Millimeter Wave Radar.
BabelCalib: A Universal Approach to Calibrating Central Cameras.
VSAC: Efficient and Accurate Estimator for H and F.
From Goals, Waypoints & Paths To Long Term Human Trajectory Forecasting.
End-to-End Urban Driving by Imitating a Reinforcement Learning Coach.
Globally Optimal and Efficient Manhattan Frame Estimation by Delimiting Rotation Search Space.
Indoor Scene Generation from a Collection of Semantic-Segmented Depth Images.
Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery.
GRF: Learning a General Radiance Field for 3D Representation and Rendering.
Geometry-based Distance Decomposition for Monocular 3D Object Detection.
Waypoint Models for Instruction-guided Navigation in Continuous Environments.
Active Learning for Lane Detection: A Knowledge Distillation Approach.
GridToPix: Training Embodied Agents with Minimal Supervision.
Hierarchical Object-to-Zone Graph for Object Navigation.
Social NCE: Contrastive Learning of Socially-aware Motion Representations.
ID-Reveal: Identity-aware DeepFake Video Detection.
Multi-Expert Adversarial Attack Detection in Person Re-identification Using Context Inconsistency.
PASS: Protected Attribute Suppression System for Mitigating Bias in Face Recognition.
Ensemble Attention Distillation for Privacy-Preserving Federated Learning.
Adaptive Label Noise Cleaning with Meta-Supervision for Deep Face Recognition.
TransForensics: Image Forgery Localization with Dense Self-Attention.
Exploring Temporal Coherence for More General Video Face Forgery Detection.
Self-supervised Domain Adaptation for Forgery Localization of JPEG Compressed Images.
Learning Self-Consistency for Deepfake Detection.
TransReID: Transformer-based Object Re-Identification.
Learning Bias-Invariant Representation by Cross-Sample Mutual Information Minimization.
BiaSwap: Removing Dataset Bias with Bias-Tailored Swapping Augmentation.
Understanding and Mitigating Annotation Bias in Facial Expression Recognition.
Discover the Unknown Biased Attribute of an Image Classifier.
ICE: Inter-instance Contrastive Encoding for Unsupervised Person Re-identification.
Towards the Unseen: Iterative Text Recognition by Distilling from Errors.
Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition.
Learning Instance-level Spatial-Temporal Patterns for Person Re-identification.
3D Local Convolutional Neural Networks for Gait Recognition.
Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images.
SemIE: Semantically-aware Image Extrapolation.
LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation.
Diverse Image Style Transfer via Invertible Cross-Space Mapping.
Image Harmonization with Transformer.
Manifold Alignment for Semantically Aligned Style Transfer.
Detection and Continual Learning of Novel Face Presentation Attacks.
Robust Watermarking for Deep Neural Networks via Bi-level Optimization.
Understanding and Evaluating Racial Biases in Image Captioning.
Membership Inference Attacks are Easier on Difficult Problems.
DisUnknown: Distilling Unknown Factors for Disentanglement Learning.
Joint Audio-Visual Deepfake Detection.
Painting from Part.
Benchmarking Ultra-High-Definition Image Super-resolution.
Accelerating Atmospheric Turbulence Simulation via Learned Phase-to-Space Transform.
Click to Move: Controlling Video Generation with Sparse Motion.
Pathdreamer: A World Model for Indoor Navigation.
SLAMP: Stochastic Latent Appearance and Motion Prediction.
Point-Based Modeling of Human Clothing.
iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis.
Attention-based Multi-Reference Learning for Image Super-Resolution.
Dynamic Cross Feature Fusion for Remote Sensing Pansharpening.
Neural Image Compression via Attentional Multi-scale Back Projection and Frequency Decomposition.
Deep Edge-Aware Interactive Colorization against Color-Bleeding Effects.
Unpaired Learning for High Dynamic Range Image Tone Mapping.
Gait Recognition via Effective Global-Local Feature Representation and Local Temporal Aggregation.
Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing.
Bridging the Gap between Label- and Reference-based Synthesis in Multi-attribute Image-to-Image Translation.
StyleFormer: Real-time Arbitrary Style Transfer via Parametric Style Composition.
Domain-Aware Universal Style Transfer.
Flow-Guided Video Inpainting with Scene Templates.
Training Weakly Supervised Video Frame Interpolation with Events.
Internal Video Inpainting by Implicit Long-range Propagation.
Towards Complete Scene and Regular Shape for Distortion Rectification by Curve-Aware Extrapolation.
Parallel Multi-Resolution Fusion Network for Image Inpainting.
STRIVE: Scene Text Replacement In Videos.
Asymmetric Bilateral Motion Estimation for Video Frame Interpolation.
EgoRenderer: Rendering Human Avatars from Egocentric Camera Images.
Embedding Novel Views in a Single JPEG Image.
Learning a Sketch Tensor Space for Image Inpainting of Man-made Scenes.
OSCAR-Net: Object-centric Scene Graph Attention for Image Attribution.
XVFI: eXtreme Video Frame Interpolation.
ELF-VC: Efficient Learned Flexible-Rate Video Coding.
Occlusion-Aware Video Object Inpainting.
Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image.
Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data.
Dual Projection Generative Adversarial Networks for Conditional Image Generation.
Latent Transformations via NeuralODEs for GAN-based Image Editing.
Collaging Class-specific GANs for Semantic Image Synthesis.
EigenGAN: Layer-Wise Eigen-Learning for GANs.
HeadGAN: One-shot Neural Head Synthesis and Editing.
Physics-based Differentiable Depth Sensor Simulation.
Towards Vivid and Diverse Image Colorization with Generative Color Prior.
ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models.
Geometry-Free View Synthesis: Transformers and no 3D Priors.
FastNeRF: High-Fidelity Neural Rendering at 200FPS.
KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs.
Neural Radiance Flow for 4D View Synthesis and Video Processing.
Animatable Neural Radiance Fields for Modeling Dynamic Human Bodies.
Unconstrained Scene Generation with Locally Conditioned Radiance Fields.
Reality Transform Adversarial Generators for Image Splicing Forgery Detection and Localization.
Unsupervised Image Generation with Infinite Generative Adversarial Networks.
Semantically Robust Unpaired Image Translation for Data with Unmatched Semantics Statistics.
LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions.
Toward Spatially Unbiased Generative Models.
Cortical Surface Shape Analysis Based on Alexandrov Polyhedra.
Searching for Controllable Image Restoration Networks.
SIGNET: Efficient Neural Representation for Light Fields.
Modulated Periodic Activations for Generalizable Local Functional Representations.
Neural Strokes: Stylized Line Drawing of 3D Shapes.
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.
Image Manipulation Detection by Multi-View Multi-Scale Supervision.
Unaligned Image-to-Image Translation by Learning to Reweight.
CR-Fill: Generative Image Inpainting with Auxiliary Contextual Reconstruction.
Rethinking the Truly Unsupervised Image-to-Image Translation.
Aligning Latent and Image Spaces to Connect the Unconnectable.
Image Inpainting via Conditional Texture and Structure Dual Generation.
MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo.
WaveFill: A Wavelet-based Generation Network for Image Inpainting.
PixelSynth: Generating a 3D-Consistent Experience from a Single Image.
Towards Discovery and Attribution of Open-world GAN Generated Images.
GAN-Control: Explicitly Controllable GANs.
GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds.
Omni-GAN: On the Secrets of cGANs and Beyond.
Sketch Your Own GAN.
FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.
Multi-Scale Separable Network for Ultra-High-Definition Video Deblurring.
Instance-wise Hard Negative Example Generation for Contrastive Learning in Unpaired Image-to-Image Translation.
TransferI2I: Transfer Learning for Image-to-Image Translation from Small Datasets.
Deep Halftoning with Reversible Binary Pattern.
Learning High-Fidelity Face Texture Completion without Complete Face Texture.
Diagonal Attention and Style-based GAN for Content-Style Disentanglement in Image Generation and Translation.
Labels4Free: Unsupervised Segmentation using StyleGAN.
DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis.
Detail Me More: Improving GAN's photo-realism of complex scenes.
GAN Inversion for Out-of-Range Images with Geometric Transformations.
Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving.
Focal Frequency Loss for Image Reconstruction and Synthesis.
From Continuity to Editability: Inverting GANs with Consecutive Images.
Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts.
VariTex: Variational Neural Face Textures.
Learning Generative Models of Textured 3D Meshes from Real-World Images.
Learning to Stylize Novel Views.
Structure-transformed Texture-enhanced Network for Person Image Synthesis.
3D Human Texture Estimation from a Single Image with Transformers.
Motion-Aware Dynamic Architecture for Efficient Frame Interpolation.
Learned Spatial Representations for Few-shot Talking-Head Synthesis.
Image Synthesis from Layout with Locality-Aware Mask Adaption.
FashionMirror: Co-attention Feature-remapping Virtual Try-on with Sequential Template Poses.
Talk-to-Edit: Fine-Grained Facial Editing via Dialog.
A Latent Transformer for Disentangled Face Editing in Images and Videos.
Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering.
Image Shape Manipulation from a Single Augmented Training Sample.
PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering.
Image Synthesis via Semantic Composition.
Class Semantics-based Attention for Action Detection.
CAG-QIL: Context-Aware Actionness Grouping via Q Imitation Learning for Online Temporal Action Localization.
Efficient Action Recognition via Dynamic Knowledge Propagation.
TAM: Temporal Adaptive Module for Video Recognition.
Class-Incremental Learning for Action Recognition in Videos.
Target Adaptive Context Aggregation for Video Scene Graph Generation.
Multi-Modal Multi-Action Video Recognition.
GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer.
Video Self-Stitching Graph Network for Temporal Action Localization.
Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization.
Elaborative Rehearsal for Zero-shot Action Recognition.
Selective Feature Compression for Efficient Activity Recognition Inference.
Learning Cross-Modal Contrastive Features for Video Domain Adaptation.
D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations.
Assignment-Space-based Multi-Object Tracking and Segmentation.
A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Prediction.
VidTr: Video Transformer Without Convolutions.
Channel Augmented Joint Learning for Visible-Infrared Recognition.
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos.
Learning to Track Objects from Unlabeled Videos.
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions.
Relaxed Transformer Decoders for Direct Action Proposal Generation.
Enriching Local and Global Contexts for Temporal Action Localization.
Anticipative Video Transformer.
The Spatio-Temporal Poisson Point Process: A Simple Model for the Alignment of Event Camera Data.
Social Fabric: Tubelet Compositions for Video Relation Detection.
Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection.
HAA500: Human-Centric Atomic Action Dataset with Curated Videos.
Divide and Conquer for Single-frame Temporal Action Localization.
Learning Target Candidate Association to Keep Track of What Not to Track.
Else-Net: Elastic Semantic Network for Continual Action Recognition from Skeleton Data.
Skeleton Cloud Colorization for Unsupervised 3D Action Representation Learning.
AdaSGN: Adapting Joint Number and Model Size for Efficient Skeleton-Based Action Recognition.
AI Choreographer: Music Conditioned 3D Dance Generation with AIST++.
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild.
GeomNet: A Neural Network Based on Riemannian Geometries of SPD Matrix Space and Cholesky Space for 3D Skeleton-Based Interaction Recognition.
Consistency-Aware Graph Network for Human Interaction Understanding.
Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition.
Evidential Deep Learning for Open Set Action Recognition.
Learn to Match: Automatic Matching Network Design for Visual Tracking.
Self-supervised 3D Skeleton Action Representation Learning with Motion Consistency and Continuity.
Spatially Conditioned Graphs for Detecting Human-Object Interactions.
Generating Smooth Pose Sequences for Diverse Human Motion Prediction.
Motion Prediction using Trajectory Cues.
Self-Supervised 3D Face Reconstruction via Conditional Estimation.
Likelihood-Based Diverse Sampling for Trajectory Forecasting.
Provably Approximated Point Cloud Registration.
Square Root Marginalization for Sliding-Window Bundle Adjustment.
Three Steps to Multimodal Trajectory Prediction: Modality Clustering, Classification and Synthesis.
M3D-VTON: A Monocular-to-3D Virtual Try-On Network.
PCAM: Product of Cross-Attention Matrices for Rigid Registration of Point Clouds.
A General Recurrent Tracking Framework without Real Data.
CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds.
Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds.
Voxel-based Network for Shape Completion by Leveraging Edge Generation.
MEDIRL: Predicting the Visual Attention of Drivers via Maximum Entropy Deep Inverse Reinforcement Learning.
Unlimited Neighborhood Interaction for Heterogeneous Trajectory Prediction.
MG-GAN: A Multi-Generator Model Preventing Out-of-Distribution Samples in Pedestrian Trajectory Prediction.
On Exposing the Challenging Long Tail in Future Prediction of Traffic Actors.
Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation.
SLIM: Self-Supervised LiDAR Scene Flow and Motion Segmentation.
Motion Basis Learning for Unsupervised Deep Homography Estimation with Subspace Projection.
Efficient and Differentiable Shadow Computation for Inverse Problems.
Geometric Granularity Aware Pixel-to-Mesh.
Multiresolution Deep Implicit Functions for 3D Shape Representation.
Motion Guided Attention Fusion to Recognize Interactions from Videos.
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition.
Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection.
Object Tracking by Jointly Exploiting Frame and Event Domain.
Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation.
Sketch2Mesh: Reconstructing and Editing 3D Shapes from Sketches.
SIMstack: A Generative Shape and Instance Model for Unordered Object Stacks.
A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation.
Planar Surface Reconstruction from Sparse Views.
Discovering 3D Parts from Image Collections.
THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers.
Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video.
CodeNeRF: Disentangled Neural Radiance Fields for Object Categories.
Mesh Graphormer.
I2UV-HandNet: Image-to-UV Prediction Network for Accurate and High-fidelity 3D Hand Mesh Modeling.
DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling.
Context-Sensitive Temporal Feature Learning for Gait Recognition.
PIAP-DF: Pixel-Interested and Anti Person-Specific Facial Action Unit Detection Net with Discrete Feedback Learning.
Hierarchical Memory Matching Network for Video Object Segmentation.
Towards Interpretable Deep Networks for Monocular Depth Estimation.
GyroFlow: Gyroscope-Guided Unsupervised Optical Flow Learning.
VaPiD: A Rapid Vanishing Point Detector via Learned Optimizers.
Adaptive Surface Normal Constraint for Depth Estimation.
SurfaceNet: Adversarial SVBRDF Estimation from a Single Image.
Sparse Needlets for Lighting Estimation with Spherical Transport Loss.
Towards High Fidelity Monocular Face Reconstruction with Rich Reflectance using Self-supervised Learning and Ray Tracing.
Adaptive confidence thresholding for monocular depth estimation.
DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes.
MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments.
R-MSFM: Recurrent Multi-Scale Feature Modulation for Monocular Depth Estimating.
Boosting Monocular Depth Estimation with Lightweight 3D Point Fusion.
PX-NET: Simple and Efficient Pixel-Wise Training of Photometric Stereo Networks.
Unsupervised Depth Completion with Calibrated Backprojection Layers.
Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation.
Can Scale-Consistent Monocular Depth Be Learned in a Self-Supervised Scale-Invariant Manner?
Holistic Pose Graph: Modeling Geometric Structure among Objects in a Scene using Graph Inference for 3D Object Prediction.
4DComplete: Non-Rigid Motion Estimation Beyond the Observable Surface.
NPMs: Neural Parametric Models for 3D Deformable Shapes.
NeRD: Neural Reflectance Decomposition from Image Collections.
Learning Anchored Unsigned Distance Functions with Gradient Direction Alignment for Single-view Garment Reconstruction.
StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation.
Deep Implicit Surface Point Prediction Networks.
Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation.
DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization.
Bridging Unsupervised and Supervised Depth from Focus via All-in-Focus Supervision.
Geometric Deep Neural Network using Rigid and Non-Rigid Transformations for Human Action Recognition.
Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images.
Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image.
MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis.
RetrievalFuse: Neural 3D Scene Reconstruction with a Database.
In-the-Wild Single Camera 3D Reconstruction Through Moving Water Surfaces.
3D Building Reconstruction from Monocular Remote Sensing Images.
Learning Indoor Inverse Rendering with 3D Spatially-Varying Lighting.
Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image.
SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting.
RFNet: Recurrent Forward Network for Dense Point Cloud Completion.
PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers.
ME-PCN: Point Completion Conditioned on Mask Emptiness.
CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing.
Unsupervised Learning of Fine Structure Generation for 3D Point Clouds by 2D Projection Matching.
3DStyleNet: Creating 3D Shapes with Geometric and Texture Style Variations.
3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces.
Sat2Vid: Street-view Panoramic Video Synthesis from a Single Satellite Image.
Structured Outdoor Architecture Reconstruction by Exploration and Classification.
Reconstructing Hand-Object Interactions in the Wild.
Single View Physical Distance Estimation using Human Pose.
SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation.
EventHands: Real-Time Neural 3D Hand Pose Estimation from an Event Stream.
Uncertainty-Aware Human Mesh Recovery from Video by Learning Part-Based 3D Dynamics.
Gravity-Aware Monocular 3D Human-Object Reconstruction.
Pi-NAS: Improving Neural Architecture Search by Reducing Supernet Training Consistency Shift.
One-pass Multi-view Clustering for Large-scale Data.
Orthogonal Projection Loss.
AdvRush: Searching for Adversarially Robust Neural Architectures.
Learning Latent Architectural Distribution in Differentiable Neural Architecture Search via Variational Information Maximization.
Adaptive Convolutions with Per-pixel Dynamic Filter Atom.
Unifying Nonlocal Blocks for Neural Networks.
BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search.
AutoFormer: Searching Transformers for Visual Recognition.
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference.
Homogeneous Architecture Augmentation for Neural Predictor.
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search.
Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces.
Direct Differentiable Augmentation Search.
Product Quantizer Aware Inverted Index for Scalable Nearest Neighbor Search.
Vector Neurons: A General Framework for SO(3)-Equivariant Networks.
Robustness via Cross-Domain Ensembles.
Vision Transformers for Dense Prediction.
Viewpoint Invariant Dense Matching for Visual Geolocalization.
Bayesian Triplet Loss: Uncertainty Quantification in Image Retrieval.
Learning Attribute-driven Disentangled Representations for Interactive Fashion Retrieval.
Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval.
Video Geo-Localization Employing Geo-Temporal Feature Learning and GPS Trajectory Smoothing.
Face Image Retrieval with Attribute Manipulation.
Instance-level Image Retrieval using Reranking Transformers.
Learning specialized activation functions with the Piecewise Linear Unit.
Self-supervised Product Quantization for Deep Unsupervised Image Retrieval.
Deep Symmetric Network for Underexposed Image Enhancement with Recurrent Attentional Learning.
Deep Relational Metric Learning.
Universal Cross-Domain Retrieval: Generalizing Across Classes and Domains.
Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences.
Video-based Person Re-identification with Spatial and Temporal Memory Networks.
Pyramid Spatial-Temporal Aggregation for Video-based Person Re-Identification.
ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer.
Weakly Supervised Person Search with Region Siamese Networks.
PT-CapsNet: A Novel Prediction-Tuning Capsule Network Suitable for Deeper Architectures.
EC-DARTS: Inducing Equalized and Consistent Optimization into DARTS.
Inferring high-resolution traffic accident risk maps based on satellite imagery and GPS trajectories.
LIRA: Learnable, Imperceptible and Robust Backdoor Attacks.
Building-GAN: Graph-Conditioned Architectural Volumetric Design Generation.
Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation.
Rethinking Spatial Dimensions of Vision Transformers.
ALADIN: All Layer Adaptive Instance Normalization for Fine-grained Style Similarity.
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval.
Beyond Road Extraction: A Dataset for Map Update using Aerial Images.
Clothing Status Awareness for Long-Term Person Re-Identification.
Learning to Know Where to See: A Visibility-Aware Approach for Occluded Person Re-identification.
Occluded Person Re-Identification with Single-scale Global Representations.
IDM: An Intermediate Domain Module for Domain Adaptive Person Re-ID.
The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation.
Memory-augmented Dynamic Neural Relational Inference.
Occlude Them All: Occlusion-Aware Attention Network for Occluded Person Re-ID.
CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification.
Explainable Person Re-Identification with Attribute-guided Metric Distillation.
TransPose: Keypoint Localization via Transformer.
Learning with Memory-based Virtual Classes for Deep Metric Learning.
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining.
DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features.
Ranking Models in Unlabeled New Environments.
Improving Robustness of Facial Landmark Detection by Defending against Adversarial Attacks.
Online Knowledge Distillation for Efficient Pose Estimation.
DensePose 3D: Lifting Canonical Surface Maps of Articulated Objects to the Third Dimension.
Motion Adaptive Pose Estimation from Compressed Videos.
Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing.
Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction.
Full-Body Motion from a Single Head-Mounted Device: Generating SMPL Poses from Partial Observations.
DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders.
TravelNet: Self-supervised Physically Plausible Hand Motion Learning from Monocular Color Images.
3D Human Pose Estimation with Spatial and Temporal Transformers.
A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder∗.
Neural TMDlayer: Modeling Instantaneous flow of features via SDE Generators.
Self-supervised Transfer Learning for Hand Mesh Recovery from Binocular Images.
Deep Virtual Markers for Articulated 3D Shapes.
Probabilistic Modeling for Human Mesh Recovery.
SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes.
TeachText: CrossModal Generalized Distillation for Text-Video Retrieval.
Support-Set Based Cross-Supervision for Video Grounding.
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment.
Aligning Subtitles in Sign Language Videos.
Visual Alignment Constraint for Continuous Sign Language Recognition.
Physics-based Human Motion Estimation and Synthesis from Videos.
Normalized Human Pose Features for Human Action Video Alignment.
EM-POSE: 3D Human Pose Estimation from Sparse Electromagnetic Trackers.
Estimating Egocentric 3D Human Pose in Global Space.
HuMoR: 3D Human Motion Model for Robust Pose Estimation.
Modulated Graph Convolutional Network for 3D Human Pose Estimation.
MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction.
Revitalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation.
PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop.
Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation.
Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data via Differentiable Cross-Approximation.
Learning Deep Local Features with Multiple Dynamic Attentions for Large-Scale Image Retrieval.
Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning.
Weakly Supervised Text-based Person Re-Identification.
Neural Architecture Search for Joint Human Parsing and Pose Estimation.
Stochastic Scene-Aware Motion Prediction.
SemiHand: Semi-supervised Hand Pose Estimation with Consistency.
Interacting Two-Hand 3D Pose and Shape Reconstruction from Single Color Image.
Learning Motion Priors for 4D Human Body Capture in 3D Scenes.
Contextually Plausible and Diverse 3D Human Motion Prediction.
The Animation Transformer: Visual Correspondence via Segment Matching.
TokenPose: Learning Keypoint Tokens for Human Pose Estimation.
Self-Mutual Distillation Learning for Continuous Sign Language Recognition.
Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders.
Hand Image Understanding via Deep Multi-Task Learning.
Learning Causal Representation for Training Cross-Domain Pose Estimator via Generative Interventions.
HandFoldingNet: A 3D Hand Pose Estimation Network Using Multiscale-Feature Guided Folding of a 2D Hand Skeleton.
Learning to Regress Bodies from Images using Differentiable Semantic Rendering.
An Empirical Study of the Collapsing Problem in Semi-Supervised 2D Human Pose Estimation.
Self-Supervised 3D Hand Pose Estimation from monocular RGB via Contrastive Learning.
Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild.
Space-Time-Separable Graph Convolutional Network for Pose Forecasting.
Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows.
End-to-End Detection and Pose Estimation of Two Interacting Hands.
Monocular, One-stage, Regression of Multiple 3D People.
Camera Distortion-aware 3D Human Pose Estimation in Video with Optimization-based Meta-Learning.
Shape-aware Multi-Person Pose Estimation from Multi-View Images.
Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images.
Learning Realistic Human Reposing using Cyclic Self-Supervision with 3D Shape, Pose, and Appearance Consistency.
PARE: Part Attention Regressor for 3D Human Body Estimation.
SOMA: Solving Optical Marker-Based MoCap Automatically.
Hand-Object Contact Consistency Reasoning for Human Grasps Generation.
CPF: Learning a Contact Potential Field to Model the Hand-Object Interaction.
SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition.
Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates.
Removing the Bias of Integral Pose Regression.
Keypoint Communities.
ARCH++: Animation-Ready Clothed Human Reconstruction Revisited.
SPEC: Seeing People in the Wild with an Estimated Camera.
Human Pose Regression with Residual Log-likelihood Estimation.
Unsupervised 3D Pose Estimation for Hierarchical Dance Video Recognition *.
Egocentric Pose Estimation from Human Vision Span.
EventHPE: Event-based 3D Human Pose and Shape Estimation.
Action-Conditioned 3D Human Motion Synthesis with Transformer VAE.
The Power of Points for Modeling Humans in Clothing.
BioFors: A Large Biomedical Image Forensics Dataset.
FloW: A Dataset and Benchmark for Floating Waste Detection in Inland Waters.
BV-Person: A Large-scale Dataset for Bird-view Person Re-identification.
3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics.
Towards Real-world X-ray Security Inspection: A High-Quality Benchmark And Lateral Inhibition Module For Prohibited Items Detection.
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding.
Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction.
UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model.
SynFace: Face Recognition with Synthetic Data.
StereOBJ-1M: Large-scale Stereo Image Dataset for 6D Object Pose Estimation.
Learning to Track with Object Permanence.
MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?
Learning to Adversarially Blur Visual Object Tracking.
Wanderlust: Online Continual Object Detection in the Real World.
ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition.
Separable Flow: Learning Motion Cost Volumes for Optical Flow Estimation.
End-to-End Video Instance Segmentation via Spatial-Temporal Graph Neural Networks.
Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans.
Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation.
ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding.
Dynamic Surface Function Networks for Clothed Human Bodies.
KoDF: A Large-scale Korean DeepFake Detection Dataset.
Transparent Object Tracking Benchmark.
DepthTrack: Unveiling the Power of RGBD Tracking.
Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks.
Low-Shot Validation: Active Importance Sampling for Estimating Classifier Performance on Rare Categories.
Lifelong Infinite Mixture Model Based on Knowledge-Driven Dirichlet Process.
Learning with Privileged Tasks.
Lipschitz Continuity Guided Knowledge Distillation.
Kernel Methods in Hyperbolic Spaces.
DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities.
Do Different Deep Metric Learning Losses Lead to Similar Learned Features?
LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning.
Contrastive Learning for Label Efficient Semantic Segmentation.
von Mises-Fisher Loss: An Exploration of Embedding Geometries for Supervised Learning.
Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach.
Weakly Supervised Representation Learning with Coarse Labels.
Focus on the Positives: Self-Supervised Learning for Biodiversity Monitoring.
Partner-Assisted Learning for Few-Shot Image Classification.
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning.
Personalized Image Semantic Segmentation.
Region Similarity Representation Learning.
Impact of Aliasing on Generalization in Deep Convolutional Networks.
Poly-NL: Linear Complexity Non-local Layers With 3rd Order Polynomials.
Not All Operations Contribute Equally: Hierarchical Operation-adaptive Predictor for Neural Architecture Search.
High-Resolution Optical Flow from 1D Attention and Correlation.
Exploring Simple 3D Multi-Object Tracking for Autonomous Driving.
Point-set Distances for Learning Representations of 3D Point Clouds.
SGMNet: Learning Rotation-Invariant Point Cloud Representations via Sorted Gram Matrix.
Temporally-Coherent Surface Reconstruction via Metric-Consistent Atlases.
Learning Spatio-Temporal Transformer for Visual Tracking.
PARTS: Unsupervised segmentation with slots, attention and independence maximization.
Motion-Augmented Self-Training for Video Recognition at Smaller Scale.
ViewNet: Unsupervised Viewpoint Estimation from Conditional Generation.
Curious Representation Learning for Embodied Intelligence.
BuildingNet: Learning to Label 3D Buildings.
Distilling Holistic Knowledge with Graph Neural Networks.
RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving.
Adversarial Unsupervised Domain Adaptation with Conditional and Label Shift: Infer, Align and Iterate.
Refining activation downsampling with SoftPool.
Warp Consistency for Unsupervised Learning of Dense Correspondences.
Instance Similarity Learning for Unsupervised Feature Representation.
Mean Shift for Self-Supervised Learning.
Rethinking preventing class-collapsing in metric learning with margin-based losses.
Improving Contrastive Learning by Visualizing Feature Transformation.
Video Annotation for Visual Tracking via Selection and Refinement.
Benchmark Platform for Ultra-Fine-Grained Visual Categorization Beyond Human Performance.
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning.
Active Learning for Deep Object Detection via Probabilistic Modeling.
Self-Supervised Pretraining of 3D Features on any Point-Cloud.
Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment.
Understanding Robustness of Transformers for Image Classification.
Temporal-wise Attention Spiking Neural Networks for Event Streams Classification.
Improving robustness against common corruptions with frequency biased models.
Improve Unsupervised Pretraining for Few-label Transfer.
Self-Supervised Representation Learning from Flow Equivariance.
Geography-Aware Self-Supervised Learning.
Temporal Knowledge Consistency for Unsupervised Visual Representation Learning.
Self-Supervised Visual Representations Learning by Contrastive Mask Prediction.
Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency.
H2O: Two Hands Manipulating Objects for First Person Interaction Recognition.
FloorPlanCAD: A Large-Scale CAD Drawing Dataset for Panoptic Symbol Spotting.
OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild.
LaLaLoc: Latent Layout Localisation in Dynamic, Unvisited Environments.
SketchAA: Abstract Representation for Abstract Sketches.
Efficient Visual Pretraining with Contrastive Detection.
Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective.
Divide and Contrast: Self-supervised Learning from Uncurated Data.
Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.
Weakly Supervised Contrastive Learning.
Rethinking and Improving Relative Position Encoding for Vision Transformer.
InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
Field Convolutions for Surface CNNs.
T-SVDNet: Exploring High-Order Prototypical Correlations for Multi-Source Domain Adaptation.
Co-Scale Conv-Attentional Image Transformers.
Time-Equivariant Contrastive Video Representation Learning.
Modelling Neighbor Relation in Joint Space-Time Graph for Video Correspondence Learning.
Contrasting Contrastive Self-Supervised Representation Learning Pipelines.
Learning Compatible Embeddings.
Clustering by Maximizing Mutual Information Across Views.
Learning Better Visual Data Similarities via New Grouplet Non-Euclidean Embedding.
Deep Matching Prior: Test-Time Optimization for Dense Correspondence.
On Equivariant and Invariant Learning of Object Landmark Representations.
Towards Interpretable Deep Metric Learning with Structural Matching.
Track without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking.
Saliency-Associated Object Tracking.
High-Performance Discriminative Tracking with Transformers.
CrowdDriven: A New Challenging Dataset for Outdoor Visual Localization.
Visio-Temporal Attention for Multi-Camera Multi-Target Association.
Human Trajectory Prediction via Counterfactual Analysis.
AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting.
LOKI: Long Term and Key Intentions for Trajectory Prediction.
Learn-to-Race: A Multimodal Control Environment for Autonomous Racing.
Unsupervised Point Cloud Pre-training via Occlusion Completion.
Learning to Estimate Hidden Motions with Global Motion Aggregation.
X-World: Accessibility, Vision, and Autonomy Meet.
A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction.
Dissecting Image Crops.
Video Autoencoder: self-supervised disentanglement of static 3D structure and motion.
Contact-Aware Retargeting of Skinned Motion.
Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset.
Seeing Dynamic Scene in the Dark: A High-Quality Video Dataset with Mechatronic Alignment.
UVStyle-Net: Unsupervised Few-shot Learning of 3D Style Similarity Measure for B-Reps.
Learning Facial Representations from the Cycle-consistency of Face.
Joint Inductive and Transductive Learning for Video Object Segmentation.
Do Image Classifiers Generalize Across Time?
Emerging Properties in Self-Supervised Vision Transformers.
An Empirical Study of Training Self-Supervised Vision Transformers.
Concept Generalization in Visual Representation Learning.
SelfReg: Self-supervised Contrastive Regularization for Domain Generalization.
ISD: Self-Supervised Learning by Iterative Similarity Distillation.
On Feature Decorrelation in Self-Supervised Learning.
With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations.
On Compositions of Transformations in Contrastive Self-Supervised Learning.
Universal-Prototype Enhancing for Few-Shot Object Detection.
SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation.
Field-Guide-Inspired Zero-Shot Learning.
Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation.
Universal Representation Learning from Multiple Domains for Few-shot Classification.
Co2L: Contrastive Continual Learning.
Solving Inefficiency of Self-supervised Representation Learning.
Distributional Robustness Loss for Long-tail Learning.
Learning from Noisy Data with Robust Representation Learning.
CoMatch: Semi-supervised Learning with Contrastive Graph Regularization.
Meta-Learning with Task-Adaptive Loss Function for Few-Shot Learning.
Few-Shot and Continual Learning with Attentive Independent Mechanisms.
Few-shot Image Classification: Just Use a Library of Pre-trained Feature Extractors and a Simple Classifier.
Meta Navigator: Search for a Good Adaptation Policy for Few-shot Learning.
Boosting the Generalization Capability in Cross-Domain Few-shot Learning via Noise-enhanced Supervised Autoencoder.
Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data.
Testing using Privileged Information by Adapting Features with Statistical Dependence.
Densely Guided Knowledge Distillation using Multiple Teacher Assistants.
Rehearsal revealed: The limits and merits of revisiting samples in continual learning.
Always Be Dreaming: A New Approach for Data-Free Class-Incremental Learning.
MT-ORL: Multi-Task Occlusion Relationship Learning.
STEM: An approach to Multi-source Domain Adaptation with Guarantees.
Vector-Decomposed Disentanglement for Domain-Invariant Object Detection.
Partial Video Domain Adaptation with Partial Adversarial Temporal Attentive Network.
Towards Novel Target Discovery Through Open-Set Domain Adaptation.
Me-Momentum: Extracting Hard Confident Examples from Noisily Labeled Data.
Energy-Based Open-World Uncertainty Modeling for Confidence Calibration.
Localized Simple Multiple Kernel K-means.
A Unified Objective for Novel Class Discovery.
Influence Selection for Active Learning.
Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation.
Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition.
Long Short View Feature Decomposition via Contrastive Video Representation Learning.
Multi-VAE: Learning Disentangled View-common and View-peculiar Visual Representations for Multi-view Clustering.
Graph Contrastive Clustering.
Information-theoretic regularization for Multi-source Domain Adaptation.
Seeking Similarities over Differences: Similarity-based Domain Alignment for Adaptive Object Detection.
Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation.
Tune it the Right Way: Unsupervised Validation of Domain Adaptation via Soft Neighborhood Density.
Re-energizing Domain Discriminator with Sample Relabeling for Adversarial Domain Adaptation.
A Style and Semantic Memory Mechanism for Domain Generalization*.
The Pursuit of Knowledge: Discovering and Localizing Novel Categories using Dual Memory.
Robust Object Detection via Instance-Level Temporal Cycle Confusion.
Knowledge Mining and Transferring for Domain Adaptive Object Detection.
CDS: Cross-Domain Self-supervised Pre-training.
Multi-Anchor Active Domain Adaptation for Semantic Segmentation.
Semantic Concentration for Domain Adaptation.
Uncertainty-aware Pseudo Label Refinery for Domain Adaptive Semantic Segmentation.
Dual Path Learning for Domain Adaptation of Semantic Segmentation.
Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation.
Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning.
Coarsely-labeled Data for Better Few-shot Transfer.
Mixture-based Feature Space Learning for Few-shot Image Classification.
On the Importance of Distractors for Few-Shot Classification.
Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting.
Adaptive Adversarial Network for Source-free Domain Adaptation.
OVANet: One-vs-All Network for Universal Domain Adaptation.
RDA: Robust Domain Adaptation via Fourier Adversarial Attacking.
Generalized Source-free Domain Adaptation.
Active Universal Domain Adaptation.
Confidence Calibration for Domain Generalization under Covariate Shift.
Meta Learning on a Sequence of Imbalanced Domains with Difficulty Awareness.
Gradient Distribution Alignment Certificates Better Adversarial Domain Adaptation.
Contrastive Coding for Active Learning under Class Distribution Mismatch.
Weak Adaptation Learning: Addressing Cross-domain Data Insufficiency with Weak Annotator.
Deep Co-Training with Task Decomposition for Semi-Supervised Domain Adaptation.
Collaborative Learning with Disentangled Features for Zero-shot Domain Adaptation.
A Simple Feature Augmentation for Domain Generalization.
mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets.
Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency.
Multi-Task Self-Training for Learning General Representations.
A Broad Study on the Transferability of Visual Representations with Contrastive Learning.
Composable Augmentation Encoding for Video Representation Learning.
Relational Embedding for Few-Shot Classification.
Variational Feature Disentangling for Fine-Grained Few-Shot Classification.
BAPA-Net: Boundary Adaptation and Prototype Alignment for Cross-domain Semantic Segmentation.
Divide-and-Assemble: Learning Block-wise Memory for Unsupervised Anomaly Detection.
Deep Transport Network for Unsupervised Video Object Segmentation.
Domain-Invariant Disentangled Network for Generalizable Object Detection.
PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation.
Iterative label cleaning for transductive and semi-supervised few-shot learning.
Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer.
Discriminative Region-based Multi-Label Zero-Shot Learning.
Mining Latent Classes for Few-shot Segmentation.
Semantics Disentangling for Generalized Zero-Shot Learning.
Learning to Hallucinate Examples from Extrinsic and Intrinsic Supervision.
Curvature Generation in Curved Spaces for Few-Shot Learning.
DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection.
Pseudo-loss Confidence Metric for Semi-supervised Few-shot Learning.
Synthesized Feature based Few-Shot Class-Incremental Learning on a Mixture of Subspaces.
Towards Alleviating the Modeling Ambiguity of Unsupervised Monocular 3D Human Pose Estimation.
Unsupervised Layered Image Decomposition into Object Prototypes.
Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer.
Skeleton2Mesh: Kinematics Prior Injected Unsupervised Human Mesh Recovery.
Self-Supervised Object Detection via Generative Image Synthesis.
Transporting Causal Mechanisms for Unsupervised Domain Adaptation.
LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation.
ECACL: A Holistic Framework for Semi-Supervised Domain Adaptation.
Adversarial Robustness for Unsupervised Domain Adaptation.
SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised Domain Adaptation.
BiMaL: Bijective Maximum Likelihood Approach to Domain Adaptation in Semantic Scene Segmentation.
Geometric Unsupervised Domain Adaptation for Semantic Segmentation.
Towards Discriminative Representation Learning for Unsupervised Person Re-identification.
Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation.
Active Domain Adaptation via Clustering Uncertainty-weighted Embeddings.
A Hierarchical Transformation-Discriminating Generative Model for Few Shot Anomaly Detection.
Unsupervised Few-Shot Action Recognition via Action-Appearance Aligned Meta-Adaptation.
Interaction Compass: Multi-Label Zero-Shot Learning of Human-Object Interactions via Spatial Relations.
LoFGAN: Fusing Local Representations for Few-shot Image Generation.
A Multi-Mode Modulator for Multi-Domain Few-Shot Classification.
Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples.
Task-aware Part Mining Network for Few-Shot Learning.
Learning Rare Category Classifiers on a Tight Labeling Budget.
Transductive Few-Shot Classification on the Oblique Manifold.
Binocular Mutual Learning for Improving Few-shot Classification.
DetCo: Unsupervised Contrastive Learning for Object Detection.
Shape Self-Correction for Unsupervised Point Cloud Understanding.
Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification.
Unsupervised Dense Deformation Embedding Network for Template-Free Shape Correspondence.
Keep CALM and Improve Visual Feature Attribution.
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization.
DRÆM - A discriminatively trained reconstruction embedding for surface anomaly detection.
NAS-OoD: Neural Architecture Search for Out-of-Distribution Generalization.
Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for Open-Set Semi-Supervised Learning.
Semantically Coherent Out-of-Distribution Detection.
Task Switching Network for Multi-task Learning.
Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data.
Exploring Inter-Channel Correlation for Diversity-preserved Knowledge Distillation.
CCT-Net: Category-Invariant Cross-Domain Transfer for Medical Single-to-Multiple Disease Diagnosis.
Continual Prototype Evolution: Learning Online from Non-Stationary Data Streams.
Self-Supervised Video Representation Learning with Meta-Contrastive Network.
A Simple Baseline for Semi-supervised Semantic Segmentation with Strong Data Augmentation*.
Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank.
GistNet: a Geometric Structure Transfer Network for Long-Tailed Recognition.
Parallel Detection-and-Segmentation Learning for Weakly Supervised Instance Segmentation.
Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection.
Watch Only Once: An End-to-End Video Action Detection Framework.
Interactive Prototype Learning for Egocentric Action Recognition.
HighlightMe: Detecting Highlights from Human-Centric Videos.
Attention is not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion.
TF-Blender: Temporal Feature Blender for Video Object Detection.
Joint Visual and Audio Learning for Video Highlight Detection.
Unified Graph Structured Models for Video Understanding.
Detecting Human-Object Relationships in Videos.
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency.
Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning.
Generic Event Boundary Detection: A Benchmark for Event Segmentation.
Video Object Segmentation with Dynamic Memory Networks and Adaptive Object Alignment.
Domain Adaptive Video Segmentation via Temporal Consistency Regularization.
Crossover Learning for Fast Online Video Instance Segmentation.
Searching for Two-Stream Models in Multivariate Space for Video Recognition.
Temporal Action Detection with Multi-level Supervision.
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos.
Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization.
Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization.
PR-Net: Preference Reasoning for Personalized Video Highlight Detection.
Cross-category Video Highlight Detection via Set-based Learning.
VideoLT: Large-scale Long-tailed Video Recognition.
Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion.
Contrast and Order Representations for Video Self-supervised Learning.
Online-trained Upsampler for Deep Low Complexity Video Compression.
Group-aware Contrastive Regression for Action Quality Assessment.
Sensor-Guided Optical Flow.
Fooling LiDAR Perception via Adversarial Trajectory Perturbation.
End-to-End Unsupervised Document Image Blind Denoising.
Removing Adversarial Noise in Class Activation Feature Space.
Data-free Universal Adversarial Perturbation and Black-box Attack.
Exploiting Multi-Object Relationships for Detecting Adversarial Attacks in Complex Scenes.
Naturalistic Physical Adversarial Patch for Object Detectors.
On the Robustness of Vision Transformers to Adversarial Examples.
Integer-arithmetic-only Certified Robustness for Quantized Neural Networks.
Batch Normalization Increases Adversarial Vulnerability and Decreases Adversarial Transferability: A Non-Robust Feature Perspective.
Relating Adversarially Robust Generalization to Flat Minima.
Minimal Adversarial Examples for Deep Learning on 3D Point Clouds.
Meta-Attack: Class-agnostic and Model-agnostic Physical Adversarial Attack.
Consistency-Sensitivity Guided Ensemble Black-Box Adversarial Attacks in Low-Dimensional Spaces.
Adversarial Attacks On Multi-Agent Communication.
Reliably fast adversarial training via latent adversarial perturbation.
Meta Gradient Adversarial Attack.
Augmented Lagrangian Adversarial Attacks.
Towards Understanding the Generative Capability of Adversarially Robust Classifiers.
ProFlip: Targeted Trojan Attack with Progressive Bit Flips.
On Generating Transferable Targeted Perturbations.
Parallel Rectangle Flip Attack: A Query-based Black-box Attack against Object Detection.
Adversarial Example Detection Using Latent Neighborhood Graph.
Sample Efficient Detection and Classification of Adversarial Attacks via Self-Supervised Embeddings.
Just One Moment: Structural Vulnerability of Deep Action Recognition against One Frame Attack.
AGKD-BML: Defense Against Adversarial Attack by Attention Guided Knowledge Distillation and Bi-directional Metric Learning.
TkML-AP: Adversarial Attacks to Top-k Multi-Label Learning.
Feature Importance-aware Transferable Adversarial Attacks.
Where are you heading? Dynamic Trajectory Prediction with Expert Goal Examples.
DRIVE: Deep Reinforced Accident Anticipation with Visual Explanation.
Robustness Certification for Point Cloud Models.
A Backdoor Attack against 3D Point Cloud Classifiers.
Q-Match: Iterative Shape Matching via Quantum Annealing.
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition.
OadTR: Online Action Detection with Transformers.
Tripartite Information Mining and Integration for Image Matting.
MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving.
Self-Supervised Vessel Segmentation via Adversarial Learning.
Can Shape Structure Features Improve Model Robustness under Diverse Adversarial Settings?
S3VAADA: Submodular Subset Selection for Virtual Adversarial Active Domain Adaptation.
AdvDrop: Adversarial Attack to DNNs by Dropping Information.
Towards Robustness of Deep Neural Networks via Regularization.
Dynamic Divide-and-Conquer Adversarial Training for Robust Semantic Segmentation.
Spatio-Temporal Dynamic Inference Network for Group Activity Recognition.
Interpolation-Aware Padding for 3D Sparse Convolutional Neural Networks.
CPFN: Cascaded Primitive Fitting Networks for High-Resolution Point Clouds.
DRINet: A Dual-Representation Iterative Learning Network for Point Cloud Segmentation.
Differentiable Convolution Search for Point Cloud Processing.
Scaling Semantic Segmentation Beyond 1K Classes on a Single GPU.
Scribble-Supervised Semantic Segmentation by Uncertainty Reduction on Neural Representation and Self-Supervision on Neural Eigenspace.
Weakly Supervised Segmentation of Small Buildings with Point Labels.
A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation.
Graph-BAS3Net: Boundary-Aware Semi-Supervised Segmentation Network with Bilateral Graph Convolution.
Dynamic Network Quantization for Efficient Video Inference.
Predictive Feature Learning for Future Segmentation Prediction.
Weakly Supervised Temporal Anomaly Segmentation with Dynamic Time Warping.
Conditional Diffusion for Interactive Segmentation.
Unsupervised Point Cloud Object Co-segmentation by Co-contrastive Learning and Mutual Attention Sampling.
Unsupervised Segmentation incorporating Shape Prior via Generative Adversarial Networks.
Real-time Instance Segmentation with Discriminative Orientation Maps.
Exploring Cross-Image Pixel Contrast for Semantic Segmentation.
Few-Shot Semantic Segmentation with Cyclic Memory Network.
ECS-Net: Improving Weakly Supervised Semantic Segmentation by Using Connections Between Class Activation Maps.
Pixel Contrastive-Consistent Semi-Supervised Semantic Segmentation.
Segmenter: Transformer for Semantic Segmentation.
From Contexts to Locality: Ultra-high Resolution Image Segmentation via Locality-aware Contextual Correlation.
Complementary Patch for Weakly Supervised Semantic Segmentation.
Mining Contextual Information Beyond Image for Semantic Segmentation.
Boundary-sensitive Pre-training for Temporal Localization in Videos.
Multiview Pseudo-Labeling for Semi-supervised Learning from Video.
Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation.
ISNet: Integrate Image-Level and Semantic-Level Context for Semantic Segmentation.
Self-supervised Video Object Segmentation by Motion Grouping.
Cascade Image Matting with Deformable Graph Refinement.
SOTR: Segmenting Objects with Transformers.
Joint Topology-preserving and Feature-refinement Network for Curvilinear Structure Segmentation.
Specialize and Fuse: Pyramidal Output Representation for Semantic Segmentation.
How Shift Equivariance Impacts Metric Learning for Instance Segmentation.
TempNet: Online Semantic Segmentation on Large-scale Point Cloud Series.
Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation.
Persistent Homology based Graph Convolution Network for Fine-grained 3D Shape Segmentation.
ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation.
AINet: Association Implantation for Superpixel Segmentation.
Self-Mutating Network for Domain Adaptive Segmentation of Aerial Images.
Calibrated Adversarial Refinement for Stochastic Semantic Segmentation.
Generalize then Adapt: Source-Free Domain Adaptive Semantic Segmentation.
C3-SemiSeg: Contrastive Semi-supervised Segmentation via Cross-set Learning and Dynamic Class-balancing.
RECALL: Replay-based Continual Learning in Semantic Segmentation.
The surprising impact of mask-head architecture on novel class segmentation.
Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation.
Unlocking the Potential of Ordinary Classifier: Class-specific Adversarial Erasing Framework for Weakly Supervised Semantic Segmentation.
Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation.
Prototypical Matching and Open Set Rejection for Zero-Shot Semantic Segmentation.
Pseudo-mask Matters in Weakly-supervised Semantic Segmentation.
Self-Regulation for Semantic Segmentation.
Hypercorrelation Squeeze for Few-Shot Segmenation.
Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation.
Seminar Learning for Click-Level Weakly Supervised Semantic Segmentation.
Instances as Queries.
An Elastica Geodesic Approach with Convexity Shape Prior.
Local Temperature Scaling for Probability Calibration.
RINDNet: Edge Detection for Discontinuity in Reflectance, Illumination, Normal and Depth.
Field of Junctions: Extracting Boundary Structure at Low SNR.
Learning to Cut by Watching Movies.
End-to-End Dense Video Captioning with Parallel Decoding.
ViViT: A Video Vision Transformer.
Multiscale Vision Transformers.
Where2Act: From Pixels to Actions for Articulated 3D Objects.
Toward a Visual Concept Vocabulary for GAN Latent Space.
Online Multi-Granularity Distillation for GAN Compression.
Scaling-up Disentanglement for Image Translation.
DeepCAD: A Deep Generative Network for Computer-Aided Design Models.
Multi-Class Multi-Instance Count Conditioned Adversarial Image Generation.
Harnessing the Conditioning Sensorium for Improved Image Translation.
F-Drop&Match: GANs with a Dead Zone in the High-Frequency Domain.
Dual Contrastive Loss and Attention for GANs.
Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation.
ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement.
When do GANs replicate? On the choice of dataset size.
Generative Layout Modeling using Constraint Graphs.
Extending Neural P-frame Codecs for B-frame Coding.
Searching for Robustness: Loss Learning for Noisy Classification Tasks.
Evolving Search Space for Neural Architecture Search.
AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer.
PixelPyramids: Exact Inference Models from Lossless Image Pyramids.
Domain Generalization via Gradient Surgery.
Semantic Perturbations with Normalizing Flows for Improved Generalization.
Robust Trust Region for Weakly Supervised Segmentation.
Paint Transformer: Feed Forward Neural Painting with Stroke Prediction.
Manifold Matching via Deep Metric Learning for Generative Modeling.
A Lazy Approach to Long-Horizon Gradient-Based Meta-Learning.
Self-Knowledge Distillation with Progressive Refinement of Targets.
Bias Loss for Mobile Neural Networks.
SPatchGAN: A Statistical Feature Based Discriminator for Unsupervised Image-to-Image Translation.
Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds.
Learning Signed Distance Field for Multi-view Surface Reconstruction.
Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds of Large Scenes with Learned Virtual View Visibility.
SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks.
JEM++: Improved Techniques for Training JEM.
Collaborative Optimization and Aggregation for Decentralized Domain Generalization and Adaptation.
Generalized Shuffled Linear Regression.
Progressive Correspondence Pruning by Consensus Learning.
Synchronization of Group-labelled Multi-graphs.
Learning with Noisy Labels for Robust Point Cloud Segmentation.
Bootstrap Your Own Correspondences.
Guided Point Contrastive Learning for Semi-supervised Point Cloud Semantic Segmentation.
Progressive Seed Generation Auto-encoder for Unsupervised Point Cloud Learning.
Geometry-Aware Self-Training for Unsupervised Domain Adaptation on Object Point Clouds.
WarpedGANSpace: Finding non-linear RBF paths in GAN latent space.
DRB-GAN: A Dynamic ResBlock Generative Adversarial Network for Artistic Style Transfer.
Gradient Normalization for Generative Adversarial Networks.
Auto Graph Encoder-Decoder for Neural Network Pruning.
GNeRF: GAN-based Neural Radiance Field without Posed Camera.
TMCOSS: Thresholded Multi-Criteria Online Subset Selection for Data-Efficient Autonomous Driving.
Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths.
Fast Light-field Disparity Estimation with Multi-disparity-scale Cost Aggregation.
UASNet: Uncertainty Adaptive Sampling Network for Deep Stereo Matching.
Learning to Match Features with Seeded Graph Matching Network.
Distilling Global and Local Logits with Densely Connected Relations.
FFT-OT: A Fast Algorithm for Optimal Transportation.
Fusion Moves for Graph Matching.
Faster Multi-Object Segmentation using Parallel Quadratic Pseudo-Boolean Optimization.
Learning to Bundle-adjust: A Graph Network Approach to Faster Optimization of Bundle Adjustment for Vehicular SLAM.
DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras.
iMAP: Implicit Mapping and Positioning in Real-Time.
On the Limits of Pseudo Ground Truth in Visual Camera Re-localisation.
COTR: Correspondence Transformer for Matching Across Images.
Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers.
AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network.
Just a Few Points are All You Need for Multi-view Stereo: A Novel Semi-supervised Learning Method for Multi-view Stereo.
A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo.
PatchMatch-RL: Deep MVS with Pixelwise Depth, Normal, and Visibility.
Rational Polynomial Camera Model Warping for Deep Learning Based Satellite Multi-View Stereo Matching.
A Robust Loss for Point Cloud Registration.
Sampling Network Guided Cross-Entropy Method for Unsupervised Point Cloud Registration.
AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds.
(Just) A Spoonful of Refinements Helps the Registration Error Go Down.
Pyramid Point Cloud Transformer for Large-Scale Place Recognition.
Differentiable Surface Rendering via Non-Differentiable Sampling.
Digging into Uncertainty in Self-supervised Multi-view Stereo.
Minimal Cases for Computing the Generalized Relative Pose using Affine Correspondences.
Cross-Descriptor Visual Localization and Mapping.
Stacked Homography Transformations for Multi-View Pedestrian Detection.
DepthInSpace: Exploitation and Fusion of Multiple Video Frames for Structured-Light Depth Estimation.
Matching in the Dark: A Dataset for Matching Image Pairs of Low-light Scenes.
Transfusion: A Novel SLAM Method Focused on Transparent Objects.
SaccadeCam: Adaptive Visual Attention for Monocular Depth Sensing.
ODAM: Object Detection, Association, and Mapping using Posed RGB Video.
Pixel-Perfect Structure-from-Motion with Featuremetric Refinement.
Deep Permutation Equivariant Structure from Motion.
STR-GQN: Scene Representation and Rendering for Unknown Cameras Based on Spatial Transformation Routing.
Learning Efficient Photometric Feature Transform for Multi-view Stereo.
ELLIPSDF: Joint Object Pose and Shape Optimization with a Bi-level Ellipsoid and Signed Distance Function Description.
Calibrated and Partially Calibrated Semi-Generalized Homographies.
Dynamical Pose Estimation.
Gaussian Fusion: Accurate 3D Reconstruction via Geometry-Guided Displacement Interpolation.
Radial Distortion Invariant Factorization for Structure from Motion.
PoGO-Net: Pose Graph Optimization with Graph Neural Networks.
Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis.
Baking Neural Radiance Fields for Real-Time View Synthesis.
Nerfies: Deformable Neural Radiance Fields.
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.
Self-Calibrating Neural Radiance Fields.
LSD-StructureNet: Modeling Levels of Structural Detail in 3D Part Hierarchies.
3D Shape Generation and Completion through Point-Voxel Diffusion.
ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators.
Deep Hybrid Self-Prior for Full 3D Mesh Generation.
GTT-Net: Learned Generalized Trajectory Triangulation.
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis.
Editing Conditional Radiance Fields.
Neural Articulated Radiance Field.
PlenOctrees for Real-time Rendering of Neural Radiance Fields.
BARF: Bundle-Adjusting Neural Radiance Fields.
EPP-MVSNet: Epipolar-assembling based Depth Prediction for Multi-view Stereo.
Multi-view 3D Reconstruction with Transformers.
Dynamic View Synthesis from Dynamic Monocular Video.
Extreme Structure from Motion for Indoor Panoramas without Visual Overlaps.
Pri3D: Can 3D Priors Help 2D Representation Learning?
DeepPRO: Deep Partial Point Cloud Registration of Objects.
3DeepCT: Learning Volumetric Scattering Tomography of Clouds.
Learning Icosahedral Spherical Probability Map Based on Bingham Mixture Model for Vanishing Point Estimation.
Adaptive Surface Reconstruction with Multiscale Convolutional Kernels.
Out-of-Core Surface Reconstruction via Global TGV Minimization.
Scene Synthesis via Uncertainty-Driven Attribute Synchronization.
H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction.
NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo.
PR-RRN: Pairwise-Regularized Residual-Recursive Networks for Non-rigid Structure-from-Motion.
UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction.
Minimal Solutions for Panoramic Stitching Given Gravity Prior.
Orthographic-Perspective Epipolar Geometry.
Lightweight Multi-person Total Motion Capture Using Sparse Multi-view Cameras.
MBA-VO: Motion Blur Aware Visual Odometry.
Viewing Graph Solvability via Cycle Consistency.
Feature Interactive Representation for Point Cloud Registration.
4D Cloud Scattering Tomography.
Superpoint Network for Point Cloud Oversegmentation.
SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer.
Distinctiveness oriented Positional Equilibrium for Point Cloud Registration.
CanvasVAE: Learning to Generate Vector Graphic Documents.
DeePSD: Automatic Deep Skinning And Pose Space Deformation For 3D Garment Animation.
imGHUM: Implicit Generative Models of 3D Human Shape and Articulated Pose.
Rotation Averaging in a Split Second: A Primal-Dual Method and a Closed-Form for Cycle Graphs.
Structure-from-Sherds: Incremental 3D Reassembly of Axially Symmetric Pots from Unordered and Mixed Fragment Collections.
ZFlow: Gated Appearance Flow-based Virtual Try-on with 3D Priors.
Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation.
Towards Real-World Prohibited Item Detection: A Large-Scale X-ray Benchmark.
BEV-Net: Assessing Social Distancing Compliance by Joint People Localization and Geometric Reasoning.
SmartShadow: Artistic Shadow Drawing Tool for Line Drawings.
Fast and Efficient DNN Deployment via Deep Gaussian Transfer Learning.
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss.
Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks.
Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization.
Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search.
Dynamic Dual Gating Neural Networks.
Improving Generalization of Batch Whitening by Convolutional Unit Optimization.
Channel-wise Knowledge Distillation for Dense Prediction*.
Meta-Aggregator: Learning to Aggregate for 1-bit Graph Neural Networks.
Generalizable Mixed-Precision Quantization via Attribution Rank Preservation.
Improving Neural Network Efficiency via Post-training Quantization with Adaptive Floating-Point.
Distance-aware Quantization.
Improving Low-Precision Network Quantization via Bin Regularization.
RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions.
GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization.
Towards Memory-Efficient Neural Networks via Multi-Level in situ Generation.
FATNN: Fast and Accurate Ternary Neural Networks*.
HIRE-SNN: Harnessing the Inherent Robustness of Energy-Efficient Deep Spiking Neural Networks by Training with Crafted Input Noise.
ReCU: Reviving the Dead Weights in Binary Neural Networks.
Bit-Mixer: Mixed-precision networks with runtime bit-width selection.
Unsupervised Curriculum Domain Adaptation for No-Reference Video Quality Assessment.
SACoD: Sensor Algorithm Co-Design Towards Efficient CNN-powered Intelligent PhlatCam.
BlockCopy: High-Resolution Video Processing with Block-Sparse Feature Propagation and Online Policies.
MUSIQ: Multi-scale Image Quality Transformer.
Spectral Leakage and Rethinking the Kernel Size in CNNs.
Entropy Maximization and Meta Classification for Out-of-Distribution Detection in Semantic Segmentation.
Pixel Difference Networks for Efficient Edge Detection.
Learning Multiple Pixelwise Tasks Based on Loss Scale Balancing.
NASOA: Towards Faster Task-oriented Online Fine-tuning with a Zoo of Models.
Rethinking Deep Image Prior for Denoising.
BlockPlanner: City Block Generation with Vectorized Graph Representation.
Adaptive Curriculum Learning.
Student Customized Knowledge Distillation: Bridging the Gap Between Student and Teacher.
Self-born Wiring for Neural Trees.
Polarimetric Helmholtz Stereopsis.
DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network.
Location-aware Single Image Reflection Removal.
Learning to Remove Refractive Distortions from Underwater Images.
Towards Flexible Blind JPEG Artifacts Removal.
Improving De-raining Generalization via Neural Reorganization.
Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning.
Adaptive Graph Convolution for Point Cloud Analysis.
The Benefit of Distraction: Denoising Camera-Based Physiological Measurements using Inverse Attention.
A Machine Teaching Framework for Scalable Recognition.
iNAS: Integral NAS for Device-Aware Salient Object Detection.
Full-Duplex Strategy for Video Object Segmentation.
Collaborative Unsupervised Visual Representation Learning from Decentralized Data.
Video Matting via Consistency-Regularized Graph Neural Networks.
Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging.
EvIntSR-Net: Event Guided Multiple Latent Frames Reconstruction and Super-resolution.
Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks.
Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation.
R-SLAM: Optimizing Eye Tracking from Rolling Shutter Video of the Retina.
Out-of-boundary View Synthesis Towards Full-Frame Video Stabilization.
SSH: A Self-Supervised Framework for Image Harmonization.
Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search.
Deep Blind Video Super-resolution.
Learning A Single Network for Scale-Arbitrary Super-Resolution.
Designing a Practical Degradation Model for Deep Blind Image Super-Resolution.
Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme.
Morphable Detector for Object Detection on Demand.
DivAug: Plug-in Automated Data Augmentation with Explicit Diversity Maximization.
Unpaired Learning for Deep Image Deraining with Rain Direction Regularizer.
CANet: A Context-Aware Network for Shadow Removal.
HiNet: Deep Image Hiding by Invertible Network.
Visual Saliency Transformer.
Light Field Saliency Detection with Dual Local Graph Learning and Reciprocative Guidance.
Mitigating Intensity Bias in Shadow Detection via Feature Decomposition and Reweighting.
High-Fidelity Pluralistic Image Completion with Transformers.
Specificity-preserving RGB-D Saliency Detection.
DCT-SNN: Using DCT to Distribute Spatial Information over Time for Low-Latency Spiking Neural Networks.
PnP-DETR: Towards Efficient Visual Analysis with Transformers.
Cross-Patch Graph Convolutional Network for Image Denoising.
Rethinking Coarse-to-Fine Approach in Single Image Deblurring.
Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation.
RDI-Net: Relational Dynamic Inference Networks.
Low-Rank Tensor Completion by Approximating the Tensor Average Rank.
Extensions of Karger's Algorithm: Why They Fail in Theory and How They Are Useful in Practice.
Rethinking Noise Synthesis and Modeling in Raw Denoising.
Score-Based Point Cloud Denoising.
Real-Time Video Inference on Edge Devices via Adaptive Model Streaming.
Augmenting Depth Estimation with Geospatial Context.
Robust Automatic Monocular Vehicle Speed Estimation for Traffic Surveillance.
SUNet: Symmetric Undistortion Network for Rolling Shutter Correction.
Bringing Events into Video Deblurring with Non-consecutively Blurry Frames.
Efficient Video Compression via Content-Adaptive Super-Resolution.
ResRep: Lossless CNN Pruning via Decoupling Remembering and Forgetting.
A New Journey from SDRTV to HDRTV.
Self-Conditioned Probabilistic Learning of Video Rescaling.
Event Stream Super-Resolution via Spatiotemporal Constraint Learning.
Super-Resolving Cross-Domain Face Miniatures by Peeking at One-Shot Exemplar.
Representative Color Transform for Image Enhancement.
Ultra-High-Definition Image HDR Reconstruction via Collaborative Bilateral Learning.
Adaptive Unfolding Total Variation Network for Low-Light Image Enhancement.
Omniscient Video Super-Resolution.
Federated Learning for Non-IID Data via Unified Feature Learning and Optimization Objective Alignment.
MixMix: All You Need for Data-Free Compression Are Feature and Data Mixing.
Zero-Shot Day-Night Domain Adaptation with a Physics Prior.
Multi-Level Curriculum for Training A Distortion-Aware Barrel Distortion Rectification Model.
Equivariant Imaging: Learning Beyond the Range Space.
Learning Unsupervised Metaformer for Anomaly Detection.
Deep Structured Instance Graph for Distilling Object Detectors.
Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision.
RGB-D Saliency Detection via Cascaded Mutual Information Minimization.
Dynamic Attentive Graph Learning for Image Restoration.
Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective.
Learning Frequency-aware Dynamic Network for Efficient Super-Resolution.
Pyramid Architecture Search for Real-Time Image Deblurring.
Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution.
Context Reasoning Attention Network for Image Super-Resolution.
End-to-end Piece-wise Unwarping of Document Images.
Event-Intensity Stereo: Estimating Depth by the Best of Both Worlds.
ReconfigISP: Reconfigurable Camera Image Processing Pipeline.
Structure-Preserving Deraining with Residue Channel Prior Guidance.
Inverting a Rolling Shutter Camera: Bring Rolling Shutter Images to High Framerate Global Shutter Video.
TransView: Inside, Outside, and Across the Cropping View Boundaries.
Exploring Visual Engagement Signals for Representation Learning.
ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss.
PlaneTR: Structure-Guided Transformers for 3D Plane Recovery.
Light Source Guided Single-Image Flare Removal from Unpaired Data.
Summarize and Search: Learning Consensus-aware Dynamic Convolution for Co-Saliency Detection.
Scene Context-Aware Salient Object Detection.
Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection.
MFNet: Multi-filter Directive Network for Weakly Supervised Salient Object Detection.
StarEnhancer: Learning Real-Time and Style-Aware Image Enhancement.
Perceptual Variousness Motion Deblurring with Light Global Context Refinement.
STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement.
Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution.
Learning Dual Priors for JPEG Compression Artifacts Removal.
Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling.
CryoDRGN2: Ab initio neural reconstruction of 3D protein structures from real cryo-EM images.
Self-Supervised Cryo-Electron Tomography Volumetric Image Restoration from Single Noisy Volume with Sparsity Constraint.
Deep survival analysis with longitudinal X-rays for COVID-19.
Mutual-Complementing Framework for Nuclei Detection and Segmentation in Pathology Image.
CDNet: Centripetal Direction Network for Nuclear Instance Segmentation.
Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images.
Multi-Class Cell Detection Using Spatial Context Representation.
The Way to my Heart is through Contrastive Learning: Remote Photoplethysmography from Unlabelled Video.
Visual-Textual Attentive Semantic Consistency for Medical Report Generation.
RFNet: Region-aware Fusion Network for Incomplete Multi-modal Brain Tumor Segmentation.
T-AutoML: Automated Machine Learning for Lesion Segmentation using Transformers in 3D Medical Imaging.
Semantic Aware Data Augmentation for Cell Nuclei Microscopical Images with Artificial Neural Networks.
GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition.
Generative Adversarial Registration for Improved Conditional Deformable Templates.
Recurrent Mask Refinement for Few-Shot Medical Image Segmentation.
Re-Aging GAN: Toward Personalized Face Age Transformation.
Towards Face Encryption by Generating Adversarial Identity Masks.
Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval.
Disentangled Lifespan Face Synthesis.
FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning.
End-to-end robust joint unsupervised image alignment and clustering.
Learn to Cluster Faces via Pairwise Classification.
Generalizing Gaze Estimation with Outlier-guided Collaborative Adaptation.
Topologically Consistent Multi-View Face Inference Using Volumetric Sampling.
DAM: Discrepancy Alignment Metric for Face Recognition.
Physics-Enhanced Machine Learning for Virtual Fluorescence Microscopy.
DWKS : A Local Descriptor of Deformations Between Meshes and Point Clouds.
CrackFormer: Transformer Network for Fine-Grained Crack Detection.
CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution.
Multi-Echo LiDAR for 3D Object Detection.
Towards Efficient Graph Convolutional Networks for Point Cloud Handling.
Looking here or there? Gaze Following in 360-Degree Images.
Real-time Vanishing Point Detector Integrating Under-parameterized RANSAC and Hough Transform.
Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud.
VENet: Voting Enhancement Network for 3D Object Detection.
Cross-Encoder for Unsupervised Gaze Representation Learning.
Disentangled Representation for Age-Invariant Face Recognition: A Mutual Information Minimization Perspective.
Fake it till you make it: face analysis in the wild using synthetic data alone.
Teacher-Student Adversarial Depth Hallucination to Improve Face Recognition.
Meta Pairwise Relationship Distillation for Unsupervised Person Re-identification.
Conditional DETR for Fast Training Convergence.
Mutual Supervision for Dense Object Detection.
Reconcile Prediction Consistency for Balanced Object Detection.
Fast Convergence of DETR with Spatially Modulated Co-Attention.
Rethinking Transformer-based Set Prediction for Object Detection.
TransFER: Learning Relation-aware Facial Expression Representations with Transformers.
G-DetKD: Towards General Distillation Framework for Object Detectors via Contrastive and Semantic-guided Feature Imitation.
Disentangled High Quality Salient Object Detection.
SimROD: A Simple Adaptation Method for Robust Object Detection.
DualPoseNet: Category-level 6D Object Pose and Size Estimation Using Dual Pose Network with Refined Learning of Pose Consistency.
Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries.
FMODetect: Robust Detection of Fast Moving Objects.
Towards Rotation Invariance in Object Detection.
Oriented R-CNN for Object Detection.
TOOD: Task-aligned One-stage Object Detection.
Preservational Learning Improves Self-supervised Medical Image Models by Reconstructing Diverse Contexts.
Collaborative and Adversarial Learning of Focused and Dispersive Representations for Semi-supervised Polyp Segmentation.
Big Self-Supervised Models Advance Medical Image Classification.
Learning Hierarchical Graph Neural Networks for Image Clustering.
FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation.
Semi-Supervised Active Learning with Temporal Output Discrepancy.
Training Multi-Object Detector by Estimating Bounding Box Distribution for Input Image.
Normalization Matters in Weakly Supervised Object Localization.
Exploring Classification Equilibrium in Long-Tailed Object Detection.
DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision.
ICON: Learning Regular Maps Through Inverse Consistency.
Foreground Activation Maps for Weakly Supervised Object Localization.
Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos.
Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework.
Multi-scale Matching Networks for Semantic Correspondence.
Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data.
Personalized and Invertible Face De-identification by Disentangled Identity Information Manipulation.
GarmentNets: Category-Level Pose Estimation for Garments via Canonical Space Shape Completion.
PICCOLO: Point Cloud-Centric Omnidirectional Localization.
RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering.
Exploring Geometry-aware Contrast and Clustering Harmonization for Self-supervised 3D Object Detection.
You Don't Only Look Once: Constructing Spatial-Temporal Memory for Integrated 3D Object Detection and Tracking.
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection.
Multi-Source Domain Adaptation for Object Detection.
Query Adaptive Few-Shot Object Detection with Heterogeneous Graph Convolutional Networks.
Continual Learning for Image-Based Camera Localization.
Efficient Large Scale Inlier Voting for Geometric Vision Problems.
Uniformity in Heterogeneity: Diving Deep into Count Interval Partition for Crowd Counting.
Are we Missing Confidence in Pseudo-LiDAR Methods for Monocular 3D Object Detection?
Exploiting sample correlation for crowd counting with multi-expert network.
Towards A Universal Model for Cross-Dataset Crowd Counting.
CrossDet: Crossline Representation for Object Detection.
Detecting Invisible People.
Voxel Transformer for 3D Object Detection.
LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector.
Is Pseudo-Lidar needed for Monocular 3D Object detection?
OMNet: Learning Overlapping Mask for Partial-to-Partial Point Cloud Registration.
Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation.
Geometry Uncertainty Projection Network for Monocular 3D Object Detection.
MLVSNet: Multi-level Voting Siamese Network for 3D Visual Tracking.
Causal Attention for Unbiased Visual Recognition.
ADNet: Leveraging Error-Bias Towards Normal Direction in Face Alignment.
CaT: Weakly Supervised Object Detection with Category Transfer.
End-to-End Semi-Supervised Object Detection with Soft Teacher.
Robust Small-scale Pedestrian Detection with Cued Recall via Memory Learning.
Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification.
DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training.
Switchable K-class Hyperplanes for Noise-Robust Representation Learning.
Rank & Sort Loss for Object Detection and Instance Segmentation.
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding.
Dynamic DETR: End-to-End Object Detection with Dynamic Attention.
WB-DETR: Transformer-Based Detector without Backbone.
ELSD: Efficient Line Segment Detector and Descriptor.
Body-Face Joint Detection via Embedding and Head Hook.
Group-Free 3D Object Detection via Transformers.
Gated3D: Monocular 3D Object Detection From Temporal Illumination Cues.
3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds.
RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection.
An End-to-End Transformer Model for 3D Object Detection.
Semi-supervised Active Learning for Semi-supervised Models: Exploit Adversarial Examples with Graph-based Virtual Labels.
TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.
Boosting Weakly Supervised Object Detection via Learning Bounding Box Adjusters.
PreDet: Large-scale weakly supervised pre-training for detection.
Human Detection and Segmentation via Multi-view Consensus.
Self-Supervised Image Prior Learning with GMM from a Single Noisy Image.
Weakly Supervised 3D Semantic Segmentation Using Cross-Image Consensus and Inter-Voxel Affinity Relations.
Prior to Segment: Foreground Cues for Weakly Annotated Classes in Partially Supervised Instance Segmentation.
Sparse-shot Learning with Exclusive Cross-Entropy for Extremely Many Localisations.
Contrastive Attention Maps for Self-supervised Co-localization.
PR-GCN: A Deep Graph Convolutional Network with Point Refinement for 6D Pose Estimation.
Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks.
SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation.
GraphFPN: Graph Feature Pyramid Network for Object Detection.
HPNet: Deep Primitive Segmentation Using Hybrid Representations.
Improving 3D Object Detection with Channel-wise Transformer.
Learning Multi-Scene Absolute Pose Regression with Transformers.
Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection.
The Devil is in the Task: Exploiting Reciprocal Appearance-Localization Features for Monocular 3D Object Detection.
Dual Bipartite Graph Learning: A General Approach for Domain Adaptive Object Detection.
Time-Multiplexed Coded Aperture Imaging: Learned Coded Aperture and Pixel Exposures for Compressive Imaging Systems.
A Hybrid Frequency-Spatial Domain Model for Sparse Image Reconstruction in Scanning Transmission Electron Microscopy.
Multispectral illumination estimation using deep unrolling network.
Incorporating Learnable Membrane Time Constant to Enhance Learning of Spiking Neural Networks.
Single-shot Hyperspectral-Depth Imaging with Learned Diffractive Optics.
Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions.
Extreme-Quality Computational Imaging via Degradation Framework.
Self-supervised Neural Networks for Spectral Snapshot Compressive Imaging.
Universal and Flexible Optical Aberration Correction Using Deep-Prior Based Deconvolution.
A Simple Framework for 3D Lensless Imaging with Programmable Masks.
Objects as Cameras: Estimating High-Frequency Illumination from Shadows.
Motion Deblurring with Real Events.
Learning Privacy-preserving Optics for Human Pose Estimation.
Event-based Video Reconstruction Using Transformer.
Multitask AET with Orthogonal Tangent Regularity for Dark Object Detection.
COMISR: Compression-Informed Video Super-Resolution.
Super Resolve Dynamic Scene from Continuous Spike Streams.
Unsupervised Non-Rigid Image Distortion Removal via Grid Deformation.
Photon-Starved Scene Inference using Single Photon Cameras.
HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset.
SeLFVi: Self-supervised Light-Field Video Reconstruction from Stereo Video.
Distillation-guided Image Inpainting.
Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables.
Deep Reparametrization of Multi-Frame Super-Resolution and Denoising.
Learning Dynamic Interpolation for Extremely Sparse Light Fields with Wide Baselines.
Virtual light transport matrices for non-line-of-sight imaging.
A Dark Flash Normal Camera.
A Light Stage on Every Desk.
Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination.
NeuSpike-Net: High Speed Video Reconstruction via Bio-inspired Neuromorphic Cameras.
V-DESIRR: Very Fast Deep Embedded Single Image Reflection Removal.
Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform.
Lucas-Kanade Reloaded: End-to-End Super-Resolution from Raw Image Bursts.
Fourier Space Losses for Efficient Perceptual Image Super-Resolution.
C2N: Practical Generative Noise Modeling for Real-World Denoising.
Inference of Black Hole Fluid-Dynamics from Sparse Interferometric Measurements.
What You Can Learn by Staring at a Blank Wall.
Anonymizing Egocentric Videos.
Spatially-Adaptive Image Restoration using Distortion-Guided Networks.
Hybrid Neural Fusion for Full-frame Video Stabilization.
Learning to Reduce Defocus Blur by Realistically Modeling Dual-Pixel Data.
Semantic-embedded Unsupervised Spectral Reconstruction from Single RGB Images in the Wild.
High Quality Disparity Remapping with Two-Stage Warping.
Dynamic CT Reconstruction from Limited Views with Implicit Neural Representations and Parametric Motion Fields.
Hyperspectral Image Denoising with Realistic Data.
How to Train Neural Networks for Flare Removal.
Defocus Map Estimation and Deblurring from a Single Dual-Pixel Image.
Adversarial Attack on Deep Cross-Modal Hamming Retrieval.
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation.
Auto-Parsing Network for Image Captioning and Visual Question Answering.
Partial Off-policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning.
Hierarchical Graph Attention Network for Few-shot Visual-Semantic Learning.
LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision.
Patch Craft: Video Denoising by Deep Modeling and Patch Matching.
N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras.
Dual Transfer Learning for Event-based End-task Prediction via Pluggable Event to Image Translation.
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models.
Language-Guided Global Image Editing via Cross-Modal Cyclic Mechanism.
Motion-Focused Contrastive Learning of Video Representations*.
Viewpoint-Agnostic Change Captioning with Cycle Consistency.
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery.
TRAR: Routing the Attention Spans in Transformer for Visual Question Answering.
On the hidden treasure of dialog in video question answering.
AESOP: Abstract Encoding of Stories, Objects, and Pictures.
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models.
Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos.
Explainable Video Entailment with Grounded Visual Evidence.
Let's See Clearly: Contaminant Artifact Removal for Moving Cameras.
Dual-Camera Super-Resolution with Aligned Attention Modules.
IICNet: A Generic Framework for Reversible Image Conversion.
Cross-Camera Convolutional Color Constancy.
Describing and Localizing Multiple Changes with Transformers.
IntraTomo: Self-supervised Learning-based Tomography via Sinogram Synthesis and Prediction.
T-Net: Effective Permutation-Equivariant Network for Two-View Correspondence Learning.
Spatial-Temporal Consistency Network for Low-Latency Trajectory Forecasting.
Localize to Binauralize: Audio Spatialization from Visual Sound Source Localization.
Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives.
Weakly Supervised Relative Spatial Reasoning for Visual Question Answering.
Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue.
Factorizing Perception and Policy for Interactive Instruction Following.
Interpretable Visual Reasoning via Induced Symbolic Space.
Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference.
SAT: 2D Semantics Assisted Training for 3D Visual Grounding.
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions.
Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval with Partial Query.
Learning to Generate Scene Graph from Natural Language Supervision.
Wasserstein Coupled Graph Learning for Cross-Modal Retrieval.
Detector-Free Weakly Supervised Grounding by Separation.
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring.
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding.
TransVG: End-to-End Visual Grounding with Transformers.
Unsupervised Deep Video Denoising.
Deep 3D Mask Volume for View Synthesis of Dynamic Scenes.
Video Instance Segmentation with a Propose-Reduce Paradigm.
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.
Multiple Pairwise Ranking Networks for Personalized Video Summarization.
Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature.
HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering.
Just Ask: Learning to Answer Questions from Millions of Narrated Videos.
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments.
VLGrammar: Grounded Grammar Induction of Vision and Language.
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation.
Vision-Language Navigation with Random Environmental Mixup.
Airbert: In-domain Pretraining for Vision-and-Language Navigation.
LapsCore: Language-guided Person Search via Color Reasoning.
Linguistically Routing Capsule Network for Out-of-distribution Visual Question Answering.
Contrast and Classify: Training Robust VQA Models.
Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation.
Greedy Gradient Ensemble for Robust Visual Question Answering.
Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering.
Learning Motion-Appearance Co-Attention for Zero-Shot Video Object Segmentation.
Dynamic Context-Sensitive Filtering Network for Video Salient Object Detection.
Motion Guided Region Message Passing for Video Captioning.
STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding.
Fast Video Moment Retrieval.
MGSampler: An Explainable Sampling Strategy for Video Action Recognition.
Vi2CLR: Video and Image for Visual Contrastive Learning of Representation.
Dense Interaction Learning for Video-based Person Re-identification.
Learning Temporal Dynamics from Cycles in Narrated Video.
Zero-shot Natural Language Video Localization.
Graph Constrained Data Representation Learning for Human Motion Segmentation.
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations.
UniT: Multimodal Multitask Learning with a Unified Transformer.
Compressing Visual-linguistic Model via Knowledge Distillation.
Unshuffling Data for Improved Generalization in Visual Question Answering.
In Defense of Scene Graphs for Image Captioning.
Synthesis of Compositional Animations from Textual Descriptions.
YouRefIt: Embodied Reference Understanding with Language and Gesture.
Who's Waldo? Linking People Across Text and Images.
Panoptic Narrative Grounding.
LFI-CAM: Learning Feature Importance for Better Visual Explanation.
Finding Representative Interpretations on Convolutional Neural Networks.
Towards Better Explanations of Class Activation Mapping.
Towards Learning Spatially Discriminative Feature Representations.
Shape-Biased Domain Generalization via Shock Graph Embeddings.
Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection.
TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition.
Learning to Discover Reflection Symmetry via Polar Matching Convolution.
Embed Me If You Can: A Geometric Perceptron.
Hypergraph Neural Networks for Hypergraph Matching.
Broaden Your Views for Self-Supervised Video Learning.
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks.
Explanations for Occluded Images.
Explaining Local, Global, And Higher-Order Interactions In Deep Learning.
Better Aggregation in Test-Time Augmentation.
Visual Scene Graphs for Audio Source Separation.
How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild.
Audio-Visual Floorplan Reconstruction.
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement.
IDARTS: Interactive Differentiable Architecture Search.
CODEs: Chamfer Out-of-Distribution Examples against Overconfidence Issue.
Transforms based Tensor Robust PCA: Corrupted Low-Rank Tensors Recovery via Convex Optimization.
Predicting with Confidence on Unseen Distributions.
Striking a Balance between Stability and Plasticity for Class-Incremental Learning.
Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?
The Right to Talk: An Audio-Visual Transformer Approach.
Interpreting Attributions and Interactions of Adversarial Attacks.
Handwriting Transformers.
De-rendering Stylized Texts.
From Culture to Clothing: Discovering the World Events Behind A Century of Fashion Images.
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations.
SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition.
Learning Canonical 3D Object Representation for Fine-Grained Recognition.
Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification.
Effectively Leveraging Attributes for Visual Similarity.
LayoutTransformer: Layout Generation and Completion with Self-attention.
DocFormer: End-to-End Transformer for Document Understanding.
Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation.
Detecting Persuasive Atypicality by Modeling Contextual Compatibility.
Spatial and Semantic Consistency Regularizations for Pedestrian Attribute Recognition.
SketchLattice: Latticed Representation for Sketch Manipulation.
Parsing Table Structures in the Wild.
Graph-based Asynchronous Event Processing for Rapid Object Recognition.
End-to-End Trainable Trident Person Search Network Using Adaptive Gradient Propagation.
Walk in the Cloud: Learning Curves for Point Clouds Shape Analysis.
Generating Attribution Maps with Disentangled Masked Backpropagation.
Interpretable Image Recognition by Constructing Transparent Embedding Space.
Attentional Pyramid Pooling of Salient Visual Residuals for Place Recognition.
Grafit: Learning fine-grained image representations with coarse labels.
FaPN: Feature-aligned Pyramid Network for Dense Image Prediction.
Multimodal Knowledge Expansion.
SS-IL: Separated Softmax for Incremental Learning.
Learning to Diversify for Single Domain Generalization.
MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks.
OpenGAN: Open-Set Recognition via Open Data Generation.
Neural Video Portrait Relighting in Real-time via Consistency Modeling.
Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs.
FcaNet: Frequency Channel Attention Networks.
TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation.
Recursively Conditional Gaussian for Ordinal Unsupervised Domain Adaptation.
Contrastive Multimodal Fusion with TupleInfoNCE.
Statistically Consistent Saliency Estimation.
Influence-Balanced Loss for Imbalanced Visual Classification.
Learning Fast Sample Re-weighting Without Reward Data.
Parametric Contrastive Learning.
Ground-truth or DAER: Selective Re-query of Secondary Information.
Explaining in Style: Training a GAN to explain a classifier in StyleSpace.
Exploiting Explanations for Model Inversion Attacks.
Architecture Disentanglement for Deep Neural Networks.
Adversarial Attacks are Reversible with Natural Supervision.
Shallow Bayesian Meta Learning for Real-World Few-Shot Recognition.
Semantic Diversity Learning for Zero-Shot Multi-label Classification.
Self Supervision to Distillation for Long-Tailed Visual Recognition.
Stochastic Partial Swap: Enhanced Model Generalization and Interpretability for Fine-grained Recognition.
Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data.
Visual Transformers: Where Do Transformers Really Belong in Vision Models?
Visformer: The Vision-friendly Transformer.
Incorporating Convolution Designs into Visual Transformers.
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions.
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet.
Point Cloud Augmentation with Weighted Local Transformations.
Continual Learning on Noisy Data Streams via Self-Purified Replay.
Aggregation with Feature Detection.
Learning Meta-class Memory for Few-Shot Semantic Segmentation.
Learning to Resize Images for Computer Vision Tasks.
Exploration and Estimation for Model Compression.
Group-wise Inhibition based Feature Regularization for Robust Classification.
MicroNet: Improving Image Recognition with Extremely Low FLOPs.
Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain.
An Asynchronous Kalman Filter for Hybrid Event Cameras.
Virtual Multi-Modality Self-Supervised Foreground Matting for Human-Object Interaction.
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision.
MosaicOS: A Simple and Effective Use of Object-Centric Images for Long-Tailed Object Detection.
Learning Canonical View Representation for 3D Shape Recognition with Arbitrary Views.
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers.
Vision Transformer with Progressive Sampling.
Scalable Vision Transformers with Hierarchical Pooling.
Conformer: Local Features Coupling Global Representations for Visual Recognition.
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification.
Zen-NAS: A Zero-Shot NAS for High-Performance Image Recognition.
AutoSpace: Neural Architecture Search with Less Human Interference.
Differentiable Dynamic Wirings for Neural Networks.
BN-NAS: Neural Architecture Search with Batch Normalization.
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video.
Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis.
Move2Hear: Active Audio-Visual Source Separation.
MAAS: Multi-modal Assignation for Active Speaker Detection.
When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes.
Neural Photofit: Gaze-based Mental Image Reconstruction.
Distilling Virtual Examples for Long-tailed Recognition.
Syncretic Modality Collaborative Learning for Visible Infrared Person Re-Identification.
Attack-Guided Perceptual Data Generation for Real-world Re-Identification.
Heterogeneous Relational Complement for Vehicle Re-identification.
Self-supervised Geometric Features Discovery via Interpretable Attention for Vehicle Re-Identification and Beyond.
Residual Attention: A Simple but Effective Method for Multi-Label Recognition.
Dance with Self-Attention: A New Look of Conditional Random Fields on Anomaly Detection in Videos.
Transformer-based Dual Relation Graph for Multi-label Image Recognition.
Spatio-Temporal Representation Factorization for Video-based Person Re-Identification.
Z-Score Normalization, Hubness, and Few-Shot Learning.
Online Refinement of Low-level Feature Based Activation Map for Weakly Supervised Object Localization.
FREE: Feature Refinement for Generalized Zero-Shot Learning.
ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot.
Conditional Variational Capsule Network for Open Set Recognition.
Procrustean Training for Imbalanced Deep Learning.
Asymmetric Loss For Multi-Label Classification.
Learning with Noisy Labels via Sparse Regularization.
NGC: A Unified Framework for Learning with Open-World Noisy Data.
CrossNorm and SelfNorm for Generalization under Distribution Shifts.
DTMNet: A Discrete Tchebichef Moments-based Deep Neural Network for Multi-focus Image Fusion.
Going deeper with Image Transformers.
CvT: Introducing Convolutions to Vision Transformers.
GLiT: Neural Architecture Search for Global and Local Image Transformer.
MVTN: Multi-View Transformation Network for 3D Shape Recognition.