| Data | Program | Europe Time(UTC+1) | Beijing Time(UTC+8) | Paper ID | Paper Title | |||||||||||||||
| 2021.3.7 | Conference Opening | 01:00-02:00 | 08:00-09:00 | |||||||||||||||||
| Keynote 1 by Jiebo Luo | 02:00-03:00 | 09:00-10:00 | ||||||||||||||||||
| Best Paper Session(4 papers) | 03:00-05:00 | 10:00-12:00 | 12 | Distilling Knowledge in Causal Inference for Unbiased Visual Question Answering | ||||||||||||||||
| 58 | Similar Scene Retrieval in Soccer Videos with Weak Annotations by Multimodal Use of Bidirectional LSTM | |||||||||||||||||||
| 73 | Interactive Re-ranking for Cross-modal Retrieval Based on Object-wise Question Answering | |||||||||||||||||||
| 92 | Real-Time Arbitrary Video Style Transfer | |||||||||||||||||||
| Tutorial 1:Bias Issues and Solutions in Recommender System | 07:00-09:00 | 14:00-16:00 | Bias Issues and Solutions in Recommender System | |||||||||||||||||
| Demo 1 | 09:00-10:00 | 16:00-17:00 | 142 | Synthesized 3D Model Suggestions with Smartphone Based MR to Modify the PreBuilt Environment: Interior Design | ||||||||||||||||
| Special Session Poster 1-Multimedia application(8 papers) | 10:00-10:40 | 17:00-17:40 | 18 | An Automated Method with Anchor-Free Detection and U-Shaped Segmentation for Nuclei Instance Segmentation | ||||||||||||||||
| 22 | Improving face recognition in Surveillance video with judicious selection and fusion of representative frames | |||||||||||||||||||
| 25 | A Multimedia Solution to Motivate Childhood Cancer Patients to Keep Up with Cancer Treatment | |||||||||||||||||||
| 85 | Story Segmentation For News Broadcast Based On Primary Caption | |||||||||||||||||||
| 86 | Intermediate Coordinate based Pose Non-perspective Estimation from Line Correspondences | |||||||||||||||||||
| 132 | Structure-Preserving Extremely Low Light Image Enhancement with Fractional Order Differential Mask Guidance | |||||||||||||||||||
| 136 | Change Detection from Synthetic Aperture Radar Images Based on Deformable Residual Convolutional Neural Networks | |||||||||||||||||||
| 146 | Towards Annotation-Free Evaluation of Cross-Lingual Image Captioning | |||||||||||||||||||
| Poster Session 1(8 papers) | 11:00-11:40 | 18:00-18:40 | 4 | A Treatment Engine by Multimodal EMR Data | ||||||||||||||||
| 11 | Storyboard Relational Model for Group Activity Recognition | |||||||||||||||||||
| 26 | Global and Local Feature Alignment for Video Object Detection | |||||||||||||||||||
| 28 | Semantic Feature Augmentation for Fine-grained Visual Categorization with Few-Sample Training | |||||||||||||||||||
| 33 | Destylization of text with decorative elements | |||||||||||||||||||
| 35 | Hierarchical Clustering via Mutual Learning for Unsupervised Person Re-identification | |||||||||||||||||||
| 48 | Robust Visual Tracking via Scale-Aware Localization and Peak Response Strength | |||||||||||||||||||
| 49 | Hungry Networks: 3D Mesh Reconstruction of a Dish and a Plate from a Single Dish Image for Estimating Food Volume | |||||||||||||||||||
| Demo 1-Mirrored | 17:00-18:00 | 00:00-01:00+1 day | 142 | Synthesized 3D Model Suggestions with Smartphone Based MR to Modify the PreBuilt Environment: Interior Design | ||||||||||||||||
| Special Session Poster 1-Mirrored-Multimedia application(8 papers) | 18:00-18:40 | 01:00-01:40+1 day | 18 | An Automated Method with Anchor-Free Detection and U-Shaped Segmentation for Nuclei Instance Segmentation | ||||||||||||||||
| 22 | Improving face recognition in Surveillance video with judicious selection and fusion of representative frames | |||||||||||||||||||
| 25 | A Multimedia Solution to Motivate Childhood Cancer Patients to Keep Up with Cancer Treatment | |||||||||||||||||||
| 85 | Story Segmentation For News Broadcast Based On Primary Caption | |||||||||||||||||||
| 86 | Intermediate Coordinate based Pose Non-perspective Estimation from Line Correspondences | |||||||||||||||||||
| 132 | Structure-Preserving Extremely Low Light Image Enhancement with Fractional Order Differential Mask Guidance | |||||||||||||||||||
| 136 | Change Detection from Synthetic Aperture Radar Images Based on Deformable Residual Convolutional Neural Networks | |||||||||||||||||||
| 146 | Towards Annotation-Free Evaluation of Cross-Lingual Image Captioning | |||||||||||||||||||
| Poster Session 1-Mirrored (8 papers) | 19:00-19:40 | 02:00-02:40+1 day | 4 | A Treatment Engine by Multimodal EMR Data | ||||||||||||||||
| 11 | Storyboard Relational Model for Group Activity Recognition | |||||||||||||||||||
| 26 | Global and Local Feature Alignment for Video Object Detection | |||||||||||||||||||
| 28 | Semantic Feature Augmentation for Fine-grained Visual Categorization with Few-Sample Training | |||||||||||||||||||
| 33 | Destylization of text with decorative elements | |||||||||||||||||||
| 35 | Hierarchical Clustering via Mutual Learning for Unsupervised Person Re-identification | |||||||||||||||||||
| 48 | Robust Visual Tracking via Scale-Aware Localization and Peak Response Strength | |||||||||||||||||||
| 49 | Hungry Networks: 3D Mesh Reconstruction of a Dish and a Plate from a Single Dish Image for Estimating Food Volume | |||||||||||||||||||
| Data | Program | Europe Time(UTC+1) | Beijing Time(UTC+8) | Paper ID | Paper Title | |||||||||||||||
| 2021.3.8 | Keynote 2 by Kristen Grauman | 01:00-02:00 | 08:00-09:00 | |||||||||||||||||
| Tutorial 2:10 Years of Video Browser Showdown | 02:00-04:00 | 09:00-11:00 | 10 Years of Video Browser Showdown | |||||||||||||||||
| Demo 2 | 04:00-05:00 | 11:00-12:00 | 144 | SeekSuspect : Retrieving Suspects from Criminal Datasets using Visual Memory | ||||||||||||||||
| Oral Session 1(4 papers) | 07:00-08:20 | 14:00-15:20 | 16 | Incremental Multi-view Object Detection from a Moving Camera | ||||||||||||||||
| 24 | Low-quality Watermarked Face Inpainting with Discriminative Residual Learning | |||||||||||||||||||
| 31 | Unsupervised learning of co-occurrences for face images retrieval | |||||||||||||||||||
| 32 | EvoGAN: An Evolutionary GAN for Face Aging and Rejuvenation | |||||||||||||||||||
| Oral Session 2(4 papers) | 08:20-09:40 | 15:20-16:40 | 36 | Self-Supervised Adversarial Learning for Cross-Modal Retrieval | ||||||||||||||||
| 37 | Multi-Level Expression Guided Attention Network for Referring Expression Comprehension | |||||||||||||||||||
| 45 | Learning Intra-inter Semantic Aggregation for Video Object Detection | |||||||||||||||||||
| 55 | A Multi-Scale Language Embedding Network for Proposal-Free Referring Expression Comprehension | |||||||||||||||||||
| Special Session Poster 2-Multimedia system(8 papers) | 10:00-10:40 | 17:00-17:40 | 23 | Two-stage Structure Aware Image Inpainting Based on Generative Adversarial Network | ||||||||||||||||
| 39 | Adaptive Feature Aggregation Network for Nuclei Segmentation | |||||||||||||||||||
| 44 | Classification of Multimedia SNS Posts about Tourist Sites Based on Their Focus toward Predicting Eco-Friendly Users | |||||||||||||||||||
| 80 | Table Detection and Cell Segmentation in Online Handwritten Documents with Graph Attention Networks | |||||||||||||||||||
| 97 | Determining Image Age with Rank-Consistent Ordinal Classification and Object-centered Ensemble | |||||||||||||||||||
| 99 | Cross-Modal Learning for Saliency Prediction in Mobile Environment | |||||||||||||||||||
| 125 | Integrating Aspect-aware Interactive Attention and Emotional Position-aware for Multi-aspect Sentiment Analysis | |||||||||||||||||||
| 130 | Pulse Localization Networks with Infrared Camera | |||||||||||||||||||
| Poster Session 2(8 papers) | 11:00-11:40 | 18:00-18:40 | 52 | A Novel System Architecture and an Automatic Monitoring Method for Remote Production | ||||||||||||||||
| 60 | Patch Assembly for Real-time Instance Segmentation | |||||||||||||||||||
| 61 | Full-Resolution Encoder–Decoder Networks with Multi-Scale Feature Fusion for Human Pose Estimation | |||||||||||||||||||
| 64 | Graph-based Variational Auto-Encoder for Generalized Zero-Shot Learning | |||||||||||||||||||
| 66 | Fixed-size Video Summarization over Streaming Data via Non-monotone Submodular Maximization | |||||||||||||||||||
| 69 | Multi-focus noisy image fusion based on gradient regularized convolutional sparse representation | |||||||||||||||||||
| 71 | Fixation Guided Network for Salient Object Detection | |||||||||||||||||||
| 83 | RICAPS: Residual Inception and Cascaded Capsule Network for Broadcast Sports Video Classification | |||||||||||||||||||
| Demo 2-Mirrored | 17:00-18:00 | 00:00-01:00+1 day | 144 | SeekSuspect : Retrieving Suspects from Criminal Datasets using Visual Memory | ||||||||||||||||
| Oral Session 1-Mirrored (4 papers) | 18:00-19:20 | 01:00-02:20+1 day | 16 | Incremental Multi-view Object Detection from a Moving Camera | ||||||||||||||||
| 24 | Low-quality Watermarked Face Inpainting with Discriminative Residual Learning | |||||||||||||||||||
| 31 | Unsupervised learning of co-occurrences for face images retrieval | |||||||||||||||||||
| 32 | EvoGAN: An Evolutionary GAN for Face Aging and Rejuvenation | |||||||||||||||||||
| Oral Session 2-Mirrored (4 papers) | 19:40-21:00 | 02:40-04:00+1 day | 36 | Self-Supervised Adversarial Learning for Cross-Modal Retrieval | ||||||||||||||||
| 37 | Multi-Level Expression Guided Attention Network for Referring Expression Comprehension | |||||||||||||||||||
| 45 | Learning Intra-inter Semantic Aggregation for Video Object Detection | |||||||||||||||||||
| 55 | A Multi-Scale Language Embedding Network for Proposal-Free Referring Expression Comprehension | |||||||||||||||||||
| Special Session Poster 2-Mirrored-Multimedia system (8 papers) | 21:00-21:40 | 04:00-04:40+1 day | 23 | Two-stage Structure Aware Image Inpainting Based on Generative Adversarial Network | ||||||||||||||||
| 39 | Adaptive Feature Aggregation Network for Nuclei Segmentation | |||||||||||||||||||
| 44 | Classification of Multimedia SNS Posts about Tourist Sites Based on Their Focus toward Predicting Eco-Friendly Users | |||||||||||||||||||
| 80 | Table Detection and Cell Segmentation in Online Handwritten Documents with Graph Attention Networks | |||||||||||||||||||
| 97 | Determining Image Age with Rank-Consistent Ordinal Classification and Object-centered Ensemble | |||||||||||||||||||
| 99 | Cross-Modal Learning for Saliency Prediction in Mobile Environment | |||||||||||||||||||
| 125 | Integrating Aspect-aware Interactive Attention and Emotional Position-aware for Multi-aspect Sentiment Analysis | |||||||||||||||||||
| 130 | Pulse Localization Networks with Infrared Camera | |||||||||||||||||||
| Poster Session 2-Mirrored (8 papers) | 22:00-22:40 | 05:00-05:40+1 day | 52 | A Novel System Architecture and an Automatic Monitoring Method for Remote Production | ||||||||||||||||
| 60 | Patch Assembly for Real-time Instance Segmentation | |||||||||||||||||||
| 61 | Full-Resolution Encoder–Decoder Networks with Multi-Scale Feature Fusion for Human Pose Estimation | |||||||||||||||||||
| 64 | Graph-based Variational Auto-Encoder for Generalized Zero-Shot Learning | |||||||||||||||||||
| 66 | Fixed-size Video Summarization over Streaming Data via Non-monotone Submodular Maximization | |||||||||||||||||||
| 69 | Multi-focus noisy image fusion based on gradient regularized convolutional sparse representation | |||||||||||||||||||
| 71 | Fixation Guided Network for Salient Object Detection | |||||||||||||||||||
| 83 | RICAPS: Residual Inception and Cascaded Capsule Network for Broadcast Sports Video Classification | |||||||||||||||||||
| Data | Program | Europe Time(UTC+1) | Beijing Time(UTC+8) | Paper ID | Paper Title | |||||||||||||||
| 2021.3.9 | Keynote 3 by Bernt Schiele | 01:00-02:00 | 08:00-09:00 | |||||||||||||||||
| Demo 3 | 02:00-03:00 | 09:00-10:00 | 145 | A Large-Scale Image Retrieval System for Everyday Scenes | ||||||||||||||||
| Oral Session 3(4 papers) | 03:00-04:20 | 10:00-11:20 | 68 | Overlap Classification Mechanism for Skeletal Bone Age Assessment | ||||||||||||||||
| 72 | Motion-Transformer: Self-supervised Pre-trianing for Skeleton-based Action Recognition | |||||||||||||||||||
| 75 | A Background-induced Generative Network with Multi-level Discriminator for Text-to-Image Generation | |||||||||||||||||||
| 76 | WFN-PSC: Weighted-Fusion Network with Poly-Scale Convolution for image dehazing | |||||||||||||||||||
| Oral Session 4(5 papers) | 07:00-08:40 | 14:00-15:40 | 88 | An Autoregressive Generation Model for Producing Instant Basketball Defensive Trajectory | ||||||||||||||||
| 104 | Objective Object Segmentation Visual Quality Evaluation based on Pixel-Level and Region-Level Characteristics | |||||||||||||||||||
| 115 | Fixations Based Personal Target Objects Segmentation | |||||||||||||||||||
| 120 | Relationship Graph Learning Network For Visual Relationship Detection | |||||||||||||||||||
| 127 | Graph-Based Motion Prediction for Abnormal Action Detection | |||||||||||||||||||
| Special Session Poster 3-Multimedia analysis and understanding(8 papers) | 09:00-09:40 | 16:00-16:40 | 51 | Scene Graph Generation via Multi-Relation Classification and Cross-modal Attention Coordinator | ||||||||||||||||
| 54 | Graph Convolution Network with Node Feature Optimization Using Cross Attention for Few-shot Learning | |||||||||||||||||||
| 65 | A Multi-scale Human Action Recognition Method Based on Laplacian Pyramid Depth Motion Images | |||||||||||||||||||
| 77 | Video Scene Detection Based on Link Prediction Using Graph Convolution Network | |||||||||||||||||||
| 78 | Cross-Cultural Design of Facial Expressions for Humanoids-Is There Cultural Difference Between Japan and Denmark? | |||||||||||||||||||
| 119 | Improving auto-encoder novelty detection using channel attention and entropy minimization | |||||||||||||||||||
| 122 | Local Structure Alignment Guided Domain Adaptation with Few Source Samples | |||||||||||||||||||
| 138 | Efficient Inter-image Relation Graph Neural Network Hashing for Scalable Image Retrieval | |||||||||||||||||||
| Poster Session 3(8 papers) | 10:00-10:40 | 17:00-17:40 | 84 | Transfer Non-stationary Texture with Complex Appearance | ||||||||||||||||
| 93 | C3VQG: Category Consistent Cyclic Visual Question Generation | |||||||||||||||||||
| 106 | Text-based Visual Question Answering with Knowledge Base | |||||||||||||||||||
| 109 | Attention-Constraint Facial Expression Recognition | |||||||||||||||||||
| 110 | Defense for adversarial videos by Self-adaptive JPEG Compression and Optical Texture | |||||||||||||||||||
| 111 | Fusing CAMs-Weighted Features and Temporal Information for Robust Loop Closure Detection | |||||||||||||||||||
| 123 | Multiplicative Angular Margin Loss for Text-Based Person Search | |||||||||||||||||||
| 129 | Attended Feature Matching for Weakly-supervised Video Relocalization | |||||||||||||||||||
| Demo 3-Mirrored | 17:00-18:00 | 00:00-01:00+1 day | 145 | A Large-Scale Image Retrieval System for Everyday Scenes | ||||||||||||||||
| Oral Session 3-Mirrored (4 papers) | 18:00-19:20 | 01:00-02:20+1 day | 68 | Overlap Classification Mechanism for Skeletal Bone Age Assessment | ||||||||||||||||
| 72 | Motion-Transformer: Self-supervised Pre-trianing for Skeleton-based Action Recognition | |||||||||||||||||||
| 75 | A Background-induced Generative Network with Multi-level Discriminator for Text-to-Image Generation | |||||||||||||||||||
| 76 | WFN-PSC: Weighted-Fusion Network with Poly-Scale Convolution for image dehazing | |||||||||||||||||||
| Oral Session 4-Mirrored (5 papers) | 19:20-21:00+1 day | 02:20-04:00+1 day | 88 | An Autoregressive Generation Model for Producing Instant Basketball Defensive Trajectory | ||||||||||||||||
| 104 | Objective Object Segmentation Visual Quality Evaluation based on Pixel-Level and Region-Level Characteristics | |||||||||||||||||||
| 115 | Fixations Based Personal Target Objects Segmentation | |||||||||||||||||||
| 120 | Relationship Graph Learning Network For Visual Relationship Detection | |||||||||||||||||||
| 127 | Graph-Based Motion Prediction for Abnormal Action Detection | |||||||||||||||||||
| Special Session Poster 3-Mirrored-Multimedia analysis and understanding (8 papers) | 21:00-21:40 | 04:00-04:40+1 day | 51 | Scene Graph Generation via Multi-Relation Classification and Cross-modal Attention Coordinator | ||||||||||||||||
| 54 | Graph Convolution Network with Node Feature Optimization Using Cross Attention for Few-shot Learning | |||||||||||||||||||
| 65 | A Multi-scale Human Action Recognition Method Based on Laplacian Pyramid Depth Motion Images | |||||||||||||||||||
| 77 | Video Scene Detection Based on Link Prediction Using Graph Convolution Network | |||||||||||||||||||
| 78 | Cross-Cultural Design of Facial Expressions for Humanoids-Is There Cultural Difference Between Japan and Denmark? | |||||||||||||||||||
| 119 | Improving auto-encoder novelty detection using channel attention and entropy minimization | |||||||||||||||||||
| 122 | Local Structure Alignment Guided Domain Adaptation with Few Source Samples | |||||||||||||||||||
| 138 | Efficient Inter-image Relation Graph Neural Network Hashing for Scalable Image Retrieval | |||||||||||||||||||
| Poster Session 3-Mirrored (8 papers) | 22:00-22:40 | 05:00-05:40+1 day | 84 | Transfer Non-stationary Texture with Complex Appearance | ||||||||||||||||
| 93 | C3VQG: Category Consistent Cyclic Visual Question Generation | |||||||||||||||||||
| 106 | Text-based Visual Question Answering with Knowledge Base | |||||||||||||||||||
| 109 | Attention-Constraint Facial Expression Recognition | |||||||||||||||||||
| 110 | Defense for adversarial videos by Self-adaptive JPEG Compression and Optical Texture | |||||||||||||||||||
| 111 | Fusing CAMs-Weighted Features and Temporal Information for Robust Loop Closure Detection | |||||||||||||||||||
| 123 | Multiplicative Angular Margin Loss for Text-Based Person Search | |||||||||||||||||||
| 129 | Attended Feature Matching for Weakly-supervised Video Relocalization | |||||||||||||||||||