With the rapid growth of video surveillance applications and services, the amount of surveillance videos has become extremely "big" which makes human monitoring tedious and difficult. Therefore, there exists a huge demand for smart surveillance techniques which can perform monitoring in an automatic or semi-automatic way. A number of challenges have arisen in the area of big surveillance data analysis and processing. Firstly, with the huge amount of surveillance videos in storage, video analysis tasks such as event detection, action recognition, and video summarization are of increasing importance in applications including events-of-interest retrieval and abnormality detection. Secondly, semantic data (e.g. objects' trajectory and bounding boxes) has become an essential data type in surveillance systems owing much to the growth of its size and complexity, hence introducing new challenging topics, such as efficient semantic data processing and compression, to the community. Thirdly, with the rapid growth from the static centric-based processing to the dynamic computing among distributed video processing nodes/cameras, new challenges such as multi-camera analysis, person re-identification, or distributed video processing are being issued in front of us. To meet these challenges, there is great need to extend existing approaches or explore new feasible techniques.

This is the 3rd edition of our workshop. The first two were organized in conjunction with ICME 2019 (Shanghai, China) and ICME 2020 (London, UK)

Scope & Topics

This workshop is intended to provide a forum for researchers and engineers to present their latest innovations and share their experiences on all aspects of design and implementation of new surveillance video analysis and processing techniques. Topics of interests include, but are not limited to:

  • Action/activity recognition, and event detection in surveillance videos
  • Multi-camera surveillance networks and applications
  • Surveillance scene parsing, segmentation, and analysis
  • Crowd parsing, estimation and analysis
  • Person, group or object or re-identification
  • Summarization and synopsis of surveillance videos
  • Big Data processing in large-scale surveillance systems
  • Distributed, edge and fog computing for surveillance systems
  • Low-resolution video analysis and processing: Recognition and object detection, restoration, denoising, enhancement, super-resolution
  • Scalable surveillance video analysis with fast model inference and low memory footprint
  • Surveillance from multiple modalities, not limited to: UAVs, satellite imagery, dash cams, wearables.

Call for Papers

Important Dates
    Paper Submission Due Date: March 13, 2021 March 20, 2021
    Notification of Acceptance/Rejection: March 27, 2021 April 6, 2021
    Camera-Ready Due Date: April 13, 2021
    Workshop Date and Venue: 9 July 2021, 2.00pm - 6.00pm (GMT+08, BJT), Virtual
Format Requirements & Templates
    Length: Papers must be no longer than 6 pages, including all text, figures, and references.
    Format: Workshop papers have the same format as regular papers. See the templates below. Submitted paper does not need to be double blind.
    Important: A complete paper should be submitted using the above templates.
Submission Details
    Paper Submission Site:
    (Please make sure your paper is submitted to the correct track)
    Submissions may be accompanied by up to 20 MB of supplemental material following the same guidelines as regular and special session papers.
    Review: Reviews will be handled directly by the Organizers and the Technical Program Committee (TPC).
    Presentation guarantee: As with accepted Regular and Special Session papers, accepted Workshop papers must be registered by the author deadline and presented at the conference; otherwise they will not be included in IEEE Xplore. A workshop paper is covered by a full-conference registration only.
    Conference Location: Virtual


Time Talk/Presentation
14.00-14.10 Opening Remarks
14.10-15.00 Invited Keynote: Toward Human-Level General Video Understanding
Yu Qiao (SIAT, CAS)
(12 mins per talk)
Track 1: Large-scale Surveillance Tasks

Hierarchical Attention Image-Text Alignment Network for Person Re-Identification
Kajal Kansal (IIITD)*; A Subramanyam (IIITD); Zheng Wang (National Institute of Informatics); Shin'ichi Satoh (National Institute of Informatics)

Cluster-based Distribution Alignment for Generalizable Person Re-identification
Chengzhang Zhu (Central South University); Zhe Chang (Central South University); Yalong Xiao (Central South University); Beiji Zou (Central South University); Bozhou Li (Central South University); Shu Liu (Central South University)*

Deep4Air: A Novel Deep Learning Framework for Airport Airside Surveillance
Phat Van Thai (Nanyang Technological University)*; Sameer Alam (Nanyang Technological University); Nimrod Lilith (Nanyang Technological University); Phu Tran (Nanyang Technological University ); Thanh Binh Nguyen (University of Science)

Dense Point Prediction: A Simple Baseline for Crowd Counting and Localization
Yi Wang (Nanyang Technological University); Xinyu Hou (Nanyang Technological University); Lap-Pui Chau (Nanyang Technological University)*

(12 mins per talk)
Track 2: Detection, Tracking & Recognition for Surveillance

A Dataset and Benchmark of Underwater Object Detection for Robot Picking
Chongwei Liu (Dalian University of Technology); Haojie Li (Dalian University of Technology); Shuchang Wang ( Dalian University of Technology); Ming Zhu (Dalian University of Technology); Dong Wang (Dalian University of Technology); Xin Fan (Dalian University of Technology); zhihui wang (Dalian University of Technology)*

Oriented Object Detection for Remote Sensing Images Based on Weakly Supervised Learning
Yongqing Sun (NTT, Japan); Ran Jie (Chongqing University of Posts and Telecommunications); Feng Yang (Chongqing Key Laboratory of Signal and Information Processing, Chongqing University of Posts and Telecommunications)*; Chenqiang Gao (Chongqing University of Posts and Telecommunications); Takayuki Kurozumi (NTT Media Intelligence Laboratories); Hideaki Kimata (NTT); Ziqi Ye (Chongqing University of Posts and Telecommunications)

Multi-Object Tracking with Tracked Object Bounding Box Association
Nanyang Yang (Nanyang Technological University); Yi Wang (Nanyang Technological University); Lap-Pui Chau (Nanyang Technological University)*

Generate and Adjust: a Novel Framework for Semi-supervised Pedestrian Attribute Recognition
Xuebo Shan (Peking University Shenzhen Graduate School)*; Peixi Peng (Peking University); Yunpeng Zhai (Peking University Shenzhen Graduate School); Chong Zhang (Peking University Shenzhen Graduate School); Tiejun Huang (Peking University); Yonghong Tian (Peking University)

16.36-16.48 Short Break
(12 mins per talk)
Track 3: Complementary Topics to Surveillance

Correcting Perspective Distortion in Incremental Video Stitching
Yinqi Chen (Jihua Lab); Huicheng Zheng (Sun Yat-sen University)*; Junyu Lin (Sun Yat-sen University)

Topic-guided Local-global Graph Neural Network for Image Captioning
Jichao Kan (University of Sydney)*; Kun Hu (The Univeristy of Sydney); Zhiyong Wang (The University of Sydney); Qiuxia Wu (South China University of Technology, China); Markus Hagenbuchner (The University of Wollongong, Australia); Ah Chung Tsoi (University of Wollongong)

Adaptive Multi-Scale Semantic Fusion Network for Zero-Shot Learning
Jing Song (Peking University Shenzhen Graduate School)*; Peixi Peng (Peking University); Yunpeng Zhai (Peking University Shenzhen Graduate School); Chong Zhang (Peking University Shenzhen Graduate School); Yonghong Tian (Peking University)

Global Feature Fusion Attention Network for Single Image Dehazing
Jie Luo (Northwest University); Qirong Bu (NorthWest University)*; Lei Zhang (NorthWest University); Jun Feng (Northwest University)

17.36-17.45 Closing Remarks

Invited Keynote Speaker

Toward Human-Level General Video Understanding

Abstract: Video understanding is an important yet challenging problem in computer vision. Compared with images, video include multiple frames of images with complex motions and dynamic structures. Recent years witnessed the significant progresses in video classification, with the deep learning models and larger video datasets. However, there is a clear gap between human level understanding and SOTA algorithms. This talk with summarize recent progresses on video understanding from the perspective of dataset, task, and models. We will also discuss future tendency toward human-level General Video Understanding (GVU), including large video datasets with fine tasks, more effective and efficient deep network, and generalization to long tail distribution.

Biodata: Yu Qiao is a Professor with Shenzhen Institutes of Advanced Technology (SIAT) Chinese Academy of Science, and Shanghai AI Laboratory. His research interests include computer vision, deep learning, and bioinformation. He has published more than 180 papers in international journals and conferences, including T-PAMI, IJCV, T-IP, T-SP, CVPR, ICCV etc. His H-index is 62, with 25,000+ citations in Google scholar. He is a recipient of the distinguished paper award in AAAI 2021. He received the first prize of Guangdong technological invention award, and Jiaxi Lv young researcher award from Chinese academy of sciences. His group achieved the first runner-up at the ImageNet Large Scale Visual Recognition Challenge 2015 in scene recognition, and the winner at the ActivityNet Large Scale Activity Recognition Challenge 2016 in video classification.


Weiyao Lin
 wylin AT
John See
 johnsee AT
Xiatian Zhu (Eddy)
 eddy.zhuxt AT


Please feel free to send any question or comments to:
johnsee AT, wylin AT, eddy.zhuxt AT