Emily ZhiXuan Zeng
I am Emily Zhixuan Zeng.
PhD student focusing on Machine Learning and Computer Vision,
University of Waterloo.
Work experience
- May-August 2021NVIDIA - Autonomous vehicle
Incorporated synthetic data for training lane detection model to target challenging scenarios
- May-August 2020NVIDIA - Autonomous vehicle
Time series light signal detection for autonomous vehicles
- Jan-April 2019Synapse Technology
Developed and analyzed CNN models for detecting threats from x-ray scans
Projects
- Explaining Diffusion
[Current, in progress] understanding relationships in concepts learned by image generation models
2024
- Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language PromptsE Zhixuan Zeng, Yuhao Chen , and Alexander WongIn Neurips Safe Generative AI Workshop 2024 , Dec 2024
Recent advances in image generation have made diffusion models powerful tools for creating high-quality images. However, their iterative denoising process makes understanding and interpreting their semantic latent spaces more challenging than other generative models, such as GANs. Recent methods have attempted to address this issue by identifying semantically meaningful directions within the latent space. However, they often need manual interpretation or are limited in the number of vectors that can be trained, restricting their scope and utility. This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. This method allows for the automatic understanding of hidden features and supports a broader range of analysis without the need to train specific vectors. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models, facilitating comprehensive analysis of latent biases and the nuanced representations these models learn. Experimental results show that our framework can uncover hidden patterns and associations in various domains, offering new insights into the interpretability of diffusion model latent spaces.
- Understanding the Limitations of Diffusion Concept Algebra Through FoodE Zhixuan Zeng, Yuhao Chen , and Alexander WongarXiv preprint arXiv:2406.03582, Dec 2024
Image generation techniques, particularly latent diffusion models, have exploded in popularity in recent years. Many techniques have been developed to manipulate and clarify the semantic concepts these large-scale models learn, offering crucial insights into biases and concept relationships. However, these techniques are often only validated in conventional realms of human or animal faces and artistic style transitions. The food domain offers unique challenges through complex compositions and regional biases, which can shed light on the limitations and opportunities within existing methods. Through the lens of food imagery, we analyze both qualitative and quantitative patterns within a concept traversal technique. We reveal measurable insights into the model’s ability to capture and represent the nuances of culinary diversity, while also identifying areas where the model’s biases and limitations emerge.
-
- Second order XAI
Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order Explainability
2023
- Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order ExplainabilityE Zhixuan Zeng*, Hayden Gunraj* , Sheldon Fernandez , and 1 more authorIn XAI4CV workshop , Dec 2023
Explainability plays a crucial role in providing a more comprehensive understanding of deep learning models’ behaviour. This allows for thorough validation of the model’s performance, ensuring that its decisions are based on relevant visual indicators and not biased toward irrelevant patterns existing in training data. However, existing methods provide only instance-level explainability, which requires manual analysis of each sample. Such manual review is time-consuming and prone to human biases. To address this issue, the concept of second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level. SOXAI automates the analysis of the connections between quantitative explanations and dataset biases by identifying prevalent concepts. In this work, we explore the use of this higher-level interpretation of a deep neural network’s behaviour to allows us to "explain the explainability" for actionable insights. Specifically, we demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model’s performance.
-
- MetaGraspNet
Robotic Grasping Dataset and object pose estimation via superquadrics
2023
- ShapeShift: Superquadric-based Object Pose Estimation for Robotic GraspingE Zhixuan Zeng, Yuhao Chen , and Alexander WongIn WICV workshop , Dec 2023
Object pose estimation is a critical task in robotics for precise object manipulation. However, current techniques heavily rely on a reference 3D object, limiting their generalizability and making it expensive to expand to new object categories. Direct pose predictions also provide limited information for robotic grasping without referencing the 3D model. Keypoint-based methods offer intrinsic descriptiveness without relying on an exact 3D model, but they may lack consistency and accuracy. To address these challenges, this paper proposes ShapeShift, a superquadric-based framework for object pose estimation that predicts the object’s pose relative to a primitive shape which is fitted to the object. The proposed framework offers intrinsic descriptiveness and the ability to generalize to arbitrary geometric shapes beyond the training set.
- MMRNet: Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal RedundancyYuhao Chen , Hayden Gunraj , E Zhixuan Zeng, and 3 more authorsIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , Dec 2023
Recently, there has been tremendous interest in industry 4.0 infrastructure to address labor shortages in global supply chains. Deploying artificial intelligence-enabled robotic bin picking systems in real world has become particularly important for reducing stress and physical demands of workers while increasing speed and efficiency of warehouses. To this end, artificial intelligence-enabled robotic bin picking systems may be used to automate order picking, but with the risk of causing expensive damage during an abnormal event such as sensor failure. As such, reliability becomes a critical factor for translating artificial intelligence research to real world applications and products. In this paper, we propose a reliable object detection and segmentation system with MultiModal Redundancy (MMRNet) for tackling object detection and segmentation for robotic bin picking using data from different modalities. This is the first system that introduces the concept of multimodal redundancy to address sensor failure issues during deployment. In particular, we realize the multimodal redundancy framework with a gate fusion module and dynamic ensemble learning. Finally, we present a new label-free multi-modal consistency (MC) score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty. Through experiments, we demonstrate that in an event of missing modality, our system provides a much more reliable performance compared to baseline models. We also demonstrate that our MC score is a more reliability indicator for outputs during inference time compared to the model generated confidence scores that are often over-confident.
2022
- MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse SynthesisYuhao Chen , Maximilian Gilles , E Zhixuan Zeng, and 1 more authorIn 2022 IEEE CASE , Dec 2022
Finalist
Autonomous bin picking poses significant challenges to vision-driven robotic systems given the complexity of the problem, ranging from various sensor modalities, to highly entangled object layouts, to diverse item properties and gripper types. Existing methods often address the problem from one perspective. Diverse items and complex bin scenes require diverse picking strategies together with advanced reasoning. As such, to build robust and effective machine-learning algorithms for solving this complex task requires significant amounts of comprehensive and high quality data. Collecting such data in real world would be too expensive and time prohibitive and therefore intractable from a scalability perspective. To tackle this big, diverse data problem, we take inspiration from the recent rise in the concept of metaverses, and introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis. The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper. We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties. Finally, we conduct extensive experiments showing that our proposed vacuum seal model and synthetic dataset achieves state-of-the-art performance and generalizes to real world use-cases.
- Investigating Use of Keypoints for Object Pose RecognitionE Zhixuan Zeng, Yuhao Chen , and Alexander WongIn Journal of Computational Vision and Imaging Systems , Dec 2022
Object pose detection is a task that is highly useful for a variety of object manipulation tasks such as robotic grasping and tool handling. Perspective-n-Point matching between keypoints on the objects offers a way to perform pose estimation where the keypoints also provide inherent object information, such as corner locations and object part sections, without the need to reference a separate 3D model. Existing works focus on scenes with little occlusion and limited object categories. In this study, we demonstrate the feasibility of a pose estimation network based on detecting semantically important keypoints on the MetagraspNet dataset which contains heavy occlusion and greater scene complexity. We further discuss various challenges in using semantically important keypoints as a way to perform object pose estimation. These challenges include maintaining consistent keypoint definition, as well as dealing with heavy occlusion and similar visual features.
-
- COVID-Net US-X
Enhanced Deep Neural Network for Detection of COVID-19 Patient Cases from Convex Ultrasound Imaging Through Extended Linear-Convex Ultrasound Augmentation Learning
2024
- COVID-Net L2C-ULTRA: An Explainable Linear-Convex Ultrasound Augmentation Learning Framework to Improve COVID-19 Assessment and MonitoringE Zhixuan Zeng, Ashkan Ebadi , Adrian Florea , and 1 more authorSensors, Dec 2024
While no longer a public health emergency of international concern, COVID-19 remains an established and ongoing global health threat. As the global population continues to face significant negative impacts of the pandemic, there has been an increased usage of point-of-care ultrasound (POCUS) imaging as a low-cost, portable, and effective modality of choice in the COVID-19 clinical workflow. A major barrier to the widespread adoption of POCUS in the COVID-19 clinical workflow is the scarcity of expert clinicians who can interpret POCUS examinations, leading to considerable interest in artificial intelligence-driven clinical decision support systems to tackle this challenge. A major challenge to building deep neural networks for COVID-19 screening using POCUS is the heterogeneity in the types of probes used to capture ultrasound images (e.g., convex vs. linear probes), which can lead to very different visual appearances. In this study, we propose an analytic framework for COVID-19 assessment able to consume ultrasound images captured by linear and convex probes. We analyze the impact of leveraging extended linear-convex ultrasound augmentation learning on producing enhanced deep neural networks for COVID-19 assessment, where we conduct data augmentation on convex probe data alongside linear probe data that have been transformed to better resemble convex probe data. The proposed explainable framework, called COVID-Net L2C-ULTRA, employs an efficient deep columnar anti-aliased convolutional neural network designed via a machine-driven design exploration strategy. Our experimental results confirm that the proposed extended linear–convex ultrasound augmentation learning significantly increases performance, with a gain of 3.9% in test accuracy and 3.2% in AUC, 10.9% in recall, and 4.4% in precision. The proposed method also demonstrates a much more effective utilization of linear probe images through a 5.1% performance improvement in recall when such images are added to the training dataset, while all other methods show a decrease in recall when trained on the combined linear–convex dataset. We further verify the validity of the model by assessing what the network considers to be the critical regions of an image with our contribution clinician.
-
- AutoRead
BASc Capstone Project. The dramatic automatic audiobook maker. Using the power of text to speach, we seek to generate suitably dramatic readings for fiction novels. Uses fully automated dataset generation for training a custom text to speach model.
selected publications
- Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language PromptsE Zhixuan Zeng, Yuhao Chen , and Alexander WongIn Neurips Safe Generative AI Workshop 2024 , Dec 2024
Recent advances in image generation have made diffusion models powerful tools for creating high-quality images. However, their iterative denoising process makes understanding and interpreting their semantic latent spaces more challenging than other generative models, such as GANs. Recent methods have attempted to address this issue by identifying semantically meaningful directions within the latent space. However, they often need manual interpretation or are limited in the number of vectors that can be trained, restricting their scope and utility. This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. This method allows for the automatic understanding of hidden features and supports a broader range of analysis without the need to train specific vectors. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models, facilitating comprehensive analysis of latent biases and the nuanced representations these models learn. Experimental results show that our framework can uncover hidden patterns and associations in various domains, offering new insights into the interpretability of diffusion model latent spaces.
- MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse SynthesisYuhao Chen , Maximilian Gilles , E Zhixuan Zeng, and 1 more authorIn 2022 IEEE CASE , Dec 2022
Finalist
Autonomous bin picking poses significant challenges to vision-driven robotic systems given the complexity of the problem, ranging from various sensor modalities, to highly entangled object layouts, to diverse item properties and gripper types. Existing methods often address the problem from one perspective. Diverse items and complex bin scenes require diverse picking strategies together with advanced reasoning. As such, to build robust and effective machine-learning algorithms for solving this complex task requires significant amounts of comprehensive and high quality data. Collecting such data in real world would be too expensive and time prohibitive and therefore intractable from a scalability perspective. To tackle this big, diverse data problem, we take inspiration from the recent rise in the concept of metaverses, and introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis. The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper. We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties. Finally, we conduct extensive experiments showing that our proposed vacuum seal model and synthetic dataset achieves state-of-the-art performance and generalizes to real world use-cases.
- ShapeShift: Superquadric-based Object Pose Estimation for Robotic GraspingE Zhixuan Zeng, Yuhao Chen , and Alexander WongIn WICV workshop , Dec 2023
Object pose estimation is a critical task in robotics for precise object manipulation. However, current techniques heavily rely on a reference 3D object, limiting their generalizability and making it expensive to expand to new object categories. Direct pose predictions also provide limited information for robotic grasping without referencing the 3D model. Keypoint-based methods offer intrinsic descriptiveness without relying on an exact 3D model, but they may lack consistency and accuracy. To address these challenges, this paper proposes ShapeShift, a superquadric-based framework for object pose estimation that predicts the object’s pose relative to a primitive shape which is fitted to the object. The proposed framework offers intrinsic descriptiveness and the ability to generalize to arbitrary geometric shapes beyond the training set.
- Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order ExplainabilityE Zhixuan Zeng*, Hayden Gunraj* , Sheldon Fernandez , and 1 more authorIn XAI4CV workshop , Dec 2023
Explainability plays a crucial role in providing a more comprehensive understanding of deep learning models’ behaviour. This allows for thorough validation of the model’s performance, ensuring that its decisions are based on relevant visual indicators and not biased toward irrelevant patterns existing in training data. However, existing methods provide only instance-level explainability, which requires manual analysis of each sample. Such manual review is time-consuming and prone to human biases. To address this issue, the concept of second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level. SOXAI automates the analysis of the connections between quantitative explanations and dataset biases by identifying prevalent concepts. In this work, we explore the use of this higher-level interpretation of a deep neural network’s behaviour to allows us to "explain the explainability" for actionable insights. Specifically, we demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model’s performance.
- MMRNet: Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal RedundancyYuhao Chen , Hayden Gunraj , E Zhixuan Zeng, and 3 more authorsIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , Dec 2023
Recently, there has been tremendous interest in industry 4.0 infrastructure to address labor shortages in global supply chains. Deploying artificial intelligence-enabled robotic bin picking systems in real world has become particularly important for reducing stress and physical demands of workers while increasing speed and efficiency of warehouses. To this end, artificial intelligence-enabled robotic bin picking systems may be used to automate order picking, but with the risk of causing expensive damage during an abnormal event such as sensor failure. As such, reliability becomes a critical factor for translating artificial intelligence research to real world applications and products. In this paper, we propose a reliable object detection and segmentation system with MultiModal Redundancy (MMRNet) for tackling object detection and segmentation for robotic bin picking using data from different modalities. This is the first system that introduces the concept of multimodal redundancy to address sensor failure issues during deployment. In particular, we realize the multimodal redundancy framework with a gate fusion module and dynamic ensemble learning. Finally, we present a new label-free multi-modal consistency (MC) score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty. Through experiments, we demonstrate that in an event of missing modality, our system provides a much more reliable performance compared to baseline models. We also demonstrate that our MC score is a more reliability indicator for outputs during inference time compared to the model generated confidence scores that are often over-confident.