publications | Emily ZhiXuan Zeng

2024

Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

E Zhixuan Zeng, Yuhao Chen , and Alexander Wong

In Neurips Safe Generative AI Workshop 2024 , Dec 2024

Abs PDF

Recent advances in image generation have made diffusion models powerful tools for creating high-quality images. However, their iterative denoising process makes understanding and interpreting their semantic latent spaces more challenging than other generative models, such as GANs. Recent methods have attempted to address this issue by identifying semantically meaningful directions within the latent space. However, they often need manual interpretation or are limited in the number of vectors that can be trained, restricting their scope and utility. This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. This method allows for the automatic understanding of hidden features and supports a broader range of analysis without the need to train specific vectors. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models, facilitating comprehensive analysis of latent biases and the nuanced representations these models learn. Experimental results show that our framework can uncover hidden patterns and associations in various domains, offering new insights into the interpretability of diffusion model latent spaces.
COVID-Net L2C-ULTRA: An Explainable Linear-Convex Ultrasound Augmentation Learning Framework to Improve COVID-19 Assessment and Monitoring

E Zhixuan Zeng, Ashkan Ebadi , Adrian Florea , and 1 more author

Sensors, Dec 2024

Abs PDF

While no longer a public health emergency of international concern, COVID-19 remains an established and ongoing global health threat. As the global population continues to face significant negative impacts of the pandemic, there has been an increased usage of point-of-care ultrasound (POCUS) imaging as a low-cost, portable, and effective modality of choice in the COVID-19 clinical workflow. A major barrier to the widespread adoption of POCUS in the COVID-19 clinical workflow is the scarcity of expert clinicians who can interpret POCUS examinations, leading to considerable interest in artificial intelligence-driven clinical decision support systems to tackle this challenge. A major challenge to building deep neural networks for COVID-19 screening using POCUS is the heterogeneity in the types of probes used to capture ultrasound images (e.g., convex vs. linear probes), which can lead to very different visual appearances. In this study, we propose an analytic framework for COVID-19 assessment able to consume ultrasound images captured by linear and convex probes. We analyze the impact of leveraging extended linear-convex ultrasound augmentation learning on producing enhanced deep neural networks for COVID-19 assessment, where we conduct data augmentation on convex probe data alongside linear probe data that have been transformed to better resemble convex probe data. The proposed explainable framework, called COVID-Net L2C-ULTRA, employs an efficient deep columnar anti-aliased convolutional neural network designed via a machine-driven design exploration strategy. Our experimental results confirm that the proposed extended linear–convex ultrasound augmentation learning significantly increases performance, with a gain of 3.9% in test accuracy and 3.2% in AUC, 10.9% in recall, and 4.4% in precision. The proposed method also demonstrates a much more effective utilization of linear probe images through a 5.1% performance improvement in recall when such images are added to the training dataset, while all other methods show a decrease in recall when trained on the combined linear–convex dataset. We further verify the validity of the model by assessing what the network considers to be the critical regions of an image with our contribution clinician.
Understanding the Limitations of Diffusion Concept Algebra Through Food

E Zhixuan Zeng, Yuhao Chen , and Alexander Wong

arXiv preprint arXiv:2406.03582, Dec 2024

Abs PDF

Image generation techniques, particularly latent diffusion models, have exploded in popularity in recent years. Many techniques have been developed to manipulate and clarify the semantic concepts these large-scale models learn, offering crucial insights into biases and concept relationships. However, these techniques are often only validated in conventional realms of human or animal faces and artistic style transitions. The food domain offers unique challenges through complex compositions and regional biases, which can shed light on the limitations and opportunities within existing methods. Through the lens of food imagery, we analyze both qualitative and quantitative patterns within a concept traversal technique. We reveal measurable insights into the model’s ability to capture and represent the nuances of culinary diversity, while also identifying areas where the model’s biases and limitations emerge.
Beyond the Scoreboard: Advancing Fairness in Athlete Selection with Simulation-Based Tournament Strategies

E. Zhixuan Zeng, and Yuhong Zeng

Journal of Computational Vision and Imaging Systems, Apr 2024

Abs PDF

The process of selecting athletes for competitive sports teams is often undermined by the limitations of traditional tournament formats, which can misrepresent the true skill levels of participants. This issue is exemplified by a scenario observed in a table tennis team tryout, where a moderately skilled player advanced to the final round due to consistently facing weaker opponents, while more adept players were eliminated early against stronger competitors. Such occurrences cast doubt on the fairness and effectiveness of single elimination tournaments for player assessment. Addressing these concerns, our study conducts a thorough analysis of various tournament selection strategies, including single elimination, Swiss tournaments, and novel graph and sorting-based methods. By modeling players as Gaussian distributions with established mean skill levels, we simulate match outcomes to quantitatively evaluate the efficiency and accuracy of each strategy. Our evaluation employs two loss functions: Strict Loss, to gauge ranking precision, and Binary Loss, to assess the accuracy in identifying top performers. The experimental results reveal significant insights. Strategies integrating Elo ratings with circular graph approaches show enhanced performance, particularly in larger player groups, while TrueSkill and single elimination exhibit limitations in scalability and nuanced player ranking. The Swiss tournament, although consistent, experiences fluctuations in loss, suggesting areas for refinement. Notably, a novel graph-based strategy emerges as a stable and efficient alternative, underscoring its potential for future research. These findings aim to guide the development of more equitable and precise selection processes in sports team composition.

2023

ShapeShift: Superquadric-based Object Pose Estimation for Robotic Grasping

E Zhixuan Zeng, Yuhao Chen , and Alexander Wong

In WICV workshop , Apr 2023

Abs PDF

Object pose estimation is a critical task in robotics for precise object manipulation. However, current techniques heavily rely on a reference 3D object, limiting their generalizability and making it expensive to expand to new object categories. Direct pose predictions also provide limited information for robotic grasping without referencing the 3D model. Keypoint-based methods offer intrinsic descriptiveness without relying on an exact 3D model, but they may lack consistency and accuracy. To address these challenges, this paper proposes ShapeShift, a superquadric-based framework for object pose estimation that predicts the object’s pose relative to a primitive shape which is fitted to the object. The proposed framework offers intrinsic descriptiveness and the ability to generalize to arbitrary geometric shapes beyond the training set.
Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order Explainability

E Zhixuan Zeng*, Hayden Gunraj* , Sheldon Fernandez , and 1 more author

In XAI4CV workshop , Apr 2023

Abs PDF

Explainability plays a crucial role in providing a more comprehensive understanding of deep learning models’ behaviour. This allows for thorough validation of the model’s performance, ensuring that its decisions are based on relevant visual indicators and not biased toward irrelevant patterns existing in training data. However, existing methods provide only instance-level explainability, which requires manual analysis of each sample. Such manual review is time-consuming and prone to human biases. To address this issue, the concept of second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level. SOXAI automates the analysis of the connections between quantitative explanations and dataset biases by identifying prevalent concepts. In this work, we explore the use of this higher-level interpretation of a deep neural network’s behaviour to allows us to "explain the explainability" for actionable insights. Specifically, we demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model’s performance.
MMRNet: Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy

Yuhao Chen , Hayden Gunraj , E Zhixuan Zeng, and 3 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , Apr 2023

Abs PDF

Recently, there has been tremendous interest in industry 4.0 infrastructure to address labor shortages in global supply chains. Deploying artificial intelligence-enabled robotic bin picking systems in real world has become particularly important for reducing stress and physical demands of workers while increasing speed and efficiency of warehouses. To this end, artificial intelligence-enabled robotic bin picking systems may be used to automate order picking, but with the risk of causing expensive damage during an abnormal event such as sensor failure. As such, reliability becomes a critical factor for translating artificial intelligence research to real world applications and products. In this paper, we propose a reliable object detection and segmentation system with MultiModal Redundancy (MMRNet) for tackling object detection and segmentation for robotic bin picking using data from different modalities. This is the first system that introduces the concept of multimodal redundancy to address sensor failure issues during deployment. In particular, we realize the multimodal redundancy framework with a gate fusion module and dynamic ensemble learning. Finally, we present a new label-free multi-modal consistency (MC) score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty. Through experiments, we demonstrate that in an event of missing modality, our system provides a much more reliable performance compared to baseline models. We also demonstrate that our MC score is a more reliability indicator for outputs during inference time compared to the model generated confidence scores that are often over-confident.

2022

MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis

Yuhao Chen , Maximilian Gilles , E Zhixuan Zeng, and 1 more author

In 2022 IEEE CASE , Apr 2022

Awareded Abs PDF

Finalist

Autonomous bin picking poses significant challenges to vision-driven robotic systems given the complexity of the problem, ranging from various sensor modalities, to highly entangled object layouts, to diverse item properties and gripper types. Existing methods often address the problem from one perspective. Diverse items and complex bin scenes require diverse picking strategies together with advanced reasoning. As such, to build robust and effective machine-learning algorithms for solving this complex task requires significant amounts of comprehensive and high quality data. Collecting such data in real world would be too expensive and time prohibitive and therefore intractable from a scalability perspective. To tackle this big, diverse data problem, we take inspiration from the recent rise in the concept of metaverses, and introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis. The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper. We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties. Finally, we conduct extensive experiments showing that our proposed vacuum seal model and synthetic dataset achieves state-of-the-art performance and generalizes to real world use-cases.
Investigating Use of Keypoints for Object Pose Recognition

E Zhixuan Zeng, Yuhao Chen , and Alexander Wong

In Journal of Computational Vision and Imaging Systems , Apr 2022

Abs PDF

Object pose detection is a task that is highly useful for a variety of object manipulation tasks such as robotic grasping and tool handling. Perspective-n-Point matching between keypoints on the objects offers a way to perform pose estimation where the keypoints also provide inherent object information, such as corner locations and object part sections, without the need to reference a separate 3D model. Existing works focus on scenes with little occlusion and limited object categories. In this study, we demonstrate the feasibility of a pose estimation network based on detecting semantically important keypoints on the MetagraspNet dataset which contains heavy occlusion and greater scene complexity. We further discuss various challenges in using semantically important keypoints as a way to perform object pose estimation. These challenges include maintaining consistent keypoint definition, as well as dealing with heavy occlusion and similar visual features.