Modern approaches to object detection heavily rely on deep learning models trained end-to-end. Enhancing these models often involves training them on larger, more diverse annotated datasets, a somewhat brute-force yet effective method for performance improvement. However, obtaining precise annotations for object detection, including item names and accurate bounding boxes, is a time-consuming and expensive process compared to image classification.
Data augmentation emerges as a strategy to expand the training instances without necessitating additional annotations. By manipulating existing datasets, augmentation involves actions like rotation, resizing, or flipping to train more robust object detection models.
While conventional data augmentation methods offer increased variety, realism, and visual characteristics, generative data augmentation takes it a step further, introducing fresh visual elements. This approach significantly enhances performance in downstream vision tasks.
Unlike classic data augmentation, generative data augmentation for object detection poses challenges due to the complexity of bounding box labels. AWS AI’s recent study explores the possibility of utilizing diffusion models for generative data augmentation without human annotations. The researchers employ diffusion-based inpainting techniques to create objects within specified bounding boxes, incorporating visual priors and configurable diffusion models for guided text-to-image generation.
To ensure the augmented images align with the original annotations, the researchers propose a method for calculating CLIP scores. Integrating inpainting-based approaches into the pipeline further accelerates the process.
The study’s experiments, conducted on various datasets and scenarios, demonstrate promising results. Significant improvements, such as 18.0%, 15.6%, and 15.9% in YOLOX detector’s mAP for different COCO datasets, 2.9% for the complete PASCAL VOC dataset, and an average improvement of 12.4% for downstream datasets, showcase the efficacy of the proposed method.
It’s highlighted that this method can complement other data augmentation approaches, suggesting potential synergies for further performance enhancements.