Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
{{output}}
Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual repr... ...