Microsoft has launched a preview of Florence, a computer vision model that was first introduced two years ago. Florence is capable of understanding both images and language, making it a “unified” and “multimodal” system. The model is trained on billions of image-text pairs, making it incredibly versatile, with capabilities ranging from automatic captioning, background removal, and video summarization to image retrieval. Florence is being used across Microsoft’s own platforms, products, and services, with LinkedIn, Microsoft Teams, PowerPoint, Outlook, and Word all leveraging the model’s image captioning abilities.
Reddit is also using Florence to generate captions for images on its platform, creating “alt text” so that users with vision challenges can better follow along in threads. Florence’s ability to generate up to 10,000 tags per image will give Reddit much more control over how many objects in a picture they can identify, and help generate much better captions. Florence is expected to be used by customers for a range of applications in the future, such as detecting defects in manufacturing and enabling self-checkout in retail stores.
Multimodal models are increasingly seen as the best path forward for more capable AI systems. These models are capable of understanding multiple modalities, such as language and images, or videos and audio. Multimodal models tend to perform better than unimodal models, thanks to the contextual information from the additional modalities. They are also more efficient from a computational standpoint, leading to speedups in processing and cost reductions on the backend. Microsoft’s Florence is a complete re-thinking of vision models, and once there is easy and high-quality translation between images and text, a world of possibilities opens up.