I built an app called Queryable, which integrates the CLIP model on iOS to search the
Photos album OFFLINE. It is available on App Store today and I thought it might be helpful to others who are as frustrated with the search function of
Photos as I was, so I wrote this article to introduce it.
CLIP(Contrastive Language-Image Pre-Training) is a model proposed by OpenAI in 2021. CLIP can encode images and text into representations that can be compared in the same space. CLIP is the basis for many text-to-image models (e.g. Stable Diffusion) to calculate the distance between the generated image and the prompt during training.
To run on iOS devices in real time, I made a compromise between the performance and the model size, and finally chose the ViT-B-32 model, separated the
Text Encoder and
Text Encoderwill encode any text into a 1x512 dimensional vector.
Image Encoderwill encode any image into a 1x512 dimensional vector.
We can calculate the proximity of a text sentence and an image by finding the
cosine similarity between their
text vector and
image vector. The pseudo code is as follows:
import clip # Load ViT-B-32 CLIP model model, preprocess = clip.load("ViT-B/32", device=device) # Calculate image vector & text vector image_feature = model.encode_image("photo-of-a-dog.png") text_feature = model.encode_text("rainly night") # cosine similarity sim = cosin_similarity(image_feature, text_feature)
Integrate CLIP into iOS
I exported the
Text Encoder and
Image Encoder to CoreML model using coremltools library. The final models has a total file size of 300MB. Then, I started writing Swift.
Here is how to do inference with
Text Encoder on Swift:
// Load the Text Encoder model. let text_encoder = try MLModel(contentsOf: TextEncoderURL, configuration: config) // Given a prompt, calculate the CLIP text vector for it. let text_feature = text_encoder.encode("a dog")
The reason I split
Text Encoder and
Image Encoder into two models is because, when actually using this Photos search app, your input text will always change, but the content of the Photos library is fixed. Which means that all the
image vectors can be computed once and saved in advance. Then, the
text vector is computed only once for each of your searches.
Thus, real-time text searching from tens of thousands of
Photos library becomes possible. Below is a flowchart of how Queryable works
But, compared to the search function of the iPhone
Photos, how much does the CLIP-based album search capability improve? The answer is: overwhelmingly better. With CLIP, you can search for a scene in your mind, a tone, an object, or even an emotion conveyed by the image.
To use Queryable, you need to first build the index, which will traverse your album, calculate all the image vectors and store. This takes place only ONCE, the total time required for building the index depends on the number of your Photos, the speed is of ~2000 photos per minute on iPhone 12 mini. When you have new photos, you can manually update the index, which is very fast.
The time cost for a search also depends on your Photos number, For <10,000 photos it takes less than 1s. For me, an iPhone 12 mini user with 35,000 photos, each search takes about 2.8s.
I made a video to demonstrate the search capabilities of Queryable:
1.On Privacy and security issues.
Queryable is designed as an OFFLINE app that does not require a network connection and will NEVER request network access, thereby avoiding privacy issues.
2.What if my pictures are stored in iCloud?
Due to the inability to connect to a network, Queryable can only use the cache of the low-definition version of your local Photos album. However, the CLIP model itself resizes the input image to a very small size (e.g. ViT-B-32 is 224x224), so if your image is stored in iCloud, it actually does not affect search accuracy except that you cannot view its original image in search result.
- Update: In the latest version, you have the option to grant the app access to the network in order to download photos stored in iCloud. This will only occur when the photo is included in your search results, the original version is stored in iCloud, and you have navigated to the details page and clicked the download icon. Once you grant the permissions, you can close the app, reopen it, and the photos will be automatically downloaded from iCloud.
3. Any requirements for the device?
- iOS 16.0 or above
- iPhone 11 (A13 chip) or later models
4.Have some suggestions or product experience issues?
Feel free to contact me by email: myfancoo@gmail dot com.