Update: I have made Queryable free & open-source. This might help you learn how to export Core ML models, as well as how to calculate, store, search, and accelerate queries.
I built an app called Queryable, which integrates the CLIP model on iOS to search the Photos
album OFFLINE. It is available on App Store today and I thought it might be helpful to others who are as frustrated with the search function of Photos
as I was, so I wrote this article to introduce it.
CLIP
CLIP(Contrastive Language-Image Pre-Training) is a model proposed by OpenAI in 2021. CLIP can encode images and text into representations that can be compared in the same space. CLIP is the basis for many text-to-image models (e.g. Stable Diffusion) to calculate the distance between the generated image and the prompt during training.
To run on iOS devices in real time, I made a compromise between the performance and the model size, and finally chose the ViT-B-32 model, separated the Text Encoder
and Image Encoder
.
In ViT-B-32:
Text Encoder
will encode any text into a 1x512 dimensional vector.Image Encoder
will encode any image into a 1x512 dimensional vector.
We can calculate the proximity of a text sentence and an image by finding the cosine
similarity between their text vector
and image vector
. The pseudo code is as follows:
import clip
# Load ViT-B-32 CLIP model
model, preprocess = clip.load("ViT-B/32", device=device)
# Calculate image vector & text vector
image_feature = model.encode_image("photo-of-a-dog.png")
text_feature = model.encode_text("rainly night")
# cosine similarity
sim = cosin_similarity(image_feature, text_feature)
Integrate CLIP into iOS
I exported the Text Encoder
and Image Encoder
to CoreML model using coremltools library. The final models has a total file size of 300MB. Then, I started writing Swift.
Here is how to do inference with Text Encoder
on Swift:
// Load the Text Encoder model.
let text_encoder = try MLModel(contentsOf: TextEncoderURL, configuration: config)
// Given a prompt, calculate the CLIP text vector for it.
let text_feature = text_encoder.encode("a dog")
The reason I split Text Encoder
and Image Encoder
into two models is because, when actually using this Photos search app, your input text will always change, but the content of the Photos library is fixed. Which means that all the image vectors
can be computed once and saved in advance. Then, the text vector
is computed only once for each of your searches.
Thus, real-time text searching from tens of thousands of Photos
library becomes possible. Below is a flowchart of how Queryable works
Performance
But, compared to the search function of the iPhone Photos
, how much does the CLIP-based album search capability improve? The answer is: overwhelmingly better. With CLIP, you can search for a scene in your mind, a tone, an object, or even an emotion conveyed by the image.
To use Queryable, you need to first build the index, which will traverse your album, calculate all the image vectors and store. This takes place only ONCE, the total time required for building the index depends on the number of your Photos, the speed is of ~2000 photos per minute on iPhone 12 mini. When you have new photos, you can manually update the index, which is very fast.
The time cost for a search also depends on your Photos number, For <10,000 photos it takes less than 1s. For me, an iPhone 12 mini user with 35,000 photos, each search takes about 2.8s.
I made a video to demonstrate the search capabilities of Queryable:
QA
1.On Privacy and security issues.
Queryable is designed as an OFFLINE app that does not require a network connection and will NEVER request network access, thereby avoiding privacy issues.
2.What if my pictures are stored in iCloud?
Due to the inability to connect to a network, Queryable can only use the cache of the low-definition version of your local Photos album. However, the CLIP model itself resizes the input image to a very small size (e.g. ViT-B-32 is 224x224), so if your image is stored in iCloud, it actually does not affect search accuracy except that you cannot view its original image in search result.
- Update: In the latest version, you have the option to grant the app access to the network in order to download photos stored in iCloud. This will only occur when the photo is included in your search results, the original version is stored in iCloud, and you have navigated to the details page and clicked the download icon. Once you grant the permissions, you can close the app, reopen it, and the photos will be automatically downloaded from iCloud.
3. Any requirements for the device?
- iOS 16.0 or above
- iPhone 11 (A13 chip) or later models
4.Have some suggestions or product experience issues?
Feel free to contact me by email: myfancoo@gmail dot com.