Run CLIP on iPhone to Search Photos

Update: I have made Queryable open-source. This might help you learn how to export Core ML models, as well as how to calculate, store, search, and accelerate queries.

I built an app called Queryable, which integrates the CLIP model on iOS to search the Photos album OFFLINE. It is available on App Store today and I thought it might be helpful to others who are as frustrated with the search function of Photos as I was, so I wrote this article to introduce it.

Queryable website

CLIP

CLIP(Contrastive Language-Image Pre-Training) is a model proposed by OpenAI in 2021. CLIP can encode images and text into representations that can be compared in the same space. CLIP is the basis for many text-to-image models (e.g. Stable Diffusion) to calculate the distance between the generated image and the prompt during training.
Source from OpenAI

To run on iOS devices in real time, I made a compromise between the performance and the model size, and finally chose the ViT-B-32 model, separated the Text Encoder and Image Encoder.

In ViT-B-32:

  • Text Encoder will encode any text into a 1x512 dimensional vector.
  • Image Encoder will encode any image into a 1x512 dimensional vector.

We can calculate the proximity of a text sentence and an image by finding the cosine similarity between their text vector and image vector. The pseudo code is as follows:

import clip

# Load ViT-B-32 CLIP model
model, preprocess = clip.load("ViT-B/32", device=device)

# Calculate image vector & text vector
image_feature = model.encode_image("photo-of-a-dog.png")
text_feature = model.encode_text("rainy night")

# cosine similarity
sim = cosin_similarity(image_feature, text_feature)

Integrate CLIP into iOS

I exported the Text Encoder and Image Encoder to CoreML model using coremltools library. The final models has a total file size of 300MB. Then, I started writing Swift.

Here is how to do inference with Text Encoder on Swift:

// Load the Text Encoder model.
let text_encoder = try MLModel(contentsOf: TextEncoderURL, configuration: config)
// Given a prompt, calculate the CLIP text vector for it.
let text_feature = text_encoder.encode("a dog")

The reason I split Text Encoder and Image Encoder into two models is because, when actually using this Photos search app, your input text will always change, but the content of the Photos library is fixed. Which means that all the image vectors can be computed once and saved in advance. Then, the text vector is computed only once for each of your searches.

Thus, real-time text searching from tens of thousands of Photos library becomes possible. Below is a flowchart of how Queryable works
How does Queryable works

Performance

But, compared to the search function of the iPhone Photos, how much does the CLIP-based album search capability improve? The answer is: overwhelmingly better. With CLIP, you can search for a scene in your mind, a tone, an object, or even an emotion conveyed by the image.

Search for a scene, an object, a tone or the meaning related to the photo with Queryable

To use Queryable, you need to first build the index, which will traverse your album, calculate all the image vectors and store. This takes place only ONCE, the total time required for building the index depends on the number of your Photos, the speed is of ~2000 photos per minute on iPhone 12 mini. When you have new photos, you can manually update the index, which is very fast.

The time cost for a search also depends on your Photos number, For <10,000 photos it takes less than 1s. For me, an iPhone 12 mini user with 35,000 photos, each search takes about 2.8s.

I made a video to demonstrate the search capabilities of Queryable:


QA

1.On Privacy and security issues.

Queryable is designed as an OFFLINE app that does not require a network connection and will NEVER request network access, thereby avoiding privacy issues.

2.What if my pictures are stored in iCloud?

Due to the inability to connect to a network, Queryable can only use the cache of the low-definition version of your local Photos album. However, the CLIP model itself resizes the input image to a very small size (e.g. ViT-B-32 is 224x224), so if your image is stored in iCloud, it actually does not affect search accuracy except that you cannot view its original image in search result.

- Update: In the latest version, you have the option to grant the app access to the network in order to download photos stored in iCloud. This will only occur when the photo is included in your search results, the original version is stored in iCloud, and you have navigated to the details page and clicked the download icon. Once you grant the permissions, you can close the app, reopen it, and the photos will be automatically downloaded from iCloud.

3. Any requirements for the device?

  • iOS 16.0 or above
  • iPhone 11 (A13 chip) or later models

4.Have some suggestions or product experience issues?

Feel free to contact me by email: myfancoo@gmail dot com.