VILA Playground

Instructions:

Upload an image and a question, and press send. There are several examples on the left that you can try out.
The model will show a heatmap that is generated in the forward pass of the model indicating the probability of high-res patch selection for each region in PS3, and also a text response returned by the model.
You can press the download button on the top right corner of the heatmap or the original image and download the native-resolution version to look into the details.
You can adjust the % High-Res Patch Processed parameter to control the percentage of high-res patches processed by the model. Larger value means more high-res patches will be processed, which normally leads to better accuracy for images with dense information but also higher latency.
Press the Clear Conversation button to clear the conversation and submit another query.

The patch selection is not fully accurate when locating very detailed information, e.g., asking about one specific sentence in a document.
The patch selection is not fully accurate for multi-round conversations. This is mainly due to the lack of multi-round patch selection data in our training dataset.
The model tends to select high-res patches around small objects or patterns instead of larger objects. This is a result of our data design where we have bounding boxes mostly around small objects. However, this also makes sense to some level because large objects usually don't need higher resolution to recognize.