Model

1 100

Document Understanding
Document Understanding
Document Understanding
Document Understanding
Natural Image
Natural Image
Autonomous Driving
Autonomous Driving
Household
Household
Gaming Agent
Gaming Agent
UI
UI
MultimodalTextbox

Instructions:

  1. Upload an image (or multiple images) and a question, and press send. There are several examples on the left that you can try out.
  2. The model will show a heatmap that is generated in the forward pass of the model and indicates the probability of high-res patch selection for each region in PS3, and also a text response returned by the model.
  3. You can press the download button on the top right corner of the heatmap or the original image and download the native-resolution version to look into the details.
  4. You can adjust the % High-Res Patch Processed parameter to control the percentage of high-res patches processed by the model. Larger value means more high-res patches will be processed, which normally leads to better accuracy for images with dense information but also higher latency.
  5. Press the Clear Conversation button to clear the conversation and submit another query.

Known Limitations:

  1. The patch selection is not fully accurate when locating very detailed information, e.g., asking about one specific sentence in a document.
  2. The model tends to select high-res patches around small objects or patterns instead of larger objects. This is a result of our data design where we have bounding boxes mostly around small objects. However, this also makes sense to some level because large objects usually don't need higher resolution to recognize.