Use text and image encoder separately with onnxruntime

#57
by Frayin - opened

Problem

Hey thanks for sharing the model. I want to use your clip model with onnxruntime on a cpu but it seems that the model is exported with both text and image inputs.
I want to use the text and image encoder separately for inference (like how we can use encode_text and encode_image separately), so I tried to export it myself but the model fails to export to ONNX format.

What I tried

Standard torch.onnx.export() with various configurations
Dynamo-based export (dynamo=True)
Different opset versions (11, 12, 14)
Static shapes (no dynamic_axes)
Custom wrapper classes to isolate the text encoder

Error

All export attempts fail with:
IndexError: Argument passed to at() was not in the map.
This occurs during TorchScript's peephole optimization pass (_C._jit_pass_peephole).

Question

The only reason I'm doing all of this is because I wish to use the text and image encoder separately with onnxruntime, so if you could point me to how I can achieve this, that'll be great. Otherwise, could you share some insights on how I can go about exporting text and image encoder separately to ONNX? Thank you very much.

@Frayin did you manage to solve it? I am facing the same issue.

Facing the same issue here. We don't really want to load in memory the full model, and need to export our own ONNX files for text and vision. Did you managed to solve it ?

@thibaut-orn I have managed to compile it separately. Essentially, you would want to run everything once to download all the files, and then find jinaai/jina-embeddings-v3, and change its config.json to use jinaai/xlm-roberta-flash-implementation-onnx instead of jinaai/xlm-roberta-flash-implementation. there are some explanation on the former project's page on why the latter could not be exported to ONNX which could account to your problems. You should also use hf cli to download the project, and perform some manual patches according to the errors that torch would product on export (since the project appears to be largely behind the standard implementation). After all these, you would want to export all these things with float32 precision, as ONNX runtime has REALLY bad support (I just found out that it doesn't even support bfloat16 matrix multiplication, and that's probably why the official ONNX release only contains a float32 version). The exported model would have not been normalized, so if you want normalized vectors, you would have to add a custom wrapper module outside so that the exported ONNX would produce normalized vectors instead of (or along with) the raw vectors.

Probably exporting directly with torch_tensorrt would product working TensorRT model on NVidia GPUs but anyway I will try it first.

Sign up or log in to comment