Unleashing the power of ONNX for speedy SBERT inference

Author(s): Swaraj Patil

Originally published on Towards AI.

SBERTalso known as Sentence – BurtSentence embeddings are a widely used approach for obtaining data that aims to preserve relevant information within sentences. However, generating these embeddings can be slow when working with large amounts of data. To address this, one option is to use batch-based encoding to accelerate inference. However, this may not necessarily reduce estimation time. In this Medium blog post, we will explore its application ONNX (Open Neural Network Exchange) framework and how it helps in reducing the inference time of models.

P.S.. It does not shed light on the inner workings of the article ONNX. For more detailed information please consult the official ONNX Documentation.

Let’s start by installing the import library. We can use pip for its installation ONNX

pip install onnx
pip install onnxruntime-gpu
pip install transformers
pip install torch

Once ONNX is installed we verify it using the snippet below

import onnx
print(onnx.__version__)

To get sentence embeddings, we will use imdb databasesource from t Kagal. Specifically, we will focus on “Movie OverviewUse columns to generate embeddings SBERT. The time required to create the embedding will be determined 1000 sentences is present in the dataset.

We will perform two experiments here on both CPU and GPU

estimated time for 1000 use of sentences Vanilla SBERT (CPU).
estimated time for 1000 use of sentences ONNX changed SBERT (CPU).
estimated time for 1000 use of sentences Vanilla SBERT (GPU).
estimated time for 1000 use of sentences ONNX converted SBERT (GPU).

The sentence BERT model that we will consider here is all-minilm-l6-v2

We can apply sentence BERT model hugging face library and sentence transformer Library. The output embeddings from both libraries will be the same. For our experiments, we will use hugging face Library. Remember that when we use hugging face After obtaining the embeddings library, additional post-processing may be required e.g. pooling Or standardization. Various steps can be accessed from the model page hugging face. Follow those steps to get the final sentence embedding.

Let’s first change the model to ONNX Format.

# # Load pretrained model and tokenizer
from transformers import AutoModel, AutoTokenizermodel_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name, 
do_lower_case=True )
model = AutoModel.from_pretrained(model_name )
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output(0) #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
temp = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), 
min=1e-9)
return F.normalize(temp, p=2, dim=1)
# Get the first example data to run the model and export it to ONNX
sample = ('Hey, how are you today?')
inputs = tokenizer(sample,
padding=True,
truncation=True,
return_tensors="pt"
)
## Convert Model to ONNX Format
import os
import torch
device = torch.device("cpu")
# Set model to inference mode, which is required before exporting 
# the model because some operators behave differently in
# inference and training mode.
model.eval()
model.to(device)
output_dir = os.path.join(".", "onnx_models")
if not os.path.exists(output_dir):
os.makedirs(output_dir)
export_model_path = os.path.join(output_dir, 'all_MiniLM_L6-v2.onnx')
with torch.no_grad():
symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
torch.onnx.export(model, # model being run
args=tuple(inputs.values()), # model input (or a tuple for multiple inputs)
f=export_model_path, # where to save the model (can be a file or file-like object)
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=('input_ids', # the model's input names
'attention_mask',
'token_type_ids'),
output_names=('start', 'end'), # the model's output names
dynamic_axes={'input_ids': symbolic_names, # variable length axes
'attention_mask' : symbolic_names,
'token_type_ids' : symbolic_names,
'start' : symbolic_names,
'end' : symbolic_names})
print("Model exported at ", export_model_path)

Now that we have converted it Sentence bert Sample. Let’s get the statistics of the models.

Vanilla SBERT (CPU)

estimated time received for vanilla SBERT model on CPU Can be found using the snippet below.

import time
import pandas as pd
import numpy as npfrom tqdm import tqdm
df = pd.read_csv('./imdb_top_1000.csv', usecols=('Overview'))
total_samples = len(df)
latency = ()
outputs_cpu = ()
with torch.no_grad():
for i in tqdm(range(total_samples)):
data = (df.loc(i, "Overview"))
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
)
start = time.time()
outputs_cpu.append(mean_pooling(model(**inputs), 
inputs('attention_mask')
).cpu().detach().numpy())
latency.append(time.time() - start)
print("n")
print("PyTorch {} Inference time = {} ms".format(device.type, 
np.round(np.average(latency)*1000, 4)))

100%|██████████| 1000/1000 (00:36<00:00, 27.62it/s)

PyTorch cpu Inference time = 34.2605 ms

ONNX Convert SBERT(CPU)

estimated time received for ONNX SBERT model on CPU Can be found using the snippet below.

import onnxruntime
import numpy as npsess_options = onnxruntime.SessionOptions()
session = onnxruntime.InferenceSession(export_model_path, 
sess_options, 
providers=('CPUExecutionProvider'))
latency = ()
ort_outputs_cpu = ()
for i in tqdm(range(total_samples)):
data = (df.loc(i, "Overview"))
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
)
ort_inputs = {k:v.cpu().numpy() for k, v in inputs.items()}
start = time.time()
op = session.run(None, ort_inputs)
op = torch.from_numpy(op(0))
ort_outputs_cpu.append(mean_pooling((op),
inputs('attention_mask')
).cpu().detach().numpy())
latency.append(time.time() - start)
print("n")
print("OnnxRuntime {} Inference time = {} ms".format(device.type, 
np.round(np.average(latency)*1000, 4)))

100%|██████████| 1000/1000 (00:16<00:00, 60.80it/s)

OnnxRuntime cpu Inference time = 15.5696 ms

output

outputs_cpu(0)(:,:10) ## Vanilla SBERT CPU Outputarray(((-0.06326339, 0.0414625 , -0.04707527, -0.03361899, -0.02562934, 
0.03499832, 0.00804075, -0.05042004, 0.00215668, -0.03816812)), 
dtype=float32)

ort_outputs_cpu(0)(:,:10) ## Onnx SBERT CPU Outputarray(((-0.06326343, 0.04146247, -0.04707528, -0.033619 , -0.02562926, 
0.03499835, 0.0080408 , -0.05042008, 0.00215669, -0.03816817)),
dtype=float32)

Vanilla SBERT (GPU)

estimated time received for Vanilla SBERT model on gpu Can be found using the snippet below.

device = torch.device("cuda")# Set model to inference mode, which is required before exporting 
# the model because some operators behave differently in
# inference and training mode.
model.eval()
model.to(device)
total_samples = len(df)
latency = ()
outputs_gpu = ()
with torch.no_grad():
for i in tqdm(range(total_samples)):
data = (df.loc(i, "Overview"))
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
).to(device)
start = time.time()
outputs_gpu.append(mean_pooling(model(**inputs), 
inputs('attention_mask')).cpu().detach().numpy())
latency.append(time.time() - start)
print("n")
print("PyTorch {} Inference time = {} ms".format(device.type, 
np.round(np.average(latency)*1000, 4)))

100%|██████████| 1000/1000 (00:07<00:00, 135.29it/s)

PyTorch cuda Inference time = 6.737 ms

ONNX Convert SBERT (GPU)

estimated time received for ONNX SBERT model on gpu Can be found using the snippet below.

import onnxruntime
import numpy as npsess_options = onnxruntime.SessionOptions()
session = onnxruntime.InferenceSession(export_model_path, 
sess_options, 
providers=('CUDAExecutionProvider'))
latency = ()
ort_outputs_gpu = ()
for i in tqdm(range(total_samples)):
data = (df.loc(i, "Overview"))
inputs = tokenizer(data,
padding=True,
truncation=True,
return_tensors="pt"
).to(device)
ort_inputs = {k:v.cpu().numpy() for k, v in inputs.items()}
start = time.time()
op = session.run(None, ort_inputs)
op = torch.from_numpy(op(0))
ort_outputs_gpu.append(mean_pooling((op), 
inputs('attention_mask').cpu()).cpu().detach().numpy())
latency.append(time.time() - start)
print("n")
print("OnnxRuntime {} Inference time = {} ms".format(device.type, 
np.round(np.average(latency)*1000, 4)))

100%|██████████| 1000/1000 (00:02<00:00, 373.49it/s)

OnnxRuntime cuda Inference time = 1.9466 ms

output

outputs_gpu(0)(:,:10) ## Vanilla SBERT GPU array(((-0.06326333, 0.04146247, -0.0470753 , -0.03361904, -0.02562935,
0.03499833, 0.00804079, -0.05042002, 0.00215669, -0.03816818)),
dtype=float32)

ort_outputs_gpu(0)(:,:10) ## ONNX SBERT GPUarray(((-0.06326336, 0.04146249, -0.04707528, -0.03361899, -0.02562931, 
0.03499832, 0.0080408 , -0.05042004, 0.00215668, -0.03816817)),
dtype=float32)

summary table

conclusion

Based on the results obtained we can see that ONNX-The transformed model takes significantly less time to obtain sentence embeddings without any loss in the data. experiments were conducted on google collab with t4 gpu. Similar or better results can be expected from other hardware.

In ONNXwe can also have quantified version of SBERT. the quantified version will be int8 dtype. One can investigate that also. The Jupyter notebook for the complete experiments has been added to the GitHub repo below for further reference.

GitHub – SP2203/onnx-sbert

Contribute to SP2203/onnx-sbert development by creating an account on GitHub.

github.com

Reference

— — — — — — — — — — — — — — — — — — — — — — — —

If it was useful, consider clapping it, it really helps. I write about ML, AI and technology. Follow me here on Medium so you don’t miss the next one.

📌More from me:

→ Keras implementation of LE-Net

→ AlexNet: Leading the way in modern deep learning

→ Smoothing noisy GNSS dataonn

Published via Towards AI