fin-mpnet-base: Finance-Specialized Embeddings via Synthetic Fine-Tuning

mukaj/fin-mpnet-base 227k+ downloads

General-purpose embedding models are not trained on financial text. They are trained on web crawls, Wikipedia, and broad NLP corpora, none of which look much like an earnings call transcript or a 10-K filing. The vocabulary is different, the query structure is different, and retrieval performance on financial benchmarks reflects that gap. all-mpnet-base-v2 scores 49.96 on FiQA. The previous domain-specific state of the art sat at 56.59. Neither is adequate for production retrieval over financial documents.

fin-mpnet-base reaches 79.91 on the FiQA test set, a 60% relative improvement over the general model, while holding close to its performance on non-financial benchmarks. This write-up covers the synthetic dataset generation pipeline, the fine-tuning decisions that worked and those that did not, and the evaluation setup.

1. Dataset Generation

Synthetic data generation pipeline: financial documents chunked and fed to Mixtral 8x7B to produce positive and negative query pairs — Synthetic data pipeline. Financial documents are chunked and passed to Mixtral 8x7B, which generates positive and negative retrieval queries per passage, producing 150k+ labeled pairs.

The training data was generated synthetically, following the approach described in Wang et al. (2024). The core idea is to use a capable LLM to generate retrieval-style queries from raw document passages, rather than relying on manually annotated pairs or weakly supervised web data. The paper demonstrates that this approach scales well and produces embeddings competitive with models trained on large labeled datasets.

The seed corpus consisted of financial documents: annual reports, earnings call transcripts, SEC filings, and sustainability reports. Each document was chunked page by page with basic cleaning and filtering to remove boilerplate, tables without sufficient surrounding context, and very short passages. For each retained passage, Mixtral 8x7B was prompted to generate both a positive query (something a user would ask if they were looking for this passage) and a negative query (a plausible but subtly mismatched question). Each row in the resulting dataset contained a query, the source passage, and a positive or negative label.

The total dataset reached 150k+ (query, passage) pairs drawn from across the financial document types. The breadth is intentional. Retrieval over annual reports looks different from retrieval over earnings transcripts, and a model trained on only one source type tends to overfit its stylistic patterns.

2. Fine-Tuning

The base model is sentence-transformers/all-mpnet-base-v2, producing 768-dimensional dense embeddings. Fine-tuning used the sentence-transformers library: learning rate 1e-5, Lion optimizer, 10 epochs.

The first attempt used ContrastiveLoss, training on both positive and negative pairs from the generated dataset. Validation performance was poor. The LLM-generated hard negatives were not reliable: Mixtral occasionally produced negatives that were actually relevant to the passage, and sometimes produced ones so obviously unrelated that the model learned nothing from them. Contrastive loss with noisy labels does not converge to a useful representation.

The fix was to drop the negatives entirely and switch to MultipleNegativesRankingLoss. MNR loss treats every other example in the batch as an implicit negative, which removes the dependency on LLM label quality. At sufficient batch size, the batch naturally contains passages unrelated to any given query, and the loss works correctly. Validation metrics improved immediately. The final model was trained on positive (query, passage) pairs only.

Five thousand query/passage pairs were held out as a validation set throughout training. The evaluator was InformationRetrievalEvaluator, tracking NDCG@10 and MRR@10. Validation performance correlated strongly with FiQA benchmark performance, which gave reasonable confidence the model was not overfitting the synthetic distribution. No FiQA test data was downloaded or used during training. All FiQA scores come from the MTEB library evaluation run after training was complete.

3. Evaluation Results

The table below reports MTEB benchmark scores across financial and non-financial tasks. FiQA is the primary financial retrieval benchmark. The other tasks show how much general performance was traded away by domain specialization.

Model	FiQA	SciFact	AmazonReviews	OnlineBankingIntent	ArguAna
fin-mpnet-base	79.91	65.40	29.12	80.25	49.11
all-mpnet-base-v2	49.96	65.57	31.92	81.86	46.52
Previous SoTA (FiQA)	56.59	-	-	-	-

FiQA improves substantially. The tradeoffs are small and largely expected. AmazonReviews drops slightly, which is unsurprising given how far consumer reviews are from financial documents. OnlineBankingIntent is essentially unchanged. ArguAna improves. SciFact, academic rather than financial text, holds almost exactly. The model did not sacrifice general retrieval capability; it extended its performance within the target domain.

The model ranked first on the FiQA retrieval task on the MTEB leaderboard at time of release and has accumulated 227k+ downloads since, suggesting it found genuine use in production financial retrieval pipelines where general-purpose models were underperforming.

4. What Worked and What Did Not

The decisive change was abandoning hard negatives. The theoretical case for them is straightforward: contrastive learning benefits from difficult negatives that force fine-grained distinctions. That argument holds when the negatives are reliably hard and reliably wrong. LLM-generated hard negatives in a domain-specific setting satisfy neither condition consistently. Small amounts of label noise in contrastive loss are enough to push the model toward a poor representation. The signal degrades faster than the hard negatives add value.

MNR loss with in-batch negatives is robust to this because it makes no assumptions about which examples are negatives. It treats everything in the batch that is not the paired positive as contrast. At scale, that is sufficient. The cost is losing the benefit of hard negatives entirely, which for a first version of a domain-specialized model is a reasonable trade.

The other effective decision was training across multiple document types rather than a single source. A model trained only on earnings transcripts would retrieve well from earnings transcripts and poorly from filings or sustainability reports. Financial retrieval in practice spans all of these simultaneously, and the corpus breadth is what keeps the model useful across them.

The model is available on Hugging Face at mukaj/fin-mpnet-base and works directly with the sentence-transformers library:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mukaj/fin-mpnet-base')
embeddings = model.encode(["earnings per share grew 12% year on year"])