Do LLMS or MT engines perform better at translations?

Dan Cho
December 29, 2023
5 minutes
Share this post

Generative AI and Large Language Models (LLMs) are anticipated to revolutionize content services industries and professions by empowering content authors to create multilingual content, streamlining workflows, and facilitating translations.

First, how are LLMs fundamentally different from standard machine translation engines (e.g. Google translate)? 

The key differentiator is the ability to feed context cues to the LLM. Contextual information and cues significantly improve the translation quality of language models compared to standard machine translations. Traditional machine translation models, such as statistical machine translation (SMT), lack the ability to capture complex linguistic nuances and context in the same way that large language models (LLMs) like neural machine translation (NMT) with context cues can.

Here are some key points of comparison:

  1. Contextual Understanding: LLMs with context cues have a better grasp of contextual nuances, enabling them to produce translations that take into account the broader context of a sentence or paragraph. Traditional machine translation may struggle to capture the full meaning of a sentence without such contextual information.
  2. Ambiguity Resolution: Context cues help LLMs resolve ambiguities more effectively. Standard machine translation models may struggle with words or phrases that have multiple meanings, leading to potential inaccuracies in translation.
  3. Idiomatic Expressions: LLMs, especially those with contextual information, can handle idiomatic expressions more adeptly. Traditional machine translation models may produce literal translations that do not capture the idiomatic meaning.
  4. Named Entity Recognition (NER): Context cues assist LLMs in recognizing and translating named entities more accurately. Standard machine translation models may struggle with proper nouns and specific terms, leading to less precise translations.
  5. Domain-specific Knowledge: LLMs with context cues can leverage domain-specific knowledge for more accurate translations in specialized fields. Traditional machine translation models may lack the ability to adapt to specific domains, resulting in less accurate translations for technical or specialized content.
  6. Sentence Flow and Coherence: LLMs maintain better sentence flow and coherence due to their ability to consider context. Standard machine translation may produce translations that feel disjointed or lack overall coherence.
  7. Handling Polysemy: LLMs excel in handling polysemy by considering context, choosing the most appropriate translation for a word based on the broader sentence context. Traditional machine translation models may struggle to differentiate between multiple meanings of a word.

In a recent study, we conducted a performance comparison of eight distinct Large Language Models (LLMs) and variants in Machine Translation (MT) workflows. Our evaluation focused on assessing translation quality for customer support content translated from English into five target languages, namely Arabic, Chinese, Japanese, and Spanish.

Our findings reveal that outputs from LLM-augmented workflows and 'pure LLM prompts' closely approached a high industry standard quality level threshold, sometimes differing by mere tenths of a percentage. Neural Machine Translation (NMT) models barely outperformed all others, including both 'pure LLM' output and configurations combining NMT and LLMs. 

Dan Cho, Co-founder of, notes the particularly impressive results for challenging languages like Arabic, Chinese, and Japanese. While LLMs such as GPT-4 may not yet match the raw translation performance of highly trained NMT engines, they demonstrate impressive proximity to achieving comparable results.

As LLMs become fine-tuned and work their way into the corporate IT stack, their ability to achieve desired translation results with lighter prompting and minimal task-specific training will be a compelling alternative.

“It is easy to imagine a future where LLMs outperform NMT, especially for specific applications, content types, or use cases. We will continue to compare and analyze their performance in the coming months,” adds O’Curran. “It will also be interesting to see the performance of customized LLMs. Similar to MTs, the idea is to fine-tune the model for a specific context, domain, task, or customer requirement to enhance their ability to provide more accurate translations for different use cases.”

Select results from the evaluation of translations by trained MT, generic MT, and LLMs. The darker the red in the cell, the further the translation was from the quality pass threshold. Source word count: 5,000.

Table Notes

DQF-MQM = Dynamic Quality Framework – Multidimensional Quality Metrics
Quality Level 4 = High quality (human translation quality)

The pass threshold is based on the quality level.
Each error is assigned a value based on the error severity, and the quality score is calculated based on the number and severity of errors found in the translation.

Customizing for Greater Accuracy

Tailoring Large Language Models (LLMs) entails refining the model by training it on domain-specific or task-specific data, enhancing its performance in a targeted area. Discover the process and rationale behind fine-tuning LLMs to augment their capacity for delivering more precise translations:

  1. Specialized knowledge in a specific domain: Training Large Language Models (LLMs) with data from distinct domains, such as legal, medical, or technical fields, acquaints the model with domain-specific vocabulary, terminology, and context. This familiarity empowers the LLM to generate translations with greater accuracy within that particular domain of expertise.
  2. Task-specific refinement: LLMs can undergo fine-tuning using task-specific data to enhance their performance in designated translation tasks. For example, if the aim is to improve translations of customer support content, the LLM can be trained on a dataset containing similar customer support conversations. This enables the model to grasp specific patterns, sentence structures, and commonly used phrases in such interactions, resulting in more accurate translations within that specific context.
  3. Alignment with company-specific data: Training LLMs with data unique to a company, such as previously translated content or proprietary glossaries, ensures that the model aligns with the organization’s preferred terminology, style, and tone. This customization guarantees that the LLM produces translations consistent with the company’s brand voice and requirements.
  4. Enhanced contextual awareness: LLMs exhibit robust contextual understanding and text generation capabilities. Through fine-tuning, the model can become more context-aware in specific translation scenarios. For instance, fine-tuning can focus on improving the LLM’s comprehension of idiomatic expressions, cultural nuances, or regional variations, resulting in translations that are more accurate and culturally appropriate.

Moving Multilingual Content Generation Upstream

The incorporation of Large Language Models (LLMs) into content tools and workflows has the potential to reshape the translation industry. Companies can now effortlessly create content in multiple languages simultaneously, streamlining their operations and enhancing overall efficiency. 

LLMs, as a disruptive force in the translation industry, are poised to bring about significant changes. As they continuously improve in accuracy, automation is expected to surge, pushing translation and localization upstream in the content supply chain. 

At Strings, we are at the forefront of advancing AI in global content. Are you prepared to harness the capabilities of LLMs? Connect with us to automate your localization and translation processes, effectively reaching your global audiences. Explore, innovate, and translate more with Strings.

Share this post