Applied UX Research · Q2 2026 · AI Translation

AI Translation Evaluation for Classical Chinese Texts

Iterative evaluation of an AI tool translating classical Chinese (古文) into modern vernacular Chinese across 13 builds — combining usability testing, model comparison, domain-expert review, and App Store build testing.

Role Product Research Consultant
Method Model comparison + usability testing
Scale 13 builds; 10 models; 5,000+ words
Impact Model, quota, release, and register decisions informed
01 — Snapshot

Project Snapshot

Role
Product Research Consultant
Domain-Expert Evaluator
Timeline
Q2 2026
Product Type
AI-Assisted Classical Chinese Translation Tool
Confidentiality
NDA — Client & Product Anonymized
13
Iterative Builds Tested
10
AI Models Compared
10
Testers Recruited
5,000+
Words of Classical Text Tested
What I did
Evaluated iterative AI translation builds through model comparison, usability testing, and domain-expert review.
What surfaced
Translation quality, UI register, quota behavior, and release-readiness issues that affected product suitability.
What changed
Evidence informed production model choice, usage tiers, App Store readiness, and scholarly interface copy.
02 — Context

Overview

My Role
  • Product Research Consultant
  • Domain Expert Evaluator
  • Usability Researcher
  • Human-in-the-Loop Reviewer
Research Activities
  • Usability Testing
  • Model Comparison
  • Translation Review & Accuracy Evaluation
  • Human-in-the-Loop Evaluation
  • App Store Build Testing
Outcomes at a Glance
  • Production model selected
  • Usage quotas defined
  • App Store release readiness achieved
  • Interface copy aligned to audience

Context

Platform iOS Mobile App
Target Audience Scholars & Students
Content Domain Classical Chinese Texts
Primary Goal Improve readability and accessibility

Challenge

  • Translation Quality
  • Readability & Comprehension
  • User Experience
  • Audience Suitability

Evaluating AI translation for historical and literary texts required balancing usability, accuracy, and reader expectations for a specialized audience.

03 — Research

Research Focus

Objectives and guiding questions across 13 iterative product builds.

01
Evaluate Translation Quality
Assess translation accuracy, readability, and usefulness for intended users.
02
Identify Reliability Issues
Understand common quality risks and areas where output may degrade.
03
Assess User Experience
Evaluate whether the product experience supports the needs of its target audience.
04
Validate Product Assumptions
Review key design and business assumptions through testing and analysis.
05
Inform Product Decisions
Generate evidence-based recommendations for future development.
03 — Research

Evaluation Workflow

Research proceeded across 13 iterative builds, with each cycle informing the next.

1
Content Review
Curate and prepare Classical Chinese text samples.
2
Model Evaluation
Compare translation approaches and model outputs.
3
Expert Assessment
Review translation quality and audience suitability.
4
Findings Synthesis
Identify patterns, opportunities, and recurring issues.
5
Product Recommendations
Translate findings into actionable product decisions.

Methods

Translation Evaluation
  • Model Comparison
  • Expert Review
  • Output Quality Assessment
Product Research
  • Usability Testing
  • Workflow Evaluation
  • Build Testing
Content Analysis
  • Translation Quality Review
  • Reading Experience Assessment
  • Text Processing Review
Domain Expertise as Method

Classical Chinese translation requires specialized linguistic and cultural knowledge.

In addition to usability research, I served as a domain-expert evaluator, assessing translation quality, audience suitability, and reading experience for scholarly users.

04 — Analysis

Analysis & Findings

What the evaluation examined — and what it revealed.

Examined
Translation Quality

Model output readability, accuracy, and overall usefulness for scholarly readers.

Learned

The strongest approaches consistently balanced readability with source fidelity — directly informing the production model recommendation.

e.g., a single classical term rendered three ways by different models — only one preserved the original's literary tone.

Examined
Reliability & Trust

Recurring output-quality risks across builds and models.

Learned

Specific output patterns undermined reader trust and were flagged for review before release.

Examined
Audience Fit

Interface language and positioning for scholarly users.

Learned

Product copy needed register refinement to match scholarly expectations — adopted by the product team.

Also examined: Build Readiness — subscription, onboarding, and release workflows tested before deployment · Usage Calibration — text-length and usage analysis to support product planning
Key Insight

Translation quality alone did not determine product suitability.

Audience fit, trustworthiness, usability, and reading experience proved equally important factors in evaluation and recommendation decisions.

05 — Outcomes

Outcomes

Research contributions delivered across 13 iterative builds.

Deliverables

Framework
Evaluation Framework

Structured approach to assessing AI translation quality across iterative builds.

Report
Model Comparison Report

Evidence-based analysis to support production model selection.

Field Findings
UX & Build Findings

Usability and release-readiness feedback delivered per build cycle.

Strategy
Product Recommendations

Actionable guidance on product direction, copy, and feature priorities.

Impact

Model Selected
Evidence-based recommendation informed production translation engine choice
Quotas Defined
Usage-tier structure calibrated from real text-length and behavior analysis
App Store Ready
Release-readiness issues identified and resolved before deployment
Register Aligned
Interface copy refined to match scholarly audience expectations
06 — Reflection

Reflection

What this project revealed about research practice and AI evaluation.

What Worked
Combining Domain Expertise with UX Research

Embedding specialist knowledge directly into the evaluation process surfaced issues that standard usability methods would have missed.

What I'd Explore Next
Formalising Evaluation for Scale

A structured evaluation framework like this could be adapted for other AI products serving specialist audiences — legal, medical, or archival domains.

Key Takeaway
Product Suitability Is More Than Quality

For AI products serving niche audiences, evaluation must account for trust, register, and cultural fit — not just accuracy metrics.

Back to Research Portfolio

View more applied UX, product, and marketing research projects.

← View All Projects