Project Snapshot
Overview
- Product Research Consultant
- Domain Expert Evaluator
- Usability Researcher
- Human-in-the-Loop Reviewer
- Usability Testing
- Model Comparison
- Translation Review & Accuracy Evaluation
- Human-in-the-Loop Evaluation
- App Store Build Testing
- Production model selected
- Usage quotas defined
- App Store release readiness achieved
- Interface copy aligned to audience
Context
Challenge
- Translation Quality
- Readability & Comprehension
- User Experience
- Audience Suitability
Evaluating AI translation for historical and literary texts required balancing usability, accuracy, and reader expectations for a specialized audience.
Research Focus
Objectives and guiding questions across 13 iterative product builds.
Evaluation Workflow
Research proceeded across 13 iterative builds, with each cycle informing the next.
Methods
- Model Comparison
- Expert Review
- Output Quality Assessment
- Usability Testing
- Workflow Evaluation
- Build Testing
- Translation Quality Review
- Reading Experience Assessment
- Text Processing Review
Classical Chinese translation requires specialized linguistic and cultural knowledge.
In addition to usability research, I served as a domain-expert evaluator, assessing translation quality, audience suitability, and reading experience for scholarly users.
Analysis & Findings
What the evaluation examined — and what it revealed.
Model output readability, accuracy, and overall usefulness for scholarly readers.
The strongest approaches consistently balanced readability with source fidelity — directly informing the production model recommendation.
e.g., a single classical term rendered three ways by different models — only one preserved the original's literary tone.
Recurring output-quality risks across builds and models.
Specific output patterns undermined reader trust and were flagged for review before release.
Interface language and positioning for scholarly users.
Product copy needed register refinement to match scholarly expectations — adopted by the product team.
Translation quality alone did not determine product suitability.
Audience fit, trustworthiness, usability, and reading experience proved equally important factors in evaluation and recommendation decisions.
Outcomes
Research contributions delivered across 13 iterative builds.
Deliverables
Structured approach to assessing AI translation quality across iterative builds.
Evidence-based analysis to support production model selection.
Usability and release-readiness feedback delivered per build cycle.
Actionable guidance on product direction, copy, and feature priorities.
Impact
Reflection
What this project revealed about research practice and AI evaluation.
Embedding specialist knowledge directly into the evaluation process surfaced issues that standard usability methods would have missed.
A structured evaluation framework like this could be adapted for other AI products serving specialist audiences — legal, medical, or archival domains.
For AI products serving niche audiences, evaluation must account for trust, register, and cultural fit — not just accuracy metrics.