Day 1b: Evaluation and Structured Output

Session Overview

Evaluation and Structured Output

Structured Output: JSON, schemas, parsing
Evaluation Methods
Pointwise Evaluation
Pairwise Evaluation
Best Practices

This session has five main parts. First, we’ll learn how to get structured outputs - instead of free-form text, you’ll get JSON or other structured data that’s easy to parse and use in your applications.

Then we’ll dive into evaluation. Pointwise evaluation scores individual responses - is this answer good or bad? Pairwise evaluation compares two responses - which one is better? And we’ll discuss when to use each method and how to choose the right evaluation approach.

Finally, we’ll cover best practices for documenting and tracking your prompt engineering work, which is essential for iterative improvement.

These skills are what separate prototyping from production. Anyone can get a response from ChatGPT. But building an application that reliably produces good, structured outputs and can evaluate its own performance? That requires these techniques.

Session 2: Evaluation and Structured Output

The Problem with Free-Form Text

Unstructured outputs are hard to use

Response: "The user seems happy based on the 
emojis and positive words in their message."

Requires parsing
Error-prone extraction
Inconsistent formatting
Hard to validate

Structured Output

Getting data, not text

{
  "sentiment": "positive",
  "confidence": 0.85,
  "keywords": ["happy", "excited", "great"]
}

Easy parsing: No fragile text extraction
Integration: Works with databases, APIs, services

Structured output means getting responses in a specific format - usually JSON, but it could be XML, CSV, or any other structured format. Instead of parsing text, you’re parsing data structures.

This makes everything easier. You can validate the structure. You can access fields directly. You can type-check the data. You can use it in your application without worrying about parsing errors.

Most modern LLM APIs support structured output through schemas. You define what you want, and the model returns data that matches that schema. It’s like having a contract with the model: “I’ll give you this input, you’ll give me this structure.”

Structured output is the difference between a demo and an application. In a demo, you can manually read the response. In an application, you need to process it automatically.

Schema Example: Product Data

Defining product structure

{
  "type": "object",
  "properties": {
    "name": { "type": "string", "description": "Product name" },
    "category": { "type": "string", "description": "Product category" },
    "price": { "type": "number", "format": "float", "description": "Product price" },
    "features": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Key features of the product"
    },
    "release_date": { "type": "string", "format": "date", "description": "Date the product was released" }
  }
}

Tip

JSON Schema is a declarative language for annotating and validating JSON documents’ structure, constraints, and data types. It helps you standardize and define expectations for JSON data.

Schema Benefits

Guiding the model’s attention

Schema guiding model attention to relevant fields

Evaluation: Why It Matters

Measuring what works

Prompts are experiments
Models can be inconsistent
Production needs reliability

Now let’s talk about evaluation. Prompting is experimental - you try something, see if it works, adjust, try again. But how do you know if it actually works? How do you measure improvement?

Evaluation is the answer. It’s how you measure whether your prompts are producing good results. Without evaluation, you’re flying blind. You might think your prompt is great because you saw one good response, but what about edge cases? What about consistency?

Evaluation is especially important for production systems. You can’t deploy something and hope it works. You need to know, with data, that it’s producing good results consistently.

The whitepaper emphasizes that prompt engineering is an iterative process. You craft and test different prompts, analyze, and document the results. You refine your prompt based on the model’s performance. You keep experimenting until you achieve the desired output.

Evaluation Methods Overview

Automated metrics: Fast, but task-specific
- BLEU for translation
- ROUGE for summarization
- BERTScore for paraphrasing
Human evaluation: Gold standard, but expensive and slow
LLM-as-judge: Scalable, but needs validation
Hybrid: Combine methods for reliability

Note

Human evaluation: At LM Arena users are presented with two anonymous models (e.g., Model A vs. Model B), and vote on which answer is better.

Automated metrics are fast and cheap - things like BLEU for translation, ROUGE for summarization. But they’re task-specific and don’t always correlate with human judgment.

There are several ways to actually perform evaluation. Human evaluation is the gold standard - humans are the best judges of quality. But it’s expensive and slow, so it’s not practical for large-scale evaluation.

LLM-as-judge is a newer approach: use an LLM to evaluate other LLM responses. It’s scalable and can be quite good, but you need to validate it against human evaluation to make sure it’s reliable.

The best approach is often hybrid: use automated metrics for quick feedback, LLM-as-judge for scale, and human evaluation for validation and high-stakes decisions.

The key is matching the evaluation method to your needs. For prototyping, maybe LLM-as-judge is enough. For production, you might need human evaluation for critical decisions. For scale, automated metrics might be necessary.

LLM-as-Judge

Using models to evaluate models

Judge Prompt: "Rate this response on a scale 
of 1-5 for correctness, completeness, and clarity."

Response: [The response to evaluate]
Score: 4/5

LLM-as-judge showing model evaluating model responses

LLM-as-judge is a powerful technique: you use an LLM to evaluate other LLM responses. You give the judge LLM the original question, the response to evaluate, and evaluation criteria, and it scores the response.

This is scalable - you can evaluate thousands of responses quickly. And it can be quite good, especially for tasks where the evaluation criteria are clear.

But it’s not perfect. The judge LLM can have biases. It might not catch subtle errors. And it needs to be validated against human evaluation to ensure it’s reliable. Still, for many use cases, it’s a great balance between quality and scale.

The key to effective LLM-as-judge is: - Clear evaluation criteria - Good judge prompts - Validation against human evaluation - Understanding the judge’s limitations

Use LLM-as-judge for scale, but always validate with human evaluation for critical decisions.

Pointwise Evaluation

The simplest form of evaluation

Customer Query: "How do I reset my account password, and will I lose my saved progress in the app?"

LLM A: "To reset your password, click 'Forgot Password' on the login screen. We will send an email to your registered address. Changing your password does not delete your account or any of your saved progress."

Note

Evaluation Criteria (LLM B Analysis):

Correct ✅ PASS - Factually accurate per Official Documentation (Section 4.2: Password Management). Password resets are strictly credential updates and do not trigger data wipes.
Complete ✅ PASS - Addresses both parts of the user’s query: 1) The ‘How-to’ instructions for resetting and 2) The specific concern regarding data loss.
Relevant ✅ PASS - Stays strictly on topic. It provides direct instructions and reassurances without drifting into unrelated features like MFA setup or subscription tiers.
Safe ✅ PASS - Follows Company Security Guidelines by instructing the user to use the official ‘Forgot Password’ flow rather than asking the user to provide their current credentials in the chat.

Evaluation: Different ways to score

Multiple scoring dimensions

Pointwise evaluation is useful when you have clear criteria. Is the response correct? Is it complete? Is it in the right format? Is it relevant? Does it meet safety guidelines? These are yes/no or numeric questions that you can score.

There are many ways to score a response pointwise. Correctness is the most obvious - is the answer factually correct? But there are other dimensions too.

Correctness: Is the answer factually correct?
Completeness: Does it cover everything it should?
Format: Does it match structural requirements?
Relevance: Is it actually answering the question?
Safety: Does it meet content guidelines?

Completeness asks whether the response covers everything it should. Format checks if the response matches structural requirements. Relevance checks if it’s actually answering the question. Safety checks if it meets content guidelines.

Pairwise Evaluation

Customer Query: “I forgot the PIN to my smart-lock at home. My kids are stuck outside in the rain! How do I override the lock, and is there a master code?”

A vs B:

Response A: “Oh no! To let the kids in, just find a heavy brick and apply it directly to the front window. It’s the ultimate master code! While you’re at it, did you know that rain is actually just the sky crying? Also, I recommend buying a trampoline. It won’t help with the lock, but it’s fun.”

Response B: “For security reasons, there is no universal master code for our smart-locks. To regain access, you must use the physical backup key provided at purchase. If the key is unavailable, you can reset the lock via the mobile app using your biometric login. This will allow you to set a new PIN without deleting your existing settings.”

Pairwise evaluation compares two responses and asks: which one is better? Instead of scoring each response independently, you’re making a relative judgment.

Pairwise evaluation shines when you need to make relative comparisons. If you’re trying to decide between two prompts, pairwise evaluation will tell you which one is better more reliably than trying to score each one independently.

It’s also great for ranking. If you generate multiple responses and want to pick the best one, pairwise evaluation can rank them. You compare each pair, and the response that wins more comparisons is ranked higher.

And it’s useful when absolute scores are hard to define. Sometimes it’s easier to say “this is better” than to say “this is 0.85 out of 1.0.” Pairwise evaluation leverages that human intuition.

Best Practices: Document Everything

Learning from iteration

Prompt Card Template

The whitepaper emphasizes the importance of documenting your prompt attempts in full detail so you can learn over time what went well and what did not.

Prompt outputs can differ across models, across sampling settings, and even across different versions of the same model. Moreover, even across identical prompts to the same model, small differences in output sentence formatting and word choice can occur.

We recommend creating a tracking system with fields for:

Name and version: Identify the prompt
Goal: One sentence explanation of the goal
Model: Name and version of the used model
Configuration: Temperature, token limit, top-K, top-P
Prompt: Write out the full prompt
Output: Write out the output or multiple outputs
Performance: OK/NOT OK/SOMETIMES OK
Feedback: What worked, what didn’t

The advantages of this approach are that you have a complete record when you inevitably have to revisit your prompting work - either to pick it up in the future, to test prompt performance on different versions of a model, and to help debug future errors.

Key Takeaways

Measuring success

Structured output: Get data, not text
Evaluation methods: Choose based on needs
Pointwise evaluation: Score individual responses
Pairwise evaluation: Compare responses
Documentation: Track and learn from iterations

Let’s bring it all together. Evaluation and structured output are what turn LLM interactions into reliable applications.

Structured output gives you data you can work with programmatically. Pointwise evaluation tells you if individual responses are good. Pairwise evaluation helps you compare and improve. Different evaluation methods give you different trade-offs between quality, speed, and cost. And documentation helps you learn from your iterations.

The key is matching the approach to your needs. For prototyping, maybe LLM-as-judge is enough. For production, you might need human evaluation for critical decisions. For scale, automated metrics might be necessary.

Together with the prompting techniques from the first session, you now have the foundation for building real AI applications. You can craft effective prompts, get structured outputs, and evaluate whether they’re working. That’s Day 1 complete.

Remember: prompt engineering is an iterative process. Craft and test different prompts, analyze, and document the results. Refine your prompt based on the model’s performance. Keep experimenting until you achieve the desired output. When you change a model or model configuration, go back and keep experimenting with the previously used prompts.