top of page

Blueprint of SaaSy: Preliminary Statistical Validation of Our Free B2B SaaS Audience Simulator

Writer: Rebeka PopRebeka Pop

SaaSy Free B2B Audience Simulator Blueprint and statistical validation

In a world where data overload slows decision-making, SaaSy is revolutionizing how executives gather insights. Imagine trying to navigate your next strategic move with limited time and resources—desk research tasks that once took hours or even days are now completed in minutes. SaaSy isn’t just a tool; it’s your partner in transforming complexity into clarity.


SaaSy is a B2B SaaS audience simulator that replicates audience needs and preferences through realistic content styles and tones of voice. Picturehaving a one-on-one conversation with the average opinion of C-level executives interested in buying SaaS products—this is the unique experience SaaSy delivers. By combining advanced desk research (synthesis of secondary data) with real-time conversational AI, SaaSy delivers insights that mirror the "average opinion" of your target audience in a tone and clarity befitting a C-level executive.


Here’s how SaaSy redefines research for C-level executives:

  • Saves Time and Costs: SaaSy delivers insights in minutes, reducing desk research time by at least 75%—and it's completely free.

  • Thinks Beyond Data: With its emergent behavior, SaaSy goes beyond preloaded data to offer creative, independent insights tailored to your needs. 63% of its responses reflect its ability to adapt and solve complex, unstructured problems.

  • Reliable and Transparent: SaaSy delivers 95.12% accuracy and achieves an industry-leading 96.19% F1 score.

  • Designed for Decision-Makers: Whether you're identifying market trends, refining product strategies, or exploring SaaS solutions, SaaSy speaks your language with a professional, C-level tone.


But we’re not stopping here. SaaSy is evolving. As it learns from users like you, it becomes an even more indispensable tool for turning data into decisions.


This blueprint provides a transparent, detailed overview of SaaSy’s statistical validation process, including:

  • The methodology behind its development.

  • Key performance metrics.

  • Limitations and disclaimers.

  • Plans for ongoing improvement.


Key Metrics:

  • Accuracy: 95.12% overall accuracy, consistently delivers correct responses.

  • Precision: 97.47% precision, ensuring that nearly all confident responses are correct.

  • Recall: 97.47% recall, capturing almost all correct answers.

  • Binary F1 Score: 96.19% overall, balancing precision and recall for well-rounded performance.

  • Emergent Behavior: 63.4% of responses are emergent, with 98.00% F1 score, showcasing the AI’s ability to think independently and adapt to novel scenarios.


Key Benefits:

  • Time Savings: Reduces desk research time by 75% to 97%.

  • Cost Efficiency: Eliminates traditional desk research costs (€840–€3,360 per task).

  • Real-Time Insights: Dynamically incorporates real-time inputs to deliver up-to-date audience sentiments.

  • C-Level Tone: Delivers insights in a concise, actionable manner, mirroring the strategic thinking of a C-level executive.


How Was SaaSy Built?


SaaSy was developed using Enäks AI, an advanced artificial intelligence system integrating three core data layers to create a unified knowledge base. This combination of consumer behavior data, industry insights, and real-time expert opinions forms the backbone of SaaSy’s conversational capabilities.


The three layers of SaaSy’s knowledge base:

  1. Foundation Layer:

    • Secondary consumer behavior research data forms the base of SaaSy’s knowledge.

    • Focused on extracting audience behavior insights from methodologically rigorous and reliable studies.

  2. Contextual Layer:

    • Industry and macro reports provide essential context, enabling SaaSy to tailor its responses to industry-specific challenges and opportunities.

  3. Voice of SaaSy:

    • Real-time community-generated insights, also known as the Voice of Experts, add a dynamic element. This layer captures up-to-date sentiments and preferences directly from C-level audiences.


This multi-layered architecture ensures that SaaSy delivers accurate, timely, and nuanced insights in just seconds.




Roadmap: From the vision of SasSy to its current functionality


  1. The Vision


SaaSy was originally designed to streamline desk research for C-level executives and marketers. Its primary goal was to simplify the process of extracting insights from secondary data sources, such as consumer behavior studies.

For anyone familiar with desk research, the process is often time-intensive and challenging, especially when the focus is on understanding specific audiences. Reliable and up-to-date studies are hard to find, and the quality of consumer data varies widely. SaaSy’s mission was to address these pain points, saving users countless hours while ensuring they could rely on methodologically sound data.


At this stage, SaaSy relied on a single data layer—secondary consumer studies—and was trained for ~85 hours, with statistical validation (see later) ensuring its outputs were reliable based on its sources.


  1. Adding Context and Depth


As SaaSy evolved, it became clear that additional data layers were needed to enhance its capabilities. Two new layers were introduced:

  • Industry and macroeconomic reports: To give responses more context and align them with market trends.

  • Voice of Experts: To provide real-time, community-driven insights, adding a dynamic element to its functionality.


This expansion allowed SaaSy to deliver more comprehensive and contextually relevant insights. During this phase, we observed an increase in emergent behavior responses. Emergent behavior responses are answers generated through the AI’s independent thinking, beyond the predefined data layers, and thus provide insights into the AI’s capability to handle complex or novel scenarios. 


  1. SaaSy’s Current Function


After several iterations, 520 of training hours, SaaSy has reached its current state as a free B2B SaaS audience simulator, delivering insights with:

  • 95.12% average accuracy.

  • 87% average response confidence rate.


SaaSy's capabilities now extend beyond basic desk research to include:

  • Audience Simulation: Mimicking the preferences and decision-making processes of C-level executives buying SaaS products.

  • Communication Optimization: Helping companies tailor their messaging to align with audience needs and preferences.

  • Real-Time Insights: Pinpointing real-time shifts in audience sentiments, offering an advantage in dynamic market conditions.


What sets SaaSy apart is its ability to act as an advanced desk research tool that combines the depth of traditional secondary data analysis with the agility of real-time, conversational AI. It synthesizes vast amounts of existing data while dynamically incorporating real-time inputs, delivering insights that reflect the collective opinion of your target audience—all in the tone and clarity of a C-level executive.



Blueprint of SaaSy: Preliminary Statistical Validation


To assess the evolution of SaaSy’s performance from its initial development (V0) to its current version (V2), we conducted a rigorous statistical validation using a consistent set of questions across all versions within a controlled experimental design. This ensures comparability and pinpoints improvements in accuracy, reliability, and adaptability.


Methodology


Sample Design:

  • Sample Size: 82 responses (same questions posed to V0, V1, and V2).

  • Question Scope: Covered all core SaaSy functions, including:

    • Decision-making processes (e.g., “Who and how many people are involved in the decision making and from which department?”).

    • Audience preferences (e.g., “when you compare multiple SaaS products, what are the factors based on what you consider to compare with the alternatives?”).

    • Communication styles (e.g., “How do you prefer to be contacted by SaaS vendors?”).

    • Pricing sensitivity (e.g., “What budget constraints do you have when considering new SaaS solutions?”).

    • Personal opinions and motivations (e.g., “How do you feel before purchasing a SaaS product?”).


Validation Process:

1. Consistent Questions

  • Approach: Identical sets of questions were used across all versions (V0, V1, and V2). This ensured that performance improvements could be accurately attributed to the AI's iterative development rather than differences in the questions themselves.

2. Expert Evaluation

  • Evaluation Team: Two independent SaaS experts assessed each AI response based on the following criteria:

    • Correctness: Determined by the alignment of the response with verified, factual data.

    • Contextual Relevance: Evaluated based on the appropriateness and fit of the response for C-level decision-making scenarios.

  • Confusion Matrix Criteria: The responses were categorized using the standard confusion matrix approach:

    • True Positive (TP): The AI correctly answered the question.

    • False Positive (FP): The AI provided an answer, but it was incorrect.

    • False Negative (FN): The AI failed to provide a correct answer when it should have.

    • True Negative (TN): The AI correctly identified that no answer was needed.

  • Resolution of Discrepancies: In cases where the two experts disagreed, a third expert was brought in to resolve the discrepancy and make a final judgment.


3. Human-in-the-Loop Auditing

  • Weekly Reviews: A panel of two experts conducts weekly audits of all AI-generated responses.

  • Evaluation Criteria: Each response is reviewed for:

    • Accuracy: How factually correct the response is.

    • Relevance: How well the response addresses the query in the context of a C-level executive's needs.

    • Tone: Whether the tone of the response was professional and aligned with the communication style expected by C-level executives.




Detailed results of General AI Performance


This includes the statistical validation of 82 responses.


Precision

  • What it measures: Precision assesses how many of the AI’s positive (correct) responses are truly correct. It reflects how reliable the AI is when it confidently provides an answer.

  • Result: The AI’s overall precision is 97.47%, meaning 97.47% of the AI’s correct responses are truly accurate.

  • Scenario: Out of 82 responses, the AI classified 39 as correct responses (TP + FP). With a precision of 97.47%, this means about 38 of these 39 responses were accurate, while 1 might be wrong.

  • Insight: This high precision score demonstrates that the AI is careful and deliberate, delivering correct answers with minimal errors. Users can rely on the AI for accurate insights, reducing the risk of false information.


Recall

  • What it measures: Recall measures how many of the actual correct responses the AI was able to identify. It reflects the AI’s ability to capture all relevant answers without missing any.

  • Result: The AI’s overall recall is 97.47%, meaning it correctly identifies almost all correct answers.

  • Scenario: Out of 82 total responses, there were 39 actual correct answers (TP + FN) that the AI should have identified. With a recall of 97.47%, this means the AI correctly identified 38 of these 39 answers, missing only 1.

  • Insight: High recall ensures that the AI doesn’t overlook important correct answers. This thoroughness makes the AI a valuable tool for tasks where comprehensive responses are critical.


Binary F1 Score

  • What it measures: The F1 score balances precision and recall, giving a holistic view of the AI’s performance. It ensures that the AI is not only accurate but also consistent in capturing correct answers.

  • Result: The AI’s overall F1 Score is 96.19%, which reflects a strong balance between precision and recall.

  • Scenario: Out of 82 responses, the AI performed well both in avoiding incorrect answers (precision) and in identifying correct ones (recall). An F1 score of 96.19% shows that the AI is excelling at being both smart and thorough in its responses.

  • Insight: This score confirms that the AI delivers both precise and comprehensive responses, making it highly reliable for real-world applications.


Accuracy

  • What it measures: Accuracy refers to the proportion of all responses—whether correct or incorrect—that were classified correctly.

  • Result: The AI’s overall accuracy is 95.12%, meaning 95.12% of all responses were accurate.

  • Scenario: Out of 82 total responses, the AI classified 78 correctly (including both correct answers and correctly identified TNs), leaving 4 responses incorrect. This means the AI is accurate in 19 out of every 20 cases.

  • Insight: With accuracy exceeding 95%, SaaSy demonstrates strong general reliability, ensuring correct answers in most scenarios.



Detailed Results of Emergent Behavior Responses


Emergent behavior responses are a subset of the AI’s total output, constituting 63.4% of all responses. These are answers generated through the AI’s independent thinking, beyond predefined data layers, and provide insights into its ability to handle complex or novel scenarios. The fact that over 60% of the AI’s output involves emergent behavior highlights the tool’s capacity for independent reasoning and adaptability. This capability positions the AI as a powerful solution for tackling complex and ambiguous queries.


Emergent Correctness Score (ECS)

  • What it measures: ECS evaluates the proportion of emergent responses that were validated as correct, assessing the AI’s ability to produce accurate insights when thinking independently.

  • Result: The ECS is 96.15%, meaning 50 of the 52 emergent responses were correct.

  • Scenario: Out of the 52 emergent responses, 50 are accurate, while only 2 were incorrect. This shows that almost all of the AI’s independent thinking is spot-on.

  • Insight: This score demonstrates the AI’s ability to deliver accurate and reliable insights when generating novel responses, making it highly effective for use cases where predefined knowledge is insufficient.


Emergent Precision

  • What it measures: Emergent precision evaluates how many of the AI’s emergent responses flagged as correct were actually correct. It measures the AI’s accuracy when it confidently gives emergent answers.

  • Result: The emergent precision is 98.04%, meaning the AI is extremely cautious about providing incorrect emergent responses.

  • Scenario: Out of the 51 emergent responses classified as correct (TP + FP), 50 were accurate. This means that when the AI speaks confidently from its independent reasoning, it almost never makes mistakes.

  • Insight: This high precision highlights the AI’s careful and deliberate approach when delivering emergent responses, ensuring that users can trust the AI’s independently reasoned insights.


Emergent Recall

  • What it measures: Emergent recall evaluates how well the AI captures all the correct emergent responses. It measures the AI’s thoroughness in identifying accurate independent responses.

  • Result: The emergent recall is 98.04%, meaning the AI successfully captured almost all the correct emergent responses.

  • Scenario: Out of the 51 actual correct emergent responses (TP + FN), 50 were correctly identified by the AI. This means the AI misses very few correct emergent responses.

  • Insight: The high recall score indicates that the AI is comprehensive in its independent reasoning, ensuring that important and relevant insights are not missed.


Emergent F1 Score

  • What it measures: The F1 score balances emergent precision and recall, providing a holistic metric of the AI’s performance in emergent behavior.

  • Result: The emergent F1 score is 98.00%, reflecting a strong balance between accuracy (precision) and completeness (recall).

  • Scenario: Out of 52 emergent responses, the AI excels in both providing correct answers (precision) and not missing correct ones it should have given (recall). A score of 98% means the AI is nearly flawless in its emergent reasoning.

  • Insight: This high F1 score underscores the AI’s ability to adapt, perform accurately, and provide comprehensive insights, even in unfamiliar or novel scenarios.


Emergent Accuracy

  • What it measures: Emergent accuracy represents the proportion of emergent responses that were classified correctly overall.

  • Result: The emergent accuracy is 98.08%, meaning 51 out of the 52 emergent responses were accurate.

  • Scenario: Out of 52 total emergent responses, 51 were accurate, and only 1 response was incorrect. This means the AI maintains a very high standard when venturing into independent reasoning.

  • Insight: The AI’s emergent accuracy reflects its reliability in generating accurate independent responses. This ensures trustworthiness in real-world scenarios requiring creativity, adaptability, and critical thinking.


Comparative Results


Why Emergent responses perform better:


  1. Flexible Thinking:

    • Emergent responses adapt dynamically to new situations, synthesizing connections between concepts (e.g., linking pricing strategies to cybersecurity concerns).

    • Example: SaaSy might recommend bundling features for CISOs, even if this wasn’t explicitly in its training data.

  2. Generalization:

    • Generalization for Broader Accuracy: Emergent responses leverage the AI’s ability to generalize—applying learned knowledge to new, unfamiliar contexts. This gives the AI an edge over source-backed responses, which can be too specific, by offering a more holistic understanding of complex problems and performing better when faced with novel challenges.

    • Example: SaaSy generalizes that “C-levels prioritize ROI in fast-moving SaaS sectors,” even if the specific sector isn’t in its training set.

  3. Fresh Insights:

    • Combines real-time reasoning with historical data to avoid stale outputs.

    • Example: SaaSy identifies a sudden shift toward AI-driven analytics in CTO preferences, even if this trend isn’t yet in formal reports.

  4. Creativity:

    • Generates innovative solutions (e.g., “Offer modular pricing for CFOs”).

    • Example: SaaSy proposes a tiered pricing model tailored to mid-market CFOs, despite no explicit training on this approach.

  5. Contextual Awareness:

    • Adjusts responses to the specific context of each query.

    • Example: SaaSy recognizes that a CMO’s tone preferences differ from a CTO’s and adapts its language accordingly.


When we compare the AI’s overall performance with how it does when thinking for itself, we see that it does a little better when it’s using its independent thinking. It gives correct answers more often when it’s thinking on its own compared to when it’s just relying on what it was trained to do. This means the AI’s independent thinking is actually helping it perform better. 


  1. Flexibility Over Rigidity


Source-backed responses are like well-rehearsed lines—precise, reliable, but sometimes limited by the context they were trained in. When the AI pulls from its source data, it’s bound by those predefined rules and historical knowledge, limiting its ability to adjust to new and evolving situations.


Emergent behavior, however, taps into the AI's ability to synthesize broader connections across concepts, leveraging the AI's internal architecture to think beyond the data it’s trained on. These emergent responses are akin to a spontaneous conversation, where the AI uses reasoning, pattern recognition, and adaptation to produce responses that are highly relevant and creative in novel contexts.


Emergent responses allow the AI to be more dynamic and adaptive, reacting to new or nuanced situations in ways that pre-trained data cannot. It isn't just retrieving data—it’s thinking through a problem.


  1. Emergent Responses Benefit from Generalization


Source-backed responses, while rooted in factual correctness, can sometimes be too specific. They adhere closely to the patterns in their training data, which may result in rigid answers when faced with a situation that doesn't perfectly match a past scenario.


Emergent responses, on the other hand, are the result of the AI’s ability to generalize. This means the AI can apply concepts from one domain to another, drawing from multiple layers of information. By doing so, it creates a more holistic view of a problem, often arriving at a solution that is contextually accurate even if the scenario deviates from typical cases.


Generalization enables the AI to handle ambiguity better than pure data-driven responses. When the AI is not confined to rigid data rules, it can infer patterns and draw logical connections across multiple topics, leading to better performance in unfamiliar situations.


  1. Over-Reliance on Source Data Can Lead to Stale Responses


While highly accurate for well-documented problems, these responses can sometimes feel stale or static, lacking the nuance and insight required in evolving industries or dynamic conversations.


Emergent behavior, by contrast, injects fresh perspective into the AI's responses. It combines learned knowledge with real-time reasoning, offering insights that are often more relevant and timely. This can be especially critical in fast-moving fields where up-to-the-minute insights are valued more than static knowledge.


Emergent responses are more attuned to real-time challenges, providing insights that are fresh, relevant, and adaptable to dynamic, changing environments. This allows the AI to stay current and engaged in conversations or tasks, while static, source-backed responses may fall behind.


  1. Emergent Responses Leverage Creativity and Innovation


Source-backed responses are excellent for reproducing established facts, but they can lack creativity. The AI is essentially repeating patterns it has seen before, which is useful for questions that have clear, factual answers but less effective in situations that require creative thinking or innovative solutions.


Emergent responses draw on the AI’s ability to think creatively—to generate new ideas, solutions, or approaches that have not been explicitly trained. The AI can connect seemingly unrelated concepts, much like a human brainstorming session, where multiple ideas converge into something original and valuable.


Emergent behavior fosters a sense of creativity and innovation that is crucial in complex problem-solving. When faced with a scenario that demands out-of-the-box thinking, the AI's independent reasoning can outperform pre-trained data by offering fresh, innovative solutions.


  1. Emergent Responses Are More Context-Aware


Source-backed responses are often limited by the specific framing of the original data, making them less flexible when a question or situation doesn’t exactly match the data points. These responses are constrained by predefined interpretations, which can sometimes make them less accurate in fluid, contextual scenarios.


In contrast, emergent responses are generated by the AI's ability to interpret context on the fly. Rather than relying on fixed data points, the AI can assess the nuance of the situation and adjust its response accordingly. This ability to adapt contextually makes emergent responses not only more flexible but often more accurate when faced with complex or multi-layered questions.


Emergent behavior allows the AI to be more aware of the context in which a question is being asked, adjusting its answer to better suit the situation. This contextual flexibility explains why emergent responses often outperform more rigid, data-backed answers.


  1. Calibration Metric


Calibration measures how well the AI’s confidence in its responses aligns with its actual performance. Well-calibrated systems provide confidence scores that accurately reflect the likelihood that their answers are correct.


What it measures: Calibration ensures that when the AI expresses a high degree of confidence, it’s justified, and when it’s uncertain, that uncertainty reflects reality.


Result:

  • The Brier Score for the emergent responses is 0.00992, a very low score, indicating that the AI’s confidence levels are closely aligned with the actual outcomes.

  • The Brier Score for the source backed responses is 0.01023, a low score, indicating that the AI’s confidence levels are closely aligned with the actual outcomes.

  • Both models have excellent Brier Scores, very close to 0. The emergent responses model has a slightly better (lower) Brier Score, indicating marginally better calibration overall. However, the difference is minimal, and both models show very good calibration.

  • The emergent responses model shows slight underconfidence across all bins, as the observed accuracy is consistently higher than the average confidence.

  • The data layer responses model shows a mix of under- and overconfidence.


The data layer model doesn't have any responses in the lowest confidence bin (0.60-0.70), suggesting it may be more confident overall.



Key results across V0, V1, V2


Table 1 highlights the significant improvement in SaaSy's performance across its three versions (V0, V1, and V2). Each metric demonstrates a clear trend of refinement and optimization, showcasing how iterative development has enhanced the tool's capabilities. Below is an interpretation of each metric and its evolution:


Table 1. Performance Across Versions

SaaSy's performance across versions by Enaks

The progression from V0 to V2 is a testament to how iterative development can drastically enhance the performance of AI tools. SaaSy's journey highlights the value of emergent behavior and calibration. 


  1.  Accuracy

What It Measures: Accuracy indicates the percentage of correct responses out of the total responses provided by SaaSy.

Progress:

  • V0: 25% — In its initial version, SaaSy struggled with delivering accurate responses. In its initial version, SaaSy delivered accurate responses for 21 out of 82 questions, achieving an accuracy of 25%.

  • V1: By V1, SaaSy improved significantly, correctly answering 61 out of 82 questions, achieving an accuracy of 74%.

  • V2: In its current version, V2, SaaSy demonstrates near-perfect accuracy, answering 78 out of 82 questions correctly, achieving 95.12% accuracy.

The accuracy increase shows how the integration of emergent behavior and refinements in the model have made SaaSy more dependable for real-world usage.


  1. F1 Score

What It Measures: The F1 Score balances precision (how many of the predicted correct responses are truly correct) and recall (how many of the actual correct responses are identified). It's particularly valuable when evaluating systems with uneven class distributions.

Progress:

  • V0: 38% — The low F1 score in V0 indicates both poor precision and recall, meaning SaaSy frequently missed correct responses or returned incorrect ones.

  • V1: 85% — A significant leap in precision and recall made the tool much more reliable, capturing a majority of correct responses while minimizing false predictions.

  • V2: 96.19% — This score reflects excellent precision and recall, showing that the tool has achieved a nearly perfect balance between identifying correct answers and avoiding false positives.


  1. ECS/Accuracy (Emergent Correctness Score / Accuracy of Emergent Behavior)

What It Measures: ECS evaluates the proportion of emergent behavior responses (independent thinking) that are correct. It tracks how well SaaSy performs when it synthesizes information beyond its training sources.

Progress:

  • V0: 17% — Early versions of SaaSy struggled with emergent behavior, producing few correct responses when thinking independently.

  • V1: 80% — Emergent behavior became a standout feature, with the majority of independently generated insights proving accurate.

  • V2: 96% — Emergent behavior is now highly accurate, rivaling or surpassing traditional source-backed responses.

The growth in ECS/Accuracy highlights SaaSy’s evolving ability to handle novel, unstructured queries with precision, making it a standout feature.


  1. Brier Score

What It Measures: The Brier score evaluates the calibration of SaaSy's confidence scores against the correctness of its predictions. Lower scores indicate better calibration (i.e., confidence aligns closely with accuracy).

Progress:

  • V0: 0.5135 — The high Brier score suggests poor calibration, with confidence scores often misaligned with the correctness of responses.

  • V1: 0.2043 — Calibration improved significantly, with SaaSy’s confidence better reflecting its actual accuracy.

  • V2: 0.0688 — Near-perfect calibration was achieved, meaning SaaSy's confidence scores are now highly trustworthy.


The reduction in Brier score across versions shows that SaaSy not only became more accurate but also developed greater self-awareness, providing confidence scores that users can rely on.



Next steps


  1. Expand Sample Size:

Using a larger dataset to ensure statistical robustness. Increasing the sample size to 300-500 responses will help capture a more representative view of SaaSy’s capabilities.


  1. Diversify the Review Panel:

Expanding the auditing panel to include more experts, incorporating external reviewers with no prior exposure to SaaSy’s design. This will enhance objectivity and reduce internal biases. Initial attempts to recruit professionals for testing SaaSy yielded limited participation. However, the recruiting process is still ongoing so anyone interested in testing SaaSy contact us.


  1. Gradient Evaluation:

Introducing a Likert scale (e.g., 1–5) for evaluating correctness and relevance, especially for emergent behavior. This will better capture partial correctness and nuanced insights, offering a more detailed view of SaaSy’s accuracy.


  1. Real-World User Feedback:

Collecting qualitative and quantitative feedback to understand how SaaSy performs in real-world contexts and refine its output accordingly.


  1. Broader Query Diversity:

Introducing more complex and ambiguous questions from independent sources to simulate diverse user needs. This will ensure SaaSy’s emergent behavior is tested against a wide range of scenarios.



Disclaimers and Limitations


  1. Sample Size Constraints:

The current validation uses a sample size of 82 responses. Even though this is just a preliminary validation, it may not fully represent SaaSy’s performance across diverse, real-world scenarios. Larger datasets are needed for more robust conclusions.


  1. Binary Evaluation Criteria:

The reliance on a binary system (correct/incorrect) may oversimplify complex responses. Responses that are partially correct or contextually useful are currently treated the same as incorrect ones, which limits the granularity of the analysis.


  1. Panel Size for Auditing:

While the two-expert panel provides an effective review process, a larger panel or external reviewers could improve the objectivity and reliability of the auditing process.


  1. Lack of External User Feedback:

The validation process does not yet include feedback from independent, real-world users. Testing with the target audience (CEOs and CMOs) would better represent SaaSy’s performance in real-world applications.


  1. Target Audience Query Complexity:

Questions used for validation were internally generated. While designed to simulate real-world scenarios, they may not fully reflect the complexity or variability of actual user queries.


  1. Scalability Considerations:

The current results are based on a controlled testing environment. SaaSy’s ability to maintain its high performance under larger-scale usage or more complex input scenarios remains untested.


  1. Validation Date:

Results reflect September 2024 data. Updates will be published in the coming months.


  1. Emergent Behavior Risks:

Independent reasoning introduces variability. We mitigate this via weekly expert audits and a fallback mode for source-backed responses.


bottom of page