paint-brush
AI Gone Wrong? This Guy's Job Is to Catch It Before It Happensby@linked_do
106 reads New Story

AI Gone Wrong? This Guy's Job Is to Catch It Before It Happens

by George Anadiotis11mApril 12th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

As enterprise systems shift from deterministic software to probabilistic AI, forward-thinking organizations leverage proactive quality assessment to maximize value, minimize risk and ensure regulatory compliance

Company Mentioned

Mention Thumbnail
featured image - AI Gone Wrong? This Guy's Job Is to Catch It Before It Happens
George Anadiotis HackerNoon profile picture
0-item
1-item
2-item
3-item


As enterprise systems shift from deterministic software to probabilistic AI, forward-thinking organizations leverage proactive quality assessment to maximize value, minimize risk and ensure regulatory compliance


In today’s rapidly evolving technological landscape, ensuring the quality of both traditional software and AI systems has become more critical than ever. Organizations are increasingly relying on complex digital systems to drive innovation and maintain competitive advantage, yet many struggle to effectively evaluate these systems before, during, and after deployment.


Yiannis Kanellopoulos is on the forefront of software and AI system quality assessment. He is the founder and CEO of code4thought, a startup specializing in assessing large-scale software systems and AI applications.


We connected to explore the challenges and opportunities of quality assessment and share insights both for developers and for people responsible for technology decisions within their organizations.

The Two Worlds of System Evaluation

As Kanellopoulos explained, there’s a fundamental difference between traditional software systems and AI applications:


“In the first line of work, we deal with systems that have millions of lines of code behind them and deterministic behavior. And in the other line, we deal with systems that have way less lines of code, but tend to make probabilistic decisions.”


This distinction is crucial for understanding the different approaches needed to evaluate each type of system. Traditional software, however complex, operates according to explicitly programmed rules. AI systems such as machine learning and generative AI applications make decisions based on patterns learned from data, leading to non-deterministic outcomes. Furthermore, AI systems can be more critical.


Kanellopoulos used an example from the banking sector to highlight this: assessing the quality of a core banking system vs. assessing credit scoring risks. If a core banking system goes down, this will create disruption, but things will eventually go back to normal. But if a credit risk algorithm exhibits biases, this poses a systemic risk with potentially far-reaching consequences.

From Reactive to Proactive Quality Management

Historically, many organizations have taken a reactive approach to software quality, only investing in assessment when facing significant problems. As Kanellopoulos noted:


“The majority of our projects started when clients were facing issues with their investments on their systems. If you spend a significant amount of money on a project, and the system is not delivering the value it was expected to, or the project is behind schedule or never goes to production.


These were the first cases that clients were calling us and asking – can you help? Can you tell us what’s going on here? What do we need to fix? How much time will it take? How much effort?”

From Assessment to Partnership

But the industry is maturing. Forward-thinking organizations now understand that assessing quality from the early stages of development is more cost-effective than trying to fix problems after deployment. This shift toward proactive quality management represents a significant evolution in how businesses approach software development.


Detecting and fixing issues during the design phase is significantly less expensive than addressing them post-deployment. But it’s not just the economics of making software quality a priority that is compelling. Beyond cost savings, early quality assessment also reduces time-to-market by preventing rework and helps ensure that the final product meets business requirements and user expectations.

Kanellopoulos believes that what sets successful quality initiatives apart is the transition from one-off assessments to ongoing partnerships. From the beginning, he noted, clients wanted code4thought to not just do a one-off evaluation. They wanted them to take ownership and accountability for their recommendations. This collaborative approach ensures that quality improvements are implemented.

Metrics and Standards for Software Quality

To provide a structured approach to software quality assessment, code4thought relies on standards. Kanellopoulos mentioned guidelines from organizations such as NIST and IEEE, as well as the CMMI Maturity model. code4thought uses the ISO 25010 standard, which defines key characteristics that high-quality software systems should exhibit, such as Maintainability, Security, Portability, or Performance efficiency.


Within these broad categories, more specific sub-characteristics help teams evaluate particular aspects of software quality. For example, maintainability encompasses:

  • Analyzability: How easily can developers understand the code?
  • Changeability: How efficiently can modifications be made?
  • Testability: How effectively can changes be verified?

Research Foundations

code4thought uses a tool called Sigrid, developed by their Dutch partner, Software Improvement Group (SIG). The approach and the software have their roots in academic research, and the connection is deeper than a case of licensing the software. Kanellopoulos has a PhD in Software Quality, used to work for SIG, and was also involved in the development of Sigrid.


As he noted, this is a direct result of R&D and of work that was done together with academic institutions. Both code4thought and SIG maintain ties with academia and research. This metrics- and artifact-based approach ensures that quality assessments are based on validated results rather than subjective opinions.

From Code to Communication

One of the challenges in software quality assessment is translating technical metrics to meaningful insights for different stakeholders. In order to do this, a combination of appropriate KPIs, effective communication skills and good judgement is needed. As Kanellopoulos explained:


“Using those metrics and certain thresholds, we map them to high-level concepts that help us talk to management. These people don’t necessarily need the details, but they want to gain a good idea of the big picture.”


This approach enables both technical teams and business leaders to understand quality issues and make informed decisions about how to address them.


For more stories like this, subscribe to the Orchestrate all the Things newsletter:

https://linkeddataorchestration.com/orchestrate-all-the-things/newsletter/

Integration with Development Workflows

Modern software quality assessment tools can be integrated directly into continuous integration/continuous deployment (CI/CD) pipelines, providing real-time feedback on code quality. So whenever there is a commit, the code can be analyzed on the fly.


This integration enables teams to address quality issues as part of their regular development process rather than treating quality assessment as a separate, occasional activity. Even though Sigrid is SaaS, Kanellopoulos noted that the code never leaves the client’s premises.

The Challenge of AI System Evaluation

AI systems present unique evaluation challenges compared to traditional software. These systems have many more “moving parts”:

  • Data and different datasets
  • Machine learning models and weights
  • Software components
  • Deployment infrastructure


This complexity requires a different evaluation approach that accounts for the non-deterministic nature of AI systems and their heavy dependence on data quality.

Containment and Versioning

code4thought has developed its own AI evaluation platform, iQ4AI. Initially, it focused on traditional machine learning classification problems. For these systems, evaluation typically focuses on performance metrics and potential biases in the model’s decisions.

When evaluating AI systems, precise versioning is essential:


“When we work with a client, we ask for the testing data they have themselves to test their models, and we try also to map the version of the model with a version of the dataset and the audit that we do; everything gets a time stamp.”


This versioning approach creates a “contained” evaluation that provides a clear snapshot of system performance at a specific point in time.

Drift and Continuous Monitoring vs. One-Time Audits

Unlike traditional software, AI systems can change behavior rapidly as new data flows in, even without explicit code changes. One of the primary concerns with deployed machine learning systems is “drift” – changes in model behavior over time due to evolving input data:


“We tend to monitor the behavior of the model… by taking a series of statistics and metrics related to the performance of the model. That’s how we try to identify the so-called drift.”


Detecting drift early allows organizations to retrain or adjust their models before performance degrades significantly. This makes ongoing monitoring essential:


“The challenge here is that the AI system changes, evolves continuously… These things tend to change more drastically and more rapidly and more often. Monitoring their behavior by doing one audit per year, for example, may give a sense of control or assurance but is not enough.”

Regulation as a Driver for AI Quality Assessment

As opposed to software system quality, most clients at this point are more interested in one-off AI audits rather than continuous monitoring. But there is a shift under way in this approach.


Regulatory frameworks like the EU AI Act and the New York City Bias Audit Law are driving increased interest in AI evaluation. code4thought’s clients includes organizations who need to abide to legislation like the NYC Audit Bias Law. As per the NYC Audit Bias Law, employers who use automated employment decision tools must audit them for bias, publish results on their websites, and notify employees and job candidates that such tools are being used.


These regulations are making what was once considered “exotic” into a standard business requirement. The proliferation of AI models and applications, and the startups that develop them, means that AI due diligence is another line of business. Clients interested in acquiring AI startups need to their AI systems.

Assessment vs. Audit

There’s an important distinction to be made between audits and assessments:


“I don’t like the word auditor… because usually it means something which is very dry, very formula-based… and doesn’t leave room for creativity or for going out of the perimeter and doing some more work because you think it will help your client.”, said Kanellopoulos.


This perspective highlights the value of a more comprehensive assessment approach that goes beyond compliance checkboxes to provide genuine business value. At the same time, however, Kanellopoulos acknowledges that regulatory compliance is a driver for adopting quality assurance policies, which can lead to significant benefits.

Evaluating Generative AI

When it comes to generative AI applications, evaluation is meaningful for systems that go beyond simple API calls to foundation models. The clients code4thought deals with are using RAGs [Retrieval-Augmented Generation] on top of an LLM, usually OpenAI’s. Typically besides the retrieval part there may be additional logic and a user interface that presents information to the user.


These more complex architectures require evaluation approaches that examine each layer of the system. code4thought’s approach is to focus on the fundamentals. Framing GenAI as an NLP application, iQ4AI relies on metrics from that domain, and focuses on embeddings.

Kanellopoulos noted that the methodology that they use is published, and they welcome feedback and collaboration. He believes in a community-driven approach to GenAI evaluation, because as adoption is growing so is the need for robust evaluation.

The Need for AI Literacy

The explosion of generative AI tools like ChatGPT has dramatically increased awareness of AI, but not necessarily understanding. This can lead to misapplication of AI technologies and unrealistic expectations.


“We recently had a meeting with a client, and it was obvious to me that all the people, when talking about AI, they meant GenAI. And everything else [for them] is simply software.”, said Kanellopoulos.


Kanellopoulos thinks that the relatively recent mainstream adoption of AI technologies contributes to this knowledge gap:


“All the people who are my age and above, we’ve never seen while we were studying or in our first years at work, real-life AI applications. That might explain why we see this lack of literacy in people – even software engineers. On the other hand, I think it’s never too late to start training your people in the proper way.”


There are six pillars of AI Literacy, with ‘Create’ showing a significant, positive effect on all others

Organizations need to invest in AI literacy programs to ensure that decision-makers understand the appropriate applications and limitations of different AI approaches.

The Intersection of Software Engineering and GenAI

Being someone who works on the intersection of software engineering and AI, Kanellopoulos offers a nuanced perspective on tools like GitHub Copilot that have sparked excitement about AI-powered coding:


“I think that GenAI is a great tool to help developers document code, or understand code. But i don’t necessarily see them as productivity tools. When I was doing my PhD, which was in the area of program comprehension, there were statistics showing that 90% of the time of a developer is spent on reading and understanding code.”


While AI can accelerate certain aspects of development, it’s important to recognize that writing code is just one part of the software development lifecycle. The rest is communicating with people, analysis, design, and testing. Optimizing the time you spend writing code doesn’t mean that you optimize the whole software development lifecycle, Kanellopoulos noted.

The long-term effects of AI Code Generation

A crucial consideration for AI-generated code is long-term maintainability:


“What happens to this part of code we generate via Copilot, if now we need to change it? We need to support a new feature or we found a bug and it needs fixing… Who’s going to do it? Are you going to give it another prompt to compile or are you going to do it yourself?”


This question highlights the need to consider the entire software lifecycle when evaluating the benefits of AI-assisted development. In addition to questions about the quality of the AI-generated code, there are also questions related to mid-to-long term effect on software developers.


Could loosening, or removing, ownership of the code lead to skill degradation? How will a generation of developers used to working with tools like Copilot get the experience and insights needed to make critical architectural decisions, or develop innovative solutions? AI-assisted code generation is an evolving practice with many aspects that require thoughtful evaluation.

A Balanced Approach to Quality and Innovation

The conversation with Yiannis Kanellopoulos underscores the importance of robust quality assessment for both traditional software and AI systems. As organizations increasingly rely on complex digital systems, they need structured approaches to ensure these systems deliver the expected value safely and reliably.


For executives and decision-makers, the key takeaways include:

🚀 Invest early in quality: Assessing quality from the design phase is more cost-effective than fixing problems after deployment.

📊 Use established standards: Frameworks like ISO 25010 provide structured approaches to evaluating software quality.

♻️ Consider the full lifecycle: Quality isn’t just about the initial deployment but also about long-term maintainability and adaptability.

⚖️ Prepare for regulation: Emerging regulatory frameworks will likely make formal AI evaluation mandatory for many applications.

🧠 Invest in AI literacy: Ensure your team understands the appropriate applications and limitations of different AI approaches.


By taking a balanced, informed approach to software and AI quality, organizations can harness the power of these technologies while mitigating risks and ensuring sustainable value creation.


As Kanellopoulos aptly noted regarding AI tools for software development: “It needs careful design… The software developer profession is going to change drastically. The point is how we adapt, how we reap the benefits of tools like Copilot, and how we can maintain trust in the resulting code.”


The same principle applies to all technological adoption: success lies not in blindly embracing new technologies but in thoughtfully integrating them into well-designed processes that maintain quality and trust.


For more stories like this, subscribe to the Orchestrate all the Things newsletter:

https://linkeddataorchestration.com/orchestrate-all-the-things/newsletter/