LLM applications are frustratingly difficult to test due to their probabilistic nature. However, testing is crucial for customer-facing applications to ensure the reliability of generated answers. So, how does one effectively test an LLM app? Enter Confident AI's DeepEval: a comprehensive open-source LLM evaluation framework with excellent developer experience. Key features of DeepEval: - Ease of use: Very similar to writing unit tests with pytest. - Comprehensive suite of metrics: 14+ research-backed metrics for relevancy, hallucination, etc., including label-less standard metrics, which can quantify your bot's performance even without labeled ground truth! All you need is input and output from the bot. See the list of metrics and required data in the image below! - Custom Metrics: Tailor your evaluation process by defining your custom metrics as your business requires. - Synthetic data generator: Create an evaluation dataset synthetically to bootstrap your tests My recommendations for LLM evaluation: - Use OpenAI GPT4 as the metric model as much as possible. - Test Dataset Generation: Use the DeepEval Synthesizer to generate a comprehensive set of realistic questions! Bulk Evaluation: If you are running multiple metrics on multiple questions, generate the responses once, store them in a pandas data frame, and calculate all the metrics in bulk with parallelization. - Quantify hallucination: I love the faithfulness metric, which indicates how much of the generated output is factually consistent with the context provided by the retriever in RAG! CI/CD: Run these tests automatically in your CI/CD pipeline to ensure every code change and prompt change doesn't break anything. - Guardrails: Some high-speed tests can be run on every API call in a post-processor before responding to the user. Leave the slower tests for CI/CD. 🌟 DeepEval GitHub: https://lnkd.in/g9VzqPqZ 🔗 DeepEval Bulk evaluation: https://lnkd.in/g8DQ9JAh Let me know in the comments if you have other ways to test LLM output systematically! Follow me for more tips on building successful ML and LLM products! Medium: https://lnkd.in/g2jAJn5 X: https://lnkd.in/g_JbKEkM #generativeai #llm #nlp #artificialintelligence #mlops #llmops
Automated Testing Frameworks
বিশেষজ্ঞ পেশাদারদের থেকে সেরা LinkedIn সামগ্রী এক্সপ্লোর করুন।
-
-
The first open-source implementation of the paper that will change automatic test generation is now available! In February, Meta published a paper introducing a tool to automatically increase test coverage, guaranteeing improvements over an existing code base. This is a big deal, but Meta didn't release the code. Fortunately, we now have Cover-Agent, an open-source tool you can install that implements Meta's paper to generate unit tests automatically: https://lnkd.in/eCitDjin I recorded a quick video showing Cover-Agent in action. There are two things I want to mention: 1. Automatically generating unit tests is not new, but doing it right is difficult. If you ask ChatGPT to do it, you'll get duplicate, non-working, and meaningless tests that don't improve your code. Meta's solution only generates unique tests that run and increase code coverage. 2. People who write tests before writing the code (TDD) will find this less helpful. That's okay. Not everyone does TDD, but we all need to improve test coverage. There are many good and bad applications of AI, but this is one I'm looking forward to make part of my life.
-
The AI Coding Revolution Is Here, But Are We Testing for It? As AI-assisted development reshapes how we build software, I've been thinking a lot about something that is talked about often but doesn't always get the focus it deserves: automated testing. At JPMorganChase, we're embracing AI coding tools to accelerate delivery, reduce toil, and empower our teams to focus on the work that matters, reducing cognitive load of repetitive tasks. But speed without safety is just risk in disguise. Here's what I believe every leader (and this is broader than technology) needs to consider right now: • AI writes code faster than humans can review it manually. If your testing strategy is still largely manual, you're already behind. AI-generated code can introduce subtle logic errors, security vulnerabilities, or edge-case failures that look perfectly reasonable on the surface. Automated testing is no longer a best practice, it's a non-negotiable safeguard. • Test coverage is your new quality contract. When AI is your co-developer, the test suite becomes the specification. If you can't describe expected behavior in a test, you can't trust what the AI builds. Investing in robust unit, integration, and regression testing frameworks is investing in the integrity of your entire delivery pipeline. • Shift-left testing amplifies AI's value. It doesn't slow it down. Some worry that rigorous testing will negate the speed gains from AI coding. The opposite is true. When automated tests are embedded early in the development lifecycle, AI tools can iterate faster, self-correct, and validate outputs in real time. Testing enables velocity; it doesn't constrain it. • Your teams need to evolve alongside the tools. The best teams of tomorrow won't just write code. They'll architect test strategies, evaluate AI outputs critically, and build systems that are observable and verifiable by design. We owe it to our teams to invest in this skill evolution now. At the scale we operate, serving millions of customers, the cost of a defect isn't just technical. It's trust. And trust, once broken, is hard to rebuild. AI is a force multiplier. But multiplying without a strong foundation multiplies risk just as fast as it multiplies output. Build fast. Test smarter. Ship with confidence. I'd love to hear how other leaders are thinking about quality engineering in the age of AI. What's working for your teams? #AIEngineering #SoftwareTesting
-
MIT: 95% of Gen-AI pilots are failing. Here’s what the 5% winners do differently (steal this): 1. Start with work, not with models. Winners redesign jobs and workflows before they ship bots. Tooling follows a process, not the other way around. 2. Tie “individual value” to “org value.” If employees don’t feel AI making their work easier, the org won’t see returns. Make competence, autonomy, and collaboration the first-class metrics. 3. Go narrow, then scale. Document a few repeatable use cases (claims triage, reconciliation, collections) with unit-economics, then templatize. IBM and others have long warned: the hard part is scaling, not the POC. 4. Measure real productivity lift. In support, gen-AI has shown ~14% productivity gains at scale, with the biggest boost for junior reps. Instrument your pilots to prove (or kill) value fast. 5. Invest in org learning. Top performers combine AI learning with organizational learning, training, feedback loops, change management: not just prompt libraries. 6. Data + governance ≫ model of the month. Most stalls are from data quality, integration, and risk controls, not “we need the newest model.” Treat AI as infrastructure (monitoring, access, privacy), not a feature. A ruthless pilot checklist (copy/paste it): - Clear problem owner with P&L accountability - Baseline + target unit economics (AHT, defect rate, $/ticket…) - Change plan: job redesign, SOPs, training, incentives - Observability: evals, drift, hallucination gates, feedback loops - Scale plan: integration into systems of record; security sign-off If your AI pilot isn’t changing how work is done, it’s not a pilot, it’s a demo with better lighting
-
Been experimenting with AI tools in testing for a while now. Here's what I'm seeing in the real world. Where AI is genuinely helping: -Locator generation - Tools analyzing your app and suggesting stable locators. Saves hours compared to manual inspection. Example: Instead of spending 20 mins finding the perfect CSS/XPath, AI suggests 5 options in seconds with stability scores. -Test code generation - Writing boilerplate test cases from user stories or requirements. Not perfect, but gets you 70% there. You still need to review and fix, but it's faster than starting from scratch. -Analyzing test failures - AI reading stack traces and logs to pinpoint why tests failed. Instead of digging through 500 lines of logs, it tells you "API timeout on line 47" in 10 seconds. -Visual testing at scale - Catching UI changes across browsers/devices that humans might miss. -Test data generation - Creating realistic test data for different scenarios. Need 100 test users with valid emails, phone numbers, addresses? Done in seconds. Where AI is overpromised and underdelivering: "AI will write all your tests" - Nope. It writes basic happy path tests. Edge cases? Complex business logic? Still needs human brains. "No-code test automation" - Sounds great until the AI-generated test breaks and you can't debug it because you don't understand the code it wrote. Self-healing tests - Yes, it can update some selectors automatically. But it also "fixes" tests that should actually fail, hiding real bugs. 100% accurate defect prediction - AI says "this area is risky" based on code changes. Sometimes right, often wrong. Don't skip testing based on AI predictions alone. Replacing manual exploratory testing - AI follows patterns. Humans find weird unexpected bugs. Real examples from my experience: Win: Used AI to convert 50 manual test cases into automation scripts. Took 3 hours instead of 3 days. Still spent 4 hours reviewing and fixing. Fail: Tried "AI-powered" test maintenance tool. It auto-updated 30 tests after a UI change. 22 were correct. 8 were broken and I didn't notice for 2 days. Lost time debugging those false positives. Win: AI analyzing our failed test suite every morning. Started getting Slack messages like "12 tests failed due to database connection timeout, not code issues." Fail: Spent $$/month on an AI tool that "predicts which tests to run." Ran the wrong tests, missed critical bugs. My honest take: AI is a tool, not magic. Use it for: -Repetitive boring tasks (updating selectors, generating data) -First draft of test scripts (but YOU review) -Analyzing large amounts of data (logs, failures, patterns) Don't use it for: -Final decision making on test coverage -Replacing your understanding of the application -Skipping code reviews of AI-generated tests -Blindly trusting "self-healing" without verification Bottom line: AI saves me about 20-30% time on specific tasks. You still need to know testing, understand your app, and think critically. #AIInTesting
-
How to start verifying a simple RTL design (Ex. Memory Controller) in UVM --> 1st Part: The very step is to understand the specification thoroughly with each of the below details: a. Working principle b. I/O ports and Signal level description c. Submodule details d. Is there any dependency of one signal with another, etc. e. Design detail understanding The next step is to build the Verification Plan and Feature / Testcase plan. List down the features that are needed to be verified. Some of them are below: ✓ Check whether clock and reset are working correctly. ✓ Check the correctness of the data by writing some value and read the same value of a single register. ✓ Repeat the process for all register. ✓ Check the default value of the register by reading the data. ✓ Check the controllers ability to handle multiple write and read transaction and Back to back writes. ✓ Check the correctness of WO register by reading from them. ✓ Check the correctness of RO register by writing on them. ✓ Check the Address translation and Address decoding issues being handled correctly. ✓ Check the invalid address access by RW from invalid or out of range addresses. ✓ Check the accessibility of locked condition of register. ✓ Check the controllers capability of accessing a single register from multiple address of different submap. ✓ Check the enablement and Quircky register accessibility. Above is a indicative list. Apart from the one mentioned above one also needs to add below cases: a. Stress testing. b. Use callbacks to inject errors. c. Multiple regression. d. Corner cases. The next step in order to verify the features verification planning is needed for which we need to code the following UVC's / objects along with Interface. a. Transaction b. Driver c. Monitor d. Sequencer e. Agent f. Sequence item g. Configuration h. Env i. Test #vlsi #asic #electronics #engineering
-
🚨 BREAKING: McKinsey just revealed why 2/3 of “AI initiatives” never make it past the pilot stage. Everyone’s “using AI” now. But in most companies, most of it lives in: - One enthusiastic team’s sandbox - A few rogue Zapier automations - And a dead Slack channel called #ai-experiments McKinsey’s latest State of AI report shows: → ~2/3 of companies never scale beyond pilots → Only a small minority see real, enterprise-level impact Accenture and BCG are seeing the same thing with their clients: The problem isn’t tools. It’s the missing bridge between. When we go into companies (50–5000+ employees) and do AI automation audits, we see the same patterns on repeat: No owner after the pilot ends No success metric (beyond “it looks impressive”) No fallback when the AI gets it wrong No change management (so teams quietly ignore it) No monitoring (so the first failure kills trust) So we started asking a different question: “What if we treated AI automations like products, not like internal experiments?” For our clients, we now run every AI initiative through a Pilot → Production Pipeline: 1. Business Case First If we can’t tie the automation to revenue, margin, or a key SLA, we don’t build it. 2. Owner + KPI, Upfront One person. One metric. One timeline. No “innovation in general”. 3. Production-Ready Spec Trigger, data sources, edge cases, escalation rules, and what happens when it fails. 4. Shadow Mode Before Go-Live The AI runs in parallel to humans for 2–4+ weeks so we track everything and see real-world breakage before customers do. 5. Monitoring & Change Management Dashboards, feedback loops, and comms so the frontline knows what changed, why it’s better & how to escalate. To make this easier, I packaged the exact framework we use with clients into a Pilot-to-Production AI Checklist you can plug into your next project: ✅ 30-point Pilot → Production checklist ✅ The 5 non-negotiables before you ship any AI automation ✅ A one-page spec template you can send to any internal team or vendor ✅ A simple scoring model to kill weak pilots early Comment “CHECKLIST” and I’ll send it to you. (Make sure you’re connected)
-
If your automation needs constant babysitting, read this ⬇️⬇️⬇️ Automation is supposed to save time, but many QAs spend hours and hours every week fixing broken tests. Here’s why. Most traditional test automation works like an old GPS with hard-coded routes. You program it step by step: ↳ turn left at this exact sign ↳ stop at this exact light ↳ turn right at this exact building Now imagine the city changes slightly… just one sign gets renamed or a road shifts, or a building is redesigned… the GPS fails and your route is broken! That’s what happens when your UI changes. But what if your automation understood the destination instead? KaneAI by TestMu AI works exactly like a modern GPS. You don’t script every turn. Just a simple description of the goal is enough. KaneAI builds the test flow, runs it across web, mobile and APIs, adapts automatically when UI elements change and even generates tests directly from JIRA tickets. 👀 The focus is on the intent, not fragile instructions. For QA and engineering teams, it means: ☆ Faster releases ☆ Less test maintenance ☆ More confidence in deployments Automation FINALLY works the way it was always meant to. If you want testing that adapts with your product (not against it), KaneAI is definitely worth exploring: https://lnkd.in/ggeMdAf9 What’s your team’s “here we go again” moment in QA? I bet every team has (at least) one
-
🚨 Public Service Announcement: If you're building LLM-based applications for internal business use, especially for high-risk functions this is for you. Define Context Clearly ------------------------ 📋 Document the purpose, expected behavior, and users of the LLM system. 🚩 Note any undesirable or unacceptable behaviors upfront. Conduct a Risk Assessment ---------------------------- 🔍 Identify potential risks tied to the LLM (e.g., misinformation, bias, toxic outputs, etc), and be as specific as possible 📊 Categorize risks by impact on stakeholders or organizational goals. Implement a Test Suite ------------------------ 🧪 Ensure evaluations include relevant test cases for the expected use. ⚖️ Use benchmarks but complement them with tests tailored to your business needs. Monitor Risk Coverage ----------------------- 📈 Verify that test inputs reflect real-world usage and potential high-risk scenarios. 🚧 Address gaps in test coverage promptly. Test for Robustness --------------------- 🛡 Evaluate performance on varied inputs, ensuring consistent and accurate outputs. 🗣 Incorporate feedback from real users and subject matter experts. Document Everything ---------------------- 📑 Track risk assessments, test methods, thresholds, and results. ✅ Justify metrics and thresholds to enable accountability and traceability. #psa #llm #testingandevaluation #responsibleAI #AIGovernance Patrick Sullivan, Khoa Lam, Bryan Ilg, Jeffery Recker, Borhane Blili-Hamelin, PhD, Dr. Benjamin Lange, Dinah Rabe, Ali Hasan