Software Testing Best Practices

বিশেষজ্ঞ পেশাদারদের থেকে সেরা LinkedIn সামগ্রী এক্সপ্লোর করুন।

  • Arvind Jain-এর জন্য প্রোফাইল দেখুন
    Arvind Jain Arvind Jain একজন প্রভাবশালী
    ৮১,৭৬৯ জন ফলোয়ার

    Two strikingly similar headlines surfaced this past week that should make every leader pause: • “Companies Are Pouring Billions Into A.I. It Has Yet to Pay Off.” — New York Times • “Companies Are Pouring Billions Into AI. Here’s Why They’re Not Seeing Returns” — Forbes The NYT points to the human side: employees resist tools they don’t trust. Forbes focuses on the technical side: most AI still can’t understand the context of work. Both are true, and they’re related. When AI lacks context, employees lose trust. It can’t tell the latest doc from last year’s draft. It summarizes a customer conversation but drops the follow-ups buried in the thread. It pulls a response from Slack while ignoring the context in Google Drive. Employees realize it creates more work than it saves, and stop using it. Pilots stall, deployments fade, and projects slide into the “trough of disillusionment" as the NYT describes. Unfortunately, that's the reality for many organizations. At Glean, we work hard to make sure AI understands the enterprise context the way a human does. If a subject matter expert says something, I trust it more. If something’s old, I double-check it. That’s how people think, and it’s how AI should work too. Yet every enterprise has its own documentation culture and quirks, so sometimes we struggle at first. But we persist and co-develop with customers until the system reaches the quality they need. Then we take those learnings to make it work automatically for the next customer. We’ve seen this approach deliver measurable impact for customers: • Booking.com: Glean Agents give teams faster access to customer insights, cutting video production time by 75% and doubling monthly output. • Confluent: Glean’s AI-powered search saves 15,000+ hours/month, boosts support satisfaction by 13%, and cuts ticket investigation time by 10 minutes. • Fortune 100 telecom company: Glean surfaces instant knowledge during support calls, reducing call resolution time by 17 seconds across 800+ agents. • Leading global consultancy: Glean Agents automate RFP workflows, cutting consulting project proposals from 4 weeks to a few hours (97% faster). • Wealthsimple: Glean gives employees instant access to policies and knowledge, driving $1M+ in annual productivity gains. When AI understands the real context of work—across people, tools, and workflows— employees trust it and use it. Instead of falling into the trough of disillusionment, companies climb a slope toward productivity gains and real ROI.

  • Juan Sequeda-এর জন্য প্রোফাইল দেখুন

    Principal Data Strategist & Researcher at ServiceNow (data.world acq); co-host of Catalog & Cocktails the honest, no-bs, non-salesy data podcast. 20 years working in Knowledge Graphs & Ontologies (way before it was cool)

    ২১,১১৭ জন ফলোয়ার

    Knowledge Graphs as a source of trust for LLM-powered enterprise question answering That has been our position from the beginning when we started our research of understanding how knowledge graphs increase the accuracy of LLM-powered question answering systems over 2 years ago!  The intersection of knowledge graphs and large language models (LLMs) isn’t theoretical anymore. It's been a game-changer for enterprise question answering and now everyone is talking about it and many are doing it. 🚀 This new paper is a summary of our lessons learned of implementing this technology in data.world and working with customers, and outline the opportunities for future research contributions and where the industry needs to go (guess where the data.world AI Lab is focusing). Sneak peek and link in the comments Lessons Learned ✅ Knowledge engineering is essential but underutilized: Across organizations, it’s often sporadic and inconsistent, leading to assumptions and misalignment. It’s time to systematize this critical work. ✅ Explainability builds trust: Showing users exactly how an answer is derived, including auto-corrections, increases transparency and confidence. ✅ Governance matters: Aligning answers with an organization’s business glossary ensures consistency and clarity. ✅ Avoid “boiling the ocean”: don’t tackle too many questions at once A pay-as-you-go approach ensures meaningful progress without overwhelm. ✅ Testing matters: Non-deterministic systems like LLMs require new frameworks to test ambiguity and validate responses effectively. Where the Industry Needs to Go 🌟 Simplified knowledge engineering: Tools and methodologies must make this foundational work easier for everyone. 🌟 User-centric explainability: Different users have different needs so we need to focus on “explainable to whom?”. 🌟 Testing non-deterministic systems: The deterministic models of yesterday won’t cut it. We need innovative frameworks to ensure quality in LLMs powered software applications. 🌟 Small semantics vs. Larger semantics: The concept of semantics is being increasingly referenced in industry in the context of “semantic layers” for BI and Analytics. Let’s close the gap between the small semantics (fact/dimension modeling) and large semantics (ontologies, taxonomies) 🌟 Multi-agent systems: break down the problem into smaller, more manageable components. Should an agent deal with the core task of answering questions and managing ambiguity, or should these be split into separate agents? This research reflects our commitment to co-innovate with customers to solve real-world challenges in enterprise AI. 💬 What do you think? How are knowledge graphs shaping your AI strategies?

  • Armand Ruiz-এর জন্য প্রোফাইল দেখুন
    Armand Ruiz Armand Ruiz একজন প্রভাবশালী

    building AI systems @meta

    ২,০৭,১১১ জন ফলোয়ার

    “A Survey on LLM-as-a-Judge” outlines what could become a foundational shift in how we evaluate AI systems, and the paper is very insightful. The idea is simple, but profound: use LLMs not just to generate content, but to judge it across tasks like summarization, reasoning, classification, and beyond. Why does this matter? Because traditional evaluation methods no longer scale: - Human reviews are expensive, inconsistent, and hard to reproduce. - Automatic metrics like BLEU and ROUGE fail to capture meaning, nuance, or utility. LLM-as-a-Judge offers a compelling alternative: scalable, nuanced, and surprisingly aligned with expert judgment when done right. What makes this paper stand out is the depth and structure it brings to a chaotic space. It: 1. Defines a clear taxonomy of evaluation methods (scoring, pairwise, yes/no, multi-choice) 2. Details the full pipeline from prompt design to model selection to post-processing 3. Surfaces real risks (biases, hallucinations, format brittleness) and proposes mitigation strategies 4. Introduces benchmarks and best practices for evaluating the evaluators themselves In short, it turns a loose idea into a playbook. In the enterprise, “LLM-as-a-Judge” could soon underpin everything from agentic workflows to data labeling, model selection, and QA. It’s a new infrastructure layer, and it demands as much rigor as the models it oversees. Highly recommend reading the full paper if you’re building or deploying GenAI at scale. Link to paper: https://lnkd.in/gsVf6_Zh

  • Aishwarya Srinivasan-এর জন্য প্রোফাইল দেখুন
    Aishwarya Srinivasan Aishwarya Srinivasan একজন প্রভাবশালী
    ৬,৩৬,৬২২ জন ফলোয়ার

    If you’re an AI engineer trying to optimize your LLMs for inference, here’s a quick guide for you 👇 Efficient inference isn’t just about faster hardware, it’s a multi-layered design problem. From how you compress prompts to how your memory is managed across GPUs, everything impacts latency, throughput, and cost. Here’s a structured taxonomy of inference-time optimizations for LLMs: 1. Data-Level Optimization Reduce redundant tokens and unnecessary output computation. → Input Compression:  - Prompt Pruning, remove irrelevant history or system tokens  - Prompt Summarization, use model-generated summaries as input  - Soft Prompt Compression, encode static context using embeddings  - RAG, replace long prompts with retrieved documents plus compact queries → Output Organization:  - Pre-structure output to reduce decoding time and minimize sampling steps 2. Model-Level Optimization (a) Efficient Structure Design → Efficient FFN Design, use gated or sparsely-activated FFNs (e.g., SwiGLU) → Efficient Attention, FlashAttention, linear attention, or sliding window for long context → Transformer Alternates, e.g., Mamba, Reformer for memory-efficient decoding → Multi/Group-Query Attention, share keys/values across heads to reduce KV cache size → Low-Complexity Attention, replace full softmax with approximations (e.g., Linformer) (b) Model Compression → Quantization:  - Post-Training, no retraining needed  - Quantization-Aware Training, better accuracy, especially <8-bit → Sparsification:  - Weight Pruning, Sparse Attention → Structure Optimization:  - Neural Architecture Search, Structure Factorization → Knowledge Distillation:  - White-box, student learns internal states  - Black-box, student mimics output logits → Dynamic Inference, adaptive early exits or skipping blocks based on input complexity 3. System-Level Optimization (a) Inference Engine → Graph & Operator Optimization, use ONNX, TensorRT, BetterTransformer for op fusion → Speculative Decoding, use a smaller model to draft tokens, validate with full model → Memory Management, KV cache reuse, paging strategies (e.g., PagedAttention in vLLM) (b) Serving System → Batching, group requests with similar lengths for throughput gains → Scheduling, token-level preemption (e.g., TGI, vLLM schedulers) → Distributed Systems, use tensor, pipeline, or model parallelism to scale across GPUs My Two Cents 🫰 → Always benchmark end-to-end latency, not just token decode speed → For production, 8-bit or 4-bit quantized models with MQA and PagedAttention give the best price/performance → If using long context (>64k), consider sliding attention plus RAG, not full dense memory → Use speculative decoding and batching for chat applications with high concurrency → LLM inference is a systems problem. Optimizing it requires thinking holistically, from tokens to tensors to threads. Image inspo: A Survey on Efficient Inference for Large Language Models ---- Follow me (Aishwarya Srinivasan) for more AI insights!

  • Greg Coquillo-এর জন্য প্রোফাইল দেখুন
    Greg Coquillo Greg Coquillo একজন প্রভাবশালী

    AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

    ২,৩২,১৭৫ জন ফলোয়ার

    "The LLM works great." Works great… according to what? That's the question most AI teams skip, and it's why so many models look brilliant in demos and fall apart in production. Testing an LLM isn't one thing. It's six, and using only one of them is how trust quietly breaks. Here are the 6 methods for testing LLM output quality 👇 🔹Human Evaluation - the gold standard for nuance, tone, subtle errors. Slow and costly, but irreplaceable. 🔹Automated Metrics - BLEU, ROUGE, BERTScore, perplexity. Fast and repeatable, weak on meaning. 🔹Adversarial & Red-Teaming - stress tests for jailbreaks, prompt injection, hallucinations. Critical before launch. 🔹LLM-as-a-Judge - a strong model grades outputs. Scales human-like judgment cheaply (watch for bias). 🔹Task-Specific Evaluation - custom datasets that mirror production. Measures real business value. 🔹Benchmark Testing - MMLU, HellaSwag, GSM8K, HumanEval. Comparable across models; may miss real-world tasks. The takeaway: no single method covers everything. Layer them. Save this if you build with LLMs. Which do you trust most? 👇

  • Marie Stephen Leo-এর জন্য প্রোফাইল দেখুন

    Data & AI Director | Scaled customer facing Agentic AI @ Sephora | AI Coding | RecSys | NLP | CV | MLOps | LLMOps | GCP | AWS

    ১৬,১৩৮ জন ফলোয়ার

    LLM applications are frustratingly difficult to test due to their probabilistic nature. However, testing is crucial for customer-facing applications to ensure the reliability of generated answers. So, how does one effectively test an LLM app? Enter Confident AI's DeepEval: a comprehensive open-source LLM evaluation framework with excellent developer experience. Key features of DeepEval: - Ease of use: Very similar to writing unit tests with pytest. - Comprehensive suite of metrics: 14+ research-backed metrics for relevancy, hallucination, etc., including label-less standard metrics, which can quantify your bot's performance even without labeled ground truth! All you need is input and output from the bot. See the list of metrics and required data in the image below! - Custom Metrics: Tailor your evaluation process by defining your custom metrics as your business requires. - Synthetic data generator: Create an evaluation dataset synthetically to bootstrap your tests My recommendations for LLM evaluation: - Use OpenAI GPT4 as the metric model as much as possible. - Test Dataset Generation: Use the DeepEval Synthesizer to generate a comprehensive set of realistic questions! Bulk Evaluation: If you are running multiple metrics on multiple questions, generate the responses once, store them in a pandas data frame, and calculate all the metrics in bulk with parallelization. - Quantify hallucination: I love the faithfulness metric, which indicates how much of the generated output is factually consistent with the context provided by the retriever in RAG! CI/CD: Run these tests automatically in your CI/CD pipeline to ensure every code change and prompt change doesn't break anything. - Guardrails: Some high-speed tests can be run on every API call in a post-processor before responding to the user. Leave the slower tests for CI/CD. 🌟 DeepEval GitHub: https://lnkd.in/g9VzqPqZ 🔗 DeepEval Bulk evaluation: https://lnkd.in/g8DQ9JAh Let me know in the comments if you have other ways to test LLM output systematically! Follow me for more tips on building successful ML and LLM products! Medium: https://lnkd.in/g2jAJn5 X: https://lnkd.in/g_JbKEkM #generativeai #llm #nlp #artificialintelligence #mlops #llmops

  • Japneet Sachdeva-এর জন্য প্রোফাইল দেখুন

    Automation Lead | Instructor | Mentor | Checkout my courses on Udemy & TopMate

    ১,৩২,৩৪১ জন ফলোয়ার

    Why Most Automation Frameworks Break in 6 Months (and How to Prevent It) I recently reviewed the automation setup of a $50M product team. 73% of their tests were failing randomly. The truth? Most automation frameworks—especially Selenium-based ones—fail for the same predictable reasons. Here are the 4 patterns I see again and again (and the fixes that actually work): 1️⃣ The “Everything Lives in One Folder” Problem ❌ UI tests, APIs, utils, configs—everything mixed together Fix: Create clear packages: UI, API, POJOs, services, utilities Why it matters: New engineers should be productive in hours, not weeks. 2️⃣ Hardcoded Data Everywhere ❌ URLs, credentials, test data sitting inside the test files Fix: Externalise everything (env configs + test data files) Real benefit: Switching from dev → QA → prod becomes a single command. 3️⃣ No POJOs for API Payloads ❌ Raw JSON strings and manually built requests Fix: Use POJOs + schema validation for request/response models Outcome: Cleaner tests and a framework that stays maintainable long term. 4️⃣ Debugging Takes Forever ❌ “Test failed” with no context, no screenshot, no timeline Fix: Add reporting (Extent / Allure) + screenshots + API logs Impact: Debug time drops from hours → minutes. ---- What a Scalable Framework Actually Looks Like The setups that survive beyond 6 months usually include: - Clear UI/API/POJO separation - Environment-based configurations - Rich visual reporting - Docker + CI/CD support - Optional BDD for business-friendly readability - It’s not about “automation scripts.” - It’s about building a software system that grows with your team—from test #10 to test #1000. If you want the sample folder structure I recommend, drop “FRAMEWORK” in the comments and I’ll share it. -x-x- Become a Future Proof Full Stack QA Automation Engineer, with implementing usage of AI: https://lnkd.in/g7tn6Uif #japneetsachdeva

  • Yuvraj Vardhan-এর জন্য প্রোফাইল দেখুন

    Technical Lead | Test Automation

    ১৯,১৫৯ জন ফলোয়ার

    Automation is more than just clicking a button While automation tools can simulate human actions, they don't possess human instincts to react to various situations. Understanding the limitations of automation is crucial to avoid blaming the tool for our own scripting shortcomings. 📌 Encountering Unexpected Errors: Automation tools cannot handle scenarios like intuitively handling error messages or auto-resuming test cases after failure. Testers must investigate execution reports, refer to screenshots or logs, and provide precise instructions to handle unexpected errors effectively. 📌 Test Data Management: Automation testing relies heavily on test data. Ensuring the availability and accuracy of test data is vital for reliable testing. Testers must consider how the automation script interacts with the test data, whether it retrieves data from databases, files, or APIs. Additionally, generating test data dynamically can enhance test coverage and provide realistic scenarios. 📌 Dynamic Elements and Timing: Web applications often contain dynamic elements that change over time, such as advertisements or real-time data. Testers need to use techniques like dynamic locators or wait to handle these dynamic elements effectively. Timing issues, such as synchronization problems between application responses and script execution, can also impact test results and require careful consideration. 📌 Maintenance and Adaptability: Automation scripts need regular maintenance to stay up-to-date with application changes. As the application evolves, UI elements, workflows, or data structures might change, causing scripts to fail. Testers should establish a process for script maintenance and ensure scripts are adaptable to accommodate future changes. 📌 Test Coverage and Risk Assessment: Automation testing should not aim for 100% test coverage in all scenarios. Testers should perform risk assessments and prioritize critical functionalities or high-risk areas for automation. Balancing automation and manual testing is crucial for achieving comprehensive test coverage. 📌 Test Environment Replication: Replicating the test environment ensures that the automation scripts run accurately and produce reliable results. Testers should pay attention to factors such as hardware, software versions, configurations, and network conditions to create a robust and representative test environment. 📌 Continuous Integration and Continuous Testing: Integrating automation testing into a continuous integration and continuous delivery (CI/CD) pipeline can accelerate the software development lifecycle. Automation scripts can be triggered automatically after each code commit, providing faster feedback on the application's stability and quality. Let's go beyond just clicking a button and embrace automation testing as a strategic tool for software quality and efficiency. #automationtesting #automation #testautomation #softwaredevelopment #softwaretesting #softwareengineering #testing

  • Philipp Schmid-এর জন্য প্রোফাইল দেখুন

    Agents & Gemini API, MTS at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    ১,৬৫,৮২৫ জন ফলোয়ার

    How biased are LLMs when you use them for synthethic data generation and as LLM as a Judge to evaluate? Answer: Significantly biased. 👀 The “Preference Leakage: A Contamination Problem in LLM-as-a-judge” paper shows that using the same LLM, Family or even previous version can have a preference towards their “own” data. Experiments: 1️⃣ Use LLM (e.g., GPT-4, Gemini) to generate synthetic responses to a set of prompts (e.g., UltraFeedback). 2️⃣ Fine-tune different versions of a "student" models (e.g., Mistral, Qwen) on the synthetic data. 3️⃣ Evaluation: Use multiple "judge" LLMs to perform pairwise comparisons of these student models on benchmark (e.g., Arena-Hard, AlpacaEval 2.0). 4️⃣ Bias: Calculate and Analyze the Preference Leakage Score (PLS) across different scenarios (same model, inheritance, same family) PLS measures how much more often a judge LLM prefers a student model trained on its own data compared to Judge. If both teachers give similar grades to both students = low PLS (fair judging), If teachers give better grades to their own students = high PLS (biased judging). Insights 💡LLMs show a bias towards student models trained on data generated by themselves. 📈 Model size matters: Larger models (14B vs 7B) show stronger preference leakage. 🧪 Supervised fine-tuning (SFT) leads to the highest PLS (23.6%), (DPO) reduces it (5.2%). ❓PLS is higher in subjective tasks, e.g. writing compared to objective ones. 🧑🧑🧒🧒 Relationship bias: Same model > inheritance > same family in terms of leakage severity. 🌊 Data mixing helps but doesn't solve: Even 10% synthetic data shows detectable leakage. ✅ Use multiple independent judges and mix with human evaluation. Paper: https://lnkd.in/eupf2Vyx Github: https://lnkd.in/eeDdrEXb

বিভাগগুলি অন্বেষণ করুন