Advanced Computer Vision Techniques

বিশেষজ্ঞ পেশাদারদের থেকে সেরা LinkedIn সামগ্রী এক্সপ্লোর করুন।

  • Bhavishya Pandit-এর জন্য প্রোফাইল দেখুন

    Turning AI into enterprise value | $20 M in Business Impact | Speaker - MHA/IITs/IIMs/NITs | Google AI Expert | 50 Million+ views | MS in ML - UoA

    ৮৫,৭৮৯ জন ফলোয়ার

    You can now generate infinite-length videos!? Yes, literally infinite. Let me quickly explain why it's a problem to begin with: AI models generate videos frame by frame, and each new frame depends on the previous one. The problem? Tiny errors stack up. By frame 100, your subject starts distorting. By frame 500, everything's a mess 💩 This happens because the model was trained on clean data, but during generation, it has to build on top of its own imperfect outputs. That gap kills quality over time [vanishing gradient analogy]. Plus, existing methods only handle one prompt, so you get repetitive scenes with no real story progression. Here's where Stable Video Infinity from EPFL shines 💡: Instead of fighting errors, it learns from them. The breakthrough is Error-Recycling Fine-Tuning. During training, the model deliberately injects its own past errors into clean frames, watches what goes wrong, and figures out how to fix it. Here's the process: → inject historical errors to simulate real generation conditions → predict where drift will happen → bank those errors in memory → learn to correct them before they compound. This creates three powerful results: • Videos can extend infinitely without quality collapse • Scene transitions happen naturally with controllable storylines • Works with multiple conditions like audio, skeleton poses, and text streams They've generated 10-minute Tom & Jerry videos from a single image. Not stitched clips, but continuous generation. The efficiency comes from only training LoRA adapters, not the full model. You can customise it without massive computing. The challenges? Real-time streaming isn't there yet. The model generates clip-by-clip with bidirectional attention for quality, which means you can't stream live outputs instantly. You still need decent hardware to train custom versions, though inference is manageable. And while error recycling is clever, the model needs to bank enough error patterns during training to handle diverse scenarios. But the future's interesting. They're working on Wan 2.2 5B-based SVI and true streaming generation. If they can achieve real-time inference while maintaining quality, this becomes viable for live content creation and gaming. The bigger idea here is training models on their own mistakes, rather than just clean data. That could apply beyond video to any autoregressive generation task. What's the longest AI-generated video you've successfully created without quality degradation, and what method did you use? Follow me, Bhavishya Pandit, for honest takes on AI breakthroughs that actually work 🔥

  • Henry Ajder-এর জন্য প্রোফাইল দেখুন
    Henry Ajder Henry Ajder একজন প্রভাবশালী

    AI and Deepfake Cartographer

    ১৭,৩৭১ জন ফলোয়ার

    OpenAI's Sora is dominating the news, but Tencent's latest generative video model Hunyuan has been much less discussed. Here's why I think it's significant: Hunyuan is a 13bn parameter model providing text-to-video, avatar animation, and notably video-to-audio capabilities. Tencent claims outputs are "comparable to, if not superior to", other leading generative video models, with independent evaluations finding Hunyuan outperformed Runway Gen-3 alpha and Luma 1.6. I've found the quality impressive but inconsistent. Compared to Sora, the outputs lagged on motion fluidity and human subjects, although others have had better results. So why is it so significant? Hunyuan's is open source and represents the most powerful and dynamic open generative video model currently available. There has been progress in OS generative video (such as Mochi 1), but most advances/multi-functional capabilities are seen in proprietary/closed models. Accessible closed models like Sora may be in the hands of many users right now, but open models like Hunyuan unlock the foundations for the global OS community to experiment and develop novel applications. As we've seen with other open generative models and modalities, these permutations could reshape how we view what's possible with generative video- for better and/or for worse. https://lnkd.in/eKYc6KvW

  • Arjun Jain-এর জন্য প্রোফাইল দেখুন

    Founder & CEO, Fast Code AI | Research-grade AI for enterprises | Dad

    ৩৭,৫৮০ জন ফলোয়ার

    #MIT's new "Radial Attention" makes Generative Video 4.4x cheaper to train and 3.7x faster to run. Here's why: The problem with current AI video? It's BRUTALLY expensive. Every frame must "pay attention" to every other frame. With thousands of frames, costs explode exponentially. Training one model? $100K+ Running it? Painfully slow. Massachusetts Institute of Technology, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence just changed the game. Their breakthrough insight: Video attention works like physics. - Sound gets quieter with distance - Light dims as it travels - Heat dissipates over space Turns out, AI video tokens follow the same rules. Why waste compute power on distant, irrelevant connections? Enter Radial Attention: Instead of checking EVERY connection: • Nearby frames → full attention • Distant frames → sparse attention • Computation scales logarithmically, not quadratically Technical result: O(n log n) vs O(n²) Translation: MASSIVE efficiency gains Real-world results on production models: 📊 HunyuanVideo (Tencent): • 2.78x training speedup • 2.35x inference speedup 📊 Mochi 1: • 1.78x training speedup • 1.63x inference speedup Quality? Maintained or IMPROVED. What this unlocks: 4x longer videos, same resources 4.4x cheaper training costs 3.7x faster generation Works with existing models (no retraining!) And, MIT open-sourced everything: https://lnkd.in/gETYw8eT The bigger picture: The internet is transforming. BEFORE: A place to store videos from the real world NOW: A machine that generates synthetic content on demand Think about it: • TikTok filled with AI-generated content • YouTube creators using AI for entire videos • Streaming services producing personalized shows • Educational content generated for each student This changes everything. Remember when only big tech could afford image AI? 2020: GPT-3 → Only OpenAI 2022: Stable Diffusion → Everyone 2024: Midjourney everywhere Video AI is next. Radial Attention probably just accelerated the timeline. The future isn't coming. It's here. And it's more accessible than ever. Want to ride this wave? → Follow me for weekly AI breakthroughs → Share if this opened your eyes → Try the code: https://lnkd.in/gETYw8eT What will YOU create when video AI costs 4x less? #AI #VideoGeneration #MachineLearning #TechInnovation #FutureOfContent

  • Sahar Mor-এর জন্য প্রোফাইল দেখুন

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    ৪২,১৮৩ জন ফলোয়ার

    The last few weeks have been huge for open-source video generation and research. After two years of limited usability in open-source video generative models, we’re finally seeing major advancements. These new models outperform commercial ones, including Runway, Pika Labs, and Luma Labs. —> Mochi, released by Genmo a week ago, ranks 2nd among top generative video models, permitting commercial use with an Apache 2.0 license —> CogVideoX-5B from Tsinghua University released last month supports both text2video and image2video, allowing commercial use for companies with <1M users —> Allegro from Rhymes AI is a small model capable of generating a wide range of content, from human close-ups to diverse, dynamic scenes, permitting commercial use with an Apache 2.0 license Also, over the last few weeks, Meta announced MovieGen for generating HD personalized videos with synchronized audio, and Peking University openly released Pyramid Flow. On the proprietary generative video side of things, Runway released a new tool for transforming simple video and voice inputs into expressive character performances, Pika Labs released Pikaffects to transform video subjects with surreal effects, and Luma Labs announced an API access to its generative video models. As for OpenAI’s Sora, who knows, it might launch soon after the US elections are out of the way (this Tuesday). Generative video models leaderboard https://lnkd.in/grWJDVkd Links to the mentioned models are in the comments.

  • Sione Palu-এর জন্য প্রোফাইল দেখুন

    Machine Learning Applied Research

    ৩৭,৯৭৩ জন ফলোয়ার

    Transformer architecture excels at capturing long-range dependencies in data but suffers from quadratic complexity, making it inefficient for long sequences. State Space Models (SSMs), particularly those used in Mamba-style architectures, address this limitation by replacing the self-attention mechanism with linear recurrent layers. This approach allows the models to handle long sequences with linear complexity, making them significantly more scalable and efficient while retaining the powerful modeling capabilities of Transformers. The primary advantage of SSMs is their computational efficiency, which allows them to process much longer sequences than traditional Transformers. This efficiency makes them ideal for tasks that require analyzing high-resolution data, such as medical imaging. For example, SSMs have been adapted for computer vision (S4ND, VMamba) and have shown promising results in medical image segmentation, including liver pathologies. The development of robust computer-assisted segmentation algorithms for liver tissue is crucial for the early detection of liver pathologies like cirrhosis and cancer. Currently, this process is manual, time-consuming, and prone to error. Automated tools built with efficient architectures like SSMs can provide consistent, quantitative analysis, helping to improve diagnostic accuracy and patient outcomes by reliably identifying subtle changes in the liver. Building on the advances highlighted above, the authors of [1] introduced RMA-Mamba, a new architecture designed for medical image segmentation. This model builds on Vision State Space (VSS) models by incorporating a specialized Reverse Mamba Attention (RMA) module to effectively capture both local details and global context. RMA-Mamba combines the efficient sequence modeling of Vision Mamba (VMamba) with the targeted feature refinement of its RMA module. This dual approach allows it to handle complex morphological patterns and long-range dependencies with computational efficiency. #MedicalInformatics The architecture has been shown to achieve state-of-the-art performance in pathological liver segmentation from both MRI and CT scans. When tested on a new cirrhotic liver dataset (CirrMRI600+), RMA-Mamba achieved a Dice coefficient of 92.08%. It also demonstrated strong performance on the cancerous liver segmentation dataset (LiTS), with a Dice score of 92.9%. The links to the preprint [1] and the #Python GitHub repository are posted in the comments.

  • Zain Khalpey, MD, PhD, FACS-এর জন্য প্রোফাইল দেখুন

    Professor & Director of Artificial Heart & Robotic Cardiac Surgery Programs | Network Director Of Artificial Intelligence | Chief Medical AI Officer |#AIinHealthcare

    ৮১,১২৪ জন ফলোয়ার

    New research in JACC: Advances shows that the eye may offer a powerful, noninvasive window into coronary artery disease detection. In a multicenter study of 383 patients, deep learning models trained on retinal images were able to identify CAD with strong performance, outperforming traditional clinical risk scores, particularly in intermediate risk patients where clinical uncertainty is highest. When retinal imaging was combined with clinical indicators using a multimodal AI approach, diagnostic accuracy improved further, achieving an AUC of 0.91 with over 92 percent sensitivity. Because retinal and coronary vessels share similar vascular origins, microvascular changes captured by OCT and OCTA appear to reflect underlying coronary disease. AI enables these subtle patterns to be translated into scalable, radiation free screening and risk stratification tools. This work points toward a future where cardiovascular risk can be assessed earlier, more safely, and more equitably, especially in settings where invasive testing is limited. Multimodal AI may be key to shifting CAD detection upstream and personalizing prevention before clinical events occur. 🔗 https://lnkd.in/gWJUU447 Follow Zain Khalpey, MD, PhD, FACS for more on Ai & Healthcare. #AIinHealthcare #Cardiology #CoronaryArteryDisease #PreventiveCardiology #DigitalHealth #MedicalAI #MultimodalAI #DeepLearning #NonInvasiveDiagnostics #RetinalImaging #OCTA #OCT #CardiovascularHealth #RiskStratification #PrecisionMedicine #ClinicalInnovation #HealthEquity #CVImaging

  • Bo Wang-এর জন্য প্রোফাইল দেখুন

    Senior Vice President @ Xaira Therapeutics; Chief Artificial Intelligence Scientist @ UHN; Associate Professor @ University of Toronto; CIFAR AI Chair @ Vector Institute ; Twitter : @BoWang87

    ২১,৬৯৭ জন ফলোয়ার

    Yann LeCun's vision: machines should learn like humans — by building internal world models, not reconstructing every pixel. We just validated this idea at the largest scale ever attempted in cardiac ultrasound. Introducing EchoJEPA — the first world model for medical video.🔥 🫀 18M echocardiograms 👥 300K patients 🧠 Learns heart dynamics — not imaging noise The problem: Ultrasound is messy. Speckle, shadows, attenuation. Most pretraining objectives end up modeling the scanner, not the heart. The idea: Stop reconstructing pixels. Predict latent structure instead. EchoJEPA discards what’s unpredictable and locks onto what matters clinically: ➡️ chamber geometry ➡️ wall motion ➡️ valve dynamics The results (frozen encoder, no fine-tuning): • 20% ↓ error in LVEF • 17% ↓ error in RVSP • 79% accuracy with 1% labels (vs 42% for baselines w/ 100%) • 2% degradation under acoustic artifacts (vs 17%) • Zero-shot pediatric transfer beats all fine-tuned models Why this works: When we project embeddings: ❌ prior methods → diffuse, entangled clusters ✅ EchoJEPA → clean anatomical organization Structure separated from acquisition noise. 📄 Paper: https://lnkd.in/gPxhQpCR 💻 Code: https://lnkd.in/gQ-i6yMx Huge credit to Alif Munim, who pushed JEPA thinking into medical video and led this effort 💥 Guidance from AI at Meta (Quentin Garrido, Koustuv Sinha) Co-authors: Adib Fallahpour Teodora Szasz Ahmadreza Attarpour, PhD etc! Teams: University Health Network Amazon Web Services (AWS) University of Toronto UChicago Medicine University of California, San Francisco Philips This is representation learning for physiology, not pixels.

  • Philipp Schmid-এর জন্য প্রোফাইল দেখুন

    Agents & Gemini API, MTS at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    ১,৬৫,৮২৫ জন ফলোয়ার

    Yesterday, we released MedGemma a open medical vision-language model for Healthcare! Built on Google DeepMind Gemma 3 it advances medical understanding across images and text, significantly outperforming generalist models of similar size. MedGemma is one of the best open model under 50B! How MedGemma Was Trained: 1️⃣ Fine-tuned Gemma 3 vision-encoder (SigLIP) on over 33 million medical image-text pairs (radiology, dermatology, pathology, etc.) to create the specialized MedSigLIP, including some general data to prevent catastrophic forgetting. 2️⃣ Further pre-trained Gemma 3 Base by mixing in the medical image data (using the new MedSigLIP encoder) to ensure the text and vision components could work together effectively. 3️⃣ Distilling knowledge from a larger "teacher" model, using a mix of general and medical text-based question-answering datasets. 4️⃣ Reinforcement Learning similar to Gemma 3 on medical imaging and text data, RL led to better generalization than standard supervised fine-tuning for these multimodal tasks. Insights: - 💡 Outperforms Gemma 3 on medical tasks by 15-18% improvements in chest X-ray classification. - 🏆 Competes with, and sometimes surpasses, much larger models like GPT-4o. - 🥇 Sets a new state-of-the-art for MIMIC-CXR report generation. - 🩺 Reduces errors in EHR information retrieval by 50% after fine-tuning. - 🧠 The 27B model outperforms human physicians in a simulated agent task. - 🤗 Openly released to accelerate development in healthcare AI. - 🔬 Reinforcement Learning was found to be better for multimodal generalization. Paper: https://lnkd.in/dBTiH_cJ Model: https://lnkd.in/dnyxWPju

  • Niels Rogge-এর জন্য প্রোফাইল দেখুন

    Machine Learning Engineer at ML6 & Hugging Face

    ৭০,৮৭২ জন ফলোয়ার

    Let's go!! Meta released a new video LLM on Hugging Face, and it sets a new SOTA (state-of-the-art) for open-source video understanding. 🔥 The model is called LongVU, a new multimodal large language model capable of processing long videos (for things like answering questions about it, summarizing it, identifying important passages, etc). LongVU is capable of processing very long videos thanks to various clever compression techniques, which increasingly reduce the amount of tokens used to represent a video (and which a Transformer needs to process in parallel). First, the authors employ DINOv2, a self-supervised image model open-sourced by Meta as well, to remove redundant frames that exhibit high feature similarity across time. Next, features for the remaining frames are combined with features from SigLIP, an important vision encoder open-sourced by Google. The large language model (text decoder part of LongVU) is conditioned on these features. Next, after temporal reduction, the authors employ spatial reduction (reducing the width and height dimensions of certain video features). Based on the embeddings of the text query (e.g. "What did this man put on the pizza?"), less important frames get their features' resolution reduced, whereas the most important frames's features keep their original resolution. Finally, spatial token compression (STC) is performed to further reduce the amount of tokens. This is based on a technique where a non-overlapping window is slided over the tokens, where tokens which exhibit high cosine similarity with the first frame of each window are removed. In terms of performance, the model gets SOTA results on EgoSchema, MVBench and VideoMME and MLVU. Only on VideoMME, the gap with closed-source (GPT-4o and Gemini) is still large, but it's quite impressive to see the results. Resources: * paper: https://lnkd.in/eG-rC8Fg * Gradio demo for you to try: https://lnkd.in/e8Ey9ci7 * checkpoints: https://lnkd.in/eJr8WWQB * project page: https://lnkd.in/ee7dirPR #huggingface #video #largelanguagemodels #generativeai #ai

  • Massimiliano Viola-এর জন্য প্রোফাইল দেখুন

    ML @Bedrock Robotics | Ex Stanford, ETH Zurich | Computer Vision • 3D • Generative Models

    ১৪,৭৩৪ জন ফলোয়ার

    This DEFINITELY flew under the radar: just a few days ago, AI at Meta released V-JEPA 2.1, taking a massive step toward closing the gap between image and video domains. For a long time, image backbones were the only option for solving dense vision tasks. This model disagrees, showing that universal spatial understanding also emerges from large-scale video models! 🎥 Quick recap on V-JEPA: it is a joint embedding predictive architecture built on a classic teacher-student setup. The teacher sees the full video, and its weights slowly update as an exponential moving average of the student. The student sees a masked input and predicts the latent features of the missing regions rather than reconstructing them in pixel space. What changed between V1 and V2 was largely a matter of scale. The encoder grew to a 1B-parameter ViT-g, the dataset from 2M to 22M videos, training got longer and progressive, and clips were pushed to higher temporal and spatial resolution. V2 also introduced images into the mix via temporal duplication, training on 1M ImageNet samples. But the difference between V2 and V2.1 is conceptual, on top of just scaling. Sure, they pushed the model to 2B parameters and expanded the image dataset from 1M to 142M, but the real breakthrough lies in the training loss. In V-JEPA 2, supervision was only applied to the masked regions, despite the predictor outputting a token for every input, masked or not. Thus, the visible tokens were free to ignore local structure and aggregate global information if that would minimize the loss, similar to register tokens. V-JEPA 2.1 fixes this by extending supervision to the visible tokens too. Every patch, masked or visible, now has a training signal forcing it to encode where things actually are in space and time. This results in feature maps that look nothing like before: spatially structured, semantically coherent, and temporally consistent. Looking at the features below, you would almost think this is some small variant of DINOv3 (with due respect), except these results came from video pretraining! 🤯 This feature quality obviously translates to downstream tasks. Motion benchmarks got only a small buff, but spatial tasks are where the gains are staggering, with improvements ranging anywhere from 30 to 95%. The idea that we now basically have a SOTA image encoder baked into video features is crazy to me, and as someone working with video models on a daily basis, I could not be happier to put this to the test and distill it down into even smaller and faster variants than the smallest 80M. Resources are down in the comments. Try it out if you were using the previous version, and let me know how it goes! ⏬

বিভাগগুলি অন্বেষণ করুন