Skip to content

MBEN07/Web_Data_Integration_Engine

Repository files navigation

Web Data Integration Engine

Project Screenshot

Project Overview

The Web Data Integration Engine is a comprehensive data integration and web application system designed to aggregate and match video content from multiple social media platforms (Facebook, Instagram, and YouTube). The system scrapes data from these platforms, intelligently matches videos across platforms using advanced NLP techniques, and provides a unified interface to search and display integrated data.

This project is developed as part of a Master's level coursework in Data Integration, demonstrating practical application of data scraping, data matching algorithms, and web development.


Objective

The primary objectives of this project are to:

  1. Extract video metadata and statistics from multiple social media platforms (Facebook, Instagram, YouTube)
  2. Match videos across platforms based on title similarity and duration
  3. Integrate disparate data sources into a unified, normalized dataset
  4. Provide a user-friendly web interface to search and visualize integrated video data
  5. Display comparative statistics and engagement metrics across all platforms for the same video content

Project Structure

Web_Data_Integration_Engine/
├── README.md                    # Project documentation
├── interface.py                 # Flask web application backend
├── matching.ipynb               # Jupyter notebook for data matching & integration logic
├── mapped_data.csv             # Output file with integrated/matched data
├── Rapport.docx                # Project report documentation
├── templates/                  # HTML templates for web interface
│   ├── index.html             # Search page
│   ├── result.html            # Results display page
│   └── image.jfif             # Background image asset
├── scapped_data/              # Raw scraped data from social media platforms
│   ├── Facebook_data.csv      # Raw Facebook video data
│   ├── Instagram_data.csv     # Raw Instagram video data
│   └── YouTube_data.csv       # Raw YouTube video data
└── __pycache__/               # Python cache directory

Key Technologies & Libraries

Data Processing

  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • scikit-learn: Machine learning algorithms (TF-IDF, Cosine Similarity)

Text Processing

  • langid: Language identification
  • TfidfVectorizer: Converting text to TF-IDF vectors
  • cosine_similarity: Computing similarity between text vectors

Web Framework

  • Flask: Lightweight Python web framework
  • Jinja2: Template rendering engine (integrated with Flask)

Additional Libraries

  • chardet: Character encoding detection

Data Processing Pipeline

1. Data Collection (scapped_data/)

The system collects data from three social media platforms:

Facebook Data (Facebook_data.csv)

  • Columns: Link, Date, Views, Duration, Likes, Comments, Title
  • Aggregated Facebook video content and engagement metrics

Instagram Data (Instagram_data.csv)

  • Columns: video_title, likes_count, comments_count, comments_list, DATE, heure, duration, vue
  • Video content from Instagram accounts

YouTube Data (YouTube_data.csv)

  • Columns: Titre de la vidéo, Catégorie, Date de publication, Durée, Likes, Dislikes, Vues, Nombres de Commentaires, Commentaires, Lien de la vidéo
  • Video content and engagement metrics from YouTube

2. Data Normalization (matching.ipynb)

Column Renaming: Standardizes column names across all datasets for consistency

YouTube: 'Titre de la vidéo' → 'Titre'
Instagram: 'video_title' → 'Titre'
Facebook: 'Title' → 'Titre'

Data Cleaning:

  • Removes platform-specific prefixes and special characters
  • Removes hashtags (#) from titles
  • Handles missing values and data type conversions
  • Fixes edge cases in view counts and comment counts

Duration Normalization:

  • Converts time formats (HH:MM:SS, MM:SS) to seconds
  • Enables comparison across platforms with different time representations

3. Data Matching Algorithm

Uses a multi-criteria matching approach:

TF-IDF Based Title Matching:

  • Creates TF-IDF vectors for all video titles across platforms
  • Computes cosine similarity matrix between titles
  • Sets threshold: similarity score > 0.3

Duration-Based Validation:

  • Calculates absolute duration differences
  • Sets threshold: duration difference < 3 seconds
  • Ensures matches are temporally consistent

Matching Logic: For each potential triplet (Instagram, YouTube, Facebook):

  • Validates title similarity for all three pairs (Instagram-YouTube, Instagram-Facebook, YouTube-Facebook)
  • Validates duration similarity for all three pairs
  • Only creates a match when ALL conditions are met

Output: DataFrame containing matched videos across all three platforms

4. Data Integration (mapped_data.csv)

Combines matched data into a unified dataset with:

  • All columns from matched videos
  • Deduplicated column names
  • Removed unnecessary columns (duration conversions, platform links, time metadata)
  • Added channel_name field for organizational tracking

Final Columns Include:

  • channel_name
  • ID_vidéo
  • Titre (Video Title)
  • Date_Publication
  • Durée (Duration in seconds)
  • Platform-specific metrics:
    • Youtube_Likes, Youtube_Vues, Youtube_Nombre_Commentaires, Youtube_Commentaires
    • Instagram_Likes, Instagram_Vues, Instagram_Nombre_Commentaires, Instagram_Commentaires
    • Facebook_Likes, Facebook_Vues, Facebook_Commentaires
  • Catégorie (Category)
  • Youtube_Lien

Web Application Features

Frontend Interface

Home Page (index.html)

  • Clean, centered search interface
  • Input fields for:
    • Channel name (text input)
    • Publication date (date picker)
  • Background image with modern styling
  • Submit button to search

Results Page (result.html)

  • Displays matched videos for the selected channel and date
  • Features for each video:
    • Video Player: Embedded YouTube player for viewing
    • Video Info: Title, Category, Duration
    • Cross-Platform Statistics:
      • YouTube Stats: Likes, Views, Comments, Comment text
      • Facebook Stats: Likes, Views, Comments
      • Instagram Stats: Likes, Views, Comments, Comment text
    • Video Count: Total number of videos found for the date

Backend API (interface.py)

Flask Routes:

  • GET /: Renders home search page (index.html)
  • POST /results: Processes search query and returns filtered results (result.html)

Search Logic:

  1. Receives channel_name and date from form submission
  2. Filters mapped_data.csv by channel_name match
  3. Parses publication date (YYYY-MM-DD format)
  4. Returns matching videos with all integrated data
  5. Converts data to dictionary format for template rendering

Data Processing:

  • ISO8601 to HH:MM:SS duration conversion (utility function included)
  • Date parsing and filtering using pandas datetime functionality
  • HTML unescaping for proper rendering

Installation & Setup

Prerequisites

  • Python 3.7+
  • pip (Python package manager)

Step 1: Install Dependencies

pip install pandas numpy scikit-learn langid flask chardet

Or using the requirements installation from the notebook:

pip install chardet

Step 2: Prepare Data

Ensure the following files are present in the project directory:

  • scapped_data/Facebook_data.csv
  • scapped_data/Instagram_data.csv
  • scapped_data/YouTube_data.csv

Step 3: Run Data Integration

Execute the Jupyter notebook matching.ipynb to:

  1. Load raw data from all three platforms
  2. Normalize column names and clean data
  3. Run matching algorithm
  4. Generate mapped_data.csv

This can be done via:

jupyter notebook matching.ipynb
# Then run all cells

Step 4: Launch Web Application

python interface.py

The application will start on http://localhost:5000/


Usage Guide

Searching for Videos

  1. Navigate to Home Page: Open http://localhost:5000/ in your browser
  2. Enter Search Criteria:
    • Channel Name (e.g., "MEDI1 TV")
    • Publication Date (e.g., 2023-05-15)
  3. Click "Chercher" (Search): Submit the form
  4. View Results: See all matched videos for the selected channel and date with cross-platform statistics

Understanding Results

  • Total Count: Number of videos found for the specified date
  • Video Player: Click to watch the video on YouTube
  • Left Panel (YouTube Stats): Engagement metrics from YouTube
  • Middle Panel (Facebook Stats): Engagement metrics from Facebook
  • Right Panel (Instagram Stats): Engagement metrics from Instagram

Data Flow Diagram

┌─────────────────────────────┐
│  Social Media Platforms     │
│ (YouTube, Facebook, Inst)   │
└────────────┬────────────────┘
             │ Data Scraping
             ↓
┌─────────────────────────────┐
│   Raw Data (CSV files)      │
│ • YouTube_data.csv          │
│ • Facebook_data.csv         │
│ • Instagram_data.csv        │
└────────────┬────────────────┘
             │ matching.ipynb
             │ (Normalization)
             ↓
┌─────────────────────────────┐
│  Cleaned & Normalized Data  │
│ (Standardized columns)      │
└────────────┬────────────────┘
             │ Data Matching
             │ (TF-IDF + Duration)
             ↓
┌─────────────────────────────┐
│   Matched Data              │
│ (Videos across platforms)   │
└────────────┬────────────────┘
             │ Data Integration
             ↓
┌─────────────────────────────┐
│    mapped_data.csv          │
│ (Unified dataset)           │
└────────────┬────────────────┘
             │ Flask Backend
             ↓
┌─────────────────────────────┐
│   Web Interface             │
│ • Search Page (index.html)  │
│ • Results Page (result.html)│
└─────────────────────────────┘

Key Algorithms

TF-IDF (Term Frequency-Inverse Document Frequency)

Used to convert video titles into numerical vectors:

  • Term Frequency: How often a word appears in a document
  • Inverse Document Frequency: How unique/rare a word is across all documents
  • Result: Importance weight for each term in each title

Cosine Similarity

Measures similarity between two TF-IDF vectors:

  • Range: 0 to 1 (0 = completely different, 1 = identical)
  • Formula: similarity = (A · B) / (||A|| × ||B||)
  • Used to find similar titles across platforms

Multi-Criteria Matching

Combines multiple signals for robust matching:

  1. Title similarity (cosine similarity > 0.3)
  2. Duration similarity (difference < 3 seconds)
  3. All three platform pairs must match

Example Output

Input (Search)

  • Channel Name: MEDI1 TV
  • Date: 2023-05-15

Output (Sample Results)

Chaîne: MEDI1 TV
Date: 2023-05-15
Nombre total de vidéos trouvées: 3

Video 1: "Breaking News: Important Update"
  Catégorie: News
  Durée: 450s
  
  YouTube Stats: 
    Likes: 2.5k
    Vues: 45k
    Commentaires: 150
  
  Facebook Stats:
    Likes: 3.2k
    Vues: 52k
    Commentaires: 200
  
  Instagram Stats:
    Likes: 1.8k
    Vues: 28k
    Commentaires: 85

Limitations & Future Enhancements

Current Limitations

  1. Duplicate Column Handling: Currently removes duplicates; could preserve and differentiate platform-specific metrics
  2. Fixed Thresholds: TF-IDF (0.3) and duration (3 seconds) thresholds are hardcoded; could be configurable
  3. No User Authentication: Web interface has no login system
  4. Static Data: Updates require re-running the notebook
  5. Single Channel: Currently filters for "MEDI1 TV"; could support multiple channels

Potential Enhancements

  • Real-time Updates: Implement automated data scraping and matching
  • Machine Learning: Use more sophisticated matching algorithms (e.g., LSTM, BERT embeddings)
  • Database Integration: Replace CSV with SQL database for scalability
  • Advanced Analytics: Add trend analysis, sentiment analysis, engagement forecasting
  • API Interface: Expose data through REST API
  • Caching: Implement caching for improved performance
  • User Interface: Enhanced frontend with filtering, sorting, and export capabilities
  • Error Handling: Comprehensive error handling and logging
  • Testing: Unit tests and integration tests

Project Insights

What the Data Reveals

  1. Cross-Platform Presence: Identifies videos present across multiple platforms, indicating strategic content distribution
  2. Engagement Variations: Shows how the same video performs differently on each platform
  3. Audience Differences: Platform-specific engagement metrics reveal different audience behaviors
  4. Content Strategy: Helps understand which content types succeed on which platforms

Matching Accuracy

The TF-IDF + Duration matching approach typically achieves:

  • Precision: High (few false positives due to dual criteria)
  • Recall: Moderate (some variations in titles/durations may be missed)
  • Tuning thresholds can improve either metric based on requirements

Technical Challenges & Solutions

Challenge Solution
Platform-specific column naming Standardized column naming during normalization
Varied duration formats Converted all to seconds for comparison
Special characters in titles Removed platform prefixes and special symbols
Missing data Handled with pandas null checks
Performance with large datasets TF-IDF vectorization is efficient; CSV suitable for current scale

Authors & Credits

Project: Web Data Integration Engine
Level: Master's Degree (M2) - Data Integration Workshop
Institution: [Your University]
Course: Intégration Données (Data Integration)
Date: 2023-2024


License

This project is developed for educational purposes as part of Master's level coursework.


Contact & Support

For questions, issues, or contributions, please refer to the project documentation or contact the development team.


Appendix: File Descriptions

File Purpose
interface.py Flask application for web interface and routing
matching.ipynb Jupyter notebook containing data processing and matching logic
mapped_data.csv Output file with integrated data (generated by notebook)
index.html Search page template
result.html Results display template
Facebook_data.csv Raw data from Facebook
Instagram_data.csv Raw data from Instagram
YouTube_data.csv Raw data from YouTube
Rapport.docx Project report documentation

Last Updated: May 2024
Project Status: Active

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors