Web Data Integration Engine

Project Overview

The Web Data Integration Engine is a comprehensive data integration and web application system designed to aggregate and match video content from multiple social media platforms (Facebook, Instagram, and YouTube). The system scrapes data from these platforms, intelligently matches videos across platforms using advanced NLP techniques, and provides a unified interface to search and display integrated data.

This project is developed as part of a Master's level coursework in Data Integration, demonstrating practical application of data scraping, data matching algorithms, and web development.

Objective

The primary objectives of this project are to:

Extract video metadata and statistics from multiple social media platforms (Facebook, Instagram, YouTube)
Match videos across platforms based on title similarity and duration
Integrate disparate data sources into a unified, normalized dataset
Provide a user-friendly web interface to search and visualize integrated video data
Display comparative statistics and engagement metrics across all platforms for the same video content

Project Structure

Web_Data_Integration_Engine/
├── README.md                    # Project documentation
├── interface.py                 # Flask web application backend
├── matching.ipynb               # Jupyter notebook for data matching & integration logic
├── mapped_data.csv             # Output file with integrated/matched data
├── Rapport.docx                # Project report documentation
├── templates/                  # HTML templates for web interface
│   ├── index.html             # Search page
│   ├── result.html            # Results display page
│   └── image.jfif             # Background image asset
├── scapped_data/              # Raw scraped data from social media platforms
│   ├── Facebook_data.csv      # Raw Facebook video data
│   ├── Instagram_data.csv     # Raw Instagram video data
│   └── YouTube_data.csv       # Raw YouTube video data
└── __pycache__/               # Python cache directory

Key Technologies & Libraries

Data Processing

pandas: Data manipulation and analysis
numpy: Numerical computing
scikit-learn: Machine learning algorithms (TF-IDF, Cosine Similarity)

Text Processing

langid: Language identification
TfidfVectorizer: Converting text to TF-IDF vectors
cosine_similarity: Computing similarity between text vectors

Web Framework

Flask: Lightweight Python web framework
Jinja2: Template rendering engine (integrated with Flask)

Additional Libraries

chardet: Character encoding detection

Data Processing Pipeline

1. Data Collection (scapped_data/)

The system collects data from three social media platforms:

Facebook Data (Facebook_data.csv)

Columns: Link, Date, Views, Duration, Likes, Comments, Title
Aggregated Facebook video content and engagement metrics

Instagram Data (Instagram_data.csv)

Columns: video_title, likes_count, comments_count, comments_list, DATE, heure, duration, vue
Video content from Instagram accounts

YouTube Data (YouTube_data.csv)

Columns: Titre de la vidéo, Catégorie, Date de publication, Durée, Likes, Dislikes, Vues, Nombres de Commentaires, Commentaires, Lien de la vidéo
Video content and engagement metrics from YouTube

2. Data Normalization (matching.ipynb)

Column Renaming: Standardizes column names across all datasets for consistency

YouTube: 'Titre de la vidéo' → 'Titre'
Instagram: 'video_title' → 'Titre'
Facebook: 'Title' → 'Titre'

Data Cleaning:

Removes platform-specific prefixes and special characters
Removes hashtags (#) from titles
Handles missing values and data type conversions
Fixes edge cases in view counts and comment counts

Duration Normalization:

Converts time formats (HH:MM:SS, MM:SS) to seconds
Enables comparison across platforms with different time representations

3. Data Matching Algorithm

Uses a multi-criteria matching approach:

TF-IDF Based Title Matching:

Creates TF-IDF vectors for all video titles across platforms
Computes cosine similarity matrix between titles
Sets threshold: similarity score > 0.3

Duration-Based Validation:

Calculates absolute duration differences
Sets threshold: duration difference < 3 seconds
Ensures matches are temporally consistent

Matching Logic: For each potential triplet (Instagram, YouTube, Facebook):

Validates title similarity for all three pairs (Instagram-YouTube, Instagram-Facebook, YouTube-Facebook)
Validates duration similarity for all three pairs
Only creates a match when ALL conditions are met

Output: DataFrame containing matched videos across all three platforms

4. Data Integration (mapped_data.csv)

Combines matched data into a unified dataset with:

All columns from matched videos
Deduplicated column names
Removed unnecessary columns (duration conversions, platform links, time metadata)
Added channel_name field for organizational tracking

Final Columns Include:

channel_name
ID_vidéo
Titre (Video Title)
Date_Publication
Durée (Duration in seconds)
Platform-specific metrics:
- Youtube_Likes, Youtube_Vues, Youtube_Nombre_Commentaires, Youtube_Commentaires
- Instagram_Likes, Instagram_Vues, Instagram_Nombre_Commentaires, Instagram_Commentaires
- Facebook_Likes, Facebook_Vues, Facebook_Commentaires
Catégorie (Category)
Youtube_Lien

Web Application Features

Frontend Interface

Home Page (index.html)

Clean, centered search interface
Input fields for:
- Channel name (text input)
- Publication date (date picker)
Background image with modern styling
Submit button to search

Results Page (result.html)

Displays matched videos for the selected channel and date
Features for each video:
- Video Player: Embedded YouTube player for viewing
- Video Info: Title, Category, Duration
- Cross-Platform Statistics:
  - YouTube Stats: Likes, Views, Comments, Comment text
  - Facebook Stats: Likes, Views, Comments
  - Instagram Stats: Likes, Views, Comments, Comment text
- Video Count: Total number of videos found for the date

Backend API (interface.py)

Flask Routes:

GET /: Renders home search page (index.html)
POST /results: Processes search query and returns filtered results (result.html)

Search Logic:

Receives channel_name and date from form submission
Filters mapped_data.csv by channel_name match
Parses publication date (YYYY-MM-DD format)
Returns matching videos with all integrated data
Converts data to dictionary format for template rendering

Data Processing:

ISO8601 to HH:MM:SS duration conversion (utility function included)
Date parsing and filtering using pandas datetime functionality
HTML unescaping for proper rendering

Installation & Setup

Prerequisites

Python 3.7+
pip (Python package manager)

Step 1: Install Dependencies

pip install pandas numpy scikit-learn langid flask chardet

Or using the requirements installation from the notebook:

pip install chardet

Step 2: Prepare Data

Ensure the following files are present in the project directory:

scapped_data/Facebook_data.csv
scapped_data/Instagram_data.csv
scapped_data/YouTube_data.csv

Step 3: Run Data Integration

Execute the Jupyter notebook matching.ipynb to:

Load raw data from all three platforms
Normalize column names and clean data
Run matching algorithm
Generate mapped_data.csv

This can be done via:

jupyter notebook matching.ipynb
# Then run all cells

Step 4: Launch Web Application

python interface.py

The application will start on http://localhost:5000/

Usage Guide

Searching for Videos

Navigate to Home Page: Open http://localhost:5000/ in your browser
Enter Search Criteria:
- Channel Name (e.g., "MEDI1 TV")
- Publication Date (e.g., 2023-05-15)
Click "Chercher" (Search): Submit the form
View Results: See all matched videos for the selected channel and date with cross-platform statistics

Understanding Results

Total Count: Number of videos found for the specified date
Video Player: Click to watch the video on YouTube
Left Panel (YouTube Stats): Engagement metrics from YouTube
Middle Panel (Facebook Stats): Engagement metrics from Facebook
Right Panel (Instagram Stats): Engagement metrics from Instagram

Data Flow Diagram

┌─────────────────────────────┐
│  Social Media Platforms     │
│ (YouTube, Facebook, Inst)   │
└────────────┬────────────────┘
             │ Data Scraping
             ↓
┌─────────────────────────────┐
│   Raw Data (CSV files)      │
│ • YouTube_data.csv          │
│ • Facebook_data.csv         │
│ • Instagram_data.csv        │
└────────────┬────────────────┘
             │ matching.ipynb
             │ (Normalization)
             ↓
┌─────────────────────────────┐
│  Cleaned & Normalized Data  │
│ (Standardized columns)      │
└────────────┬────────────────┘
             │ Data Matching
             │ (TF-IDF + Duration)
             ↓
┌─────────────────────────────┐
│   Matched Data              │
│ (Videos across platforms)   │
└────────────┬────────────────┘
             │ Data Integration
             ↓
┌─────────────────────────────┐
│    mapped_data.csv          │
│ (Unified dataset)           │
└────────────┬────────────────┘
             │ Flask Backend
             ↓
┌─────────────────────────────┐
│   Web Interface             │
│ • Search Page (index.html)  │
│ • Results Page (result.html)│
└─────────────────────────────┘

Key Algorithms

TF-IDF (Term Frequency-Inverse Document Frequency)

Used to convert video titles into numerical vectors:

Term Frequency: How often a word appears in a document
Inverse Document Frequency: How unique/rare a word is across all documents
Result: Importance weight for each term in each title

Cosine Similarity

Measures similarity between two TF-IDF vectors:

Range: 0 to 1 (0 = completely different, 1 = identical)
Formula: similarity = (A · B) / (||A|| × ||B||)
Used to find similar titles across platforms

Multi-Criteria Matching

Combines multiple signals for robust matching:

Title similarity (cosine similarity > 0.3)
Duration similarity (difference < 3 seconds)
All three platform pairs must match

Example Output

Input (Search)

Channel Name: MEDI1 TV
Date: 2023-05-15

Output (Sample Results)

Chaîne: MEDI1 TV
Date: 2023-05-15
Nombre total de vidéos trouvées: 3

Video 1: "Breaking News: Important Update"
  Catégorie: News
  Durée: 450s
  
  YouTube Stats: 
    Likes: 2.5k
    Vues: 45k
    Commentaires: 150
  
  Facebook Stats:
    Likes: 3.2k
    Vues: 52k
    Commentaires: 200
  
  Instagram Stats:
    Likes: 1.8k
    Vues: 28k
    Commentaires: 85

Limitations & Future Enhancements

Current Limitations

Duplicate Column Handling: Currently removes duplicates; could preserve and differentiate platform-specific metrics
Fixed Thresholds: TF-IDF (0.3) and duration (3 seconds) thresholds are hardcoded; could be configurable
No User Authentication: Web interface has no login system
Static Data: Updates require re-running the notebook
Single Channel: Currently filters for "MEDI1 TV"; could support multiple channels

Potential Enhancements

Real-time Updates: Implement automated data scraping and matching
Machine Learning: Use more sophisticated matching algorithms (e.g., LSTM, BERT embeddings)
Database Integration: Replace CSV with SQL database for scalability
Advanced Analytics: Add trend analysis, sentiment analysis, engagement forecasting
API Interface: Expose data through REST API
Caching: Implement caching for improved performance
User Interface: Enhanced frontend with filtering, sorting, and export capabilities
Error Handling: Comprehensive error handling and logging
Testing: Unit tests and integration tests

Project Insights

What the Data Reveals

Cross-Platform Presence: Identifies videos present across multiple platforms, indicating strategic content distribution
Engagement Variations: Shows how the same video performs differently on each platform
Audience Differences: Platform-specific engagement metrics reveal different audience behaviors
Content Strategy: Helps understand which content types succeed on which platforms

Matching Accuracy

The TF-IDF + Duration matching approach typically achieves:

Precision: High (few false positives due to dual criteria)
Recall: Moderate (some variations in titles/durations may be missed)
Tuning thresholds can improve either metric based on requirements

Technical Challenges & Solutions

Challenge	Solution
Platform-specific column naming	Standardized column naming during normalization
Varied duration formats	Converted all to seconds for comparison
Special characters in titles	Removed platform prefixes and special symbols
Missing data	Handled with pandas null checks
Performance with large datasets	TF-IDF vectorization is efficient; CSV suitable for current scale

Authors & Credits

Project: Web Data Integration Engine
Level: Master's Degree (M2) - Data Integration Workshop
Institution: [Your University]
Course: Intégration Données (Data Integration)
Date: 2023-2024

License

This project is developed for educational purposes as part of Master's level coursework.

Contact & Support

For questions, issues, or contributions, please refer to the project documentation or contact the development team.

Appendix: File Descriptions

File	Purpose
`interface.py`	Flask application for web interface and routing
`matching.ipynb`	Jupyter notebook containing data processing and matching logic
`mapped_data.csv`	Output file with integrated data (generated by notebook)
`index.html`	Search page template
`result.html`	Results display template
`Facebook_data.csv`	Raw data from Facebook
`Instagram_data.csv`	Raw data from Instagram
`YouTube_data.csv`	Raw data from YouTube
`Rapport.docx`	Project report documentation

Last Updated: May 2024
Project Status: Active

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
scapped_data		scapped_data
templates		templates
README.md		README.md
Screenshot.png		Screenshot.png
image.png		image.png
interface.py		interface.py
mapped_data.csv		mapped_data.csv
matching.ipynb		matching.ipynb

Folders and files

Latest commit

History

Repository files navigation

Web Data Integration Engine

Project Overview

Objective

Project Structure

Key Technologies & Libraries

Data Processing

Text Processing

Web Framework

Additional Libraries

Data Processing Pipeline

1. Data Collection (scapped_data/)

2. Data Normalization (matching.ipynb)

3. Data Matching Algorithm

4. Data Integration (mapped_data.csv)

Web Application Features

Frontend Interface

Backend API (interface.py)

Installation & Setup

Prerequisites

Step 1: Install Dependencies

Step 2: Prepare Data

Step 3: Run Data Integration

Step 4: Launch Web Application

Usage Guide

Searching for Videos

Understanding Results

Data Flow Diagram

Key Algorithms

TF-IDF (Term Frequency-Inverse Document Frequency)

Cosine Similarity

Multi-Criteria Matching

Example Output

Input (Search)

Output (Sample Results)

Limitations & Future Enhancements

Current Limitations

Potential Enhancements

Project Insights

What the Data Reveals

Matching Accuracy

Technical Challenges & Solutions

Authors & Credits

License

Contact & Support

Appendix: File Descriptions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages