The Web Data Integration Engine is a comprehensive data integration and web application system designed to aggregate and match video content from multiple social media platforms (Facebook, Instagram, and YouTube). The system scrapes data from these platforms, intelligently matches videos across platforms using advanced NLP techniques, and provides a unified interface to search and display integrated data.
This project is developed as part of a Master's level coursework in Data Integration, demonstrating practical application of data scraping, data matching algorithms, and web development.
The primary objectives of this project are to:
- Extract video metadata and statistics from multiple social media platforms (Facebook, Instagram, YouTube)
- Match videos across platforms based on title similarity and duration
- Integrate disparate data sources into a unified, normalized dataset
- Provide a user-friendly web interface to search and visualize integrated video data
- Display comparative statistics and engagement metrics across all platforms for the same video content
Web_Data_Integration_Engine/
├── README.md # Project documentation
├── interface.py # Flask web application backend
├── matching.ipynb # Jupyter notebook for data matching & integration logic
├── mapped_data.csv # Output file with integrated/matched data
├── Rapport.docx # Project report documentation
├── templates/ # HTML templates for web interface
│ ├── index.html # Search page
│ ├── result.html # Results display page
│ └── image.jfif # Background image asset
├── scapped_data/ # Raw scraped data from social media platforms
│ ├── Facebook_data.csv # Raw Facebook video data
│ ├── Instagram_data.csv # Raw Instagram video data
│ └── YouTube_data.csv # Raw YouTube video data
└── __pycache__/ # Python cache directory
- pandas: Data manipulation and analysis
- numpy: Numerical computing
- scikit-learn: Machine learning algorithms (TF-IDF, Cosine Similarity)
- langid: Language identification
- TfidfVectorizer: Converting text to TF-IDF vectors
- cosine_similarity: Computing similarity between text vectors
- Flask: Lightweight Python web framework
- Jinja2: Template rendering engine (integrated with Flask)
- chardet: Character encoding detection
The system collects data from three social media platforms:
Facebook Data (Facebook_data.csv)
- Columns: Link, Date, Views, Duration, Likes, Comments, Title
- Aggregated Facebook video content and engagement metrics
Instagram Data (Instagram_data.csv)
- Columns: video_title, likes_count, comments_count, comments_list, DATE, heure, duration, vue
- Video content from Instagram accounts
YouTube Data (YouTube_data.csv)
- Columns: Titre de la vidéo, Catégorie, Date de publication, Durée, Likes, Dislikes, Vues, Nombres de Commentaires, Commentaires, Lien de la vidéo
- Video content and engagement metrics from YouTube
Column Renaming: Standardizes column names across all datasets for consistency
YouTube: 'Titre de la vidéo' → 'Titre'
Instagram: 'video_title' → 'Titre'
Facebook: 'Title' → 'Titre'
Data Cleaning:
- Removes platform-specific prefixes and special characters
- Removes hashtags (#) from titles
- Handles missing values and data type conversions
- Fixes edge cases in view counts and comment counts
Duration Normalization:
- Converts time formats (HH:MM:SS, MM:SS) to seconds
- Enables comparison across platforms with different time representations
Uses a multi-criteria matching approach:
TF-IDF Based Title Matching:
- Creates TF-IDF vectors for all video titles across platforms
- Computes cosine similarity matrix between titles
- Sets threshold: similarity score > 0.3
Duration-Based Validation:
- Calculates absolute duration differences
- Sets threshold: duration difference < 3 seconds
- Ensures matches are temporally consistent
Matching Logic: For each potential triplet (Instagram, YouTube, Facebook):
- Validates title similarity for all three pairs (Instagram-YouTube, Instagram-Facebook, YouTube-Facebook)
- Validates duration similarity for all three pairs
- Only creates a match when ALL conditions are met
Output: DataFrame containing matched videos across all three platforms
Combines matched data into a unified dataset with:
- All columns from matched videos
- Deduplicated column names
- Removed unnecessary columns (duration conversions, platform links, time metadata)
- Added
channel_namefield for organizational tracking
Final Columns Include:
- channel_name
- ID_vidéo
- Titre (Video Title)
- Date_Publication
- Durée (Duration in seconds)
- Platform-specific metrics:
- Youtube_Likes, Youtube_Vues, Youtube_Nombre_Commentaires, Youtube_Commentaires
- Instagram_Likes, Instagram_Vues, Instagram_Nombre_Commentaires, Instagram_Commentaires
- Facebook_Likes, Facebook_Vues, Facebook_Commentaires
- Catégorie (Category)
- Youtube_Lien
Home Page (index.html)
- Clean, centered search interface
- Input fields for:
- Channel name (text input)
- Publication date (date picker)
- Background image with modern styling
- Submit button to search
Results Page (result.html)
- Displays matched videos for the selected channel and date
- Features for each video:
- Video Player: Embedded YouTube player for viewing
- Video Info: Title, Category, Duration
- Cross-Platform Statistics:
- YouTube Stats: Likes, Views, Comments, Comment text
- Facebook Stats: Likes, Views, Comments
- Instagram Stats: Likes, Views, Comments, Comment text
- Video Count: Total number of videos found for the date
Flask Routes:
- GET
/: Renders home search page (index.html) - POST
/results: Processes search query and returns filtered results (result.html)
Search Logic:
- Receives channel_name and date from form submission
- Filters
mapped_data.csvby channel_name match - Parses publication date (YYYY-MM-DD format)
- Returns matching videos with all integrated data
- Converts data to dictionary format for template rendering
Data Processing:
- ISO8601 to HH:MM:SS duration conversion (utility function included)
- Date parsing and filtering using pandas datetime functionality
- HTML unescaping for proper rendering
- Python 3.7+
- pip (Python package manager)
pip install pandas numpy scikit-learn langid flask chardetOr using the requirements installation from the notebook:
pip install chardetEnsure the following files are present in the project directory:
scapped_data/Facebook_data.csvscapped_data/Instagram_data.csvscapped_data/YouTube_data.csv
Execute the Jupyter notebook matching.ipynb to:
- Load raw data from all three platforms
- Normalize column names and clean data
- Run matching algorithm
- Generate
mapped_data.csv
This can be done via:
jupyter notebook matching.ipynb
# Then run all cellspython interface.pyThe application will start on http://localhost:5000/
- Navigate to Home Page: Open
http://localhost:5000/in your browser - Enter Search Criteria:
- Channel Name (e.g., "MEDI1 TV")
- Publication Date (e.g., 2023-05-15)
- Click "Chercher" (Search): Submit the form
- View Results: See all matched videos for the selected channel and date with cross-platform statistics
- Total Count: Number of videos found for the specified date
- Video Player: Click to watch the video on YouTube
- Left Panel (YouTube Stats): Engagement metrics from YouTube
- Middle Panel (Facebook Stats): Engagement metrics from Facebook
- Right Panel (Instagram Stats): Engagement metrics from Instagram
┌─────────────────────────────┐
│ Social Media Platforms │
│ (YouTube, Facebook, Inst) │
└────────────┬────────────────┘
│ Data Scraping
↓
┌─────────────────────────────┐
│ Raw Data (CSV files) │
│ • YouTube_data.csv │
│ • Facebook_data.csv │
│ • Instagram_data.csv │
└────────────┬────────────────┘
│ matching.ipynb
│ (Normalization)
↓
┌─────────────────────────────┐
│ Cleaned & Normalized Data │
│ (Standardized columns) │
└────────────┬────────────────┘
│ Data Matching
│ (TF-IDF + Duration)
↓
┌─────────────────────────────┐
│ Matched Data │
│ (Videos across platforms) │
└────────────┬────────────────┘
│ Data Integration
↓
┌─────────────────────────────┐
│ mapped_data.csv │
│ (Unified dataset) │
└────────────┬────────────────┘
│ Flask Backend
↓
┌─────────────────────────────┐
│ Web Interface │
│ • Search Page (index.html) │
│ • Results Page (result.html)│
└─────────────────────────────┘
Used to convert video titles into numerical vectors:
- Term Frequency: How often a word appears in a document
- Inverse Document Frequency: How unique/rare a word is across all documents
- Result: Importance weight for each term in each title
Measures similarity between two TF-IDF vectors:
- Range: 0 to 1 (0 = completely different, 1 = identical)
- Formula: similarity = (A · B) / (||A|| × ||B||)
- Used to find similar titles across platforms
Combines multiple signals for robust matching:
- Title similarity (cosine similarity > 0.3)
- Duration similarity (difference < 3 seconds)
- All three platform pairs must match
- Channel Name:
MEDI1 TV - Date:
2023-05-15
Chaîne: MEDI1 TV
Date: 2023-05-15
Nombre total de vidéos trouvées: 3
Video 1: "Breaking News: Important Update"
Catégorie: News
Durée: 450s
YouTube Stats:
Likes: 2.5k
Vues: 45k
Commentaires: 150
Facebook Stats:
Likes: 3.2k
Vues: 52k
Commentaires: 200
Instagram Stats:
Likes: 1.8k
Vues: 28k
Commentaires: 85
- Duplicate Column Handling: Currently removes duplicates; could preserve and differentiate platform-specific metrics
- Fixed Thresholds: TF-IDF (0.3) and duration (3 seconds) thresholds are hardcoded; could be configurable
- No User Authentication: Web interface has no login system
- Static Data: Updates require re-running the notebook
- Single Channel: Currently filters for "MEDI1 TV"; could support multiple channels
- Real-time Updates: Implement automated data scraping and matching
- Machine Learning: Use more sophisticated matching algorithms (e.g., LSTM, BERT embeddings)
- Database Integration: Replace CSV with SQL database for scalability
- Advanced Analytics: Add trend analysis, sentiment analysis, engagement forecasting
- API Interface: Expose data through REST API
- Caching: Implement caching for improved performance
- User Interface: Enhanced frontend with filtering, sorting, and export capabilities
- Error Handling: Comprehensive error handling and logging
- Testing: Unit tests and integration tests
- Cross-Platform Presence: Identifies videos present across multiple platforms, indicating strategic content distribution
- Engagement Variations: Shows how the same video performs differently on each platform
- Audience Differences: Platform-specific engagement metrics reveal different audience behaviors
- Content Strategy: Helps understand which content types succeed on which platforms
The TF-IDF + Duration matching approach typically achieves:
- Precision: High (few false positives due to dual criteria)
- Recall: Moderate (some variations in titles/durations may be missed)
- Tuning thresholds can improve either metric based on requirements
| Challenge | Solution |
|---|---|
| Platform-specific column naming | Standardized column naming during normalization |
| Varied duration formats | Converted all to seconds for comparison |
| Special characters in titles | Removed platform prefixes and special symbols |
| Missing data | Handled with pandas null checks |
| Performance with large datasets | TF-IDF vectorization is efficient; CSV suitable for current scale |
Project: Web Data Integration Engine
Level: Master's Degree (M2) - Data Integration Workshop
Institution: [Your University]
Course: Intégration Données (Data Integration)
Date: 2023-2024
This project is developed for educational purposes as part of Master's level coursework.
For questions, issues, or contributions, please refer to the project documentation or contact the development team.
| File | Purpose |
|---|---|
interface.py |
Flask application for web interface and routing |
matching.ipynb |
Jupyter notebook containing data processing and matching logic |
mapped_data.csv |
Output file with integrated data (generated by notebook) |
index.html |
Search page template |
result.html |
Results display template |
Facebook_data.csv |
Raw data from Facebook |
Instagram_data.csv |
Raw data from Instagram |
YouTube_data.csv |
Raw data from YouTube |
Rapport.docx |
Project report documentation |
Last Updated: May 2024
Project Status: Active
