CLD6000: Data Preprocessing and Model Tuning Pipeline
Problem Definition and Scope
The CLD6000 project focuses on Natural Language Processing (NLP) and Feature Engineering as part of the Contemporary Problem Analysis. The main deliverables include:
- A robust NLP pipeline.
- Analysis of textual data.
- Feature extraction methodologies.
- Technical implementation without classification overlap.
Core Goals
- Develop and validate a data preprocessing pipeline.
- Ensure proper separation of concerns (e.g., preprocessing vs classification tasks).
- Maintain a single source of truth in the feature engineering workflow.
Pipeline Breakdown
1. Data Preprocessing
Data Cleaning
- Identify Missing Values:
- Replace missing numerical values with mean/median/mode.
- Remove or impute missing categorical values.
- Outlier Handling:
- Use statistical thresholds (e.g., IQR or z-scores) to detect and handle extreme values.
- Noise Removal:
- Eliminate irrelevant or inconsistent entries that may bias the analysis.
Data Normalization
- Standardization:
- Normalize numerical data to have a mean of 0 and a standard deviation of 1.
- Scaling:
- Scale values between 0 and 1 (e.g., for algorithms sensitive to magnitude differences).
Data Transformation
- Encoding:
- Convert categorical data into numerical format using methods like:
- One-Hot Encoding: Each category becomes a binary vector.
- Label Encoding: Assign a unique integer to each category.