Natural language processing unit outline

Lesson 38: Text data analysis (foundation)

Focus: Getting text ready for ML - preprocessing and exploration

Topic	Purpose
Text preprocessing	Tokenization, normalization, stemming, lemmatization, stopword removal
Data exploration	Word frequency, distributions, word clouds
Basic classification	Naive Bayes text classification (no deep learning)
Rule-based sentiment	TextBlob, VADER (lexicon-based, not learned embeddings)

Key concepts: Cleaning and understanding text data before vectorization

Focus: Converting text to numbers - from simple to sophisticated

Topic	Purpose
One-hot encoding	Simplest representation, limitations (sparsity)
Bag-of-Words	Word counts, introduces CountVectorizer
TF-IDF	Weighted importance, better than raw counts
Word2Vec intro	Concept of dense embeddings, training basics

Builds on 38: Uses preprocessed text as input

Key concepts: Progression from sparse to dense representations

Focus: Using and extending word embeddings

Topic	Purpose
Pre-trained Word2Vec	Loading Google News vectors, similarity
Document vectors	Averaging word embeddings for documents
Doc2Vec	Learning document-level embeddings directly

Builds on 39: Applies Word2Vec concept at scale

Key concepts: Pre-trained embeddings, word-to-document extension

Focus: Applying embeddings to sequence tasks

Topic	Purpose
Sequence preprocessing	Padding, sequencing for neural networks
Encoder-decoder intro	Basic seq2seq architecture
Translation task	English-French with LSTM

Builds on 40: Uses embeddings as input to sequence models

Key concepts: Sequence-to-sequence architecture, encoder-decoder pattern

Focus: RNN architectures for sequences

Topic	Purpose
RNN fundamentals	Recurrent connections, vanishing gradients
LSTM	Gates, memory cells
Bi-directional RNNs	Context from both directions

Builds on 41: Deeper dive into the LSTM used in encoder-decoder

Key concepts: Recurrence, memory, handling variable-length sequences

Focus: Modern architectures that replaced RNNs

Topic	Purpose
Attention concept	Why attention improves seq2seq
Transformer architecture	Self-attention, positional encoding
Pre-trained transformers	BERT, GPT - contextual embeddings
Fine-tuning	Classification with transformers

Builds on 42: Attention solves RNN limitations

Key concepts: Self-attention, positional encoding, transfer learning with transformers