Feature engineering

Feature Engineering trong Phát triển Mô hình Machine Learning

Giới thiệu

Feature Engineering (tạm dịch: Kỹ thuật đặc trưng) là quá trình tạo ra, lựa chọn, biến đổi và tối ưu hóa các đặc trưng (features) để cải thiện hiệu suất của mô hình machine learning. Đây là một trong những bước quan trọng nhất trong quy trình phát triển mô hình ML, có thể quyết định thành công hay thất bại của dự án.

Có một câu nói nổi tiếng trong cộng đồng ML: "Garbage in, garbage out", và cũng có một câu khác: "Better features, better models". Không quan trọng thuật toán của bạn phức tạp đến đâu, nếu không có các đặc trưng chất lượng, mô hình của bạn sẽ không bao giờ đạt được hiệu suất tối ưu.

Tầm quan trọng của Feature Engineering

Cải thiện hiệu suất mô hình: Đặc trưng tốt giúp mô hình học hiệu quả hơn, đặc biệt với lượng dữ liệu hạn chế
Giảm độ phức tạp: Có thể đơn giản hóa mô hình, giảm nhu cầu tính toán
Tăng khả năng giải thích: Đặc trưng có ý nghĩa giúp mô hình dễ hiểu hơn
Xử lý dữ liệu không đồng nhất: Biến đổi dữ liệu từ nhiều nguồn khác nhau thành dạng thống nhất
Bổ sung kiến thức miền: Đưa kiến thức chuyên môn vào mô hình thông qua các đặc trưng

Các kỹ thuật Feature Engineering phổ biến

1. Biến đổi đặc trưng (Feature Transformation)

1.1. Biến đổi số học cơ bản

# Ví dụ với pandas
# Tạo đặc trưng mới bằng phép cộng
df['total_income'] = df['salary'] + df['bonus']

# Phép nhân để tính diện tích
df['area'] = df['width'] * df['height']

# Tỷ lệ và tỷ số
df['income_per_person'] = df['household_income'] / df['household_size']
df['price_to_earning_ratio'] = df['price'] / df['earning']

1.2. Biến đổi toán học

import numpy as np

# Logarithm - Hữu ích cho dữ liệu có phân phối skewed
df['log_price'] = np.log1p(df['price'])  # log1p để tránh lỗi khi có giá trị 0

# Căn bậc hai - Ít "mạnh" hơn log nhưng vẫn hiệu quả
df['sqrt_distance'] = np.sqrt(df['distance'])

# Bình phương - Khi quan hệ phi tuyến theo hướng tăng nhanh
df['age_squared'] = df['age'] ** 2

1.3. Bining (Phân nhóm)

# Equal-width binning
df['age_bin'] = pd.cut(df['age'], bins=5, labels=False)

# Equal-frequency binning
df['income_quantiles'] = pd.qcut(df['income'], q=4, labels=False)

# Custom binning
age_bins = [0, 18, 35, 50, 65, 100]
age_labels = ['Child', 'Young Adult', 'Adult', 'Middle Age', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)

2. Kỹ thuật đặc trưng thời gian (Time-based Features)

# Giả sử df['date'] là cột datetime
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)
df['quarter'] = df['date'].dt.quarter
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)

# Time since a reference date (e.g., để tính tuổi hoặc thời gian sở hữu)
import datetime as dt
reference_date = dt.datetime(2025, 1, 1)
df['days_since_ref'] = (df['date'] - reference_date).dt.days

2.1. Cyclical Features

Một kỹ thuật đặc biệt hữu ích cho dữ liệu thời gian là biến đổi sang dạng cyclical. Ví dụ, tháng 12 và tháng 1 về mặt số học cách xa nhau, nhưng thực tế chúng rất gần nhau trong chu kỳ năm.

# Biến đổi cyclical cho ngày trong tuần (1-7)
df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week']/7)
df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week']/7)

# Biến đổi cyclical cho tháng (1-12)
df['month_sin'] = np.sin(2 * np.pi * df['month']/12)
df['month_cos'] = np.cos(2 * np.pi * df['month']/12)

# Biến đổi cyclical cho giờ trong ngày (0-23)
df['hour_sin'] = np.sin(2 * np.pi * df['hour']/24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour']/24)

3. Đặc trưng tương tác (Interaction Features)

Đặc trưng tương tác mô tả mối quan hệ giữa hai hoặc nhiều đặc trưng với nhau.

# Tích đơn giản
df['area'] = df['width'] * df['height']

# Kết hợp các biến phân loại
df['region_product'] = df['region'] + '_' + df['product']

# Tương tác phức tạp hơn
df['price_per_room'] = df['price'] / df['rooms']
df['price_per_sqm_by_location'] = df['price_per_sqm'] * df['location_score']

Với các đặc trưng phức tạp hơn:

from sklearn.preprocessing import PolynomialFeatures

# Tạo tất cả tương tác bậc 2 (tích giữa các cặp đặc trưng)
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_poly = poly.fit_transform(X)

4. Đặc trưng từ văn bản (Text Features)

4.1. Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english', max_features=1000)
X_bow = vectorizer.fit_transform(df['text'])

4.2. TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_tfidf = vectorizer.fit_transform(df['text'])

4.3. Embeddings

# Sử dụng pre-trained word embeddings
import gensim.downloader as api

# Tải word2vec pre-trained model
word2vec_model = api.load("word2vec-google-news-300")

# Hàm tạo document embedding đơn giản bằng cách lấy trung bình các word vectors
def document_embedding(text, model):
    words = text.lower().split()
    valid_words = [word for word in words if word in model.key_to_index]
    if len(valid_words) > 0:
        return np.mean([model[word] for word in valid_words], axis=0)
    else:
        return np.zeros(model.vector_size)

# Áp dụng cho dataset
df['text_embedding'] = df['text'].apply(lambda x: document_embedding(x, word2vec_model))

5. Đặc trưng từ dữ liệu phân loại (Categorical Features)

5.1. One-Hot Encoding

# Sử dụng pandas
df_encoded = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)

# Hoặc sử dụng scikit-learn
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, drop='first')
X_encoded = encoder.fit_transform(df[['color', 'size']])

5.2. Target Encoding

# Tính trung bình của biến mục tiêu cho mỗi category
target_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_means)

# Phiên bản an toàn hơn với K-fold để tránh data leakage
from sklearn.model_selection import KFold

def target_encode(train_df, test_df, column, target, n_folds=5):
    # Create a copy of the train and test dataframes
    train_df = train_df.copy()
    test_df = test_df.copy()
    
    # Compute the global mean
    global_mean = train_df[target].mean()
    
    # Create KFold object
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    # Perform target encoding on the train set using k-fold
    for train_idx, val_idx in kf.split(train_df):
        # Get the current fold's train and validation data
        fold_train, fold_val = train_df.iloc[train_idx], train_df.iloc[val_idx]
        
        # Calculate the target mean for each category in the training fold
        means = fold_train.groupby(column)[target].mean()
        
        # Map the means to the validation fold
        fold_val[f'{column}_encoded'] = fold_val[column].map(means).fillna(global_mean)
        
        # Update the original dataframe
        train_df.loc[val_idx, f'{column}_encoded'] = fold_val[f'{column}_encoded'].values
    
    # Calculate the target mean for the entire train set
    means = train_df.groupby(column)[target].mean()
    
    # Map the means to the test set
    test_df[f'{column}_encoded'] = test_df[column].map(means).fillna(global_mean)
    
    return train_df, test_df

5.3. Count/Frequency Encoding

# Đếm số lần xuất hiện của mỗi giá trị
counts = df['category'].value_counts()
df['category_count'] = df['category'].map(counts)

# Frequency encoding
freq = counts / len(df)
df['category_freq'] = df['category'].map(freq)

5.4. Label Encoding (cho biến thứ tự)

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['size_encoded'] = encoder.fit_transform(df['size'])  # 'S', 'M', 'L', 'XL' -> 0, 1, 2, 3

6. Aggregation Features (Đặc trưng tổng hợp)

Đặc biệt hữu ích cho dữ liệu có cấu trúc phân cấp hoặc nhóm.

# Giả sử chúng ta có dữ liệu giao dịch và muốn tổng hợp thành đặc trưng khách hàng
# Tổng hợp theo customer_id

# Số lượng giao dịch
transaction_count = df.groupby('customer_id').size().reset_index(name='transaction_count')

# Tổng số tiền
total_spent = df.groupby('customer_id')['amount'].sum().reset_index(name='total_spent')

# Giá trị giao dịch trung bình
avg_amount = df.groupby('customer_id')['amount'].mean().reset_index(name='avg_amount')

# Thời gian giữa các giao dịch
df = df.sort_values(['customer_id', 'transaction_date'])
df['prev_date'] = df.groupby('customer_id')['transaction_date'].shift(1)
df['days_since_prev'] = (df['transaction_date'] - df['prev_date']).dt.days

# Sau đó tính thời gian trung bình giữa các giao dịch cho mỗi khách hàng
avg_days = df.groupby('customer_id')['days_since_prev'].mean().reset_index(name='avg_days_between_transactions')

# Kết hợp tất cả các đặc trưng tổng hợp
customer_features = transaction_count.merge(total_spent, on='customer_id')
customer_features = customer_features.merge(avg_amount, on='customer_id')
customer_features = customer_features.merge(avg_days, on='customer_id')

7. Window Features (Đặc trưng cửa sổ)

Đặc biệt hữu ích cho dữ liệu chuỗi thời gian.

# Giả sử chúng ta có dữ liệu chuỗi thời gian được sắp xếp theo thời gian
# Rolling window: Tổng hợp trong một cửa sổ trượt

# Trung bình trong 7 ngày gần nhất
df['rolling_mean_7d'] = df['value'].rolling(window=7).mean()

# Độ lệch chuẩn trong 7 ngày gần nhất
df['rolling_std_7d'] = df['value'].rolling(window=7).std()

# Tổng tích lũy
df['cumulative_sum'] = df['value'].cumsum()

# Tỷ lệ thay đổi so với giá trị trước đó
df['pct_change'] = df['value'].pct_change() * 100

# Lag features
for i in range(1, 4):
    df[f'lag_{i}'] = df['value'].shift(i)

# Difference features
df['diff_1'] = df['value'].diff(1)

# Expanding window
df['expanding_mean'] = df['value'].expanding().mean()

8. Dimensionality Reduction

Khi có quá nhiều đặc trưng, chúng ta có thể sử dụng các kỹ thuật giảm chiều dữ liệu.

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_scaled)

# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# LDA (supervised)
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

Các nguyên tắc hay nhất trong Feature Engineering

1. Hiểu dữ liệu và bài toán

Trước khi bắt đầu tạo đặc trưng, điều quan trọng là phải hiểu rõ:

Bản chất của dữ liệu
Kiến thức miền
Ý nghĩa của các đặc trưng hiện có
Mục tiêu của mô hình

2. Khám phá dữ liệu trước (EDA)

# Phân tích phân phối
df.describe()

# Kiểm tra tương quan giữa các đặc trưng
correlation_matrix = df.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

# Phân tích đặc trưng theo mục tiêu
for feature in df.columns:
    if feature != 'target':
        plt.figure(figsize=(10, 6))
        sns.boxplot(x='target', y=feature, data=df)
        plt.title(f'{feature} vs Target')
        plt.show()

3. Đánh giá đặc trưng

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

# Tính mutual information giữa mỗi đặc trưng và mục tiêu
if task_type == 'classification':
    mi = mutual_info_classif(X, y)
else:
    mi = mutual_info_regression(X, y)

# Tạo dataframe để xem kết quả
mi_df = pd.DataFrame({'Feature': X.columns, 'MI_Score': mi})
mi_df = mi_df.sort_values('MI_Score', ascending=False)

# Sử dụng Random Forest để đánh giá tầm quan trọng của đặc trưng
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

if task_type == 'classification':
    model = RandomForestClassifier(random_state=42)
else:
    model = RandomForestRegressor(random_state=42)

model.fit(X, y)
importances = model.feature_importances_

# Tạo dataframe để xem kết quả
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
importance_df = importance_df.sort_values('Importance', ascending=False)

4. Kiểm soát rò rỉ dữ liệu (Data Leakage)

Một sai lầm phổ biến trong feature engineering là vô tình đưa vào thông tin từ tương lai hoặc thông tin rò rỉ.

Để tránh điều này:

Thực hiện tất cả các bước tiền xử lý và feature engineering sau khi chia dữ liệu thành train-test
Sử dụng các kỹ thuật như K-fold encoding cho target encoding
Đặc biệt cẩn thận với dữ liệu chuỗi thời gian: không sử dụng thông tin từ tương lai
Sử dụng pipeline để đảm bảo quy trình nhất quán

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Tạo pipeline bao gồm feature transformation và mô hình
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# Fit và dự đoán tất cả trong một
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

5. Hiệu quả tính toán

Khi làm việc với dữ liệu lớn, cần cân nhắc hiệu quả tính toán:

Tránh tạo quá nhiều đặc trưng không cần thiết
Sử dụng các kỹ thuật giảm chiều dữ liệu khi cần
Tính toán và lưu trữ đặc trưng đắt đỏ (computationally expensive) trước thay vì tính toán lại

6. Thử nghiệm và xác thực

Thử nghiệm nhiều cách tiếp cận feature engineering khác nhau
Sử dụng validation set hoặc cross-validation để đánh giá tác động
Theo dõi sự cải thiện của từng kỹ thuật feature engineering
Kết hợp các kỹ thuật khác nhau để tìm ra giải pháp tối ưu

Tự động hóa Feature Engineering

1. Featuretools

import featuretools as ft

# Tạo entity set
es = ft.EntitySet(id="customer_data")

# Thêm dataframes như entities
es.add_dataframe(dataframe=customers_df, dataframe_name="customers", index="customer_id")
es.add_dataframe(dataframe=transactions_df, dataframe_name="transactions", index="transaction_id")

# Thêm mối quan hệ
es.add_relationship(parent_dataframe_name="customers", parent_column="customer_id",
                   child_dataframe_name="transactions", child_column="customer_id")

# Chạy Deep Feature Synthesis
features, feature_names = ft.dfs(entityset=es, target_dataframe_name="customers",
                                max_depth=2, features_only=False)

2. Feature-engine

from feature_engine.transformation import LogTransformer
from feature_engine.creation import CombineWithReferenceFeature
from feature_engine.encoding import OneHotEncoder

# Biến đổi logarithm cho các đặc trưng có skew
log_transformer = LogTransformer(variables=['price', 'income'])
X_transformed = log_transformer.fit_transform(X)

# Tạo các tỉ lệ với một cột tham chiếu
combiner = CombineWithReferenceFeature(
    variables=['rooms', 'bathrooms'],
    reference='area',
    operations=['div']
)
X_combined = combiner.fit_transform(X)

# One-hot encoding
encoder = OneHotEncoder(top_categories=10, drop_last=True)
X_encoded = encoder.fit_transform(X)

3. tsfresh (cho dữ liệu chuỗi thời gian)

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute

# Trích xuất đặc trưng từ dữ liệu chuỗi thời gian
extracted_features = extract_features(timeseries_df, column_id="id", column_sort="time")

# Impute missing values
impute(extracted_features)

# Chọn đặc trưng có liên quan với mục tiêu
selected_features = select_features(extracted_features, y)

Kết luận

Feature engineering là một nghệ thuật kết hợp với khoa học, đòi hỏi cả kiến thức kỹ thuật và hiểu biết miền. Đầu tư thời gian vào feature engineering thường mang lại lợi ích lớn hơn so với việc tối ưu hóa thuật toán mô hình.

Một quy trình Feature Engineering hiệu quả bao gồm:

Hiểu rõ dữ liệu và bài toán
Áp dụng kiến thức miền để tạo các đặc trưng có ý nghĩa
Thử nghiệm các kỹ thuật khác nhau và đánh giá tác động
Sử dụng kết hợp các phương pháp thủ công và tự động hóa

Mặc dù deep learning đang làm giảm nhu cầu feature engineering thủ công trong một số lĩnh vực, nhưng các kỹ thuật feature engineering vẫn rất quan trọng trong nhiều bài toán, đặc biệt là khi làm việc với dữ liệu có cấu trúc hoặc khi có những ràng buộc về tài nguyên.

PreviousData cleaning và preprocessing NextFeature selection và extraction

Last updated 6 months ago