Quick Start Guide ================= Basic Usage ----------- Here's a simple example to get started with EnsembleSet: .. code-block:: python import pandas as pd import ensembleset.dataset as ds # Create or load your training data train_df = pd.DataFrame({ 'feature1': [1.0, 2.0, 3.0, 4.0, 5.0], 'feature2': [10.0, 20.0, 30.0, 40.0, 50.0], 'feature3': [100.0, 200.0, 300.0, 400.0, 500.0], 'label': [0, 1, 0, 1, 0] }) # Initialize the DataSet data_ensemble = ds.DataSet( label='label', train_data=train_df, data_directory='./ensemble_output' ) # Generate ensemble datasets output_file = data_ensemble.make_datasets( n_datasets=10, # Generate 10 different datasets frac_features=0.1, # Use 10% of features per step n_steps=5 # Apply 5 engineering steps per dataset ) print(f"Datasets saved to: {output_file}") With Test Data -------------- Include test data to generate aligned training and testing datasets: .. code-block:: python import ensembleset.dataset as ds # Initialize with both training and testing data data_ensemble = ds.DataSet( label='label_column_name', train_data=train_df, test_data=test_df, data_directory='./ensemble_output' ) # Generate ensemble datasets output_file = data_ensemble.make_datasets( n_datasets=10, frac_features=0.1, n_steps=5 ) The same feature engineering pipeline applied to training data will be applied to testing data, with all transformations fitted on training data only to prevent data leakage. With String Features -------------------- Specify categorical string features that need encoding: .. code-block:: python import ensembleset.dataset as ds # Initialize with string features specified data_ensemble = ds.DataSet( label='target', train_data=train_df, test_data=test_df, string_features=['category_col', 'group_col'], data_directory='./ensemble_output', ensembleset_base_name='my_ensemble' ) # Generate ensemble datasets output_file = data_ensemble.make_datasets( n_datasets=10, frac_features=0.2, n_steps=3 ) String features will be encoded using either one-hot or ordinal encoding before numerical feature engineering methods are applied. Understanding Parameters ------------------------ ``n_datasets`` Number of dataset variations to generate. Each will have a unique random sequence of feature engineering methods applied. ``frac_features`` Fraction of features (0.0 to 1.0) to randomly select for each feature engineering step. For example, 0.1 means 10% of available features. The selection is re-randomized for each step. ``n_steps`` Number of feature engineering steps to apply in sequence for each dataset. Each step randomly selects a method from the available techniques. Output Format ------------- Generated datasets are saved to HDF5 format with the following structure: .. code-block:: text ensembleset.h5 ├── train │ ├── labels # Training labels array │ ├── dataset_0 # First training dataset │ ├── dataset_1 # Second training dataset │ └── ... └── test ├── labels # Testing labels array ├── dataset_0 # First testing dataset ├── dataset_1 # Second testing dataset └── ... Reading Generated Datasets --------------------------- Load datasets from the HDF5 file using h5py: .. code-block:: python import h5py import numpy as np with h5py.File('ensemble_output/ensembleset.h5', 'r') as f: # Load training data for first dataset train_labels = np.array(f['train/labels']) train_features = np.array(f['train/dataset_0']) # Load testing data for first dataset test_labels = np.array(f['test/labels']) test_features = np.array(f['test/dataset_0']) Next Steps ---------- * See :doc:`api` for detailed API documentation * See :doc:`feature_catalog` for descriptions of all feature engineering methods * See :doc:`configuration` for configuration options