Initial commit
This commit is contained in:
532
skills/umap-learn/references/api_reference.md
Normal file
532
skills/umap-learn/references/api_reference.md
Normal file
@@ -0,0 +1,532 @@
|
||||
# UMAP API Reference
|
||||
|
||||
## UMAP Class
|
||||
|
||||
`umap.UMAP(n_neighbors=15, n_components=2, metric='euclidean', n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, low_memory=True, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, metric_kwds=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, transform_mode='embedding', force_approximation_algorithm=False, verbose=False, unique=False, densmap=False, dens_lambda=2.0, dens_frac=0.3, dens_var_shift=0.1, output_dens=False, disconnection_distance=None, precomputed_knn=(None, None, None))`
|
||||
|
||||
Find low-dimensional embedding that approximates the underlying manifold of the data.
|
||||
|
||||
### Core Parameters
|
||||
|
||||
#### n_neighbors (int, default: 15)
|
||||
Size of the local neighborhood used for manifold approximation. Larger values result in more global views of the manifold, while smaller values preserve more local structure. Generally in the range 2 to 100.
|
||||
|
||||
**Tuning guidance:**
|
||||
- Use 2-5 for very local structure
|
||||
- Use 10-20 for balanced local/global structure (typical)
|
||||
- Use 50-200 for emphasizing global structure
|
||||
|
||||
#### n_components (int, default: 2)
|
||||
Dimension of the embedding space. Unlike t-SNE, UMAP scales well with increasing embedding dimensions.
|
||||
|
||||
**Common values:**
|
||||
- 2-3: Visualization
|
||||
- 5-10: Clustering preprocessing
|
||||
- 10-100: Feature engineering for downstream ML
|
||||
|
||||
#### metric (str or callable, default: 'euclidean')
|
||||
Distance metric to use. Accepts:
|
||||
- Any metric from scipy.spatial.distance
|
||||
- Any metric from sklearn.metrics
|
||||
- Custom callable distance functions (must be compiled with Numba)
|
||||
|
||||
**Common metrics:**
|
||||
- `'euclidean'`: Standard Euclidean distance (default)
|
||||
- `'manhattan'`: L1 distance
|
||||
- `'cosine'`: Cosine distance (good for text/document vectors)
|
||||
- `'correlation'`: Correlation distance
|
||||
- `'hamming'`: Hamming distance (for binary data)
|
||||
- `'jaccard'`: Jaccard distance (for binary/set data)
|
||||
- `'dice'`: Dice distance
|
||||
- `'canberra'`: Canberra distance
|
||||
- `'braycurtis'`: Bray-Curtis distance
|
||||
- `'chebyshev'`: Chebyshev distance
|
||||
- `'minkowski'`: Minkowski distance (specify p with metric_kwds)
|
||||
- `'precomputed'`: Use precomputed distance matrix
|
||||
|
||||
#### min_dist (float, default: 0.1)
|
||||
Effective minimum distance between embedded points. Controls how tightly points are packed together. Smaller values result in clumpier embeddings.
|
||||
|
||||
**Tuning guidance:**
|
||||
- Use 0.0 for clustering applications
|
||||
- Use 0.1-0.3 for visualization (balanced)
|
||||
- Use 0.5-0.99 for loose structure preservation
|
||||
|
||||
#### spread (float, default: 1.0)
|
||||
Effective scale of embedded points. Combined with `min_dist` to control clumped vs. spread-out embeddings. Determines how spread out the clusters are in the embedding space.
|
||||
|
||||
### Training Parameters
|
||||
|
||||
#### n_epochs (int, default: None)
|
||||
Number of training epochs. If None, automatically determined based on dataset size (typically 200-500 epochs).
|
||||
|
||||
**Manual tuning:**
|
||||
- Smaller datasets may need 500+ epochs
|
||||
- Larger datasets may converge with 200 epochs
|
||||
- More epochs = better optimization but slower training
|
||||
|
||||
#### learning_rate (float, default: 1.0)
|
||||
Initial learning rate for the SGD optimizer. Higher values lead to faster convergence but may overshoot optimal solutions.
|
||||
|
||||
#### init (str or np.ndarray, default: 'spectral')
|
||||
Initialization method for the embedding:
|
||||
- `'spectral'`: Use spectral embedding (default, usually best)
|
||||
- `'random'`: Random initialization
|
||||
- `'pca'`: Initialize with PCA
|
||||
- numpy array: Custom initialization (shape: (n_samples, n_components))
|
||||
|
||||
### Advanced Structural Parameters
|
||||
|
||||
#### local_connectivity (int, default: 1.0)
|
||||
Number of nearest neighbors assumed to be locally connected. Higher values give more connected manifolds.
|
||||
|
||||
#### set_op_mix_ratio (float, default: 1.0)
|
||||
Interpolation between union and intersection when constructing fuzzy set unions. Value of 1.0 uses pure union, 0.0 uses pure intersection.
|
||||
|
||||
#### repulsion_strength (float, default: 1.0)
|
||||
Weighting applied to negative samples in low-dimensional embedding optimization. Higher values push embedded points further apart.
|
||||
|
||||
#### negative_sample_rate (int, default: 5)
|
||||
Number of negative samples to select per positive sample. Higher values lead to greater repulsion between points and more spread-out embeddings but increase computational cost.
|
||||
|
||||
### Supervised Learning Parameters
|
||||
|
||||
#### target_n_neighbors (int, default: -1)
|
||||
Number of nearest neighbors to use when constructing target simplicial set. If -1, uses n_neighbors value.
|
||||
|
||||
#### target_metric (str, default: 'categorical')
|
||||
Distance metric for target values (labels):
|
||||
- `'categorical'`: For classification tasks
|
||||
- Any other metric for regression tasks
|
||||
|
||||
#### target_weight (float, default: 0.5)
|
||||
Weight applied to target information vs. data structure. Range 0.0 to 1.0:
|
||||
- 0.0: Pure unsupervised embedding (ignores labels)
|
||||
- 0.5: Balanced (default)
|
||||
- 1.0: Pure supervised embedding (only considers labels)
|
||||
|
||||
### Transform Parameters
|
||||
|
||||
#### transform_queue_size (float, default: 4.0)
|
||||
Size of the nearest neighbor search queue for transform operations. Larger values improve transform accuracy but increase memory usage and computation time.
|
||||
|
||||
#### transform_seed (int, default: 42)
|
||||
Random seed for transform operations. Ensures reproducibility of transform results.
|
||||
|
||||
#### transform_mode (str, default: 'embedding')
|
||||
Method for transforming new data:
|
||||
- `'embedding'`: Standard approach (default)
|
||||
- `'graph'`: Use nearest neighbor graph
|
||||
|
||||
### Performance Parameters
|
||||
|
||||
#### low_memory (bool, default: True)
|
||||
Whether to use a memory-efficient implementation. Set to False only if memory is not a constraint and you want faster performance.
|
||||
|
||||
#### verbose (bool, default: False)
|
||||
Whether to print progress messages during fitting.
|
||||
|
||||
#### unique (bool, default: False)
|
||||
Whether to consider only unique data points. Set to True if you know your data contains many duplicates to improve performance.
|
||||
|
||||
#### force_approximation_algorithm (bool, default: False)
|
||||
Force use of approximate nearest neighbor search even for small datasets. Can improve performance on large datasets.
|
||||
|
||||
#### angular_rp_forest (bool, default: False)
|
||||
Whether to use angular random projection forest for nearest neighbor search. Can improve performance for normalized data in high dimensions.
|
||||
|
||||
### DensMAP Parameters
|
||||
|
||||
DensMAP is a variant that preserves local density information.
|
||||
|
||||
#### densmap (bool, default: False)
|
||||
Whether to use the DensMAP algorithm instead of standard UMAP. Preserves local density in addition to topological structure.
|
||||
|
||||
#### dens_lambda (float, default: 2.0)
|
||||
Weight of density preservation term in DensMAP optimization. Higher values emphasize density preservation.
|
||||
|
||||
#### dens_frac (float, default: 0.3)
|
||||
Fraction of dataset used for density estimation in DensMAP.
|
||||
|
||||
#### dens_var_shift (float, default: 0.1)
|
||||
Regularization parameter for density estimation in DensMAP.
|
||||
|
||||
#### output_dens (bool, default: False)
|
||||
Whether to output local density estimates in addition to the embedding. Results stored in `rad_orig_` and `rad_emb_` attributes.
|
||||
|
||||
### Other Parameters
|
||||
|
||||
#### a (float, default: None)
|
||||
Parameter controlling embedding. If None, determined automatically from min_dist and spread.
|
||||
|
||||
#### b (float, default: None)
|
||||
Parameter controlling embedding. If None, determined automatically from min_dist and spread.
|
||||
|
||||
#### random_state (int, RandomState instance, or None, default: None)
|
||||
Random state for reproducibility. Set to an integer for reproducible results.
|
||||
|
||||
#### metric_kwds (dict, default: None)
|
||||
Additional keyword arguments for the distance metric.
|
||||
|
||||
#### disconnection_distance (float, default: None)
|
||||
Distance threshold for considering points disconnected. If None, uses max distance in the graph.
|
||||
|
||||
#### precomputed_knn (tuple, default: (None, None, None))
|
||||
Precomputed k-nearest neighbors as (knn_indices, knn_dists, knn_search_index). Useful for reusing expensive computations.
|
||||
|
||||
## Methods
|
||||
|
||||
### fit(X, y=None)
|
||||
Fit the UMAP model to the data.
|
||||
|
||||
**Parameters:**
|
||||
- `X`: array-like, shape (n_samples, n_features) - Training data
|
||||
- `y`: array-like, shape (n_samples,), optional - Target values for supervised dimension reduction
|
||||
|
||||
**Returns:**
|
||||
- `self`: Fitted UMAP object
|
||||
|
||||
**Attributes set:**
|
||||
- `embedding_`: The embedded representation of training data
|
||||
- `graph_`: Fuzzy simplicial set approximation to the manifold
|
||||
- `_raw_data`: Copy of the training data
|
||||
- `_small_data`: Whether the dataset is considered small
|
||||
- `_metric_kwds`: Processed metric keyword arguments
|
||||
- `_n_neighbors`: Actual n_neighbors used
|
||||
- `_initial_alpha`: Initial learning rate
|
||||
- `_a`, `_b`: Curve parameters
|
||||
|
||||
### fit_transform(X, y=None)
|
||||
Fit the model and return the embedded representation.
|
||||
|
||||
**Parameters:**
|
||||
- `X`: array-like, shape (n_samples, n_features) - Training data
|
||||
- `y`: array-like, shape (n_samples,), optional - Target values for supervised dimension reduction
|
||||
|
||||
**Returns:**
|
||||
- `X_new`: array, shape (n_samples, n_components) - Embedded data
|
||||
|
||||
### transform(X)
|
||||
Transform new data into the existing embedded space.
|
||||
|
||||
**Parameters:**
|
||||
- `X`: array-like, shape (n_samples, n_features) - New data to transform
|
||||
|
||||
**Returns:**
|
||||
- `X_new`: array, shape (n_samples, n_components) - Embedded representation of new data
|
||||
|
||||
**Important notes:**
|
||||
- The model must be fitted before calling transform
|
||||
- Transform quality depends on similarity between training and test distributions
|
||||
- For significantly different data distributions, consider Parametric UMAP
|
||||
|
||||
### inverse_transform(X)
|
||||
Transform data from the embedded space back to the original data space.
|
||||
|
||||
**Parameters:**
|
||||
- `X`: array-like, shape (n_samples, n_components) - Embedded data points
|
||||
|
||||
**Returns:**
|
||||
- `X_new`: array, shape (n_samples, n_features) - Reconstructed data in original space
|
||||
|
||||
**Important notes:**
|
||||
- Computationally expensive operation
|
||||
- Works poorly outside the convex hull of the training embedding
|
||||
- Reconstruction quality varies by region
|
||||
|
||||
### update(X)
|
||||
Update the model with new data. Allows incremental fitting.
|
||||
|
||||
**Parameters:**
|
||||
- `X`: array-like, shape (n_samples, n_features) - New data to incorporate
|
||||
|
||||
**Returns:**
|
||||
- `self`: Updated UMAP object
|
||||
|
||||
**Note:** Experimental feature, may not preserve all properties of batch training.
|
||||
|
||||
## Attributes
|
||||
|
||||
### embedding_
|
||||
array, shape (n_samples, n_components) - The embedded representation of the training data.
|
||||
|
||||
### graph_
|
||||
scipy.sparse.csr_matrix - The weighted adjacency matrix of the fuzzy simplicial set approximation to the manifold.
|
||||
|
||||
### _raw_data
|
||||
array - Copy of the raw training data.
|
||||
|
||||
### _sparse_data
|
||||
bool - Whether the training data was sparse.
|
||||
|
||||
### _small_data
|
||||
bool - Whether the dataset was considered small (uses different algorithm for small datasets).
|
||||
|
||||
### _input_hash
|
||||
str - Hash of the input data for caching purposes.
|
||||
|
||||
### _knn_indices
|
||||
array - Indices of k-nearest neighbors for each training point.
|
||||
|
||||
### _knn_dists
|
||||
array - Distances to k-nearest neighbors for each training point.
|
||||
|
||||
### _rp_forest
|
||||
list - Random projection forest used for approximate nearest neighbor search.
|
||||
|
||||
## ParametricUMAP Class
|
||||
|
||||
`umap.ParametricUMAP(encoder=None, decoder=None, parametric_reconstruction=False, autoencoder_loss=False, reconstruction_validation=None, dims=None, batch_size=None, n_training_epochs=1, loss_report_frequency=10, optimizer=None, keras_fit_kwargs={}, **kwargs)`
|
||||
|
||||
Parametric UMAP using neural networks to learn the embedding function.
|
||||
|
||||
### Additional Parameters (beyond UMAP)
|
||||
|
||||
#### encoder (tensorflow.keras.Model, default: None)
|
||||
Keras model for encoding data to embeddings. If None, uses default 3-layer architecture with 100 neurons per layer.
|
||||
|
||||
#### decoder (tensorflow.keras.Model, default: None)
|
||||
Keras model for decoding embeddings back to data space. Only used if parametric_reconstruction=True.
|
||||
|
||||
#### parametric_reconstruction (bool, default: False)
|
||||
Whether to use parametric reconstruction. Requires decoder model.
|
||||
|
||||
#### autoencoder_loss (bool, default: False)
|
||||
Whether to include reconstruction loss in the optimization. Requires decoder model.
|
||||
|
||||
#### reconstruction_validation (tuple, default: None)
|
||||
Validation data (X_val, y_val) for monitoring reconstruction loss during training.
|
||||
|
||||
#### dims (tuple, default: None)
|
||||
Input dimensions for the encoder network. Required if providing custom encoder.
|
||||
|
||||
#### batch_size (int, default: None)
|
||||
Batch size for neural network training. If None, determined automatically.
|
||||
|
||||
#### n_training_epochs (int, default: 1)
|
||||
Number of training epochs for the neural networks. More epochs improve quality but increase training time.
|
||||
|
||||
#### loss_report_frequency (int, default: 10)
|
||||
How often to report loss during training.
|
||||
|
||||
#### optimizer (tensorflow.keras.optimizers.Optimizer, default: None)
|
||||
Keras optimizer for training. If None, uses Adam with learning_rate parameter.
|
||||
|
||||
#### keras_fit_kwargs (dict, default: {})
|
||||
Additional keyword arguments passed to the Keras fit() method.
|
||||
|
||||
### Methods
|
||||
|
||||
Same as UMAP class, but transform() and inverse_transform() use learned neural networks for faster inference.
|
||||
|
||||
## Utility Functions
|
||||
|
||||
### umap.nearest_neighbors(X, n_neighbors, metric, metric_kwds={}, angular=False, random_state=None)
|
||||
Compute k-nearest neighbors for the data.
|
||||
|
||||
**Returns:** (knn_indices, knn_dists, rp_forest)
|
||||
|
||||
### umap.fuzzy_simplicial_set(X, n_neighbors, random_state, metric, metric_kwds={}, knn_indices=None, knn_dists=None, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, apply_set_operations=True, verbose=False, return_dists=None)
|
||||
Construct fuzzy simplicial set representation of the data.
|
||||
|
||||
**Returns:** Fuzzy simplicial set as sparse matrix
|
||||
|
||||
### umap.simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel=False, verbose=False)
|
||||
Perform the optimization to find a low-dimensional embedding.
|
||||
|
||||
**Returns:** Embedding array
|
||||
|
||||
### umap.find_ab_params(spread, min_dist)
|
||||
Fit a, b params for the UMAP curve from spread and min_dist.
|
||||
|
||||
**Returns:** (a, b) tuple
|
||||
|
||||
## AlignedUMAP Class
|
||||
|
||||
`umap.AlignedUMAP(n_neighbors=15, n_components=2, metric='euclidean', alignment_regularisation=1e-2, alignment_window_size=3, **kwargs)`
|
||||
|
||||
UMAP variant for aligning multiple related datasets.
|
||||
|
||||
### Additional Parameters
|
||||
|
||||
#### alignment_regularisation (float, default: 1e-2)
|
||||
Strength of alignment regularization between datasets.
|
||||
|
||||
#### alignment_window_size (int, default: 3)
|
||||
Number of adjacent datasets to align.
|
||||
|
||||
### Methods
|
||||
|
||||
#### fit(X)
|
||||
Fit model to multiple datasets.
|
||||
|
||||
**Parameters:**
|
||||
- `X`: list of arrays - List of datasets to align
|
||||
|
||||
**Returns:**
|
||||
- `self`: Fitted model
|
||||
|
||||
### Attributes
|
||||
|
||||
#### embeddings_
|
||||
list of arrays - List of aligned embeddings, one per input dataset.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Usage with All Common Parameters
|
||||
|
||||
```python
|
||||
import umap
|
||||
|
||||
# Standard 2D visualization embedding
|
||||
reducer = umap.UMAP(
|
||||
n_neighbors=15, # Balance local/global structure
|
||||
n_components=2, # Output dimensions
|
||||
metric='euclidean', # Distance metric
|
||||
min_dist=0.1, # Minimum distance between points
|
||||
spread=1.0, # Scale of embedded points
|
||||
random_state=42, # Reproducibility
|
||||
n_epochs=200, # Training iterations (None = auto)
|
||||
learning_rate=1.0, # SGD learning rate
|
||||
init='spectral', # Initialization method
|
||||
low_memory=True, # Memory-efficient mode
|
||||
verbose=True # Print progress
|
||||
)
|
||||
|
||||
embedding = reducer.fit_transform(data)
|
||||
```
|
||||
|
||||
### Supervised Learning
|
||||
|
||||
```python
|
||||
# Train with labels for class separation
|
||||
reducer = umap.UMAP(
|
||||
n_neighbors=15,
|
||||
target_weight=0.5, # Balance data structure vs labels
|
||||
target_metric='categorical', # Metric for labels
|
||||
random_state=42
|
||||
)
|
||||
|
||||
embedding = reducer.fit_transform(data, y=labels)
|
||||
```
|
||||
|
||||
### Clustering Preprocessing
|
||||
|
||||
```python
|
||||
# Optimized for clustering
|
||||
reducer = umap.UMAP(
|
||||
n_neighbors=30, # More global structure
|
||||
min_dist=0.0, # Allow tight packing
|
||||
n_components=10, # Higher dimensions for density
|
||||
metric='euclidean',
|
||||
random_state=42
|
||||
)
|
||||
|
||||
embedding = reducer.fit_transform(data)
|
||||
```
|
||||
|
||||
### Custom Distance Metric
|
||||
|
||||
```python
|
||||
from numba import njit
|
||||
|
||||
@njit()
|
||||
def custom_distance(x, y):
|
||||
"""Custom distance function (must be Numba-compatible)"""
|
||||
result = 0.0
|
||||
for i in range(x.shape[0]):
|
||||
result += abs(x[i] - y[i])
|
||||
return result
|
||||
|
||||
reducer = umap.UMAP(metric=custom_distance)
|
||||
embedding = reducer.fit_transform(data)
|
||||
```
|
||||
|
||||
### Parametric UMAP with Custom Architecture
|
||||
|
||||
```python
|
||||
import tensorflow as tf
|
||||
from umap.parametric_umap import ParametricUMAP
|
||||
|
||||
# Define custom encoder
|
||||
encoder = tf.keras.Sequential([
|
||||
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
|
||||
tf.keras.layers.Dense(256, activation='relu'),
|
||||
tf.keras.layers.Dropout(0.3),
|
||||
tf.keras.layers.Dense(128, activation='relu'),
|
||||
tf.keras.layers.Dropout(0.3),
|
||||
tf.keras.layers.Dense(2) # Output dimension
|
||||
])
|
||||
|
||||
# Define decoder for reconstruction
|
||||
decoder = tf.keras.Sequential([
|
||||
tf.keras.layers.InputLayer(input_shape=(2,)),
|
||||
tf.keras.layers.Dense(128, activation='relu'),
|
||||
tf.keras.layers.Dense(256, activation='relu'),
|
||||
tf.keras.layers.Dense(input_dim)
|
||||
])
|
||||
|
||||
# Train parametric UMAP with autoencoder
|
||||
embedder = ParametricUMAP(
|
||||
encoder=encoder,
|
||||
decoder=decoder,
|
||||
dims=(input_dim,),
|
||||
parametric_reconstruction=True,
|
||||
autoencoder_loss=True,
|
||||
n_training_epochs=10,
|
||||
batch_size=128,
|
||||
n_neighbors=15,
|
||||
min_dist=0.1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
embedding = embedder.fit_transform(data)
|
||||
new_embedding = embedder.transform(new_data)
|
||||
reconstructed = embedder.inverse_transform(embedding)
|
||||
```
|
||||
|
||||
### DensMAP for Density Preservation
|
||||
|
||||
```python
|
||||
# Preserve local density information
|
||||
reducer = umap.UMAP(
|
||||
densmap=True, # Enable DensMAP
|
||||
dens_lambda=2.0, # Weight of density preservation
|
||||
dens_frac=0.3, # Fraction for density estimation
|
||||
output_dens=True, # Output density estimates
|
||||
n_neighbors=15,
|
||||
min_dist=0.1,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
embedding = reducer.fit_transform(data)
|
||||
|
||||
# Access density estimates
|
||||
original_density = reducer.rad_orig_ # Density in original space
|
||||
embedded_density = reducer.rad_emb_ # Density in embedded space
|
||||
```
|
||||
|
||||
### Aligned UMAP for Time Series
|
||||
|
||||
```python
|
||||
from umap import AlignedUMAP
|
||||
|
||||
# Multiple related datasets (e.g., different time points)
|
||||
datasets = [day1_data, day2_data, day3_data, day4_data]
|
||||
|
||||
# Align embeddings
|
||||
mapper = AlignedUMAP(
|
||||
n_neighbors=15,
|
||||
alignment_regularisation=1e-2, # Alignment strength
|
||||
alignment_window_size=2, # Align with adjacent datasets
|
||||
n_components=2,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
mapper.fit(datasets)
|
||||
|
||||
# Access aligned embeddings
|
||||
aligned_embeddings = mapper.embeddings_
|
||||
# aligned_embeddings[0] is day1 embedding
|
||||
# aligned_embeddings[1] is day2 embedding, etc.
|
||||
```
|
||||
Reference in New Issue
Block a user