188 lines
5.1 KiB
Markdown
188 lines
5.1 KiB
Markdown
# Similarity Search
|
||
|
||
Aeon provides tools for finding similar patterns within and across time series, including subsequence search, motif discovery, and approximate nearest neighbors.
|
||
|
||
## Subsequence Nearest Neighbors (SNN)
|
||
|
||
Find most similar subsequences within a time series.
|
||
|
||
### MASS Algorithm
|
||
- `MassSNN` - Mueen's Algorithm for Similarity Search
|
||
- Fast normalized cross-correlation for similarity
|
||
- Computes distance profile efficiently
|
||
- **Use when**: Need exact nearest neighbor distances, large series
|
||
|
||
### STOMP-Based Motif Discovery
|
||
- `StompMotif` - Discovers recurring patterns (motifs)
|
||
- Finds top-k most similar subsequence pairs
|
||
- Based on matrix profile computation
|
||
- **Use when**: Want to discover repeated patterns
|
||
|
||
### Brute Force Baseline
|
||
- `DummySNN` - Exhaustive distance computation
|
||
- Computes all pairwise distances
|
||
- **Use when**: Small series, need exact baseline
|
||
|
||
## Collection-Level Search
|
||
|
||
Find similar time series across collections.
|
||
|
||
### Approximate Nearest Neighbors (ANN)
|
||
- `RandomProjectionIndexANN` - Locality-sensitive hashing
|
||
- Uses random projections with cosine similarity
|
||
- Builds index for fast approximate search
|
||
- **Use when**: Large collection, speed more important than exactness
|
||
|
||
## Quick Start: Motif Discovery
|
||
|
||
```python
|
||
from aeon.similarity_search import StompMotif
|
||
import numpy as np
|
||
|
||
# Create time series with repeated patterns
|
||
pattern = np.sin(np.linspace(0, 2*np.pi, 50))
|
||
y = np.concatenate([
|
||
pattern + np.random.normal(0, 0.1, 50),
|
||
np.random.normal(0, 1, 100),
|
||
pattern + np.random.normal(0, 0.1, 50),
|
||
np.random.normal(0, 1, 100)
|
||
])
|
||
|
||
# Find top-3 motifs
|
||
motif_finder = StompMotif(window_size=50, k=3)
|
||
motifs = motif_finder.fit_predict(y)
|
||
|
||
# motifs contains indices of motif occurrences
|
||
for i, (idx1, idx2) in enumerate(motifs):
|
||
print(f"Motif {i+1} at positions {idx1} and {idx2}")
|
||
```
|
||
|
||
## Quick Start: Subsequence Search
|
||
|
||
```python
|
||
from aeon.similarity_search import MassSNN
|
||
import numpy as np
|
||
|
||
# Time series to search within
|
||
y = np.sin(np.linspace(0, 20, 500))
|
||
|
||
# Query subsequence
|
||
query = np.sin(np.linspace(0, 2, 50))
|
||
|
||
# Find nearest subsequences
|
||
searcher = MassSNN()
|
||
distances = searcher.fit_transform(y, query)
|
||
|
||
# Find best match
|
||
best_match_idx = np.argmin(distances)
|
||
print(f"Best match at index {best_match_idx}")
|
||
```
|
||
|
||
## Quick Start: Approximate NN on Collections
|
||
|
||
```python
|
||
from aeon.similarity_search import RandomProjectionIndexANN
|
||
from aeon.datasets import load_classification
|
||
|
||
# Load time series collection
|
||
X_train, _ = load_classification("GunPoint", split="train")
|
||
|
||
# Build index
|
||
ann = RandomProjectionIndexANN(n_projections=8, n_bits=4)
|
||
ann.fit(X_train)
|
||
|
||
# Find approximate nearest neighbors
|
||
query = X_train[0]
|
||
neighbors, distances = ann.kneighbors(query, k=5)
|
||
```
|
||
|
||
## Matrix Profile
|
||
|
||
The matrix profile is a fundamental data structure for many similarity search tasks:
|
||
|
||
- **Distance Profile**: Distances from a query to all subsequences
|
||
- **Matrix Profile**: Minimum distance for each subsequence to any other
|
||
- **Motif**: Pair of subsequences with minimum distance
|
||
- **Discord**: Subsequence with maximum minimum distance (anomaly)
|
||
|
||
```python
|
||
from aeon.similarity_search import StompMotif
|
||
|
||
# Compute matrix profile and find motifs/discords
|
||
mp = StompMotif(window_size=50)
|
||
mp.fit(y)
|
||
|
||
# Access matrix profile
|
||
profile = mp.matrix_profile_
|
||
profile_indices = mp.matrix_profile_index_
|
||
|
||
# Find discords (anomalies)
|
||
discord_idx = np.argmax(profile)
|
||
```
|
||
|
||
## Algorithm Selection
|
||
|
||
- **Exact subsequence search**: MassSNN
|
||
- **Motif discovery**: StompMotif
|
||
- **Anomaly detection**: Matrix profile (see anomaly_detection.md)
|
||
- **Fast approximate search**: RandomProjectionIndexANN
|
||
- **Small data**: DummySNN for exact results
|
||
|
||
## Use Cases
|
||
|
||
### Pattern Matching
|
||
Find where a pattern occurs in a long series:
|
||
|
||
```python
|
||
# Find heartbeat pattern in ECG data
|
||
searcher = MassSNN()
|
||
distances = searcher.fit_transform(ecg_data, heartbeat_pattern)
|
||
occurrences = np.where(distances < threshold)[0]
|
||
```
|
||
|
||
### Motif Discovery
|
||
Identify recurring patterns:
|
||
|
||
```python
|
||
# Find repeated behavioral patterns
|
||
motif_finder = StompMotif(window_size=100, k=5)
|
||
motifs = motif_finder.fit_predict(activity_data)
|
||
```
|
||
|
||
### Time Series Retrieval
|
||
Find similar time series in database:
|
||
|
||
```python
|
||
# Build searchable index
|
||
ann = RandomProjectionIndexANN()
|
||
ann.fit(time_series_database)
|
||
|
||
# Query for similar series
|
||
neighbors = ann.kneighbors(query_series, k=10)
|
||
```
|
||
|
||
## Best Practices
|
||
|
||
1. **Window size**: Critical parameter for subsequence methods
|
||
- Too small: Captures noise
|
||
- Too large: Misses fine-grained patterns
|
||
- Rule of thumb: 10-20% of series length
|
||
|
||
2. **Normalization**: Most methods assume z-normalized subsequences
|
||
- Handles amplitude variations
|
||
- Focus on shape similarity
|
||
|
||
3. **Distance metrics**: Different metrics for different needs
|
||
- Euclidean: Fast, shape-based
|
||
- DTW: Handles temporal warping
|
||
- Cosine: Scale-invariant
|
||
|
||
4. **Exclusion zone**: For motif discovery, exclude trivial matches
|
||
- Typically set to 0.5-1.0 × window_size
|
||
- Prevents finding overlapping occurrences
|
||
|
||
5. **Performance**:
|
||
- MASS is O(n log n) vs O(n²) brute force
|
||
- ANN trades accuracy for speed
|
||
- GPU acceleration available for some methods
|