5.1 KiB
5.1 KiB
Similarity Search
Aeon provides tools for finding similar patterns within and across time series, including subsequence search, motif discovery, and approximate nearest neighbors.
Subsequence Nearest Neighbors (SNN)
Find most similar subsequences within a time series.
MASS Algorithm
MassSNN- Mueen's Algorithm for Similarity Search- Fast normalized cross-correlation for similarity
- Computes distance profile efficiently
- Use when: Need exact nearest neighbor distances, large series
STOMP-Based Motif Discovery
StompMotif- Discovers recurring patterns (motifs)- Finds top-k most similar subsequence pairs
- Based on matrix profile computation
- Use when: Want to discover repeated patterns
Brute Force Baseline
DummySNN- Exhaustive distance computation- Computes all pairwise distances
- Use when: Small series, need exact baseline
Collection-Level Search
Find similar time series across collections.
Approximate Nearest Neighbors (ANN)
RandomProjectionIndexANN- Locality-sensitive hashing- Uses random projections with cosine similarity
- Builds index for fast approximate search
- Use when: Large collection, speed more important than exactness
Quick Start: Motif Discovery
from aeon.similarity_search import StompMotif
import numpy as np
# Create time series with repeated patterns
pattern = np.sin(np.linspace(0, 2*np.pi, 50))
y = np.concatenate([
pattern + np.random.normal(0, 0.1, 50),
np.random.normal(0, 1, 100),
pattern + np.random.normal(0, 0.1, 50),
np.random.normal(0, 1, 100)
])
# Find top-3 motifs
motif_finder = StompMotif(window_size=50, k=3)
motifs = motif_finder.fit_predict(y)
# motifs contains indices of motif occurrences
for i, (idx1, idx2) in enumerate(motifs):
print(f"Motif {i+1} at positions {idx1} and {idx2}")
Quick Start: Subsequence Search
from aeon.similarity_search import MassSNN
import numpy as np
# Time series to search within
y = np.sin(np.linspace(0, 20, 500))
# Query subsequence
query = np.sin(np.linspace(0, 2, 50))
# Find nearest subsequences
searcher = MassSNN()
distances = searcher.fit_transform(y, query)
# Find best match
best_match_idx = np.argmin(distances)
print(f"Best match at index {best_match_idx}")
Quick Start: Approximate NN on Collections
from aeon.similarity_search import RandomProjectionIndexANN
from aeon.datasets import load_classification
# Load time series collection
X_train, _ = load_classification("GunPoint", split="train")
# Build index
ann = RandomProjectionIndexANN(n_projections=8, n_bits=4)
ann.fit(X_train)
# Find approximate nearest neighbors
query = X_train[0]
neighbors, distances = ann.kneighbors(query, k=5)
Matrix Profile
The matrix profile is a fundamental data structure for many similarity search tasks:
- Distance Profile: Distances from a query to all subsequences
- Matrix Profile: Minimum distance for each subsequence to any other
- Motif: Pair of subsequences with minimum distance
- Discord: Subsequence with maximum minimum distance (anomaly)
from aeon.similarity_search import StompMotif
# Compute matrix profile and find motifs/discords
mp = StompMotif(window_size=50)
mp.fit(y)
# Access matrix profile
profile = mp.matrix_profile_
profile_indices = mp.matrix_profile_index_
# Find discords (anomalies)
discord_idx = np.argmax(profile)
Algorithm Selection
- Exact subsequence search: MassSNN
- Motif discovery: StompMotif
- Anomaly detection: Matrix profile (see anomaly_detection.md)
- Fast approximate search: RandomProjectionIndexANN
- Small data: DummySNN for exact results
Use Cases
Pattern Matching
Find where a pattern occurs in a long series:
# Find heartbeat pattern in ECG data
searcher = MassSNN()
distances = searcher.fit_transform(ecg_data, heartbeat_pattern)
occurrences = np.where(distances < threshold)[0]
Motif Discovery
Identify recurring patterns:
# Find repeated behavioral patterns
motif_finder = StompMotif(window_size=100, k=5)
motifs = motif_finder.fit_predict(activity_data)
Time Series Retrieval
Find similar time series in database:
# Build searchable index
ann = RandomProjectionIndexANN()
ann.fit(time_series_database)
# Query for similar series
neighbors = ann.kneighbors(query_series, k=10)
Best Practices
-
Window size: Critical parameter for subsequence methods
- Too small: Captures noise
- Too large: Misses fine-grained patterns
- Rule of thumb: 10-20% of series length
-
Normalization: Most methods assume z-normalized subsequences
- Handles amplitude variations
- Focus on shape similarity
-
Distance metrics: Different metrics for different needs
- Euclidean: Fast, shape-based
- DTW: Handles temporal warping
- Cosine: Scale-invariant
-
Exclusion zone: For motif discovery, exclude trivial matches
- Typically set to 0.5-1.0 × window_size
- Prevents finding overlapping occurrences
-
Performance:
- MASS is O(n log n) vs O(n²) brute force
- ANN trades accuracy for speed
- GPU acceleration available for some methods