snf.cv.snf_gridsearch

snf.cv.snf_gridsearch(*data, metric='sqeuclidean', mu=None, K=None, n_clusters=None, t=20, folds=3, n_perms=1000, normalize=True, seed=None)[source]

Performs grid search for SNF hyperparameters mu, K, and n_clusters

Uses folds-fold CV to subsample data and performs grid search on mu, K, and n_clusters hyperparameters for SNF. There is no testing on the left-out sample for each CV fold—it is simply removed.

Parameters:
  • *data ((N, M) array_like) – Raw data arrays, where N is samples and M is features.
  • metric (str or list-of-str, optional) – Distance metrics to compute on data. Must be one of available metrics in scipy.spatial.distance.pdist. If a list is provided for data a list of equal length may be supplied here. Default: ‘sqeuclidean’
  • mu (array_like, optional) – Array of mu values to search over. Default: np.arange(0.35, 1.05, 0.05)
  • K (array_like, optional) – Array of K values to search over. Default: np.arange(5, N // 2, 5)
  • n_clusters (array_like, optional) – Array of cluster numbers to search over. Default: np.arange(2, N // 20)
  • t (int, optional) – Number of iterations for SNF. Default: 20
  • folds (int, optional) – Number of folds to use for cross-validation. Default: 3
  • n_perms (int, optional) – Number of permutations for generating z-score of silhouette (affinity) to assess reliability of SNF clustering output. Default: 1000
  • normalize (bool, optional) – Whether to normalize (z-score) data arrrays before constructing affinity matrices. Each feature is separately normalized. Default: True
  • seed (int, optional) – Random seed. Default: None
Returns:

  • grid_zaff ((F,) list of (S, K, C) np.ndarray) – Where S is mu, K is K, C is n_clusters, and F is the number of folds for CV. The entries in the individual arrays correspond to the z-scored silhouette (affinity).
  • grid_labels ((F,) list of (S, K, C, N) np.ndarray) – Where S is mu, K is K, C is n_clusters, and F is the number of folds for CV. The N entries along the last dimension correspond to the cluster labels for the given parameter combination.