Estimate the sigma-equivalent compactness of a dataset
Source:R/estimate_sigma_equivalent.R
estimate_sigma_equivalent.RdThis function calibrates the observed AMD peak of real data against synthetic datasets generated with varying levels of isotropic Gaussian noise (\(\sigma\)). For each candidate \(\sigma\), synthetic data are generated with the same number of samples, dimensionality, and number of clusters as the real data. The AMD peak of each synthetic dataset is computed, and the sigma-equivalent value is defined as the \(\sigma\) whose synthetic AMD peak best matches the real AMD peak (either by interpolation or nearest match).
Usage
estimate_sigma_equivalent(
real_data,
its,
nin,
nsp,
k_opt = NULL,
sigmas,
iter_max = 20,
make_plot = FALSE,
return_plot = TRUE,
quiet = TRUE,
open_device_each = FALSE,
device_width = 7,
device_height = 5,
cube_size = 100,
method = c("interpolate", "nearest"),
standardize = FALSE,
seed_base = 7,
plot_sigma_curves = FALSE
)Arguments
- real_data
A numeric matrix or data frame of samples × features.
- its
Number of random initialisations per AMD computation.
- nin
Minimum number of clusters to evaluate.
- nsp
Maximum number of clusters to evaluate.
- k_opt
Optional; the optimal number of clusters for the real data. If
NULL, it is estimated internally.- sigmas
Numeric vector of candidate \(\sigma\) values to evaluate.
- iter_max
Maximum number of iterations for fuzzy c-means.
- make_plot
Logical; if
TRUE, produce a comparative plot of \(\sigma\) vs synthetic AMD peaks.- return_plot
Logical; if
TRUE, return the comparative plot object.- quiet
Logical; suppress console output from synthetic data generation.
- open_device_each
Logical; if
TRUE, open a new graphics device for each sigma-curve plot (whenplot_sigma_curves = TRUE).- device_width, device_height
Size of graphics device for sigma-curve plots.
- cube_size
Side length of the hypercube used to place synthetic centroids.
- method
Method for estimating sigma-equivalent:
"interpolate"or"nearest".- standardize
Logical; if
TRUE, standardise synthetic data.- seed_base
Base seed for reproducibility.
- plot_sigma_curves
Logical; if
TRUE, plot the AMD curve for each candidate \(\sigma\).
Value
A list containing:
- amd_real_peak
AMD peak of the real dataset.
- k_opt
Optimal number of clusters for the real data.
- table_sigma_amd
Data frame of \(\sigma\) vs synthetic AMD peaks.
- sigma_equivalent
Interpolated sigma-equivalent value.
- sigma_eq
Nearest-match sigma on the explored grid.
- extrapolated
Logical; whether interpolation required extrapolation.
- plot_comparative
Comparative plot object (if requested).
- best_i
Index of best-matching sigma.
- best_sigma
Best-matching sigma value.
- best_res_syn
Full AMD results for the best synthetic dataset.
- best_df_curve
Data frame of the AMD curve for the best sigma.