Title: | Techniques for Evaluating Clustering |
---|---|
Description: | The design of this package allows us to run different clustering packages and compare the results between them, to determine which algorithm behaves best from the data provided. See Martos, L.A.P., García-Vico, Á.M., González, P. et al.(2023) <doi:10.1007/s13748-022-00294-2> "Clustering: an R library to facilitate the analysis and comparison of cluster algorithms.", Martos, L.A.P., García-Vico, Á.M., González, P. et al. "A Multiclustering Evolutionary Hyperrectangle-Based Algorithm" <doi:10.1007/s44196-023-00341-3> and L.A.P., García-Vico, Á.M., González, P. et al. "An Evolutionary Fuzzy System for Multiclustering in Data Streaming" <doi:10.1016/j.procs.2023.12.058>. |
Authors: | Luis Alfonso Perez Martos [aut, cre]
|
Maintainer: | Luis Alfonso Perez Martos <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.7.10 |
Built: | 2025-02-17 04:31:48 UTC |
Source: | https://github.com/laperez/clustering |
clustering
object returning a new
clustering
object.Generates a new filtered clustering
object.
## S3 method for class 'clustering' clustering[condition = TRUE]
## S3 method for class 'clustering' clustering[condition = TRUE]
clustering |
The |
condition |
Expression to filter the |
This function allows you to filter the data set for a given
evaluation metric. The evaluation metrics available are:
Algorithm, Distance, Clusters, Data, Var, Time, Entropy,
Variation_information, Precision, Recall, F_measure, Fowlkes_mallows_index,
Connectivity, Dunn, Silhouette and TimeAtt
.
A clustering
object filtered from the input parameters.
result <- Clustering::clustering(df = Clustering::basketball, algorithm = 'clara', min=3, max=4, metrics = c('Precision','Recall')) result[Precision > 0.14 & Recall > 0.11]
result <- Clustering::clustering(df = Clustering::basketball, algorithm = 'clara', min=3, max=4, metrics = c('Precision','Recall')) result[Precision > 0.14 & Recall > 0.11]
Method that allows us to execute the main algorithm in graphic interface mode instead of through the console.
appClustering()
appClustering()
The operation of this method is to generate a graphical user. interface to be able to execute the clustering algorithm without knowing the parameters. Its operation is very simple, we can change the values and see the behavior quickly.
GUI with the parameters of the algorithm and their representation in tables and graphs.
This data set contains a series of statistics about basketball players:
data(basketball)
data(basketball)
A data frame with 96 observations on 5 variables:
This data set contains a series of statistics about basketball players:
average number of assistances per minute
height of the player
time played by the player
number of years of the player
average number of points per minute
KEEL, <http://www.keel.es/>
Method in charge of searching for each algorithm those that have the best external classification.
Method that looks for those external attribute that are better classified, making use of the var column. In this way of discard attribute and only work with those that give the best response to the algorithm in question.
best_ranked_external_metrics(df)
best_ranked_external_metrics(df)
df |
Matrix or data frame with the result of running the clustering algorithm. |
Returns a data.frame with the best classified external attribute.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 4, algorithm='clara', metrics=c("Recall") ) Clustering::best_ranked_external_metrics(df = result)
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 4, algorithm='clara', metrics=c("Recall") ) Clustering::best_ranked_external_metrics(df = result)
Method in charge of searching for each algorithm those that have the best internal classification.
Method that looks for those internal attributes that are better classified, making use of the Var column. In this way we discard the attributes and only work with those that give the best response to the algorithm in question.
best_ranked_internal_metrics(df)
best_ranked_internal_metrics(df)
df |
Matrix or data frame with the result of running the clustering algorithm. |
Returns a data.frame with the best classified internal attributes.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Recall") ) Clustering::best_ranked_internal_metrics(df = result)
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Recall") ) Clustering::best_ranked_internal_metrics(df = result)
A manufacturer of automotive accessories provides hardware, e.g. nuts, bolts, washers and screws, to fasten the accessory to the car or truck. Hardware is counted and packaged automatically. Specifically, bolts are dumped into a large metal dish. A plate that forms the bottom of the dish rotates counterclockwise. This rotation forces bolts to the outside of the dish and up along a narrow ledge. Due to the vibration of the dish caused by the spinning bottom plate, some bolts fall off the ledge and back into the dish. The ledge spirals up to a point where the bolts are allowed to drop into a pan on a conveyor belt. As a bolt drops, it passes by an electronic eye that counts it. When the electronic counter reaches the preset number of bolts, the rotation is stopped and the conveyor belt is moved forward
data(bolts)
data(bolts)
A data frame with 40 observations on 8 variables:
A manufacturer of automotive accessories provides hardware, e.g. nuts, bolts, washers and screws, to fasten the accessory to the car or truck. Hardware is counted and packaged automatically. Specifically, bolts are dumped into a large metal dish. A plate that forms the bottom of the dish rotates counterclockwise. This rotation forces bolts to the outside of the dish and up along a narrow ledge. Due to the vibration of the dish caused by the spinning bottom plate, some bolts fall off the ledge and back into the dish. The ledge spirals up to a point where the bolts are allowed to drop into a pan on a conveyor belt. As a bolt drops, it passes by an electronic eye that counts it. When the electronic counter reaches the preset number of bolts, the rotation is stopped and the conveyor belt is moved forward
is the order in which the data were collected
a speed setting that controls the speed of rotation of the plate at the bottom of the dish
total number of bolts (TOTAL) to be counted
a second speed setting hat is used to change the speed of rotation (usually slowing it down) for the last few bolts
the number of bolts to be counted at this second speed
the sensitivity of the electronic eye
The measured response is the time, in seconds
n order to put times on a equal footing the response to be analyzed is the time to count 20 bolts
There are several adjustments on the machine that affect its operation. These include; a speed setting that controls the speed of rotation (SPEED1Integer) of the plate at the bottom of the dish, a total number of bolts (TOTAL) to be counted, a second speed setting (SPEED2Integer) that is used to change the speed of rotation (usually slowing it down) for the last few bolts, the number of bolts to be counted at this second speed (NUMBER2Integer), and the sensitivity of the electronic eye (SENSInteger). The sensitivity setting is to insure that the correct number of bolts are counted. Too few bolts packaged causes customer complaints. Too many bolts packaged increases costs. For each run conducted in this experiment the correct number of bolts was counted. From an engineering standpoint if the correct number of bolts is counted, the sensitivity should not affect the time to count bolts. The measured response is the time (TIMEReal), in seconds, it takes to count the desired number of bolts. In order to put times on a equal footing the response to be analyzed is the time to count 20 bolts (T20BOLTReal). Below are the data for 40 combinations of settings. RUNinteger is the order in which the data were collected.
KEEL, <http://www.keel.es/>
Discovering the behavior of attributes in a set of clustering packages based on evaluation metrics.
clustering( path = NULL, df = NULL, packages = NULL, algorithm = NULL, min = 3, max = 4, metrics = NULL )
clustering( path = NULL, df = NULL, packages = NULL, algorithm = NULL, min = 3, max = 4, metrics = NULL )
path |
The path of file. |
df |
data matrix or data frame, or dissimilarity matrix. |
packages |
character vector with the packets running the algorithm.
|
algorithm |
character vector with the algorithms implemented within the
package. |
min |
An integer with the minimum number of clusters This data is
necessary to indicate the minimum number of clusters when grouping the data.
The default value is |
max |
An integer with the maximum number of clusters. This data is
necessary to indicate the maximum number of clusters when grouping the data.
The default value is |
metrics |
Character vector with the metrics implemented to evaluate the
distribution of the data in clusters. |
The operation of this algorithm is to evaluate how the attributes of a dataset or a set of datasets behave in different clustering algorithms. To do this, it is necessary to indicate the type of evaluation you want to make on the distribution of the data. To be able to execute the algorithm it is necessary to indicate the number of clusters.
min
and max
, the algorithms algorithm
or packages.
packages
that we want to cluster and the metrics metrics
.
A matrix with the result of running all the metrics of the algorithms contained in the packages indicated. We also obtain information with the types of metrics, algorithms and packages executed.
result It is a list with the algorithms, metrics and variables defined in the execution of the algorithm.
has_internal_metrics Boolean field to indicate if there are internal metrics such as: dunn, silhoutte and connectivity.
has_external_metrics Boolean field to indicate if there are external metrics such as: precision, recall, f-measure, entropy, variation information and fowlkes-mallows.
algorithms_execute Character vector with the algorithms executed. These algorithms have been mentioned in the definition of the parameters.
measures_execute Character vector with the measures executed. These measures have been mentioned in the definition of the parameters.
Clustering::clustering( df = cluster::agriculture, min = 3, max = 3, algorithm='clara', metrics=c('Precision') )
Clustering::clustering( df = cluster::agriculture, min = 3, max = 3, algorithm='clara', metrics=c('Precision') )
Method to convert columns to ordinal.
convert_toOrdinal(df)
convert_toOrdinal(df)
df |
data frame with the results. |
convert data frame to Ordinal.
Method that calculates which algorithm and which metric behaves best for the datasets provided.
evaluate_best_validation_external_by_metrics(df, metric)
evaluate_best_validation_external_by_metrics(df, metric)
df |
Data matrix or data frame with the result of running the clustering algorithm. |
metric |
String with the metric. |
Method groups the data by algorithm and distance measure, instead of obtaining the best attribute from the data set.
A data.frame with the algorithms classified by measures of dissimilarity.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='kmeans_rcpp', metrics=c("F_measure")) Clustering::evaluate_best_validation_external_by_metrics(result,'F_measure')
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='kmeans_rcpp', metrics=c("F_measure")) Clustering::evaluate_best_validation_external_by_metrics(result,'F_measure')
Method that calculates which algorithm and which metric behaves best for the datasets provided.
evaluate_best_validation_internal_by_metrics(df, metric)
evaluate_best_validation_internal_by_metrics(df, metric)
df |
Data matrix or data frame with the result of running the clustering algorithm. |
metric |
It's a string with the metric to evaluate. |
This method groups the data by algorithm and distance measure, instead of obtaining the best attribute from the data set.
A data.frame with the algorithms classified by measures of dissimilarity.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision","Connectivity") ) Clustering::evaluate_best_validation_internal_by_metrics(result,"Connectivity")
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision","Connectivity") ) Clustering::evaluate_best_validation_internal_by_metrics(result,"Connectivity")
Method that calculates which algorithm behaves best for the datasets provided.
evaluate_validation_external_by_metrics(df)
evaluate_validation_external_by_metrics(df)
df |
data matrix or data frame with the result of running the clustering algorithm. |
It groups the results of the execution by algorithms.
A data.frame with all the algorithms that obtain the best results regardless of the dissimilarity measure used.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 4, algorithm='kmeans_arma', metrics=c("Precision") ) Clustering::evaluate_validation_external_by_metrics(result)
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 4, algorithm='kmeans_arma', metrics=c("Precision") ) Clustering::evaluate_validation_external_by_metrics(result)
Method that calculates which algorithm behaves best for the datasets provided.
evaluate_validation_internal_by_metrics(df)
evaluate_validation_internal_by_metrics(df)
df |
data matrix or data frame with the result of running the clustering algorithm. |
It groups the results of the execution by algorithms.
A data.frame with all the algorithms that obtain the best results regardless of the dissimilarity measure used.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='kmeans_rcpp', metrics=c("Recall","Silhouette") ) Clustering::evaluate_validation_internal_by_metrics(result) Clustering::evaluate_validation_internal_by_metrics(result$result)
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='kmeans_rcpp', metrics=c("Recall","Silhouette") ) Clustering::evaluate_validation_internal_by_metrics(result) Clustering::evaluate_validation_internal_by_metrics(result$result)
Method that exports the results of external measurements in latex format to a file.
export_file_external(df, path = NULL)
export_file_external(df, path = NULL)
df |
It's a dataframe that contains as a parameter a table in latex format with the results of the external validations. |
path |
It's a string with the path to a directory where a file is to be stored in latex format. |
When we work in latex format and we need to create a table to export the results, with this method we can export the results of the clustering algorithm to latex.
A file in Latex format with the results of the external metrics.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision") ) Clustering::export_file_external(result) file.remove("external_data.tex")
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision") ) Clustering::export_file_external(result) file.remove("external_data.tex")
Method that exports the results of internal measurements in latex format to a file.
export_file_internal(df, path = NULL)
export_file_internal(df, path = NULL)
df |
It's a dataframe that contains as a parameter a table in latex format with the results of the internal validations. |
path |
It's a string with the path to a directory where a file is to be stored in latex format. |
When we work in latex format and we need to create a table to export the results, with this method we can export the results of the clustering algorithm to latex.
A file in Latex format with the results of the internal metrics.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Recall","Dunn") ) Clustering::export_file_internal(result) file.remove("internal_data.tex")
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Recall","Dunn") ) Clustering::export_file_internal(result) file.remove("internal_data.tex")
Graphical representation of the evaluation measures grouped by cluster.
plot_clustering(df, metric)
plot_clustering(df, metric)
df |
data matrix or data frame with the result of running the clustering algorithm. |
metric |
it's a string with the name of the metric select to evaluate. |
In certain cases the review or filtering of the data is necessary to select the data, that is why thanks to the graphic representations this task is much easier. Therefore with this method we will be able to filter the data by metrics and see the data in a graphical way.
Generate an image with the distribution of the clusters by metrics.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision") ) Clustering::plot_clustering(result,c("Precision"))
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision") ) Clustering::plot_clustering(result,c("Precision"))
It is used for obtaining the results of an algorithm indicated as a parameter grouped by number of clusters.
result_external_algorithm_by_metric(df, metric)
result_external_algorithm_by_metric(df, metric)
df |
data matrix or data frame with the result of running the clustering algorithm. |
metric |
It's a string with the metric to evaluate. |
A data.frame with the results of the algorithm indicated as parameter.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision") ) Clustering::result_external_algorithm_by_metric(result,'Precision')
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Precision") ) Clustering::result_external_algorithm_by_metric(result,'Precision')
It is used for obtaining the results of an algorithm indicated as a parameter grouped by number of clusters.
result_internal_algorithm_by_metric(df, metric)
result_internal_algorithm_by_metric(df, metric)
df |
data matrix or data frame with the result of running the clustering algorithm. |
metric |
It's a string with the metric we want to evaluate your results. |
A data.frame with the results of the algorithm indicated as parameter.
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Recall","Silhouette") ) Clustering::result_internal_algorithm_by_metric(result,'Silhouette')
result = Clustering::clustering( df = cluster::agriculture, min = 4, max = 5, algorithm='gmm', metrics=c("Recall","Silhouette") ) Clustering::result_internal_algorithm_by_metric(result,'Silhouette')
This function receives a clustering object and sorts the columns by parameter. By default it performs sorting by the algorithm field.
## S3 method for class 'clustering' sort(x, decreasing = TRUE, ...)
## S3 method for class 'clustering' sort(x, decreasing = TRUE, ...)
x |
It's an |
decreasing |
A logical indicating if the sort should be increasing or decreasing. By default, decreasing. |
... |
Additional parameters as "by", a String with the name of the
evaluation measure to order by. Valid values are: |
The additional argument in "..." is the 'by' argument, which is a
array with the name of the evaluation measure to order by. Valid value are:
Algorithm, Distance, Clusters, Data, Var, Time, Entropy,
Variation_information, Precision, Recall, F_measure, Fowlkes_mallows_index,
Connectivity, Dunn, Silhouette, TimeAtt
.
Another clustering
object with the evaluation measures sorted
result <- Clustering::clustering(df = cluster::agriculture,min = 4, max = 4,algorithm='gmm', metrics='Recall') sort(result, FALSE, 'Recall')
result <- Clustering::clustering(df = cluster::agriculture,min = 4, max = 4,algorithm='gmm', metrics='Recall') sort(result, FALSE, 'Recall')
The data provided are daily stock prices from January 1988 through October 1991, for ten aerospace companies.
data(stock)
data(stock)
A data frame with 950 observations on 10 variables:
The data provided are daily stock prices from January 1988 through October 1991, for ten aerospace companies.
company1 details
company2 details
company3 details
company4 details
company5 details
company6 details
company7 details
company8 details
company9 details
company10 details
KEEL, <http://www.keel.es/>
The study was performed at the 2nd Department of Medicine, 1st Faculty of Medicine of Charles University and Charles University Hospital. The data were transferred to electronic form by the European Centre of Medical Informatics, Statisticsand Epidemiology of Charles University and Academy of Sciences.
data(stulong)
data(stulong)
A data frame with 1417 observations on 5 variables.
The study was performed at the 2nd Department of Medicine, 1st Faculty of Medicine of Charles University and Charles University Hospital. The data were transferred to electronic form by the European Centre of Medical Informatics, Statisticsand Epidemiology of Charles University and Academy of Sciences.
Height
Weight
Blood pressure I systolic (mm Hg)
Blood pressure I diastolic (mm Hg)
ercentage Cholesterol in mg
KEEL, <http://www.keel.es/>
Method for filtering external columns of a dataset.
transform_dataset(df)
transform_dataset(df)
df |
Data frame with clustering results. |
Dafa frame filtered with the columns of the external measurements.
Exists internal measure
Method for filtering internal columns of a dataset.
transform_dataset_internal(df)
transform_dataset_internal(df)
df |
data frame with clustering results. |
dafa frame filtered with the columns of the internal measurements.
Exists internal measure
One of the most known testing data sets in machine learning. This data sets describes several situations where the weather is suitable or not to play sports, depending on the current outlook, temperature, humidity and wind.
data(weather)
data(weather)
A data frame with 14 observations on 5 variables:
One of the most known testing data sets in machine learning. This data sets describes several situations where the weather is suitable or not to play sports, depending on the current outlook, temperature, humidity and wind.
sunny, overcast, rainy
hot, mild, cool
high, normal
true, false
yes, no
KEEL, <http://www.keel.es/>