Data science and big data processing in R: representations and software

Septem Riza, Lala

Data science and big data processing in Rrepresentations and software

Septem Riza, Lala

Zuzendaria:

Francisco Herrera Triguero Zuzendaria
José Manuel Benítez Sánchez Zuzendaria

Defentsa unibertsitatea: Universidad de Granada

Fecha de defensa: 2015(e)ko uztaila-(a)k 17

Epaimahaia:

Antonio González Muñoz Presidentea
Manuel Gómez Olmedo Idazkaria
Matías Gámez Martínez Kidea
Luciano Sánchez Ramos Kidea
Antonio Peregrín Rubio Kidea

Mota: Tesia

Teseo: 388363 DIALNET DIGIBUG editor

Laburpena

The main objective of this thesis is the development of high quality and easy to use software modules for represent, create and manage system models and data analysis. Since it has become a de facto standard, R is the platform of choice. The mentioned packages consider the techniques based on fuzzy systems, rough sets, and fuzzy rough sets. In addition, a universal representation framework for fuzzy rule-based systems is introduced. Finally, the implementation of random forests and random ferns for tackling Big Data is discussed. According to these objectives, the following are results of the research: 1. The "frbs" package: It is an R package implementing the most relevant types of fuzzy rule-based systems along with a selection of machine-learning algorithms to build them. The package focuses on classification and regression tasks. It also includes a mechanism to allow the construction of a model by human experts. It is available in CRAN: http://cran.r-project.org/package=frbs and in the project website: http://sci2s.ugr.es/dicits/software/FRBS. 2. The "RoughSets" package: It is an R package implementing algorithms based on rough set theory and fuzzy rough set theory for knowledge representation and data analysis. In includes tools for managing missing values, discretization, feature selection, and instance selection, for both classification and regression tasks. It is available in CRAN: http://cran.r-project.org/package=RoughSets and in the project website: http://sci2s.ugr.es/dicits/software/RoughSets. 3. frbsPMML: It is a universal representation framework for fuzzy rule based systems based on the Predictive Model Markup Language. Furthermore, two software libraries to manage the representation are implemented: an extension of the "frbs"package and the Java package "frbsJpmml". 4. The "SparkFernTreeR" package: It is an R package implementing random forests and random ferns for dealing with Big Data processing. This package is developed on top of the Big Data frameworks: Apache Hadoop and Apache Spark.