Design and architechture ======================== The package design can be found in the figure below. .. image:: ../../img/package.png Although, the library contains several folders, the only importart folders for the users are the ``types`` folder, and the ``methods`` folder. The other folders contain auxiliary functions some of the methods. Specifically, in interesting to explore the data types, as they are essential to understand how the package works, as well as the common interface of the methods. Data types ------------ The package provides types for annotations datasets and ground truth datasets, as they usually follow the same structure. These types are used in all the methods so you would need to convert your annotations dataset to the correct format accepted by the algorithm. There are three types of annotations that the package supports for which we provide Scala case classes, making it possible to detect errors at compile time when using the algorithms: * ``BinaryAnnotation``: a Dataset of this type provides three columns: * The ``example`` column (i.e the example for which the annotation is made). * The ``annotator`` column (representing the annotator that made the annotation). * The ``value`` column, (with the value of the annotation, that can take as value either 0 or 1) * ``MulticlassAnnotation``: The difference form ``BinaryAnnotation`` is that the ``value`` column can take more than two values, in the range from 0 to the total number of values. * ``RealAnnotation``: In this case, the ``value`` column can take any numeric value. You can convert an annotation dataframe with columns ``example``, ``annotator`` and ``value`` to a typed dataset easily with the following instruction: .. code-block:: scala val typedData = untypedData.as[RealAnnotation] In the case of labels, we provide 5 types of labels, 2 of which are probabilistic. The three non probabilistic types are: * ``BinaryLabel``. A dataset with two columns: ``example`` and ``value``. The column value is a binary number (0 or 1). * ``MulticlassLabel``. A dataset with the same structure as the previous one but where the column ``value`` is a binary number (0 or 1). * ``RealLabel``. In this case, the column ``value`` can take any numeric value. The probabilistic types are used by some algorithms, to provide more information about the confidence of each class value for an specific example. * ``BinarySoftLabel``. A dataset with two columns: ``example`` and ``prob``. The column ``prob`` represents the probability of the example being positive. * ``MultiSoftLabel``. A dataset with three columns: ``example``, ``class`` and ``prob``. This last column represents the probability of the example taking the class in the column ``class``. Methods --------- All methods implemented are in the ``methods`` subpackage and are mostly independent of each other. There MajorityVoting algorithms are the only exception, as most of the other methods use them in the initialization step. Apart from that, each algorithm is implemented in its specific file. Apart from that, each algorithm is implemented in its specific file. This makes it easier to extend the package with new algorithms. Although independent, all algorithms have a similar interface, which facilitates its use. To execute an algorithm, the user normally needs to use the ``apply`` method of the model (which in ``scala``, is equivalent to applying the object itself), as shown below .. code-block:: scala ... val model = IBCC(annotations) ... After the model completes its execution, a model object is returned, which will have information about the ground truth estimations and other estimations that are dependent on the chosen algorithm. The only algorithm that does not follow this pattern is ``MajorityVoting``, which has methods for each of the class types and also to obtain probabilistic labels. See the `API Docs `_ for details.