Examples¶
In this page you can find examples for several of the algorithms in the library. You can find the data used for the examples in the Github repository.
MajorityVoting¶
The example below shows how to use the MajorityVoting algorithm for estimating the ground truth for a binary target variable.
import com.enriquegrodrigo.spark.crowd.methods.MajorityVoting
import com.enriquegrodrigo.spark.crowd.types.BinaryAnnotation
val exampleFile = "data/binary-ann.parquet"
val exampleDataBinary = spark.read.parquet(exampleFile).as[BinaryAnnotation]
val muBinary = MajorityVoting.transformBinary(exampleDataBinary)
muBinary.show()
The method returns a result similar to this one:
+-------+-----+
|example|value|
+-------+-----+
| 26| 0|
| 29| 1|
| 474| 0|
| 964| 1|
| 65| 0|
| 191| 0|
| 418| 1|
....
MajorityVoting algorithms assume that all annotators are equally accurate, so they choose the most frequent annotation as the ground truth label. Because of this, they only return the ground truth for the problem.
The data file in this example follow the format from the BinaryAnnotation
type:
example, annotator, value
0, 0, 1
0, 1, 0
0, 2, 1
...
In this example, we use a .parquet
data file, which is usually a good option in terms of
efficiency. However, we do not limit the types of files you can use, as long as they can be
converted to typed datasets of BinaryAnnotation
, MulticlassAnnotation
or RealAnnotation
.
However, algorithms will suppose that there are no missing examples or annotators.
Specifically, MajorityVoting can make predictions both for discrete classes (BinaryAnnotation
and
MulticlassAnnotation
) and continuous-valued target variables. (RealAnnotation
). You can find
information about these methods in the API Docs.
DawidSkene¶
This algorithm is one of the most recommended both for its simplicity and its good results generally.
import com.enriquegrodrigo.spark.crowd.methods.DawidSkene
import com.enriquegrodrigo.spark.crowd.types.MulticlassAnnotation
val exampleFile = "examples/data/multi-ann.parquet"
val exampleData = spark.read.parquet(exampleFile).as[MulticlassAnnotation]
val mode = DawidSkene(exampleData, eMIters=10, emThreshold=0.001)
val pred = mode.getMu().as[MulticlassLabel]
val annprec = mode.getAnnotatorPrecision()
In the implementation, two parameters are used for controlling the algorithm execution, the maximum number of EM iterations and the threshold for the likelihood change. The execution stops if the number of iterations reaches the established maximum or if the change in likelihood is less than the threshold. You do not need to provide these parameters, as they have default values.
One executed, the model provides an estimation of the ground truth, and an estimation of the quality of each annotator, in the form of a confusion matrix. This information can be obtained as shown on the example.
GLAD¶
The GLAD algorithm is interesting as it provides both annotator accuracies and example difficulties obtained solely from the annotations. An example of how to use it can be found below.
import com.enriquegrodrigo.spark.crowd.methods.Glad
import com.enriquegrodrigo.spark.crowd.types.BinaryAnnotation
val annFile = "data/binary-ann.parquet"
val annData = spark.read.parquet(annFile).as[BinaryAnnotation]
val mode = Glad(annData,
eMIters=5, //Maximum number of iterations of EM algorithm
eMThreshold=0.1, //Threshold for likelihood changes
gradIters=30, //Gradient descent max number of iterations
gradTreshold=0.5, //Gradient descent threshold
gradLearningRate=0.01, //Gradient descent learning rate
alphaPrior=1, //Alpha first value (GLAD specific)
betaPrior=1) //Beta first value (GLAD specific)
val pred = mode.getMu().as[BinarySoftLabel]
val annprec = mode.getAnnotatorPrecision()
val annprec = mode.getInstanceDifficulty()
This model as implemented in the library is only compatible with binary class problems. It has a higher number of free parameters in comparison with the previous algorithm, but we provided default values for all of them for convenience. The meaning of each of these parameters is commented in the example above, as it is on the API Docs. The annotator precision is given as a vector, with an entry for each annotator. The difficulty is given in the form of a DataFrame, returning a difficulty value for each example. For more information, you can consult the documentation and/or the paper.
RaykarBinary, RaykarMulti and RaykarCont¶
We implement the three variants of this algorithm, two for discrete target variables (RaykarBinary
and
RaykarMulti
) and one for continuous variables (RaykarCont
).
These algorithms have in common that they are able to use features to estimate the ground truth
and even learn a linear model. The model also is able to use prior information about annotators,
which can be useful to add more confidence to certain annotators. The next example shows
how to use this priors to indicate that the trust put in the first annotator is higher and
that the second annotator is not reliable.
import com.enriquegrodrigo.spark.crowd.methods.RaykarBinary
import com.enriquegrodrigo.spark.crowd.types.BinaryAnnotation
val exampleFile = "data/binary-data.parquet"
val annFile = "data/binary-ann.parquet"
val exampleData = spark.read.parquet(exampleFile)
val annData = spark.read.parquet(annFile).as[BinaryAnnotation]
//Preparing priors
val nAnn = annData.map(_.annotator).distinct.count().toInt
val a = Array.fill[Double](nAnn,2)(2.0) //Uniform prior
val b = Array.fill[Double](nAnn,2)(2.0) //Uniform prior
//Give first annotator more confidence
a(0)(0) += 1000
b(0)(0) += 1000
//Give second annotator less confidence
//Annotator 1
a(1)(1) += 1000
b(1)(1) += 1000
//Applying the learning algorithm
val mode = RaykarBinary(exampleData, annData,
eMIters=5,
eMThreshold=0.001,
gradIters=100,
gradThreshold=0.1,
gradLearning=0.1
a_prior=Some(a), b_prior=Some(b))
//Get MulticlassLabel with the class predictions
val pred = mode.getMu().as[BinarySoftLabel]
//Annotator precision matrices
val annprec = mode.getAnnotatorPrecision()
Apart form the features matrix and the priors, the meaning of the parameters is the same as in the previous examples.
The priors are matrices of dimension A by 2, where A is the number of annotators. In each row we have the hyperparameters of a Beta distribution for each annotator.
The a_prior
gives prior information about the ability of annotators to correctly classify a positive example. The
b_prior
does the same thing but for the negative examples. More information about this method as well as the methods
for discrete and continuous target variables can be found in the API Docs.
CATD¶
This method allows to estimate continuous-value target variables from annotations.
import com.enriquegrodrigo.spark.crowd.methods.CATD
import com.enriquegrodrigo.spark.crowd.types.RealAnnotation
sc.setCheckpointDir("checkpoint")
val annFile = "examples/data/cont-ann.parquet"
val annData = spark.read.parquet(annFile).as[RealAnnotation]
//Applying the learning algorithm
val mode = CATD(annData, iterations=5,
threshold=0.1,
alpha=0.05)
//Get MulticlassLabel with the class predictions
val pred = mode.mu
//Annotator precision matrices
val annprec = mode.weights
It returns a model from which you can get the ground truth estimation and
also the annotator weight used (more weight means a better annotator).
The algorithm uses parameters such as iterations
and threshold
for
controlling the execution, and also alpha
, which is a parameter of the model
(check the API Docs for more information).