Quick Start

You can easily start using spark-crowd through our docker image or through spark-packages. See Installation, for all installation alternatives (such as how to add the package as a dependency in your project).

Start with our docker image

The quickest way to try out the package is through the provided docker image with the latest version of our package, as you do not need to install any other software (apart from docker).

docker pull enriquegrodrigo/spark-crowd

Thanks to it, you can run the examples provided along with the package. For example, to run DawidSkeneExample.scala we can use:

docker run --rm -it -v $(pwd)/:/home/work/project enriquegrodrigo/spark-crowd DawidSkeneExample.scala

You can also open a spark shell with the library preloaded.

docker run --rm -it -v $(pwd)/:/home/work/project enriquegrodrigo/spark-crowd

By doing that, you can test you code directly. You will not benefit from the distributed execution of Apache Spark, but you are still able to use the algorithms with medium-sized datasets (since docker can use several cores in your machine).

Start with spark-packages

If you have an installation of Apache Spark you can open a spark-shell with our package pre-loaded using:

spark-shell --packages com.enriquegrodrigo:spark-crowd_2.11:0.2.1

Likewise, you can submit an application to your cluster that uses spark-crowd using:

spark-submit --packages com.enriquegrodrigo:spark-crowd_2.11:0.2.1 application.scala

To use this option you do not need to have a cluster of computers, you can also execute the code from your local machine because Apache Spark can be installed locally. For more information on how to install Apache Spark, please refer to its homepage.

Basic usage

Once you have chosen a procedure to run the package, you have to import the method that you want to use as well as the types for your data, as you can see below:

import com.enriquegrodrigo.spark.crowd.methods.DawidSkene
import com.enriquegrodrigo.spark.crowd.types.MulticlassAnnotation

val exampleFile = "examples/data/multi-ann.parquet"

val exampleData = spark.read.parquet(exampleFile).as[MulticlassAnnotation]

//Applying the learning algorithm
val mode = DawidSkene(exampleData)

//Get MulticlassLabel with the class predictions
val pred = mode.getMu().as[MulticlassLabel]

//Annotator precision matrices
val annprec = mode.getAnnotatorPrecision()

You can find a description of the code below:

  1. First the method and the type are imported, in this case DawidSkene and MulticlassAnnotation. The type is needed as the package API only accepts typed datasets for the annotations.
  2. Then the data file (provided with the package) is loaded. It contains annotations for different examples. As you can see, the examples uses the method as to convert the Spark DataFrame in a typed Spark Dataset (with type MulticlassAnnotation).
  3. To execute the model and obtain the result, you can use the model name directly. This function returns a DawidSkeneModel, which includes several methods to obtain results from the algorithm.
  4. The method getMu returns the ground truth estimations made by the model.
  5. We use getAnnotatorPrecision to obtain the annotator quality calculated by the model.

You can consult the models implemented in this package in Methods, where you can find a link to the original article for each algorithm.