Comparison with other packages

There exists other packages implementing similar methods in other languages, but with different goals in mind. To our knowledge, there are 2 software packages with the goal of learning from crowdsourced data:

  • Ceka: it is a Java software package based on WEKA, with a great number of methods that can be used to learn from crowdsourced data.
  • Truth inference in Crowdsourcing makes available a collection of methods in Python to learn from crowdsourced data.

Both are useful packages when dealing with crowdsourced data, with a focus mainly on research. Differently, spark-crowd is useful not only in research, but also in production. It provides a clear usage interface as well as software tests for all of its methods with a high tests coverage. Moreover, methods have been implemented with a focus on scalability, so it is useful in a wide variety of situations. A comparison of the methods over a set of datasets is provided in this section, taking into account both quality of the models and execution time.


For this performance test we use simulated datasets of increasing size:

  • binary1-4: simulated binary class datasets with 10K, 100K, 1M and 10M instances respectively. Each of them has 10 simulated annotations per instance, and the ground truth for each example is known (but not used in the learning process). The accuracy shown in the tables is obtained over this known ground truth.
  • cont1-4: simulated continuous target variable datasets, with 10k, 100k, 1M and 10M instances respectively. Each of them has 10 simulated annotations per instance, and the ground truth for each example is known (but not used in the learning process). The Mean Absolute Error is obtained over this known ground truth.
  • crowdscale. A real multiclass dataset from the Crowdsourcing at Scale challenge. The data is comprised of 98979 instances, evaluated by, at least, 5 annotators, for a total of 569375 answers. We only have ground truth for the 0.3% of the data, which is used for evaluation.

All datasets are available through this link


To compare our methods with Ceka, we used two of the main methods implemented in both packages, MajorityVoting and DawidSkene. Ceka and spark-crowd also implement GLAD and Raykar’s algorithms. However, in Ceka, these algorithms are implemented using wrappers to other libraries. The library for the GLAD algorithm is not available on our platform, as it is given as an EXE Windows file, and the wrapper for Raykar’s algorithms does not admit any configuration parameters.

We provide the results of the execution of these methods in terms of accuracy (Acc) and time (in seconds). For our package, we also include the execution time for a cluster (tc) with 3 executor nodes of 10 cores and 30Gb of memory each.

Comparison with Ceka
  MajorityVoting DawidSkene
  Ceka spark-crowd Ceka spark-crowd
Method Acc t1 Acc t1 tc Acc t1 Acc t1 tc
binary1 0.931 21 0.931 11 7 0.994 57 0.994 31 32
binary2 0.936 15983 0.936 11 7 0.994 49259 0.994 60 51
binary3 X X 0.936 21 8 X X 0.994 111 69
binary4 X X 0.936 54 37 X X 0.994    
crowdscale 0.88 10458 0.9 13 7 0.89 30999 0.9033 447 86

Regarding accuracy, both packages achieve comparable results. However, regarding execution time, spark-crowd obtains significantly better results among all datasets, especially on the bigger datasets, where it can solve problems that Ceka is not able to.

Truth inference in crowdsourcing

Now we compare spark-crowd with the methods implemented by the authors. Although they can certainly be used to compare and test algorithms, the integration of these methods into a large ecosystem might be difficult, as the authors do not provide a software package structure. Nevertheless, as it is an available package with a great number of methods, a comparison with them is advisable.

For the experimentation, the same datasets are used as well as the same environments. In this case, a higher number of models can be compared, as most of the methods are written in python. However, the methods can only be applied to binary or continuous target variables. As far as we know, the use of multiclass target variables is not possible. Moreover, the use of feature information for Raykar’s methods it is also unsupported.

First, we compare the algorithms capable of learning from binary classes. In this category, MajorityVoting, DawidSkene, GLAD and IBCC are compared. For each dataset, the results in terms of Accuracy (Acc) and time (in seconds) are obtained. The table below shows the results for MajorityVoting and DawidSkene. Both packages obtain the same results in terms of accuracy. For the smaller datasets, the overhead imposed by parallelism makes Truth-inf a better choice, at least in terms of execution time. However, as the datasets increase, and especially, in the last two instances, the speedup obtained by our algorithm is notable. In the case of DawidSkene, the Truth-inf package is not able to complete the execution because of memory constraints in the largest dataset.

Comparative with Truth inference in Crowdsourcing package
  MajorityVoting DawidSkene
  Truth-inf spark-crowd Truth-inf spark-crowd
Method Acc t1 Acc t1 tc Acc t1 Acc t1 tc
binary1 0.931 1 0.931 11 7 0.994 12 0.994 31 32
binary2 0.936 8 0.936 11 7 0.994 161 0.994 60 51
binary3 0.936 112 0.936 21 8 0.994 1705 0.994 111 69
binary4 0.936 2908 0.936 57 37 M M 0.994 703 426

Next we show the results for GLAD and IBCC. As can be seen, both packages obtain similar results in terms of accuracy. Regarding execution time, they obtain comparable results in the two smaller datasets (with a slight speedup in binary2) for the GLAD algorithm. However, for this algorithm, Truth-inf is not able to complete the execution for the two largest datasets. In the case of IBCC, the speedup starts to be noticeable from the second dataset on. It is also noticeable that Truth-Inf did not complete the execution for the last dataset.

Comparative with Truth inference in Crowdsourcing package (2)
  Truth-inf spark-crowd Truth-inf spark-crowd
Method Acc t1 Acc t1 tc Acc t1 Acc t1 tc
binary1 0.994 1185 0.994 1568 1547 0.994 22 0.994 74 67
binary2 0.994 4168 0.994 2959 2051 0.994 372 0.994 97 76
binary3 X X 0.491 600 226 0.994 25764 0.994 203 129
binary4 X X 0.974 2407 1158 X X X 1529 823

Note that the performance of GLAD algorithm seems to degrade in the bigger datasets. This may be due to the ammount of parameters the algorithm needs to estimate. A way to improve the estimation goes through decreasing the learning rate, which makes the algorithm slower, as it needs a lot more iterations to obtain a good solution. This makes the algorithm unsuitable for several big data contexts. To tackle this kind of problems, we developed and enhancement, CGLAD, which is included in this package (See the last section of this page for results of other methods in the package, as well as this enhancement).

Next we analize methods that are able to learn from continuous target variables: MajorityVoting (mean), CATD and PM (with mean initialization). We show the results in terms of MAE (Mean absolute error) and time (in seconds). The results for MajorityVoting and CATD can be found in the table below.

Comparative with Truth inference in Crowdsourcing package on continuous target variables
  MajorityVoting (mean) CATD
  Truth-inf spark-crowd Truth-inf spark-crowd
Method Acc t1 Acc t1 tc Acc t1 Acc t1 tc
cont1 1.234 1 1.234 6 8 0.324 207 0.324 25 28
cont2 1.231 8 1.231 7 9 0.321 10429 0.321 26 24
cont3 1.231 74 1.231 12 13 X X 0.322 42 38
cont4 1.231 581 1.231 56 23 X X 0.322 247 176

As you can see in the table, both packages obtain similar results regarding MAE. Regarding execution time, the implementation of MajorityVoting from the Truth-inf package obtains good results, especially in the smallest dataset. It is worth pointing out that, for the smallest datasets, the overhead imposed by parallelism makes the execution time of our package a little worse in comparison. However, as datasets increase in size, the speedup obtained by our package is notable, even in MajorityVoting, which is less complex computationally. Regarding CATD, Truth-inf seems not to be able to solve the 2 bigger problems in a reasonable time, however, they can be solved by our package in a small ammount of time. Even for the smaller datasets, our package obtains a high speedup in comparison to Truth-inf for CATD.

In the table below you can find the results for PM and PMTI algorithms.

Comparative with Truth inference in Crowdsourcing package on continuous target variables (2)
  Truth-inf spark-crowd Truth-inf spark-crowd
Method Acc t1 Acc t1 tc Acc t1 Acc t1 tc
cont1 0.495 77 0.495 57 51 0.388 139 0.388 68 61
cont2 0.493 8079 0.495 76 57 0.386 14167 0.386 74 58
cont3 X X 0.494 130 97 X X 0.387 143 98
cont4 X X 0.494 769 421 X X 0.387 996 475

Although similar, the modification implemented in Truth-inf from the original algorithm seems to be more accurate. Even in the smallest sizes, our package obtains a slight speedup. However, as the datasets increase in size, our package is able to obtain a much higher speedup.

Other methods

To complete our experimentation, next we focus on other methods implemented by our package that are not implemented by Ceka or Truth-Inf. These methods are the full implementation of the Raykar’s algorithms (taking into account the features of the instances) and the enhancement over the GLAD algorithm. As a note, Truth-Inf implements a version of Raykar’s algorithms that does not use the features of the instances. First, we show the results obtained by the Raykar’s methods for discrete target variables.

Other methods implemented in spark-crowd. Raykar’s methods for discrete target variables.
  RaykarBinary RaykarMulti
  spark-crowd spark-crowd
Method Acc t1 tc Acc t1 tc
binary1 0.994 65 63 0.994 167 147
binary2 0.994 92 74 0.994 241 176
binary3 0.994 181 190 0.994 532 339
binary4 0.994 1149 560 0.994 4860 1196

Next we show the Raykar method for tackling continous target variables.

Other methods implemented in spark-crowd. Raykar method for continuous target variables.
Method Acc t1 tc
cont1 0.994 31 32
cont2 0.994 60 51
cont3 0.994 111 69
cont4 0.994 703 426

Finally, we show the results for the CGLAD algorithm. As you can see, it obtains similar results to the GLAD algorithm in the smallest instances but it performs much better in the larger ones. Regarding execution time, CGLAD obtains a high speedup in the cases where accuracy results for both algorithms are similar.

Other methods implemented in spark-crowd. CGlad, an enhancement over Glad algorithm.
Method Acc t1 tc
binary1 0.994 128 128
binary2 0.995 233 185
binary3 0.995 1429 607
binary4 0.995 17337 6190