Comparison with other packages¶

There exists other packages implementing similar methods in other languages, but with different goals in mind. To our knowledge, there are 2 software packages with the goal of learning from crowdsourced data:

Ceka: it is a Java software package based on WEKA, with a great number of methods that can be used to learn from crowdsourced data.
Truth inference in Crowdsourcing makes available a collection of methods in Python to learn from crowdsourced data.

Both are useful packages when dealing with crowdsourced data, with a focus mainly on research. Differently, spark-crowd is useful not only in research, but also in production. It provides a clear usage interface as well as software tests for all of its methods with a high tests coverage. Moreover, methods have been implemented with a focus on scalability, so it is useful in a wide variety of situations. A comparison of the methods over a set of datasets is provided in this section, taking into account both quality of the models and execution time.

Data¶

For this performance test we use simulated datasets of increasing size:

binary1-4: simulated binary class datasets with 10K, 100K, 1M and 10M instances respectively. Each of them has 10 simulated annotations per instance, and the ground truth for each example is known (but not used in the learning process). The accuracy shown in the tables is obtained over this known ground truth.
cont1-4: simulated continuous target variable datasets, with 10k, 100k, 1M and 10M instances respectively. Each of them has 10 simulated annotations per instance, and the ground truth for each example is known (but not used in the learning process). The Mean Absolute Error is obtained over this known ground truth.
crowdscale. A real multiclass dataset from the Crowdsourcing at Scale challenge. The data is comprised of 98979 instances, evaluated by, at least, 5 annotators, for a total of 569375 answers. We only have ground truth for the 0.3% of the data, which is used for evaluation.

All datasets are available through this link

CEKA¶

To compare our methods with Ceka, we used two of the main methods implemented in both packages, MajorityVoting and DawidSkene. Ceka and spark-crowd also implement GLAD and Raykar’s algorithms. However, in Ceka, these algorithms are implemented using wrappers to other libraries. The library for the GLAD algorithm is not available on our platform, as it is given as an EXE Windows file, and the wrapper for Raykar’s algorithms does not admit any configuration parameters.

We provide the results of the execution of these methods in terms of accuracy (Acc) and time (in seconds). For our package, we also include the execution time for a cluster (tc) with 3 executor nodes of 10 cores and 30Gb of memory each.

Comparison with Ceka¶
	MajorityVoting					DawidSkene
	Ceka		spark-crowd			Ceka		spark-crowd
Method	Acc	t1	Acc	t1	tc	Acc	t1	Acc	t1	tc
binary1	0.931	21	0.931	11	7	0.994	57	0.994	31	32
binary2	0.936	15983	0.936	11	7	0.994	49259	0.994	60	51
binary3	X	X	0.936	21	8	X	X	0.994	111	69
binary4	X	X	0.936	54	37	X	X	0.994
crowdscale	0.88	10458	0.9	13	7	0.89	30999	0.9033	447	86

Regarding accuracy, both packages achieve comparable results. However, regarding execution time, spark-crowd obtains significantly better results among all datasets, especially on the bigger datasets, where it can solve problems that Ceka is not able to.

Truth inference in crowdsourcing¶

Now we compare spark-crowd with the methods implemented by the authors. Although they can certainly be used to compare and test algorithms, the integration of these methods into a large ecosystem might be difficult, as the authors do not provide a software package structure. Nevertheless, as it is an available package with a great number of methods, a comparison with them is advisable.

For the experimentation, the same datasets are used as well as the same environments. In this case, a higher number of models can be compared, as most of the methods are written in python. However, the methods can only be applied to binary or continuous target variables. As far as we know, the use of multiclass target variables is not possible. Moreover, the use of feature information for Raykar’s methods it is also unsupported.

First, we compare the algorithms capable of learning from binary classes. In this category, MajorityVoting, DawidSkene, GLAD and IBCC are compared. For each dataset, the results in terms of Accuracy (Acc) and time (in seconds) are obtained. The table below shows the results for MajorityVoting and DawidSkene. Both packages obtain the same results in terms of accuracy. For the smaller datasets, the overhead imposed by parallelism makes Truth-inf a better choice, at least in terms of execution time. However, as the datasets increase, and especially, in the last two instances, the speedup obtained by our algorithm is notable. In the case of DawidSkene, the Truth-inf package is not able to complete the execution because of memory constraints in the largest dataset.

Comparative with Truth inference in Crowdsourcing package¶
	MajorityVoting					DawidSkene
	Truth-inf		spark-crowd			Truth-inf		spark-crowd
Method	Acc	t1	Acc	t1	tc	Acc	t1	Acc	t1	tc
binary1	0.931	1	0.931	11	7	0.994	12	0.994	31	32
binary2	0.936	8	0.936	11	7	0.994	161	0.994	60	51
binary3	0.936	112	0.936	21	8	0.994	1705	0.994	111	69
binary4	0.936	2908	0.936	57	37	M	M	0.994	703	426

Next we show the results for GLAD and IBCC. As can be seen, both packages obtain similar results in terms of accuracy. Regarding execution time, they obtain comparable results in the two smaller datasets (with a slight speedup in binary2) for the GLAD algorithm. However, for this algorithm, Truth-inf is not able to complete the execution for the two largest datasets. In the case of IBCC, the speedup starts to be noticeable from the second dataset on. It is also noticeable that Truth-Inf did not complete the execution for the last dataset.

Comparative with Truth inference in Crowdsourcing package (2)¶
	GLAD					IBCC
	Truth-inf		spark-crowd			Truth-inf		spark-crowd
Method	Acc	t1	Acc	t1	tc	Acc	t1	Acc	t1	tc
binary1	0.994	1185	0.994	1568	1547	0.994	22	0.994	74	67
binary2	0.994	4168	0.994	2959	2051	0.994	372	0.994	97	76
binary3	X	X	0.491	600	226	0.994	25764	0.994	203	129
binary4	X	X	0.974	2407	1158	X	X	X	1529	823

Note that the performance of GLAD algorithm seems to degrade in the bigger datasets. This may be due to the ammount of parameters the algorithm needs to estimate. A way to improve the estimation goes through decreasing the learning rate, which makes the algorithm slower, as it needs a lot more iterations to obtain a good solution. This makes the algorithm unsuitable for several big data contexts. To tackle this kind of problems, we developed and enhancement, CGLAD, which is included in this package (See the last section of this page for results of other methods in the package, as well as this enhancement).

Next we analize methods that are able to learn from continuous target variables: MajorityVoting (mean), CATD and PM (with mean initialization). We show the results in terms of MAE (Mean absolute error) and time (in seconds). The results for MajorityVoting and CATD can be found in the table below.

Comparative with Truth inference in Crowdsourcing package on continuous target variables¶
	MajorityVoting (mean)					CATD
	Truth-inf		spark-crowd			Truth-inf		spark-crowd
Method	Acc	t1	Acc	t1	tc	Acc	t1	Acc	t1	tc
cont1	1.234	1	1.234	6	8	0.324	207	0.324	25	28
cont2	1.231	8	1.231	7	9	0.321	10429	0.321	26	24
cont3	1.231	74	1.231	12	13	X	X	0.322	42	38
cont4	1.231	581	1.231	56	23	X	X	0.322	247	176

As you can see in the table, both packages obtain similar results regarding MAE. Regarding execution time, the implementation of MajorityVoting from the Truth-inf package obtains good results, especially in the smallest dataset. It is worth pointing out that, for the smallest datasets, the overhead imposed by parallelism makes the execution time of our package a little worse in comparison. However, as datasets increase in size, the speedup obtained by our package is notable, even in MajorityVoting, which is less complex computationally. Regarding CATD, Truth-inf seems not to be able to solve the 2 bigger problems in a reasonable time, however, they can be solved by our package in a small ammount of time. Even for the smaller datasets, our package obtains a high speedup in comparison to Truth-inf for CATD.

In the table below you can find the results for PM and PMTI algorithms.

Comparative with Truth inference in Crowdsourcing package on continuous target variables (2)¶
	PM					PMTI
	Truth-inf		spark-crowd			Truth-inf		spark-crowd
Method	Acc	t1	Acc	t1	tc	Acc	t1	Acc	t1	tc
cont1	0.495	77	0.495	57	51	0.388	139	0.388	68	61
cont2	0.493	8079	0.495	76	57	0.386	14167	0.386	74	58
cont3	X	X	0.494	130	97	X	X	0.387	143	98
cont4	X	X	0.494	769	421	X	X	0.387	996	475

Although similar, the modification implemented in Truth-inf from the original algorithm seems to be more accurate. Even in the smallest sizes, our package obtains a slight speedup. However, as the datasets increase in size, our package is able to obtain a much higher speedup.

Other methods¶

To complete our experimentation, next we focus on other methods implemented by our package that are not implemented by Ceka or Truth-Inf. These methods are the full implementation of the Raykar’s algorithms (taking into account the features of the instances) and the enhancement over the GLAD algorithm. As a note, Truth-Inf implements a version of Raykar’s algorithms that does not use the features of the instances. First, we show the results obtained by the Raykar’s methods for discrete target variables.

Other methods implemented in spark-crowd. Raykar’s methods for discrete target variables.¶
	RaykarBinary			RaykarMulti
	spark-crowd			spark-crowd
Method	Acc	t1	tc	Acc	t1	tc
binary1	0.994	65	63	0.994	167	147
binary2	0.994	92	74	0.994	241	176
binary3	0.994	181	190	0.994	532	339
binary4	0.994	1149	560	0.994	4860	1196

Next we show the Raykar method for tackling continous target variables.

Other methods implemented in spark-crowd. Raykar method for continuous target variables.¶
	RaykarCont
	spark-crowd
Method	Acc	t1	tc
cont1	0.994	31	32
cont2	0.994	60	51
cont3	0.994	111	69
cont4	0.994	703	426

Finally, we show the results for the CGLAD algorithm. As you can see, it obtains similar results to the GLAD algorithm in the smallest instances but it performs much better in the larger ones. Regarding execution time, CGLAD obtains a high speedup in the cases where accuracy results for both algorithms are similar.

Other methods implemented in spark-crowd. CGlad, an enhancement over Glad algorithm.¶
	CGlad
	spark-crowd
Method	Acc	t1	tc
binary1	0.994	128	128
binary2	0.995	233	185
binary3	0.995	1429	607
binary4	0.995	17337	6190