However, the sheer scale of text data generated on the Web poses obvious practical challenges of classifying user subjectivity using traditional qualitative techniques. The proliferation of user-generated content on the Web 2.0 provides new opportunities for capturing people's appraisals, feelings and opinions. Traditionally, in domains such as market research, user subjectivity has been accessed using qualitative techniques such as surveys, interviews and focus groups. Further investigation is needed into ways of better balancing positive and negative emotions. However, both quantitative and qualitative analysis highlighted its major shortcoming – oversimplification of positive emotions, which are all conflated into happiness. Therefore, out of the six schemes studied here, six basic emotions are best suited for emotive language analysis. Not surprisingly, the smallest scheme was ranked the highest in both criteria. According to the F-measure, the classification schemes were ranked as follows: (1) six basic emotions (F = 0.410), (2) Circumplex (F = 0.341), (3) wheel of emotion (F = 0.293), (4) EARL (F = 0.254), (5) free text (F = 0.159) and (6) WordNet–Affect (F = 0.158). The training data were used in cross–validation experiments to evaluate classification performance in relation to different schemes. In the second part of the study, we used the annotated corpus to create six training datasets, one for each scheme. On the opposite end of the spectrum, large schemes such as WordNet–Affect were linked to choice fatigue, which incurred significant cognitive effort in choosing the best annotation.
In particular, the scheme of six basic emotions was perceived as having insufficient coverage of the emotion space forcing annotators to often resort to inferior alternatives, e.g. The size of the classification scheme was highlighted as a significant factor affecting annotation. To complement the result of the quantitative analysis, we used semi–structured interviews to gain a qualitative insight into how annotators interacted with and interpreted the chosen schemes. However, correspondence analysis of annotations across the schemes highlighted that basic emotions are oversimplified representations of complex phenomena and as such likely to lead to invalid interpretations, which are not necessarily reflected by high inter-annotator agreement. We used Krippendorff's alpha coefficient to measure inter–annotator agreement according to which the six classification schemes were ranked as follows: (1) six basic emotions (α = 0.483), (2) wheel of emotion (α = 0.410), (3) Circumplex (α = 0.312), EARL (α = 0.286), (5) free text (α = 0.205), and (6) WordNet–Affect (α = 0.202). Assuming that classification schemes with a better balance between completeness and complexity are easier to interpret and use, we expect such schemes to be associated with higher inter–annotator agreement. The corpus was annotated manually using an online crowdsourcing platform with five independent annotators per document. We assembled a corpus of 500 emotionally charged text documents. To measure their utility, we investigated their ease of use by human annotators as well as the performance of supervised machine learning. We compared six schemes: (1) Ekman's six basic emotions, (2) Plutchik's wheel of emotion, (3) Watson and Tellegen's Circumplex theory of affect, (4) the Emotion Annotation Representation Language (EARL), (5) WordNet–Affect, and (6) free text. In this paper we investigated the utility of different classification schemes for emotive language analysis with the aim of providing experimental justification for the choice of scheme for classifying emotions in free text.