Abstract:
|
The availability of unstructured big data, such as the ones produced by social media, highlights the increasing methodological interest on text analysis and on the linked pre-processing phases. Several works have recently studied the impact of different pre-processing treatments on text classification. This aspect has been rarely studied when the target of the research is the definition of a topic-oriented dictionary that could be used to select messages regarding a certain topic among a wide group of unlabelled texts. The latter is a crucial phase: carefully filtering messages is a key aspect to start and to properly develop any type of textual analysis. In this paper, we aim at setting up a dictionary regarding environment. Starting from a verified list of Twitter Official Social Accounts, we evaluate if and how different pre-processing treatments (and their combination) can affect the final dictionary.
|