KDIR 2016 Abstracts


Full Papers
Paper Nr: 9
Title:

StorylineViz: A [Space, Time, Actors, Motion] Segmentation Method for Visual Text Exploration

Authors:

Iwona Dudek and Jean-Yves Blaise

Abstract: Supporting knowledge discovery through visual means is a hot research topic in the field of visual analytics in general, and a key issue in the analysis of textual data sets. In that context, the StorylineViz study aims at developing a generic approach to narrative analysis, supporting the identification of significant patterns inside textual data, and ultimately knowledge discovery and sensemaking. It builds on a text segmentation procedure through which sequences of situations are extracted. A situation is defined by a quadruplet of components: actors, space, time and motion. The approach aims at facilitating visual reasoning on the structure, rhythm, patterns and variations of heterogeneous texts in order to enable comparative analysis, and to summarise how the space/time/actors/motion components are organised inside a given narrative. It encompasses issues that are rooted in Information Sciences - visual analytics, knowledge representation – and issues that more closely relate to Digital Humanities – comparative methods and analytical reasoning on textual content, support in teaching and learning, cultural mediation.

Paper Nr: 20
Title:

Keyword-based Approach for Lyrics Emotion Variation Detection

Authors:

Ricardo Malheiro, Hugo Gonçalo Oliveira, Paulo Gomes and Rui Pedro Paiva

Abstract: This research addresses the role of the lyrics in the context of music emotion variation detection. To accomplish this task we create a system to detect the predominant emotion expressed by each sentence (verse) of the lyrics. The system employs Russell’s emotion model and contains 4 sets of emotions associated to each quadrant. To detect the predominant emotion in each verse, we propose a novel keyword-based approach, which receives a sentence (verse) and classifies it in the appropriate quadrant. To tune the system parameters, we created a 129-sentence training dataset from 68 songs. To validate our system, we created a separate ground-truth containing 239 sentences (verses) from 44 songs annotated manually with an average of 7 annotations per sentence. The system attains 67.4% F-Measure score.

Paper Nr: 21
Title:

Classification and Regression of Music Lyrics: Emotionally-Significant Features

Authors:

Ricardo Malheiro, Renato Panda, Paulo Gomes and Rui Pedro Paiva

Abstract: This research addresses the role of lyrics in the music emotion recognition process. Our approach is based on several state of the art features complemented by novel stylistic, structural and semantic features. To evaluate our approach, we created a ground truth dataset containing 180 song lyrics, according to Russell’s emotion model. We conduct four types of experiments: regression and classification by quadrant, arousal and valence categories. Comparing to the state of the art features (ngrams - baseline), adding other features, including novel features, improved the F-measure from 68.2%, 79.6% and 84.2% to 77.1%, 86.3% and 89.2%, respectively for the three classification experiments. To study the relation between features and emotions (quadrants) we performed experiments to identify the best features that allow to describe and discriminate between arousal hemispheres and valence meridians. To further validate these experiments, we built a validation set comprising 771 lyrics extracted from the AllMusic platform, having achieved 73.6% F- measure in the classification by quadrants. Regarding regression, results show that, comparing to similar studies for audio, we achieve a similar performance for arousal and a much better performance for valence.

Paper Nr: 32
Title:

A Multi-Layer System for Semantic Textual Similarity

Authors:

Ngoc Phuoc An Vo and Octavian Popescu

Abstract: Building a system able to cope with various phenomena which falls under the umbrella of semantic similarity is far from trivial. It is almost always the case that the performances of a system do not vary consistently or predictably from corpora to corpora. We analyzed the source of this variance and found that it is related to the word-pair similarity distribution among the topics in the various corpora. Then we used this insight to construct a 4-module system that would take into consideration not only string and semantic word similarity, but also word alignment and sentence structure. The system consistently achieves an accuracy which is very close to the state of the art, or reaching a new state of the art. The system is based on a multi-layer architecture and is able to deal with heterogeneous corpora which may not have been generated by the same distribution.

Paper Nr: 46
Title:

Unsupervised Irony Detection: A Probabilistic Model with Word Embeddings

Authors:

Debora Nozza, Elisabetta Fersini and Enza Messina

Abstract: The automatic detection of figurative language, such as irony and sarcasm, is one of the most challenging tasks of Natural Language Processing (NLP). This is because machine learning methods can be easily misled by the presence of words that have a strong polarity but are used ironically, which means that the opposite polarity was intended. In this paper, we propose an unsupervised framework for domain-independent irony detection. In particular, to derive an unsupervised Topic-Irony Model (TIM), we built upon an existing probabilistic topic model initially introduced for sentiment analysis purposes. Moreover, in order to improve its generalization abilities, we took advantage of Word Embeddings to obtain domain-aware ironic orientation of words. This is the first work that addresses this task in unsupervised settings and the first study on the topic-irony distribution. Experimental results have shown that TIM is comparable, and sometimes even better with respect to supervised state of the art approaches for irony detection. Moreover, when integrating the probabilistic model with word embeddings (TIM+WE), promising results have been obtained in a more complex and real world scenario.

Paper Nr: 47
Title:

A Machine Learning Approach for Layout Inference in Spreadsheets

Authors:

Elvis Koci, Maik Thiele, Oscar Romero and Wolfgang Lehner

Abstract: Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach delivers very high accuracy bringing us a crucial step closer towards automatic table extraction.

Paper Nr: 52
Title:

Automatic Text Summarization by Non-topic Relevance Estimation

Authors:

Ignacio Arroyo-Fernández, Juan-Manuel Torres-Moreno, Gerardo Sierra and Luis Adrián Cabrera-Diego

Abstract: We investigate a novel framework for Automatic Text Summarization. In this framework underlying language-use features are learned from a minimal sample corpus. We argue the low complexity of this kind of features allows relying in generalization ability of a learning machine, rather than in diverse human-abstracted summaries. In this way, our method reliably estimates a relevance measure for predicting summary candidature scores, regardless topics in unseen documents. Our output summaries are comparable to the state-of-the-art. Thus we show that in order to extract meaning summaries, it is not crucial what is being said; but rather how it is being said.

Paper Nr: 55
Title:

Discovering Data Lineage from Data Warehouse Procedures

Authors:

Kalle Tomingas, Priit Järv and Tanel Tammet

Abstract: We present a method to calculate component dependencies and data lineage from the database structure and a large set of associated procedures and queries, independently of actual data in the data warehouse. The method relies on the probabilistic estimation of the impact of data in queries. We present a rule system supporting the efficient calculation of the transitive closure. The dependencies are categorized, aggregated and visualized to address various planning and decision support problems. System performance is evaluated and analysed over several real-life datasets.

Paper Nr: 59
Title:

A Linear-dependence-based Approach to Design Proactive Credit Scoring Models

Authors:

Roberto Saia and Salvatore Carta

Abstract: The main aim of a credit scoring model is the classification of the loan customers into two classes, reliable and unreliable customers, on the basis of their potential capability to keep up with their repayments. Nowadays, credit scoring models are increasingly in demand, due to the consumer credit growth. Such models are usually designed on the basis of the past loan applications and used to evaluate the new ones. Their definition represents a hard challenge for different reasons, the most important of which is the imbalanced class distribution of data (i.e., the number of default cases is much smaller than that of the non-default cases), and this reduces the effectiveness of the most widely used approaches (e.g., neural network, random forests, and so on). The Linear Dependence Based (LDB) approach proposed in this paper offers a twofold advantage: it evaluates a new loan application on the basis of the linear dependence of its vector representation in the context of a matrix composed by the vector representation of the non-default applications history, thus by using only a class of data, overcoming the imbalanced class distribution issue; furthermore, it does not exploit the defaulting loans, allowing us to operate in a proactive manner, by addressing also the cold-start problem. We validate our approach on two real-world datasets characterized by a strong unbalanced distribution of data, by comparing its performance with that of one of the best state-of-the-art approach: random forests.

Paper Nr: 68
Title:

Estimating Sentiment via Probability and Information Theory

Authors:

Kevin Labille, Sultan Alfarhood and Susan Gauch

Abstract: Opinion detection and opinion analysis is a challenging but important task. Such sentiment analysis can be done using traditional supervised learning methods such as naive Bayes classification and support vector ma- chines (SVM) or unsupervised approaches based on a lexicon may be employed. Because lexicon-based senti- ment analysis methods make use of an opinion dictionary that is a list of opinion-bearing or sentiment words, sentiment lexicons play a key role. Our work focuses on the task of generating such a lexicon. We propose several novel methods to automatically generate a general-purpose sentiment lexicon using a corpus-based approach. While most existing methods generate a lexicon using a list of seed sentiment words and a domain corpus, our work differs from these by generating a lexicon from scratch using probabilistic techniques and information theoretical text mining techniques on a large diverse corpus. We conclude by presenting an ensem- ble method that combines the two approaches. We evaluate and demonstrate the effectiveness of our methods by utilizing the various automatically-generated lexicons during sentiment analysis. When used for sentiment analysis, our best single lexicon achieves an accuracy of 87.60% and the ensemble approach achieves 88.75% accuracy, both statistically significant improvements over 81.60% with a widely-used sentiment lexicon.

Short Papers
Paper Nr: 10
Title:

Putting Web Tables into Context

Authors:

Katrin Braunschweig, Maik Thiele, Elvis Koci and Wolfgang Lehner

Abstract: Web tables are a valuable source of information used in many application areas. However, to exploit Web tables it is necessary to understand their content and intention which is impeded by their ambiguous semantics and inconsistencies. Therefore, additional context information, e.g. text in which the tables are embedded, is needed to support the table understanding process. In this paper, we propose a novel contextualization approach that 1) splits the table context in topically coherent paragraphs, 2) provides a similarity measure that is able to match each paragraph to the table in question and 3) ranks these paragraphs according to their relevance. Each step is accompanied by an experimental evaluation on real-world data showing that our approach is feasible and effectively identifies the most relevant context for a given Web table.

Paper Nr: 11
Title:

DTATG: An Automatic Title Generator based on Dependency Trees

Authors:

Liqun Shao and Jie Wang

Abstract: We study automatic title generation for a given block of text and present a method called DTATG to generate titles. DTATG first extracts a small number of central sentences that convey the main meanings of the text and are in a suitable structure for conversion into a title. DTATG then constructs a dependency tree for each of these sentences and removes certain branches using a Dependency Tree Compression Model we devise. We also devise a title test to determine if a sentence can be used as a title. If a trimmed sentence passes the title test, then it becomes a title candidate. DTATG selects the title candidate with the highest ranking score as the final title. Our experiments showed that DTATG can generate adequate titles. We also showed that DTATG-generated titles have higher F1 scores than those generated by the previous methods.

Paper Nr: 17
Title:

Bootstrapping a Semantic Lexicon on Verb Similarities

Authors:

Shaheen Syed, Marco Spruit and Melania Borit

Abstract: We present a bootstrapping algorithm to create a semantic lexicon from a list of seed words and a corpus that was mined from the web. We exploit extraction patterns to bootstrap the lexicon and use collocation statistics to dynamically score new lexicon entries. Extraction patterns are subsequently scored by calculating the conditional probability in relation to a non-related text corpus. We find that verbs that are highly domain related achieved the highest accuracy and collocation statistics affect the accuracy positively and negatively during the bootstrapping runs.

Paper Nr: 18
Title:

Heterogeneous Ensemble for Imaginary Scene Classification

Authors:

Saleh Alyahyan, Majed Farrash and Wenjia Wang

Abstract: In data mining, identifying the best individual technique to achieve very reliable and accurate classification has always been considered as an important but non-trivial task. This paper presents a novel approach - heterogeneous ensemble technique, to avoid the task and also to increase the accuracy of classification. It combines the models that are generated by using methodologically different learning algorithms and selected with different rules of utilizing both accuracy of individual modules and also diversity among the models. The key strategy is to select the most accurate model among all the generated models as the core model, and then select a number of models that are more diverse from the most accurate model to build the heterogeneous ensemble. The framework of the proposed approach has been implemented and tested on a real-world data to classify imaginary scenes. The results show our approach outperforms other the state of the art methods, including Bayesian network, SVM and AdaBoost.

Paper Nr: 23
Title:

Enriching What-If Scenarios with OLAP Usage Preferences

Authors:

Mariana Carvalho and Orlando Belo

Abstract: Nowadays, enterprise managers involved in decision-making processes struggle with numerous problems related to market position or business reputation of their companies. Owning the right and high quality set of information is a crucial factor for developing business activities and consequently gaining competitive advantages on business arenas. However, retrieving information is not enough. The possibility to simulate hypothetical scenarios without harming the business using What-If analysis tools and to retrieve highly refined information is an interesting way of achieving such advantages. In this paper, we propose an approach for helping to optimize enterprise decision processes using What-If analysis scenarios combined with OLAP usage preferences. We designed and developed a specific piece of software, which aims to discover the best recommendations for What-If analysis scenarios’ parameters using OLAP usage preferences, which incorporates user experience in the definition and analysis of a target decision-making scenario.

Paper Nr: 28
Title:

Summarization of Spontaneous Speech using Automatic Speech Recognition and a Speech Prosody based Tokenizer

Authors:

György Szaszák, Máté Ákos Tündik and András Beke

Abstract: This paper addresses speech summarization of highly spontaneous speech. The audio signal is transcribed using an Automatic Speech Recognizer, which operates at relatively high word error rates due to the complexity of the recognition task and high spontaneity of speech. An analysis is carried out to assess the propagation of speech recognition errors into syntactic parsing. We also propose an automatic, speech prosody based audio tokenization approach and compare it to human performance. The so obtained sentence-like tokens are analysed by the syntactic parser to help ranking based on thematic terms and sentence position. The thematic term is expressed in two ways: TF-IDF and Latent Semantic Indexing. The sentence scores are calculated as a linear combination of the thematic term score and a positional score. The summary is generated from the top 10 candidates. Results show that prosody based tokenization reaches human average performance and that speech recognition errors propagate moderately into syntactic parsing (POS tagging and dependency parsing). Nouns prove to be quite error resistant. Audio summarization shows 0.62 recall and 0.79 precision by an F-measure of 0.68, compared to human reference. A subjective test is also carried out on a Likert-scale. All results apply to spontaneous Hungarian.

Paper Nr: 34
Title:

Result Diversity for RDF Search

Authors:

Hiba Arnaout and Shady Elbassuoni

Abstract: RDF repositories are typically searched using triple-pattern queries. Often, triple-pattern queries will return too many results, making it difficult for users to find the most relevant ones. To remedy this, some recent works have proposed relevance-based ranking-models for triple-pattern queries. However it is often the case that the top-ranked results are homogeneous. In this paper, we propose a framework to diversify the results of triple-pattern queries over RDF datasets. We first define different notions for result diversity in the setting of RDF. We then develop an approach for result diversity based on the Maximal Marginal Relevance. Finally, we develop a diversity-aware evaluation metric based on the Discounted Cumulative Gain and use it on a benchmark of 100 queries over DBPedia.

Paper Nr: 37
Title:

Sentiment Analysis of Breast Cancer Screening in the United States using Twitter

Authors:

Kai O. Wong, Faith G. Davis, Osmar R. Zaïane and Yutaka Yasui

Abstract: Whether or not U.S. women follow the recommended breast cancer screening guidelines is related to the perceived benefits and harms of the procedure. Twitter is a rich source of subjective information containing individuals’ sentiment towards public health interventions/technologies. Using our modified version of Hutto and Gilbert (2014) sentiment classifier, we described the temporal, geospatial, and thematic patterns of public sentiment towards breast cancer screening with 8 months of tweets (n=64,524) in the U.S. To examine how sentiment was related to screening uptake behaviour, we investigated and identified significant associations between breast cancer screening sentiment (via Twitter) and breast cancer screening uptake (via BRFSS) at the state level.

Paper Nr: 39
Title:

An Active Learning Approach for Ensemble-based Data Stream Mining

Authors:

Rabaa Alabdulrahman, Herna Viktor and Eric Paquet

Abstract: Data streams, where an instance is only seen once and where a limited amount of data can be buffered for processing at a later time, are omnipresent in today’s real-world applications. In this context, adaptive online ensembles that are able to learn incrementally have been developed. However, the issue of handling data that arrives asynchronously has not received enough attention. Often, the true class label arrives after with a time-lag, which is problematic for existing adaptive learning techniques. It is not realistic to require that all class labels be made available at training time. This issue is further complicated by the presence of late-arriving, slowly changing dimensions (i.e., late-arriving descriptive attributes). The aim of active learning is to construct accurate models when few labels are available. Thus, active learning has been proposed as a way to obtain such missing labels in a data stream classification setting. To this end, this paper introduces an active online ensemble (AOE) algorithm that extends online ensembles with an active learning component. Our experimental results demonstrate that our AOE algorithm builds accurate models against much smaller ensemble sizes, when compared to traditional ensemble learning algorithms. Further, our models are constructed against small, incremental data sets, thus reducing the number of examples that are required to build accurate ensembles.

Paper Nr: 40
Title:

Prediction of Company's Trend based on Publication Statistics and Sentiment Analysis

Authors:

Fumiyo Fukumoto, Yoshimi Suzuki, Akihiro Nonaka and Karman Chan

Abstract: This paper presents a method for predicting company’s trend on research and development(R&D) in business area. We used three types of data collections, i.e, scientific papers, open patents, and newspaper articles to estimate temporal changes of trends on company’s business area. We used frequency counts on scientific papers and open patents to be published in time series. For news articles, we applied sentiment analysis to extract positive news reports related to the company’s business areas, and count their frequencies. For each company, we then created temporal changes based on these frequency statistics. For each business area, we clustered these temporal changes. Finally, we estimated prediction models for each cluster. The results show that the the model obtained by combining three data is effective to predict company’s future trends, especially the results show that SP clustering contributes overall performance.

Paper Nr: 42
Title:

Explanation Retrieval in Semantic Networks - Understanding Spreading Activation based Recommendations

Authors:

Vanessa N. Michalke and Kerstin Hartig

Abstract: Spreading Activation is a well-known semantic search technique to determine the relevance of nodes in a semantic network. When used for decision support, meaningful explanations of semantic search results are crucial for the user’s acceptance and trust. Usually, explanations are generated based on the original network. Indeed, the data accumulated during the spreading activation process contains semantically extremely valuable information. Therefore, our approach exploits the so-called spread graph, a specific data structure that comprises the spreading progress data. In this paper, we present a three-step explanation retrieval method based on spread graphs. We show how to retrieve the most relevant parts of a network by minimization and extraction techniques and formulate meaningful explanations. The evaluation of the approach is then performed with a prototypical decision support system for automotive safety analyses.

Paper Nr: 58
Title:

Introducing the Key Stages for Addressing Multi-perspective Summarization

Authors:

Elena Lloret

Abstract: Generating summaries from evaluative text (e.g., reviews) is a challenging task, in which available metadata is hardly exploited, thus leading to the creation of very generic and biased summaries. In this paper, the novel task of multi-perspective summarization is introduced. The key stages for generating this type of summaries are defined, and a preliminary analysis of their feasibility is conducted. The main novelties of this study include: i) the linguistic treatment of the text at the level of basic information units, instead of sentences; and ii) the analysis carried out over the distribution of opinions and topics.

Paper Nr: 60
Title:

Efficient Distraction Detection in Surveillance Video using Approximate Counting over Decomposed Micro-streams

Authors:

Avi Bleiweiss

Abstract: Mining techniques of infinite data streams often store synoptic information about the most recently observed data elements. Motivated by space efficient solutions, our work exploits approximate counting over a fixed-size sliding window to detect distraction events in video. We propose a model that transforms inline the incoming video sequence to an orthogonal set of thousands of binary micro-streams, and for each of the bit streams we estimate at every timestamp the count of number-of-ones in a preceding sub-window interval. On window bound frames, we further extract a compact feature representation of a bag of count-of-1’s occurrences to facilitate effective query of transitive similarity samples. Despite its simplicity, our prototype demonstrates robust knowledge discovery to support the intuition of a context-neutral window summary. To evaluate our system, we use real-world scenarios from a video surveillance online-repository.

Paper Nr: 63
Title:

Learning Diagnosis from Electronic Health Records

Authors:

Ioana Barbantan and Rodica Potolea

Abstract: In the attempt to build a complete solution for a medical assistive decision support system we proposed a complex flow that integrates a sequence of modules which target the different data engineering tasks. This solution can analyse any type of unstructured medical documents which are processed by applying specific NLP steps followed by semantic analysis which leads to the medical concepts identification, thus imposing a structure on the input documents. The data collection, document pre-processing, concept extraction, and correlation are modules that have been researched by us in our previous works and for which we proposed original solutions. Using the collected and structured representation of the medical records, informed decisions regarding the health status of the patients can be made. The current paper focuses on the prediction module that joins all the components in a logical flow and is completed with the suggested diagnosis classification for the patient. The accuracy rate of 81.25%, obtained on the medical documents supports the strength of our proposed strategy.

Paper Nr: 65
Title:

Unsupervised Classification of Opinions

Authors:

Itu Vlad Vasile, Rodica Potolea and Mihaela Dinsoreanu

Abstract: Opinion mining is gaining more interest thanks to the ever growing data available on the internet. This work proposes an unsupervised approach that clusters opinions in fine grain ranges. The approach is able to generate its own seed words for better applicability to the context and eliminating user input. Furthermore, we devise a computation strategy for the influence of valence shifters and negations on opinion words. The method is general enough to perform well while reducing subjectivity to a minimum.

Paper Nr: 76
Title:

Power to the People! - Meta-Algorithmic Modelling in Applied Data Science

Authors:

Marco Spruit and Raj Jagesar

Abstract: This position paper first defines the research field of applied data science at the intersection of domain expertise, data mining, and engineering capabilities, with particular attention to analytical applications. We then propose a meta-algorithmic approach for applied data science with societal impact based on activity recipes. Our people-centred motto from an applied data science perspective translates to design science research which focuses on empowering domain experts to sensibly apply data mining techniques through prototypical software implementations supported by meta-algorithmic recipes.

Paper Nr: 79
Title:

Evaluation of Statistical Text Normalisation Techniques for Twitter

Authors:

Phavanh Sosamphan, Veronica Liesaputra, Sira Yongchareon and Mahsa Mohaghegh

Abstract: One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, particularly from micro-blog websites like Twitter. Twitter messages, called tweets, are commonly written in ill-forms, including abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ require text normalisation techniques to detect and convert them into more accurate English sentences. There are several existing techniques proposed to solve these issues, however each technique possess some limitations and therefore cannot achieve good overall results. This paper aims to evaluate individual existing statistical normalisation methods and their possible combinations in order to find the best combination that can efficiently clean noisy tweets at the character-level, which contains abbreviations, repeated letters and misspelled words. Tested on our Twitter sample dataset, the best combination can achieve 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% Word Error Rate (WER) score, both of which are considered better than the baseline model.

Posters
Paper Nr: 3
Title:

AToMRS: A Tool to Monitor Recommender Systems

Authors:

André Costa, Tiago Cunha and Carlos Soares

Abstract: Recommender systems arose in response to the excess of available online information. These systems assign, to a given individual, suggestions of items that may be relevant. These system’s monitoring and evaluation are fundamental to the proper functioning of many business related services. It is the goal of this paper to create a tool capable of collecting, aggregating and supervising the results obtained from the recommendation systems’ evaluation. To achieve this goal, a multi-granularity approach is developed and implemented in order to organize the different levels of the problem. This tool also aims to tackle the lack of mechanisms to enable visually assessment of the performance of a recommender systems’ algorithm. A functional prototype of the application is presented, with the purpose of validating the solution’s concept.

Paper Nr: 5
Title:

Discovering the Deep Web through XML Schema Extraction

Authors:

Yasser Saissi, Ahmed Zellou and Ali Idri

Abstract: The web accessible by the search engines contains a vast amount of information. However, there is another part of the web called the deep web accessible only through its associated HTML forms, and containing much more information. The integration of the deep web content presents many challenges that are not fully addressed by the actual deep web access approaches. The integration of the deep web data requires knowing the schema describing each deep web source. This paper presents our approach to extract the XML schema describing a selected deep web source. The XML schema extracted will be used to integrate the associated deep web source into a mediation system. The principle of our approach is to apply a static and a dynamic analysis to the HTML forms giving access to the selected deep web source. We describe the algorithms of our approach and compare it to the other existing approaches.

Paper Nr: 6
Title:

Grammar and Dictionary based Named-entity Linking for Knowledge Extraction of Evidence-based Dietary Recommendations

Authors:

Tome Eftimov, Barbara Koroušić Seljak and Peter Korošec

Abstract: In order to help people to follow the new knowledge about healthy diet that comes rapidly each day with the new published scientific reports, a grammar and dictionary based named-entity linking method is presented that can be used for knowledge extraction of evidence-based dietary recommendations. The method consists of two phases. The first one is a mix of entity detection and determination of a set of candidates for each entity, and the second one is a candidate selection. We evaluate our method using a corpus from dietary recommendations presented in one sentence provided by the World Health Organization and the U.S. National Library of Medicine. The corpus consists of 50 dietary recommendations and 10 sentences that are not related with dietary recommendations. For 47 out of 50 dietary recommendations the proposed method extract all the useful knowledge, and for remaining 3 only the information for one entity is missing. Due to the 10 sentences that are not dietary recommendation the method does not extract any entities, as expected.

Paper Nr: 12
Title:

Skip Search Approach for Mining Probabilistic Frequent Itemsets from Uncertain Data

Authors:

Takahiko Shintani, Tadashi Ohmori and Hideyuki Fujita

Abstract: Due to wider applications of data mining, data uncertainty came to be considered. In this paper, we study mining probabilistic frequent itemsets from uncertain data under the Possible World Semantics. For each tuple has existential probability in probabilistic data, the support of an itemset is a probability mass function (pmf). In this paper, we propose skip search approach to reduce evaluating support pmf for redundant itemsets. Our skip search approach starts evaluating support pmf from the average length of candidate itemsets. When an evaluated itemset is not probabilistic frequent, all its superset of itemsets are deleted from candidate itemsets and its subset of itemset is selected as a candidate itemset to evaluate next. When an evaluated itemset is probabilistic frequent, its superset of itemset is selected as a candidate itemset to evaluate next. Furthermore, our approach evaluates the support pmf by difference calculus using evaluated itemsets. Thus, our approach can reduce the number of candidate itemsets to evaluate their support pmf and the cost of evaluating support pmf. Finally, we show the effectiveness of our approach through experiments.

Paper Nr: 16
Title:

Success Prediction System for Student Counseling using Data Mining

Authors:

Jörg Frochte and Irina Bernst

Abstract: A framework how to use data mining of central exam data for the prediction of student success in bachelor degree courses is presented. For the prediction a supervised learning approach is used based on successful and unsuccessful student biographies. We develop a traffic light rating system and present results for two different kinds of bachelor degree courses; one in economics and one in engineering. We discuss applications for students and student counseling institutions as well as the limitations dealing with information privacy aspects, especially under the conditions regarding data mining in Germany.

Paper Nr: 22
Title:

Detecting User Emotions in Twitter through Collective Classification

Authors:

İbrahim İleri and Pinar Karagoz

Abstract: The explosion in the use of social networks has generated a big amount of data including user opinions about varying subjects. For classifying the sentiment of user postings, many text-based techniques have been proposed in the literature. As a continuation of sentiment analysis, there are also studies on the emotion analysis. Due to the fact that many different emotions are needed to be dealt with at this point, the problem gets more complicated as the number of emotions to be detected increases. In this study, a different user-centric approach for emotion detection is considered such that connected users may be more likely to hold similar emotions; therefore, leveraging relationship information can complement emotion inference task in social networks. Employing Twitter as a source for experimental data and working with the proposed collective classification algorithm, emotions of the users are predicted in a collaborative setting.

Paper Nr: 29
Title:

Graph-based Rating Prediction using Eigenvector Centrality

Authors:

Dmitry Dolgikh and Ivan Jelínek

Abstract: The most of recommendation systems rely on the statistical correlations of the past explicitly given user rating for items (e.g. collaborative filtering). However, in conditions of insufficient data of past rating activities, these systems are facing difficulties in rating prediction, this situation is commonly known as the cold-start problem. This paper describes how graph-based represendation and Social Network Analysis can be used to help dealing with cold-start problem. We proposed a method to predict user rating based on the hypotesis that the rating of the node in the network corresponded to the rating of the most important nodes which are connected to it. The proposed method has been particularly applied to three MovieLens datasets to evaluate rating predition performance. Obtained results showed competitiveness of our method.

Paper Nr: 30
Title:

Adopting Privacy Regulations in a Data Warehouse - A Case of the Anonimity versus Utility Dilemma

Authors:

Chaïm van Toledo and Marco Spruit

Abstract: This paper ìnvestigates how privacy can be protected in a data warehouse while, at the same time, an organisation tries to be as open as possible. First, we perform a literature review on relevant techniques and methods to preserve privacy and show that k-anonimity can be applied to comply with an organisation’s requirements. Then, we propose two design strategies to adopt privacy regulations within a data warehouse. The first proposal shows that during the ETL process a data transformation can be performed to effectively realise anonimised records in a business intelligence environment. The second proposal shows that with views and a having clause, anonimisation can be arranged as well.

Paper Nr: 33
Title:

A Conceptual Model of Actors and Interactions for the Knowledge Discovery Process

Authors:

Lauri Tuovinen

Abstract: The knowledge discovery process is traditionally viewed as a sequence of operations to be applied to data; the human aspect of the process is seldom taken into account, and when it is, it is mainly the roles and actions of domain and technology experts that are considered. However, non-experts can also play an important role in knowledge discovery, and furthermore, the role of technology in the process may also be substantially expanded from what it traditionally has been, with special software facilitating interactions among human actors and even operating as an actor in its own right. This diversification of the knowledge discovery process is helpful in finding tenable solutions to the new problems presented by the current deluge of digital data, but only if the process model used to manage the process adequately represents the variety of forms that the process can take. The paper addresses this requirement by presenting a conceptual model that can be used to describe different types of knowledge discovery processes in terms of the actors involved and the interactions they have with one another. Additionally, the paper discusses how the interactions can be facilitated to provide effective support for each different type of process. As a future perspective, the paper considers the implications of intelligent software taking on responsibilities traditionally reserved for human actors.

Paper Nr: 36
Title:

Exploration and Visualization of Big Graphs - The DBpedia Case Study

Authors:

Enrico G. Caldarola, Antonio Picariello, Antonio M. Rinaldi and Marco Sacco

Abstract: Increasingly, the data and information visualization is becoming strategic for the exploration and explanation of large data sets. The Big Data paradigm pushes for new ways, new technological solutions to deal with the big volume and the big variety of data today. Not surprisingly, a plethora of new tools have emerged, each of them with pros and cons, but all espousing the cause of "Bigness of Data". In this paper, we take one of this emerging tools, namely Neo4J, and stress its capabilities in order to import, query and visualize data coming from a \emph{big} case study: DBpedia. We will describe each step in this study focusing on the used strategies for overcoming the different problems mainly due to the intricate nature of the case study and its volume. We confront with both the intensional schema of DBpedia and its extensional part in order to obtain the best result in its visualization. Finally, an attempt to define some criteria to simplify the large-scale visualization of DBpedia will be made, providing some examples and considerations which have arisen. The ultimate goal of this work is to investigate techniques and approaches to get more insights from the visual representation and analytics of large graph databases.

Paper Nr: 45
Title:

An Analysis of Online Twitter Sentiment Surrounding the European Refugee Crisis

Authors:

David Pope and Josephine Griffith

Abstract: Using existing natural language and sentiment analysis techniques, this study explores different dimensions of mood states of tweet content relating to the refugee crisis in Europe. The study has two main goals. The first goal is to compare the mood states of negative emotion, positive emotion, anger and anxiety across two populations (English and German speaking). The second goal is to discover if a link exists between significant real-world events relating to the refugee crisis and online sentiment on Twitter. Gaining an insight into this comparison and relationship can help us firstly, to better understand how these events shape public attitudes towards refugees and secondly, how online expressions of emotion are affected by significant events.

Paper Nr: 50
Title:

Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization

Authors:

Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón and Gerardo Sierra

Abstract: In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, n-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flooding, capital letters, references to other users, hyperlinks, hashtags, etc.). Experiments with SVM showed up to 90% of performance.

Paper Nr: 53
Title:

Recommending Groups to Users based Both on Their Textual and Image Posts

Authors:

Elias Oliveira, Howard Roatti, Gustavo Ramos Lima and Patrick Marques Ciarelli

Abstract: This article focuses on the recommendation of Facebook groups to users, based on the post profile of each user on Facebook. In order to accomplish this task, texts and images provided by users are used as source of information, and the experiments showed that the combination of these two types of information gives results better than or equal to the results obtained when using separately these data. The proposed approach in this paper is simple and promising to recommend Facebook groups.

Paper Nr: 61
Title:

The Longest Common Subsequence Distance using a Complexity Factor

Authors:

Octavian Lucian Hasna and Rodica Potolea

Abstract: In this paper we study the classic longest common subsequence problem and we use the length of the longest common subsequence as a similarity measure between two time series. We propose an original algorithm for computing the approximate length of the LCSS that uses a discretization step, a complexity invariant factor and a dynamic threshold used for skipping the computation.

Paper Nr: 64
Title:

Syllabification with Frequent Sequence Patterns - A Language Independent Approach

Authors:

Adrian Bona, Camelia Lemnaru and Rodica Potolea

Abstract: In this paper we show how words represented as sequences of syllables can provide valuable patterns for achieving language independent syllabification. We present a novel approach for word syllabification, based on frequent pattern mining, but also a more general framework for syllabification. Preliminary evaluations on Romanian and English words indicated a word level accuracy around 77% for Romanian words and around 70% for English words. However, we believe the method can be refined in order to improve performance.

Paper Nr: 66
Title:

Using Deep Learning in Arabic-English Cross Language Information Retrieval

Authors:

Omar Attia, Michael Azmy, Ahmed Abu Emeira, Karim El Azzouni, Omar Hussein, Nagwa M. El-Makky and Khaled Nagi

Abstract: In this paper, we apply a combination of keyword-based information retrieval with a latent semantic-based model in Arabic-English Cross-Language Information Retrieval (CLIR). We aim at enabling Arabic-speaking users to access English content using their native language. Due to the complex morphology of Arabic, we use deep learning for both Arabic and English text to extract the deep semantics in the given languages and then use canonical correlation analysis (CCA) to project the semantic representations into a shared space in which retrieval can be done easily. We evaluate our system on selected articles from Wikipedia and show that the proposed combination can improve the state-of-the-art keyword-based CLIR performance.

Paper Nr: 71
Title:

Development of Domains and Keyphrases along Years

Authors:

Yaakov HaCohen-Kerner and Meir Hassan

Abstract: This paper presents a methodology (including a detailed algorithm, various development concepts and measures, and stopword lists) for measuring the development of domains and keyphrases along years. The examined corpus contains 1020 articles that were accepted for full presentation in PACLIC along the last 18 years. The experimental results for 5 chosen domains (digital humanities, language resources, machine translation, sentiment analysis and opinion mining, and social media) suggest that development trends of domains and keyphrases can be efficiently measured. Top bigrams and trigrams were found as efficient to identify general trends in NLP domains.

Paper Nr: 72
Title:

Gender Clustering of Blog Posts using Distinguishable Features

Authors:

Yaakov HaCohen-Kerner, Yarden Tzach and Ori Asis

Abstract: The aim of this research is to find out how to perform effective clustering of unlabeled personal blog posts written in English by gender. Given a gender-labeled blog corpus and a blog corpus that is not gender-labeled, we extracted from the labeled corpus distinguishable unigrams for both males and females. Then, we defined two general features that represent the relative frequencies of the distinguishable males’ unigrams and females’ unigrams, (males’ frequency and females’ frequency). The best distinguishable feature was found to be the males’ frequency feature with a ratio factor at least 1.4 times that of females. This feature leads to accuracy rate of 83.7% for gender clustering of the unlabeled blog corpus. To the best of our knowledge, this study presents two novelties: (1) this is the first study to cluster blog posts by gender, and (2) clustering of an unlabeled corpus using distinguishable features that were extracted from a labeled corpus.

Paper Nr: 77
Title:

The Flightschedule Profiler: An Attempt to Synthetise Visually an Airport’s Flight Offer in Time and Space

Authors:

Jean-Yves Blaise and Iwona Dudek

Abstract: Online route planners and travel reservations systems have become in the past years part of our everyday lives. Such sites, originating from the airlines themselves or oriented on “search and compare” tasks, do provide valuable services. But the very nature of the queries users formulate (ultimate result: one flight) limits the type of information one can expect to retrieve, and in particular does not allow to get an overall view of an airport’s flight offer over time and in space. In this contribution we introduce a proof-of-concept visualisation that sums up in a synthetic way the [where to, when to] profile of an airport, its realm of possibilities. The visualisation acts as an upstream service, independently of any actual reservation loop: its main role is to help unveiling significant spatio-temporal patterns (densities and continuity over time for instance). The prototype is implemented on a real life data set: the winter 2013/2014 schedule of the airport in Nice. Ultimately, beyond a discussion on the issue, on the pluses and minuses of the prototype, this position paper questions the way travel data is presented, and as such can promote debates over the potential impact of information visualisation solutions in that context.

Paper Nr: 81
Title:

Rating Prediction with Contextual Conditional Preferences

Authors:

Aleksandra Karpus, Tommaso Di Noia, Paolo Tomeo and Krzysztof Goczyla

Abstract: Exploiting contextual information is considered a good solution to improve the quality of recommendations, aiming at suggesting more relevant items for a specific context. On the other hand, recommender systems research still strive for solving the cold-start problem, namely where not enough information about users and their ratings is available. In this paper we propose a new rating prediction algorithm to face the cold-start system scenario, based on user interests model called contextual conditional preferences. We present results obtained with three publicly available data sets in comparison with several state-of-the-art baselines. We show that usage of contextual conditional preferences improves the prediction accuracy, even when all users have provided a few feedbacks, and hence small amount of data is available.

Paper Nr: 84
Title:

Exploring Urban Tree Site Planting Selection in Mexico City through Association Rules

Authors:

Héctor Javier Vázquez and Mihaela Juganaru-Mathieu

Abstract: In this paper we present an exploration of association rules determine planting sites considering urban tree’s characteristics. In first step itemsets and rules are generated using the unsupervised algorithm Apriori. They are rapidly characterized in terms of tree planting sites. In a second step planting sites are fixed as target values to establish rules (a supervised version of the a priori algorithm). An original approach is also presented and validated for the prediction of the planting site of the species.

Paper Nr: 85
Title:

Anti-cancer Drug Activity Prediction by Ensemble Learning

Authors:

Ertan Tolan and Mehmet Tan

Abstract: Personalized cancer treatment is an ever-evolving approach due to complexity of cancer. As a part of personalized therapy, effectiveness of a drug on a cell line is measured. However, these experiments are backbreaking and money consuming. To surmount these difficulties, computational methods are used with the provided data sets. In the present study, we considered this as a regression problem and designed an ensemble model by combining three different regression models to reduce prediction error for each drug-cell line pair. Two major data sets, were used to evaluate our method. Results of this evaluation show that predictions of ensemble method are significantly better than models \emph{per se}. Furthermore, we report the cytotoxicty predictions of our model for the drug-cell line pairs that do not appear in the original data sets.

Paper Nr: 86
Title:

Computing Ideal Number of Test Subjects - Sensorial Map Parametrization

Authors:

Jessica Dacleu Ndengue, Mihaela Juganaru-Mathieu and Jenny Faucheu

Abstract: A sensory analysis was carried using a special napping table on two different set of products in order to investigate on texture perception of material, the tests were done using a human panel. The data collected were analyzed through multiple factor analysis (MFA) which is a particular case of principal component analysis (PCA). The aim of this study is to know the minimum number of subjects in the human panel that can guarantee a meaningful statistical analysis of data, and so far allows a better understanding of the sensory results. We built a particular function that measures the similarity between two representations (two matrices) which are computed using the output of Napping table. Based on this function and using the whole datasets an algorithm able to measure the robustness is implemented. We found on the two datasets that a minimum number of subjects between 10 and 12 seems to insure a stable and robust statistical analysis of the sensory results.

Paper Nr: 87
Title:

Big Data and Deep Analytics Applied to the Common Tactical Air Picture (CTAP) and Combat Identification (CID)

Authors:

Ying Zhao, Tony Kendall and Bonnie Johnson

Abstract: Accurate combat identification (CID) enables warfighters to locate and identify critical airborne objects as friendly, hostile or neutral with high precision. The current CID processes include processing and analysing data from a vast network of sensors, platforms, and decision makers. CID plays an important role in generating the Common Tactical Air Picture (CTAP) which provides situational awareness to air warfare decision-makers. The Big “CID” Data and complexity of the problem pose challenges as well as opportunities. In this paper, we discuss CTAP and CID challenges and some Big Data and Deep Analytics solutions to address these challenges. We present a use case using a unique deep learning method, Lexical Link Analysis (LLA), which is able to associate heterogeneous data sources for object recognition and anomaly detection, both of which are critical for CTAP and CID applications.

Paper Nr: 88
Title:

SPARQL Query Generation based on RDF Graph

Authors:

Mohamed Kharrat, Anis Jedidi and Faiez Gargouri

Abstract: Data retrieval is becoming more difficult due to the heterogeneity and the huge amount of Data flowing in the Web. On the other hand, novice users could not handle querying languages (e.g., SPARQL) or knowledge based techniques. To simplify querying process, we introduce in this paper, a proposal of automatic SPARQL query generation based on user-supplied keywords. The construction of a SPARQL query is based on the top relevant RDF sub-graph, selected from our RDF Triplestore. This latter rely on our defined semantic network and on our Contextual Schema both published in two different papers of our previous studies. We evaluate 50 queries by using three measures. Results show an F-Score of about 50%. This proposal is already implemented as a web interface and the whole queries interpretation and processing is performed over this interface.

Paper Nr: 89
Title:

A Novel Clustering Algorithm to Capture Utility Information in Transactional Data

Authors:

Piyush Lakhawat, Mayank Mishra and Arun Somani

Abstract: We develop and design a novel clustering algorithm to capture utility information in transactional data. Transactional data is a special type of categorical data where transactions can be of varying length. A key objective for all categorical data analysis is pattern recognition. Therefore, transactional clustering algorithms focus on capturing the information on high frequency patterns from the data in the clusters. In recent times, utility information for category types in the data has been added to the transactional data model for a more realistic representation of data. As a result, the key information of interest has become high utility patterns instead of high frequency patterns. To the best our knowledge, no existing clustering algorithm for transactional data captures the utility information in the clusters found. Along with our new clustering rationale we also develop corresponding metrics for evaluating quality of clusters found. Experiments on real datasets show that the clusters found by our algorithm successfully capture the high utility patterns in the data. Comparative experiments with other clustering algorithms further illustrate the effectiveness of our algorithm.

Short Papers
Paper Nr: 90
Title:

Lexicon Expansion System for Domain and Time Oriented Sentiment Analysis

Authors:

Nuno Guimaraes, Luis Torgo and Alvaro Figueira

Abstract: In sentiment analysis the polarity of a text is often assessed recurring to sentiment lexicons, which usually consist of verbs and adjectives with an associated positive or negative value. However, in short informal texts like tweets or web comments, the absence of such words does not necessarily indicates that the text lacks opinion. Tweets like ”First Paris, now Brussels... What can we do?” imply opinion in spite of not using words present in sentiment lexicons, but rather due to the general sentiment or public opinion associated with terms in a specific time and domain. In order to complement general sentiment dictionaries with those domain and time specific terms, we propose a novel system for lexicon expansion that automatically extracts the more relevant and up to date terms on several different domains and then assesses their sentiment through Twitter. Experimental results on our system show an 82% accuracy on extracting domain and time specific terms and 80% on correct polarity assessment. The achieved results provide evidence that our lexicon expansion system can extract and determined the sentiment of terms for domain and time specific corpora in a fully automatic form.