KDIR 2017 Abstracts


Full Papers
Paper Nr: 1
Title:

Top-k Keyword Search over Wikipedia-based RDF Knowledge Graphs

Authors:

Hrag Yoghourdjian, Shady Elbassuoni, Mohamad Jaber and Hiba Arnaout

Abstract: Effective keyword search over RDF knowledge graphs is still an ongoing endeavor. Most existing techniques have their own limitations in terms of the unit of retrieval, the type of queries supported or the basis on which the results are ranked. In this paper, we develop a novel retrieval model for general keyword queries over Wikipedia-based RDF knowledge graphs. Our model retrieves the top-k scored subgraphs for a given keyword query. To do this, we develop a scoring function for RDF subgraphs and then we deploy a graph searching algorithm that only retrieves the top-k scored subgraphs for the given query based on our scoring function. We evaluate our retrieval model and compare it to state-of-the-art approaches using YAGO, a large Wikipedia-based RDF knowledge graph.

Paper Nr: 2
Title:

An Information Theory Subspace Analysis Approach with Application to Anomaly Detection Ensembles

Authors:

Marcelo Bacher, Irad Ben-Gal and Erez Shmueli

Abstract: Identifying anomalies in multi-dimensional datasets is an important task in many real-world applications. A special case arises when anomalies are occluded in a small set of attributes (i.e., subspaces) of the data and not necessarily over the entire data space. In this paper, we propose a new subspace analysis approach named Agglomerative Attribute Grouping (AAG) that aims to address this challenge by searching for subspaces that comprise highly correlative attributes. Such correlations among attributes represent a systematic interaction among the attributes that can better reflect the behavior of normal observations and hence can be used to improve the identification of future abnormal data samples. AAG relies on a novel multi-attribute metric derived from information theory measures of partitions to evaluate the ”information distance” between groups of data attributes. The empirical evaluation demonstrates that AAG outperforms state-of-the-art subspace analysis methods, when they are used in anomaly detection ensembles, both in cases where anomalies are occluded in relatively small subsets of the available attributes and in cases where anomalies represent a new class (i.e., novelties). Finally, and in contrast to existing methods, AAG does not require any tuning of parameters.

Paper Nr: 3
Title:

Closed Contour Extraction based on Structural Models and Their Likelihood Analysis

Authors:

Jiayin Liu and Yuan Wu

Abstract: In this paper, we describe a new algorithm for extracting closed contours inside images by introducing three basic structural models to describe all potentially closed contour candidates and their likelihood analysis to eliminate pixels of non-closed contours. To further enhance the performance of its closed contour extraction, a post processing method based on edge intensity analysis is also added to the proposed algorithm to reduce the false positives. To illustrate its effectiveness and efficiency, we applied the proposed algorithm to the casting defect detection problem and carried out extensive experiments organized in three phases. The results support that the proposed algorithm outperforms the existing representative techniques in extracting closed contours for a range of images, including artificial images, standard casting defect images from ASTM (American Society for Testing and Materials) and real casting defect images collected directly from industrial lines. Experimental results also illustrate that the proposed algorithm achieve certain level of robustness in casting defect detection under noise environment.

Paper Nr: 5
Title:

On Deep Learning in Cross-Domain Sentiment Classification

Authors:

Giacomo Domeniconi, Gianluca Moro, Andrea Pagliarani and Roberto Pasolini

Abstract: Cross-domain sentiment classification consists in distinguishing positive and negative reviews of a target domain by using knowledge extracted and transferred from a heterogeneous source domain. Cross-domain solutions aim at overcoming the costly pre-classification of each new training set by human experts. Despite the potential business relevance of this research thread, the existing ad hoc solutions are still not scalable with real large text sets. Scalable Deep Learning techniques have been effectively applied to in-domain text classification, by training and categorising documents belonging to the same domain. This work analyses the cross-domain efficacy of a well-known unsupervised Deep Learning approach for text mining, called Paragraph Vector, comparing its performance with a method based on Markov Chain developed ad hoc for cross-domain sentiment classification. The experiments show that, once enough data is available for training, Paragraph Vector achieves accuracy equivalent to Markov Chain both in-domain and cross-domain, despite no explicit transfer learning capability. The outcome suggests that combining Deep Learning with transfer learning techniques could be a breakthrough of ad hoc cross-domain sentiment solutions in big data scenarios. This opinion is confirmed by a really simple multi-source experiment we tried to improve transfer learning, which increases the accuracy of cross-domain sentiment classification.

Paper Nr: 13
Title:

Accurate Continuous and Non-intrusive User Authentication with Multivariate Keystroke Streaming

Authors:

Abdullah Alshehri, Frans Coenen and Danushka Bollegala

Abstract: In this paper, we demonstrate a novel mechanism for continuous authentication of computer users using keystroke dynamics. The mechanism models keystroke timing features, Flight time (the time between consecutive keys) and Hold time (the duration of a key press), as a multivariate time series which serves to dynamically capture typing patterns in real/continuous time. The proposed method differs from previous approaches for continuous authentication using keystroke dynamics, founded on feature vector representations, which limited real-time analysis due to the computationally expensive processing of the vectors, and which also yielded poor authentication accuracy. The proposed mechanism is compared to a feature vector based approach, taken from the literature, over two datasets. The results indicate superior performance of the proposed multivariate time series mechanisms for continuous authentication using keystroke dynamics.

Paper Nr: 20
Title:

Novel Semantics-based Distributed Representations for Message Polarity Classification using Deep Convolutional Neural Networks

Authors:

Abhinay Pandya and Mourad Oussalah

Abstract: Unsupervised learning of distributed representations (word embeddings) obviates the need for task-specific feature engineering for various NLP applications. However, such representations learned from massive text datasets do not faithfully represent finer semantic information in the feature space required by specific applications. This is owing to the fact that (a) models learning such representations ignore the linguistic structure of the sentences, (b) they fail to capture \textit{polysemous} usages of the words, and (c) they ignore pre-existing semantic information from manually-created ontologies. In this paper, we propose three semantics-based distributed representations of words and phrases as features for message polarity classification: Sentiment-Specific Multi-Word Expressions Embeddings(SSMWE) are sentiment encoded distributed representations of \textit{multi-word expressions (MWEs)}; Sense-Disambiguated Word Embeddings(SDWE) are sense-specific distributed representations of words; and WordNet embeddings(WNE) are distributed representations of hypernym and hyponym of the correct sense of a given word. We examine the effects of these features incorporated in a convolutional neural network(CNN) model for evaluation on the SemEval benchmarked dataset. Our approach of using these novel features yields 14.24\% improvement in the macro-averaged F1 score on SemEval datasets over existing methods. While we have shown promising results in twitter sentiment classification, we believe that the method is general enough to be applied to many NLP applications where finer semantic analysis is required.

Paper Nr: 26
Title:

Passage Level Evidence for Effective Document Level Retrieval

Authors:

Ghulam Sarwar, Colm O'Riordan and John Newell

Abstract: Several researchers have considered the use of passages within documents as useful units of representation as individual passages may capture accurately the topic of discourse in a document. In this work, each document is indexed as a series of unique passages. We explore and analyse a number of similarity measures which take into account the similarity at passage level with the aim of improving the quality of the answer set. We define a number of such passage level approaches and compare their performance. Mean average precision (MAP) and precision at k documents (P@k) are used as measures of the quality of the approaches. The results show that for the different test collections, the rank of a passage is a useful measure, and when used separately or in conjunction with the document score can give better results as compared to other passage or document level similarity approaches.

Paper Nr: 27
Title:

Predicting Future Interests in a Research Paper Recommender System using a Community Centric Tree of Concepts Model

Authors:

Modhi Al Alshaikh, Gulden Uchyigit and Roger Evans

Abstract: Our goal in this paper is to predict a user’s future interests in the research paper domain. Content-based recommender systems can recommend a set of papers that relate to a user’s current interests. However, they may not be able to predict a user’s future interests. Collaborative filtering approaches may predict a user’s future interests for movies, music or e-commerce domains. However, existing collaborative filtering approaches are not appropriate for the research paper domain, because they depend on large numbers of user ratings which are not available in the research paper domain. In this paper, we present a novel collaborative filtering method that does not depend on user ratings. Our novel method computes the similarity between users according to user profiles which are represented using the dynamic normalized tree of concepts model using the 2012 ACM Computing Classification System (CCS) ontology. Further, a community-centric tree of concepts is generated and used to make recommendations. Offline evaluations are performed using the BibSonomy dataset. Our model is compared with two baselines. The results show that our model significantly outperforms the two baselines and avoids the problem of sparsity.

Paper Nr: 35
Title:

Deriving Realistic Mathematical Models from Support Vector Machines for Scientific Applications

Authors:

Andrea Murari, Emmanuele Peluso, Saeed Talebzadeh, Pasqualino Gaudio, Michele Lungaroni, Ondrej Mikulin, Jesus Vega and Michela Gelfusa

Abstract: In many scientific applications, it is necessary to perform classification, which means discrimination between examples belonging to different classes. Machine Learning Tools have proved to be very performing in this task and can achieve very high success rates. On the other hand, the “realism” and interpretability of their results are very low, limiting their applicability. In this paper, a method to derive manageable equations for the hypersurface between classes is presented. The main objective consists of formulating the results of machine learning tools in a way representing the actual “physics” behind the phenomena under investigation. The proposed approach is based on a suitable combination of Support vector Machines and Symbolic Regression via Genetic Programming; it has been investigated with a series of systematic numerical tests, for different types of equations and classification problems, and tested with various experimental databases. The obtained results indicate that the proposed method permits to find a good trade-off between accuracy of the classification and complexity of the derived mathematical equations. Moreover, the derived models can be tuned to reflect the actual phenomena, providing a very useful tool to bridge the gap between data, machine learning tools and scientific theories.

Paper Nr: 39
Title:

Efficient and Effective Single-Document Summarizations and a Word-Embedding Measurement of Quality

Authors:

Liqun Shao, Hao Zhang, Ming Jia and Jie Wang

Abstract: Our task is to generate an effective summary for a given document with specific realtime requirements. We use the softplus function to enhance keyword rankings to favor important sentences, based on which we present a number of summarization algorithms using various keyword extraction and topic clustering methods. We show that our algorithms meet the realtime requirements and yield the best ROUGE recall scores on DUC-02 over all previously-known algorithms. To evaluate the quality of summaries without human-generated benchmarks, we define a measure called WESM based on word-embedding using Word Mover’s Distance. We show that the orderings of the ROUGE and WESM scores of our algorithms are highly comparable, suggesting that WESM may serve as a viable alternative for measuring the quality of a summary.

Paper Nr: 46
Title:

A Clustering based Prediction Scheme for High Utility Itemsets

Authors:

Piyush Lakhawat, Mayank Mishra and Arun Somani

Abstract: We strongly believe that the current Utility Itemset Mining (UIM) problem model can be extended with a key modeling capability of predicting future itemsets based on prior knowledge of clusters in the dataset. Information in transactions fairly representative of a cluster type is more a characteristic of the cluster type than the the entire data. Subjecting such transactions to the common threshold in the UIM problem leads to information loss. We identify that an implicit use of the cluster structure of data in the UIM problem model will address this limitation. We achieve this by introducing a new clustering based utility in the definition of the UIM problem model and modifying the definitions of absolute utilities based on it. This enhances the UIM model by including a predictive aspect to it, thereby enabling the cluster specific patterns to emerge while still mining the inter-cluster patterns. By performing experiments on two real data sets we are able to verify that our proposed predictive UIM problem model extracts more useful information than the current UIM model with high accuracy.

Paper Nr: 48
Title:

Detecting and Assessing Contextual Change in Diachronic Text Documents using Context Volatility

Authors:

Christian Kahmann, Andreas Niekler and Gerhard Heyer

Abstract: Terms in diachronic text corpora may exhibit a high degree of semantic dynamics that is only partially captured by the common notion of semantic change. The new measure of context volatility that we propose models the degree by which terms change context in a text collection over time. The computation of context volatility for a word relies on the significance-values of its co-occurrent terms and the corresponding co-occurrence ranks in sequential time spans. We define a baseline and present an efficient computational approach in order to overcome problems related to computational issues in the data structure. Results are evaluated both, on synthetic documents that are used to simulate contextual changes, and a real example based on British newspaper texts. The data and software are avaiable at https://git.informatik.uni-leipzig.de/mam10cip/KDIR.git.

Paper Nr: 53
Title:

AlQuAnS – An Arabic Language Question Answering System

Authors:

Mohamed Nabil, Ahmed Abdelmegied, Yasmin Ayman, Ahmed Fathy, Ghada Khairy, Mohammed Yousri, Nagwa El-Makky and Khaled Nagi

Abstract: Building Arabic Question Answering systems is a challenging problem compared to their English counter-parts due to several limitations inherent in the Arabic language and the scarceness of available Arabic training datasets. In our proposed Arabic Question Answering system, we combine several previously successful algorithms and add a novel approach to the answer extraction process that has not been used by any Arabic Question Answering system before. We use the state-of-the-art MADAMIRA Arabic morphological analyser for preprocessing questions and retrieved passages. We also enhance and extend the question classification and use the Explicit Semantic Approach (ESA) in the passage retrieval process to rank passages that most probably contain the correct answer. We also introduce a new answer extraction pattern, which matches the patterns formed according to the question type with the sentences in the retrieved passages in order to provide the correct answer. A performance evaluation study shows that our system gives promising results compared to other existing Arabic Question Answering systems, especially with the newly introduced answer extraction module.

Short Papers
Paper Nr: 4
Title:

Personalized Web Search via Query Expansion based on User’s Local Hierarchically-Organized Files

Authors:

Gianluca Moro, Roberto Pasolini and Claudio Sartori

Abstract: Users of Web search engines generally express information needs with short and ambiguous queries, leading to irrelevant results. Personalized search methods improve users’ experience by automatically reformulating queries before sending them to the search engine or rearranging received results, according to their specific interests. A user profile is often built from previous queries, clicked results or in general from the user’s browsing history; different topics must be distinguished in order to obtain an accurate profile. It is quite common that a set of user files, locally stored in sub-directory, are organized by the user into a coherent taxonomy corresponding to own topics of interest, but only a few methods leverage on this potentially useful source of knowledge. We propose a novel method where a user profile is built from those files, specifically considering their consistent arrangement in directories. A bag of keywords is extracted for each directory from text documents within it. We can infer the topic of each query and expand it by adding the corresponding keywords, in order to obtain a more targeted formulation. Experiments are carried out using benchmark data through a repeatable systematic process, in order to evaluate objectively how much our method can improve relevance of query results when applied upon a third-party search engine.

Paper Nr: 6
Title:

Learning to Predict the Stock Market Dow Jones Index Detecting and Mining Relevant Tweets

Authors:

Giacomo Domeniconi, Gianluca Moro, Andrea Pagliarani and Roberto Pasolini

Abstract: Stock market analysis is a primary interest for finance and such a challenging task that has always attracted many researchers. Historically, this task was accomplished by means of trend analysis, but in the last years text mining is emerging as a promising way to predict the stock price movements. Indeed, previous works showed not only a strong correlation between financial news and their impacts to the movements of stock prices, but also that the analysis of social network posts can help to predict them. These latest methods are mainly based on complex techniques to extract the semantic content and/or the sentiment of the social network posts. Differently, in this paper we describe a method to predict the Dow Jones Industrial Average (DJIA) price movements based on simpler mining techniques and text similarity measures, in order to detect and characterise relevant tweets that lead to increments and decrements of DJIA. Considering the high level of noise in the social network data, we also introduce a noise detection method based on a two steps classification. We tested our method on 10 millions twitter posts spanning one year, achieving an accuracy of 88.9% in the Dow Jones daily prediction, that is, to the best our knowledge, the best result in the literature approaches based on social networks.

Paper Nr: 11
Title:

Content-based Recommender System using Social Networks for Cold-start Users

Authors:

Alan V. Prando, Felipe G. Contratres, Solange N. A. Souza and Luiz S. de Souza

Abstract: Recommender systems have been widely applied to e-commerce to help customers find products to purchase. Cold-start is characterized by the incapability of recommending due to the lack of enough ratings. In fact, solutions for the cold-start problem have been proposed for different contexts, but the problem is still unsolved. This paper presents a RS for new e-commerce users by using only their interactions in social networks to understand their preferences. The proposed Recommender System (RS) applies a content-based approach and improves the experience of new users by recommending specific products in a preferred identified user category by analysing their data from the social network. Therefore, it combines three social network elements: (1) direct user posts (e.g.: "tweets" from Twitter and "posts" from Facebook), (2) content "likes" (e.g.: option "like" on a "post" or "tweet" posted by another user), and (3) page "likes" (e.g.: option "like" on a Facebook page). The proposed RS was tested for a retail e-commerce, which usually not only has a large range of categories of products, but also has products within these categories. The difficulty in predicting a product increases sharply with a greater number of categories and products. According to the experiment conducted, the proposed RS demonstrated to be a reasonable alternative to cold-start, i.e., for users accessing e-commerce for the very first time.

Paper Nr: 14
Title:

Entity Search/Match in Relational Databases

Authors:

Minlue Wang, Valeriia Haberland, Andrew Martin, John Howroyd and John Mark Bishop

Abstract: We study an entity search/match problem that requires retrieved tuples match to the same entity as an input entity query. We assume the input queries are of the same type as the tuples in a materialised relational table. Existing keyword search over relational databases focuses on assembling tuples from a variety of relational tables in order to respond to a keyword query. The entity queries in this work differ from the keyword queries in two ways: (i) an entity query roughly refers to an entity that contains a number of attribute values, i.e. a product entity or an address entity. (ii) there might be redundant or incorrect information in the entity queries that could lead to misinterpretations of the queries. In this paper, we propose a transformation that first converts a free-text entity query into a multi-valued structured query, and two retrieval methods are proposed in order to generate a set of candidate tuples from the database. The retrieval methods essentially formulate SQL queries against the database given the multi-valued structured query. The results of the comprehensive evaluation of a large-scale database (more than 29 millions tuples) and two real-world datasets showed that our methods have a good trade-off between generating correct candidates and the retrieval time compared to baseline approaches.

Paper Nr: 15
Title:

Consumer Engagement Characteristics in Mobile Advertising

Authors:

Lonneke Brakenhoff and Marco Spruit

Abstract: : Advertising on mobile devices is becoming increasingly more important as the possibilities regarding their design and context become increasingly more extensive. This research focuses on the characteristics of design and context regarding mobile advertisements, structured according to the CRISP-DM process model. First, we describe their key concepts and relevant theoretical background. Then, we design the Mobile Advertising Effectiveness Framework for Consumer Engagement (MAEF4CE), which relates medium types, creative attributes, ad formats, device specific ads, and brand visibility as mobile advertisement characteristics. Finally, we uncover the combination of characteristics that elicits optimal consumer engagement in mobile advertisements in a real-time bidding dataset.

Paper Nr: 17
Title:

Improving Document Clustering Performance: The Use of an Automatically Generated Ontology to Augment Document Representations

Authors:

Stephen Bradshaw, Colm O'Riordan and Daragh Bradshaw

Abstract: Clustering documents is a common task in a range of information retrieval systems and applications. Many approaches for improving the clustering process have been proposed. One approach is the use of an ontology to better inform the classifier of word context, by expanding the items to be clustered. Wordnet is commonly cited as an appropriate source from which to draw the additional terms; however, it may not be sufficient to achieve strong performance. We have two aims in this paper: first, we show that the use of Wordnet may lead to suboptimal performance. This problem may be accentuated when a document set has been drawn from comments made in social forums; due to the unstructured nature of online conversations compared to standard document sets. Second, we propose a novel method which involves constructing a bespoke ontology that facilitates better clustering. We present a study of clustering applied to a sample of threads from a social forum and investigate the effectiveness of the application of these methods.

Paper Nr: 24
Title:

Exploiting Meta Attributes for Identifying Event Related Hashtags

Authors:

Sreekanth Madisetty and Maunendra Sankar Desarkar

Abstract: Users in social media often participate in discussions regarding different events happening in the physical world (e.g., concerts, conferences, festivals) by posting messages, replying to or forwarding messages related to such events. In various applications like event recommendation, event reporting, etc. it might be useful to find user discussions related to such events from social media. Finding event related hashtags can be useful for this purpose. In this paper, we focus on the problem of finding relevant hashtags for a given event. Features are defined to identify the event related hashtags. We specifically look for features that use similarities of the hashtags with the event metadata attributes. A learning to rank algorithm is applied to learn the importance weights of the features towards the task of predicting the relevance of a hashtag to the given event. We experimented on events from four different categories (namely, Award ceremonies, E-commerce events, Festivals, and Product launches). Experimental results show that our method significantly outperforms the baseline methods.

Paper Nr: 25
Title:

Exploring Mediatoil Imagery: A Content-based Approach

Authors:

Sahil Saroop, Herna L. Viktor, Patrick McCurdy and Eric Paquet

Abstract: Debates over Canada’s energy future with its oil sands has become a flashpoint of public interest. Stakeholders have identified advantages, such as economic benefit and global energy demand, and drawbacks, notably environmental and social challenges. This research focuses on discovering how various organizations employ graphics, images and videos in the media, in order to further our understanding of the context and evolution of the oil sands discourse, since the late 1960s. To this end, we created the open-source Mediatoil database contains images from six categories of imagery, namely graphics, machines, people, landscape, protest and open-pit. We further created the Mediatoil-IR content-based image retrieval system that utilizes SURF descriptors and bags of features. We illustrate how the Mediatoil-IR system was used in order to explore and to contrast the imagery used by the various stakeholders, within a multi-class learning setting. Our experimental results show that dividing the images into sub-categories is beneficial for retrieval and classification.

Paper Nr: 28
Title:

A Novel Short-term and Long-term User Modelling Technique for a Research Paper Recommender System

Authors:

Modhi Al Alshaikh, Gulden Uchyigit and Roger Evans

Abstract: Modelling users’ interests accurately is an important aspect of recommender systems. However, this is a challenge as users’ behaviour can vary in different domains. For example, users’ reading behaviour of research papers follows a different pattern to users’ reading of online news articles. In the case of research papers, our analysis of users’ reading behaviour shows that there are breaks in reading whereas the reading of news articles is assumed to be more continuous. In this paper, we present a novel user modelling method for representing short-term and long-term user’s interests in recommending research papers. The short-term interests are modelled using a personalised dynamic sliding window which is able to adapt its size according to the ratio of concepts per paper read by the user rather than purely time-based methods. Our long-term model is based on selecting papers that represent user’s longer term interests to build his/her profile. Existing methods for modelling user’s short-term and long-term interests do not adequately take into consideration erratic reading behaviours over time that are exhibited in the research paper domain. We conducted evaluations of our short-term and long-term models and compared them with the performance of three existing methods. The evaluation results show that our models significantly outperform the existing short-term and long-term methods.

Paper Nr: 40
Title:

RQAS: A Rapid QA Scheme with Exponential Elevations of Keyword Rankings

Authors:

Cheng Zhang and Jie Wang

Abstract: We present a rapid question-answering scheme (RQAS) for constructing a chatbot over specific domains with data in the format of question-answer pairs (QAPs). We want RQAS to return the most appropriate answer to a user question in realtime. RQAS is based on TF-IDF scores and exponential elevations of keyword rankings generated by TextRank to overcome the common problems of selecting an answer based on the highest similarity of the user question and questions in the existing QAPs, making it easy for general QAS builders to construct a QAS for a given application. Experiments show that RQAS is both effective and efficient regardless the size of the underlying QAP dataset.

Paper Nr: 43
Title:

Towards Rhythmicity Analysis of Text using Empirical Mode Decomposition

Authors:

Robertas Damasevicius, Jurgita Kapociute-Dzikiene and Marcin Wozniak

Abstract: The rhythmicity characteristics of the written text is still an under-researched topic as opposed to the similar research in the speech analysis domain. The paper presents a method for text deconstruction into text modes using Empirical Mode Decomposition (EMD). First, the text is encoded into a numerical sequence using a mapping table. Next, the resulting numerical sequence is decomposed into Intrinsic Mode Functions (IMFs) using EMD. The resulting text modes provide a basis for further analysis of a text as well as specific characteristics of the language of the text itself. The text modes are used further to derive the measures of text complexity (cardinality) and rhythmicity (frequency) as well as the visual representations (scalograms, convograms), which can provide important insights into the structure of the text itself. The application of EMD to text analysis allows to decompose text into basic harmonics, which can be attributed to the structural units of the text such as syllables, words, verses and stanzas. Higher order harmonics however can be observed only in the rhymed types of the text such as poetry.

Paper Nr: 47
Title:

Software Interestingness Trigger by Social Network Automation

Authors:

Iaakov Exman, Avihu Harush and Yakir Winograd

Abstract: Social Networking software is perceived as a fast way to engage people with potential interest in a certain activity. However, interestingness has been defined as a product of a domain relevance – arbitrarily determined by a person’s tastes – and a surprise – determined by how unexpected is an activity relative to average activities in the domain. Therefore, one must manage two unknowns to trigger the desired interestingness: the person tags and the activity description. The person tags are obtained from an application ontology characterizing the chosen domain. The activity description tries to generate surprises by sharpening what differentiates it from conventional activities. This paper describes a software tool based upon a social networking infra-structure and illustrates its quasi-automated usage with inputs of the relevant tags and the surprising activity for a specific domain, viz. marketing of an Event within a conference. Preliminary results are analysed and discussed at length.

Paper Nr: 50
Title:

Generating Appropriate Question-Answer Pairs for Chatbots using Data Harvested from Community-based QA Sites

Authors:

Wenjing Yang and Jie Wang

Abstract: Community-based question-answering web sites (CQAW) contain rich collections of question-answer pages, where a single question often has multiple answers written by different authors with different aspects. We study how to harvest new question-answer pairs from CQAWs so that each question-answer pair addresses just one aspect that are suitable for chatbots over a specific domain. In particular, we first extract all answers to a question from a CQAW site using DOM-tree similarities and features of answer areas, and then cluster the answers using LDA. Next, we form a sub-question for each cluster using a small number of top keywords in the given cluster with the keywords in the original question. We select the best answer to the sub-question based on user ratings and similarities of answers to the sub-question. Experimental results show that our approach is effective.

Paper Nr: 51
Title:

Knowledge based Automatic Summarization

Authors:

Andrey Timofeyev and Ben Choi

Abstract: This paper describes a knowledge based system for automatic summarization. The knowledge based system creates abstractive summary of texts by generalizing new concepts, detecting main topics, and composing new sentences. The knowledge based system is built on the Cyc development platform, which comprises the world’s largest ontology of common sense knowledge and reasoning engine. The system is able to generate coherent and topically related new sentences by using syntactic structures and semantic features of the given documents, the knowledge base, and the reasoning engine. The system first performs knowledge acquisition by extracting syntactic structure of each sentence in the given documents, and by mapping the words and the relationships of words into Cyc knowledge base. Next, it performs knowledge discovery by using Cyc ontology and inference engine. New concepts are abstracted by exploring the ontology of the mapped concepts. Main topics are identified based on the clustering of the concepts. Then, the system performs knowledge representation for human readers by creating new English sentences to summarize the key concepts and the relationships of the concepts. The structures of the composed sentences extend beyond subject-predicate-object triplets by allowing adjective and adverb modifiers. The system was tested on various documents and webpages. The test results showed that the system is capable of creating new sentences that include generalized concepts not mentioned in the original text and is capable of combining information from different parts of the text to form a summary.

Paper Nr: 52
Title:

Conceptual Process Models and Quantitative Analysis of Classification Problems in Scrum Software Development Practices

Authors:

Leon Helwerda, Frank Niessink and Fons J. Verbeek

Abstract: We propose a novel classification method that integrates into existing agile software development practices by collecting data records generated by software and tools used in the development process. We extract features from the collected data and create visualizations that provide insights, and feed the data into a prediction framework consisting of a deep neural network. The features and results are validated against conceptual frameworks that model the development methodologies as similar processes in other contexts. Initial results show that the visualization and prediction techniques provide promising outcomes that may help development teams and management gain better understanding of past events and future risks.

Posters
Paper Nr: 10
Title:

User-to-User Recommendation using the Concept of Movement Patterns: A Study using a Dating Social Network

Authors:

Mohammed Al-Zeyadi, Frans Coenen and Alexei Lisitsa

Abstract: Dating Social Networks (DSN) have become a popular platform for people to look for potential romantic partners. However, the main challenge is the size of the dating network in terms of the number of registered users, which makes it impossible for users to conduct extensive searches. DSN systems thus make recommendations, typically based on user profiles, preferences and behaviours. The provision of effective User-to-User recommendation systems have thus become an essential part of successful dating networks. To date the most commonly used recommendation technique is founded on the concept of collaborative filtering. In this paper an alternative approach, founded on the concept of Movement Patterns, is presented. A movement pattern is a three-part pattern that captures the “traffic” (messaging) between vertices (users) in a DSN. The idea is that these capture the behaviour of users within a DSN while at the same time capturing the associated profile and preference data. The idea has been built into a User-to-User recommender system, the RecoMP system. The system has been evaluated, by comparing its operation with a collaborative filtering systems (the RecoCF system), using a data set from the Chinese Jiayuan.com DSN comprising 548,395 vertices. The reported evaluation demonstrates that very successful results can be produced, a best average F-score value of 0.961.

Paper Nr: 12
Title:

Early NPL Warning for SME Credit Risk: An Experimental Study

Authors:

Sacide Kalayci and Secil Arslan

Abstract: In credit risk, besides assessing risk of credit applications, it has been very critical to take a proactive decision by foreseeing the risk of non-performing loan (NPL). In Turkey, recent reports demonstrate that among different credit categories such as consumer, corporate, small and medium-sized enterprises (SME) loans, SMEs reflect the highest NPL ratios. This paper focuses on SME credit behavioural scoring to develop an early NPL warning system after the credit is released. Utilizing application scoring features together with behavioural scoring features, an experimental study of classifying SME customers as non-performing or performing is targeted during lifetime of the credit. The proposed system aims to support a warning 6 months ahead to detect NPL state. Random Forest (RF) algorithm is implemented for NPL state classification of active SME credits. Accuracy results of RF algorithm is compared with different machine learning algorithms like Logistic Regression, Support Vector Machine and Decision Trees. It has been observed that accuracy of RF model is increased when different SME credit product features are added to the model. An accuracy ratio of 82.25% is achieved with RF which over performs all other alternative algorithms.

Paper Nr: 19
Title:

Towards Software Experts for Knowledge Discovery - A Process Model for Knowledge Consolidation

Authors:

Lauri Tuovinen

Abstract: The process of knowledge discovery in databases (KDD) is traditionally driven by human experts with in-depth knowledge of the technology used and the domain in which it is being applied. The role of technology in this process is to passively execute the computations specified by the experts, and the role of non-expert humans is limited to a few narrowly defined special cases. However, there are many scenarios where ordinary individuals would benefit from applying KDD to their own personal data, if only they had the means to do so. Meanwhile, KDD experts are looking for more advanced tools and methods capable of coping with the challenges posed by big data. Both these needs would be addressed by autonomous software experts capable of taking on some of the responsibilities of human KDD experts, but there are several obstacles that need to be cleared before the implementation of such experts is feasible. One of these is that while there is a widely accepted process model for knowledge discovery, there is not one for knowledge consolidation: the process of integrating a KDD result with established domain knowledge. This paper explores the requirements of the knowledge consolidation process and outlines a process model based on how the concept of knowledge is understood in KDD. Furthermore, it evaluates the state of the art and attempts to estimate how far away we are from achieving the necessary technology level to implement at least one major step of the process in software. Finally, the options available for making significant advances in the near future are discussed.

Paper Nr: 21
Title:

Hadoop-based Framework for Information Extraction from Social Text

Authors:

Ferdaous Jenhani, Mohamed Salah Gouider and Lamjed Bensaid

Abstract: Social data analysis becomes a real business requirement regarding the frequent use of social media as a new business strategy. However, their volume, velocity and variety are challenging their storage and processing. In a previous contribution [11, 12], we proposed an events extraction system in which we focused only on data variety and we did not handle volume and velocity dimensions. So, our solution cannot be considered a big data system. In this work, we port previously proposed system to a parallel and distributed framework in order to reduce the complexity of task and scale up to larger volumes of data continuously growing. We propose two loosely coupled Hadoop clusters for entity recognition and events extraction. In experiments, we carried time test and accuracy test to check the performance of the system on extracting drug abuse behavioral events from 1000000 tweets. Hadoop-based system achieves better performance compared to old system.

Paper Nr: 29
Title:

Mining Hot Research Topics based on Complex Network Analysis - A Case Study on Regenerative Medicine

Authors:

Rong-Qiang Zeng, Hong-Shen Pang, Xiao-Chu Qin, Yi-Bing Song, Yi Wen, Zheng-Yin Hu, Ning Yang, Hong-Mei Guo and Qian Li

Abstract: In order to mine the hot research topics of a certain field, we propose a hypervolume-based selection algorithm based on the complex network analysis, which employs a hypervolume indicator to select the hot research topics from the network in the considered field. We carry out the experiments in the field of regenerative medicine, and the experimental results indicate that our proposed method can effectively find the hot research topics in this field. The performance analysis sheds lights on the ways to further improvements.

Paper Nr: 32
Title:

Aposentu: A Social Semantic Platform for Hotels

Authors:

Gavina Baralla, Simona Ibba and Riccardo Zenoni

Abstract: Tourism business has become competitive and dynamic and it is essential to adapt both customers’ satisfaction and to market’s changing needs. A hotel owner faces three big challenges: he must look to attract new guests to its location, manage his hotel in the best performant way and has to adapt their online marketing strategy using several tools such as online travel agencies (OTA) or meta-search engines website or other players from the sharing economy. All these channels can complicate the hotelier’s life, making the tourism market and by removing in some areas the prices according to the seasons and the availability. In addition, consumer generated content (CGCs) influence the market and the revenue management making the web reputation even more important. This paper presents Aposentu, an innovative tool which integrates all the required components in order to successfully manage the hotel. The platform will be cloud computing technology based and it will show a proper dashboard with a lot of innovative functionalities. By using semantic tools, sentiment analysis, complex network metrics, the platform will allow the hoteliers to become more competitive in the tourism industry. Moreover, administrative complexity will be reduced and that will facilitate the management of accommodation.

Paper Nr: 33
Title:

A Personalized Learning Recommendation System Architecture for Learning Management System

Authors:

Thoufeeq Ahmed Syed, Vasile Palade, Rahat Iqbal and Smitha Sunil Kumaran Nair

Abstract: The information on the web is ever increasing and it is becoming difficult for students to find appropriate information or relevant learning material to satisfy their needs. Technology Enhanced Learning (TEL) is an area which covers all technologies that improve students learning. Effective Personal Learning Recommendation Systems (PLRS) will not only reduce this burden of information overload by recommending the relevant learning material to the students of their interest, but also provide them with “right" information at the “right" time and in the “right" way. In this paper, we first present a detailed analysis of existing TEL recommendation systems and identify the challenges that exist for developing and evaluating the datasets. Then, we propose an architecture for developing a PLRS that aims to support students via a Learning Management System (LMS) to find relevant material in order to enhance student learning experience. Also we proposes a methodology for building our own collaborative dataset via learning management systems (LMS) and educational repositories. This dataset will enhance student learning by recommending learning materials from the former student’s competence qualifications. The proposed dataset offer information on the usage of more than 19,296 resources from 628 courses apart from data from social learner networks (forums, blogs, wikis and chats), which constitutes another 3,600 stored files Finally, we also present some future challenges and a roadmap for developing TEL PLRSs.

Paper Nr: 34
Title:

Evaluating Open Source Business Intelligence Tools using OSSpal Methodology

Authors:

Tânia Ferreira, Isabel Pedrosa and Jorge Bernardino

Abstract: Business Intelligence (BI) is a set of techniques and tools that transform raw data into meaningful information. BI helps business managers to make better decisions, which reflects into a better competitive advantage. Open source tools have the main advantage of not increasing costs for companies although it is necessary to choose an appropriate tool to meet their specific needs. For a more precise evaluation of open source BI tools, the OSSpal assessment methodology was applied, which combines quantitative and qualitative evaluation measures. Using the OSSpal methodology, this paper compares four of the top business intelligence tools: BIRT, Jaspersoft, Pentaho and SpagoBI.

Paper Nr: 37
Title:

On Web Based Sentence Similarity for Paraphrasing Detection

Authors:

Mourad Oussalah and Panos Kostakos

Abstract: Semantic similarity measures play vital roles in information retrieval, natural language processing and paraphrasing detection. With the growing plagiarisms cases in both commercial and research community, designing efficient tools and approaches for paraphrasing detection becomes crucial. This paper contrasts web-based approach related to analysis of snippets of the search engine with WordNet based measure. Several refinements of the web-based approach will be investigated and compared. Evaluations of the approaches with respect to Microsoft paraphrasing dataset will be performed and discussed.

Paper Nr: 38
Title:

A on Spam Filtering Classification: A Majority Voting like Approach

Authors:

Youngsu Dong, Mourad Oussalah and Lauri Lovén

Abstract: Despite the improvement in filtering tools and informatics security, spam still cause substantial damage to public and private organizations. In this paper, we present a majority-voting based approach in order to identify spam messages. A new methodology for building majority voting classifier is presented and tested. The results using SpamAssassin dataset indicates non-negligible improvement over state of art, which paves the way for further development and applications.

Paper Nr: 44
Title:

Data Warehousing in the Cloud: Amazon Redshift vs Microsoft Azure SQL

Authors:

Pedro Joel Ferreira, Ana Almeida and Jorge Bernardino

Abstract: A data warehouse enables the analysis of large amounts of information that typically comes from the organization's transactional systems (OLTP). However, today's data warehouse systems do not have the capacity to handle the massive amount of data that is currently produced, then comes the concept of cloud computing. Cloud computing is a model that enables ubiquitous and on-demand access to a set of shared or non-shared computing resources (such as networks, servers, or storage) that can be quickly provisioned or released only with a simple request and without human intervention. In this model, the features are almost unlimited and in working together they bring a very high computing power that can and should be used for the most varied purposes. From the combination of both these concepts, emerges the cloud data warehouse. It advances the way traditional data warehouse systems are defined by allowing their sources to be located anywhere as long as it is accessible through the Internet, also taking advantage of the great computational power of an infrastructure in the cloud. In this paper, we study two of the most popular cloud data warehousing market solutions: Amazon Redshift and Microsoft Azure SQL Data Warehouse.

Paper Nr: 49
Title:

Opinion Mining Meets Decision Making: Towards Opinion Engineering

Authors:

Klemens Schnattinger and Heike Walterscheid

Abstract: We introduce a methodology for opinion mining based on recent approaches for natural language processing and machine learning. To select and rank the relevant opinions, decision making based on weighted description logics is introduced. Therefore, we propose an architecture called OMA (Opinion Mining Architecture) that integrates these approaches of our methodology in a common framework. First results of a study on opinion mining with OMA in the financial sector are presented.