Accepted

BLOMA: Explain Collaborative Filtering via Boosted Local Rank-One Matrix Approximation

Chongming Gao, Shuai Yuan, Zhong Zhang, Hongzhi Yin, Junming Shao

BLOMA

Matrix Approximation (MA) is a powerful technique in context-aware recommendation systems. By exploiting both the rating data and auxiliary social/item networks, it represents the users and items as well-regularized low-rank latent factors so as to capture user preferences and item attributes. However, there are two main problems in the prevalent MA framework. The most urgent one is that the latent factor is out of explanation, which hampers the understanding of the reasons behind recommendations. Besides, traditional MA methods produce user/item factors globally, which fails to capture the idiosyncrasies of users/items. In this paper, we propose a model called Boosted Local rank-One Matrix Approximation (BLOMA). The core idea is to locally and sequentially approximate the residual matrix (which represents the unexplained part obtained from previous stage) by rank-one sub-matrix. By leveraging social networks and item relationships, the sub-matrix with distinct topic is easily extracted. The experiments show our method has good explanations that other methods never have. Meanwhile, by comparing with state-of-the-art algorithms, we show our model has well-matched prediction accuracy.
Accepted by DASFAA’19.

Semantic Trajectory Compression via Multi-resolution Synchronization-based Clustering

Chongming Gao, Yi Zhao, Ruizhi Wu, Qinli Yang, Junming Shao

image-TraCompression

With the pervasive use of location-aware devices and rapid development of location sensing technology, trajectory data has been generated in diverse fields. How to store and manage these large amounts of data is a non-trivial task, and thus trajectory compression has gained increasing attention in recent years. As traditional compression algorithms often treat trajectories as sequences of lines in geometric space, the global statistics and the semantics embedded in trajectories are not well considered. Inspired by the powerful concept of synchronization, in this paper we introduce a new semantic trajectory compression approach, called \textsc{CascadeSync}, to yield multi-resolution trajectory abstractions with semantic enrichment. The basic idea is to introduce a multi-resolution synchronization-based clustering model to produce semantic regions-of-interest (ROIs) in a hierarchical way. Specifically, by imposing constraints on points with semantic information in the interaction model, all neighboring points with similar semantics will group together automatically. Afterwards, each trajectory is compressed as a sequence of semantic ROIs and is further represented as a hierarchical ROI network. Beyond, we further extend our model on the data stream setting. The extensive experiments on synthetic data and four real-world data sets have demonstrated the effectiveness and efficiency of our proposed model.
Submitted to Knowledge-Based Systems (KBS), still under review. [PDF]

Towards Robust Arbitrarily Oriented Subspace Clustering

Zhong Zhang, Chongming Gao, Qinli Yang, Junming Shao

SC

Clustering high-dimensional data is challenging since meaningful clusters usually hide in the arbitrarily oriented subspaces, and thus classical clustering algorithms like k-means tend to fail in such case. Subspace clustering has thus attracted growing attention in last decade, and many algorithms have been proposed such as ORCLUS and 4C. However, existing approaches are usually sensitive to global and/or local noisy points, and the overlapping subspace clusters are little explored. Beyond, these approaches usually involve the exhaustive local search for correlated points or subspaces, which is time consuming. To deal with these problems, in this paper, we introduce a new subspace clustering algorithm, called RAOSC, which formulates the Robust Arbitrarily Oriented Subspace Clustering as a group structure low-rank optimization problem. RAOSC is able to recover subspace clusters from a sea of noise while noise and overlapping points can be naturally identified during the optimization process. Unlike existing low-rank based subspace clustering methods, RAOSC can explicitly produce the subspaces of clusters without any prior knowledge of subspace dimensionality and do not need a post-processing for clustering. Extensive experiments on both synthetic and real-world data sets have demonstrated that RAOSC allows yielding high-quality clusterings and outperforms many state-of-the-art algorithms.
Accepted by DASFAA’19.

SemiSync: Semi-supervised Clustering by Synchronization

Zhong Zhang, DiDi Kang, Chongming Gao, Junming Shao

SC

In this paper, we consider the semi-supervised clustering problem, where the prior knowledge is formalized as the Cannot-Link(CL) and Must-Link(ML) pairwise constraints. Previous works usually solve this problem by introducing additional penalty terms to satisfy the constraints and/or learning a constraint-based distance metric. In this paper, we solve this problem from a novel perspective: synchronization. Inspired by the emerging synchronization models, we propose a new semi-supervised clustering method, called SemiSync. The basic idea is to regard the data points as a set of (constrained) phase oscillators, and simulate their dynamics to form clusters automatically with a proposed constrained interaction model. Thanks to the powerful concept and the desirable property of synchronization, SemiSync allows dynamically propagating the constraints to unlabelled data points driven by their local data distributions, which effectively boosts the clustering performance even if little prior knowledge is available. Besides, it allows us to employ an effective strategy to reduce the computation load. Extensive experiments on synthetic and real-world data sets have demonstrated the effectiveness of our proposed method.
Accepted by DASFAA’19.

Synchronization-inspired Co-clustering and Its Application to Gene Expression Data

Junming Shao, Chongming Gao, Wei Zeng, Jingkuan Song, Qinli Yang

Cosync

In this paper, we propose a new synchronization-inspired co-clustering algorithm by dynamic simulation, called CoSync, which aims to discover biologically relevant subgroups embedding in a given gene expression data matrix. The basic idea is to view a gene expression data matrix as a dynamical system, and the weighted two-sided interactions are imposed on each element of the matrix from both aspects of genes and conditions, resulting in the values of all element in a co-cluster synchronizing together. Experiments show that our algorithm allows uncovering high-quality co-clusterings embedded in gene expression data sets and has its superiority over many state-of-the-art algorithms.
Accepted by ICDM’17. [PDF][Code]

Under Review

Semantic Trajectory Representation and Retrieval via Hierarchical Embedding

Chongming Gao, Chen Huang, Zhong Zhang, Qinli Yang, Junming Shao

image-TraEmbedding

Trajectory mining has gained growing attentions due to its wide emerging applications, such as location-based service, urban computing and movement behavior analysis. One critical and fundamental mining task is to retrieve certain locations or trajectories that satisfy particular patterns. However, it becomes a tricky problem to query various semantic patterns from traditional trajectory databases, since most existing approaches mainly represent the trajectory as a collection of geographic and temporal features, and the latent semantic properties are little considered. In this paper, we introduce a new semantic trajectory representation method, incorporating trajectory structures, temporal information as well as domain knowledge to make efficient semantic retrieval possible. Specifically, to extract structures from disordered raw trajectories, a synchronization-based model is first introduced to identify multi-resolution regions of interest (ROIs). Relying on the hierarchical ROI network, a hierarchical embedding model is further proposed to embed ROIs as well as trajectories as continuous vectors. Most importantly, the metric in this embedding vector space is tailored to express the semantic similarity. As a result, users can easily retrieve desirable ROIs or trajectories by computing the Euclidean distance among embedded vectors. Empirical experiments on synthetic data and four real-world data sets show that our approach extracts semantic trajectory information effectively, which allows retrieving more similar and interpretable trajectories, by comparing with four state-of-the-art similarity metrics, including DTW, LCSS, EDR and Fr\’echet distance.
Submitted to Data Mining and Knowledge Discovery (DMKD), still under review. [PDF]

To Be Submitted

AntClu: Any-time Trajectory Clustering via Dynamic Trend Representation

Yi Zhao, Chongming Gao, Ruizhi Wu, Qinli Yang, Junming Shao

antclu

In recent years, clustering trajectory data has been extensively explored to discover similar patterns of moving objects. Existing approaches, often cluster whole life-span trajectories into several groups according to some trajectory similarities such as dynamic time warping and edit distance. However, the trajectory of a given moving object is dynamic and evolved over time. Exploring the dynamic grouping patterns of moving objects (e.g., the expanding, shrinking, emerging or disappearing of clusters) over time thus offers a more dedicated venue to analyze the evolved moving patterns. To address this problem, in this paper, we propose a new any-time trajectory clustering algorithm, called AntClu, building upon the concepts of automatic dynamic trend representation and density-based online clustering. The basic idea is to learn a dynamic representation for each trajectory to capture “current trend”, and then cluster these “trends” in an online setting. Therefore, AntClu is capable of clustering trajectories at any time, and time-changing clusters are available whenever the request comes. More importantly, unlike traditional data stream clustering approaches or online learning, AntClu is also independent of time-window. The experimental results on real-world data sets further demonstrate its effectiveness and efficiency.
Prepared to be submitted. [PDF]