IfIS at AutoUni Roadshow Mexico

IfIS at AutoUni Roadshow México 2015

The Institute for Information Systems at Technische Universität Braunschweig contributes to the Roadshow México 2015 by Volkswagen AutoUni.

 

Topic: Making Regional Data Centers ready for Big Data Mining

In many companies Data Mining technology is currently successfully applied to gather business intelligence on already large data collections. The hope is that the insights and trends derived from raw data serve as a sound base for later decision making, product development and innovation. But the problem with making sense out of ever-growing amounts of data lies not only in the algorithms' scalability, but also in the dynamic nature of how insights are gained, refined and -more often than not- subsequently  discarded in the light on fresh incoming data. So what are the problems and challenges that data mining technologies need to face?

After completion of the course the participants will have detailed knowledge about common methods for Big Data analytics. They will be able to critically discuss impact and usefulness, but also pitfalls of analytical techniques. All participants need a background in computer science (CS) or mangement information systems (MIS). Knowledge of basic algorithms and data structures is expected.

Basic Resources for Self-Study

The following content is designed as a recorded lecture for self-study covering the most common data mining techniques:

Data Mining Overview & Association Rule Mining — Slides - Print Slides - Video
Sequence Pattern Mining & Time Series — Slides - Print Slides - Video
Classification — Slides - Print Slides - Video
Clustering — Slides - Print Slides - Video
Meta-Algorithms for Classification — Slides - Print Slides - Video

The following paper presents an overview of the main techniques discussed in the self-study part on data mining, and should be used as a starting point for each algorithm. Please refer to the citations of each technique for more specific information:

X. Wu, V. Kumar, J. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu, P. Yu, Z. Zhou, M. Steinbach, D. Hand, D. Steinberg. Top 10 Algorithms in Data Mining. Journal of Knowledge and Information Systems archive, Volume 14 Issue 1.

Research Resources for the Group Phase

Now consider that you are dealing with Big Data... Large volumes of data coming in at large velocities in a vast variety of formats and types and an often questionable veracity. Indeed, the capabilities to collect data in a growing number of sensors, devices, and formats, has significantly outpaced the capability to process, analyze, store, and lastly understand the resulting data sets. Top examples are social networks like Facebook, Twitter, or YouTube encouraging users to generate loads of new content every minute. But the same holds on a somewhat smaller scale for company data stored in data warehousing infrastructures: everything is stored for later decision support - every detail of each product produced, every business process monitored, every customer interaction recorded. But what does that mean for the evolution of data mining techniques?

It means that a static way of mining data is no longer possible. Of course this also affects today's data mining algorithms. Algorithms need to be scalable in the sense of distributed execution (e.g., Map-Reduce). Algorithms have to become dynamic with a focus on data evolution over time (streaming data). Algorithms have to become robust to changing data formats and mistakes in the data.

The following papers provide to serve as starting points for group discussions:

Scalability

M. J. Zaki, C.-T. Ho, R. Agrawal: Parallel classification for data mining on shared-memory multiprocessors. 15th International Conference on Data Engineering (ICDE), Sydney, Australia, 1999.
S. Parthasarathy, M. J. Zaki, M. Ogihara, W. Li: Parallel data mining for association rules on shared-memory systems. Journal of Knowledge and Information Systems, Vol. 3(1), Springer, 2001.
R. Jin, G. Yang, G. Agrawal: Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 17(1), IEEE, 2005.
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM (CACM), vol. 51(1), ACM, 2008.
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker: A comparison of approaches to large-scale data analysis. In Procs of ACM SIGMOD International Conference on Management of Data (SIGMOD), Providence, RI, USA, 2009.

Dynamic Behavior

C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu: A framework for clustering evolving data streams. In International Conference on Very Large Data Bases (VLDB), Berlin, Germany, 2003.
D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004.
X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun: Modeling and predicting personal information dissemination behavior. In Procs. of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Chicago, IL, USA, 2005.
M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult: MONIC: modeling and monitoring cluster transitions. i In Procs. of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Philadelphia, PA, USA, 2006.
A. Bifet and R. Gavalda: Learning from Time-Changing Data with Adaptive Windowing. in SIAM International Conference on Data Mining (SDM), Minneapolis, MN, USA, 2007.

Robustness

A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong: Model-driven data acquisition in sensor networks. In International Conference on Very Large Data Bases (VLDB), Toronto, Canada,2004.
V. S. Sheng, F. Provost, and P. G. Ipeirotis: Get another label? improving data quality and data mining using multiple, noisy labelers. In Procs. of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Las Vegas, NV, USA, 2008.
H. Becker, M. Naaman, and Luis Gravano: Learning similarity metrics for event identification in social media. In Procs. of ACM International Conference on Web Search and Data Mining (WSDM), New York, NY, USA, 2010.
A. Artikis, M. Weidlich, A. Gal, V. Kalogeraki, and D. Gunopulos: Self-adaptive event recognition for intelligent transport management. In IEEE International Conference on Big Data, Santa Clara, CA, USA, 2013.
X. Wang, X. Luo, H. Liu: Measuring the veracity of web event via uncertainty. Journal of Systems and Software, Vol. 102, Elsevier, 2015

The group phase will finish on Wednesday July, 1st with final presentations of all groups showing how Big Data affects different types of data mining algorithms including a research-based state of the art.