-
Course Code
ΨΣ-ΔΚ-521
Type of Course
Mandatory [M]
-
Semester
2nd Semester
-
ECTS Credits
7,5
Objective
The main objective of this course is to present to the students modern techniques, systems and platforms for Big Data management and scalable data analytics. Emphasis will be given to issues related to scalability, efficiency and fault-tolerance in the complete life-cycle of Big Data, from data acquisition and integration to data processing and interpreta-tion. Another important direction is data analytics over miscellaneous data types, including text, web data and social data. As expected results the students will acquire strong technical skills in management of Big Data and they will become familiar with algorithms and methods for data analytics at scale.
Course Contents
- Big Data: Basic concepts, applications, use cases, definitions, 6Vs – Volume, Variety, Velocity, Veracity, Validity and Volatility, opportunities and research challenges, requirements for Big Data management platforms, the Big Data analysis pipeline: data acquisition and recording, information extraction and cleaning, data aggregation, integration and representation, query processing, data modeling and analysis, interpretation. Challenges related to Big Data: heterogeneity and incompleteness, scale, timeliness, security and privacy, human collaboration.
- Batch-style processing of Big Data: Scalability, efficiency, fault-tolerance, programming solutions for Big Data analysis, MapReduce/Hadoop, HDFS, the Hadoop ecosystem, HBase, declarative querying, high-level query languages (Hive, Pig), Apache Mahout.
- Real-time processing of Big Data: Stream processing, real-time processing, main-memory data management systems, programming with Storm, high-level abstractions over Storm (Trident).
- Trends in Big Data management: NoSQL stores, key-value stores, document stores (MongoDB, CouchDB), extensible record stores (Google’s BigTable, Cassandra), modern techniques in Big Data management, data exploration, in-memory processing, in-situ processing, data visualization, novel platforms (incl. Pregel, Dremel, Giraph, F1, HANA).
- Scalable machine learning techniques. Unsupervised Learning: representative clustering algorithms, stream clustering problem. Supervised Learning: decision trees, support vector machines. Semi- supervised learning algorithms.
- Social network data analytics: Social data, representations, management, challenges of social network data management, structural properties of social networks: centrality, degree, balance, interesting problems in social network analysis: community detection, interesting node discovery, node classification, discovery of information flows, node influence.
- Web analytics: Search algorithms, ranking, link analysis (PageRank, HITS), analyzing website traffic such as click streams, referrals, keywords, page views, and drop rates, advertising on the Web.
- Recommendation systems: Content-based systems, collaborative filtering systems, personalization, data mining techniques for large-scale recommendation systems, evaluation of recommendation systems, applications of recommendation systems.
- Analytics and mathematics: Mathematical tools and analytics, data science, modeling and analysis of large-scale data, predictive analytics, statistical analysis, regression analysis, applied statistics, sampling, time series.
- Application areas of analytics: Business value of analytics, data-driven decision-making, healthcare analytics, analytics adoption model, analysis of scientific data.
- Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Rama-krishnan, R., Shahabi, C. (2014): Big Data and Its Technical Challenges. Communica-tions of the ACM, Vol. 57 No. 7, pages 86-94.
- Catell, R. (2010): Scalable SQL and NoSQL data stores. ACM SIGMOD Record, Vol-ume 39 Issue 4, December 2010, pages 12-27.
- White, T. (2010): Hadoop: The Definitive Guide, 2nd Edition. O’Reilly Media/Yahoo Press, ISBN: 9781449389734.
- Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets. Cam-bridge University Press.
Additional Readings
- Golab, L., Özsu, M.T. (2010): Data Stream Management. Morgan & Claypool Publishers, Synthesis Lectures on Data Management.
- Aggarwal, C.C. (2011): Social Network Data Analytics, Springer, ISBN: 978-1-4419-8462-3.
- Mohan, C. (2013): History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla. Proceedings of EDBT’13, Genoa, Italy.
- The Beckman Report on Database Research (http://beckman.cs.wisc.edu/), Octo-ber 14-15, 2013.
- Selected research articles