Big Data
- GB: $2^{30}$ B
- TB, PB, EB, ZB
- data newly generated globally
- 2006: 180 EB
- 2001: 1.8 ZB
- 2020: 35 ZB
- Data Mining Examples:
- supermarket transactions
- valuable customers
- network instrusion
- gene data
- medical data
- web
- financial data
- software data
- usage data
- data from sensors
- data of arts
- audio generation
- Data mining: non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from huge volume of data
- KDD: Knowledge Discovery in Data
- 狭义:Data mining is a core step of KDD
- 广义:KDD = DM
- data mining tasks: descriptive or predictive
- characterization
- discrimination
- association
- clustering
- classification
- regression
- outlier analysis
- trend and evolution analysis
数据仓库概念
- Property
- Subject-Oriented
- Integrated
- Time-variant
- Nonvolatile
- Data Cube: lattice of cuboids (not a paradigm of actual physical storage)
- Multidimensional Data Model
- fact table: contains the facts as well as keys to each of the related dimension tables
- schema
- star schema: 事实表+维表
- snowflake schema
- fact constalleation schema
- concept hierarchy
- schema hierarchy: Total or partial order
- set-grouping hierarchy
OLAP
OnLine Analysis Process
- OLAM architecture
- Layer 4: User Interface
- Layer 3: OLAP
- Layer 2: Data Cube
- Layer 1: Databases or Data Warehouse
- OLAP
- roll-up: climbing up a concept hierarchy
- drill-down: step down hierarchy
- slice: selection on one dimension
- dice: select on multiple dimension
- pivot: rotate the data axes in view
- Measure (度量)
- 分布的:设数据被划分为 n 个集合,函数在每一部分上的计算得到一个聚集值。如果将函数用于 n 个聚集值得到的结果,与将函数用于所有数据得到的结果一样
- 代数的:具有 $M$ 个分布聚集函数的代数函数
- 整体的
- OLAP server
- ROLAP 关系型
- Optimization technique
- Operational DBMS
- MOLAP 多维
- HOLAP 混合
- ROLAP 关系型
分块技术
-
$n$ 维数据立方体,每维有 $L_i$ 概念分层
$$T=\prod_{i=1}^n(L_i+1)$$
-
多路数组聚集: dimensions should be sorted according to their sizes in ascending order
数据仓库使用
- 信息处理:支持查询和基本的统计分析,并使用交叉表、表、图表或图进行报告。数据仓库信息处理的当前趋势是构造低代价的基于网络的存取工具,然后与网络浏览器集成在一起
- 分析处理:支持基本的 OLAP 操作,包括切片与切块、下钻、上卷和转轴。一般地,它在汇总的 和细节的历史数据上操作。与信息处理相比,联机分析处理的主要优势是它支持数据仓库的多维数据分析
- 数据挖掘:支持知识发现,包括找出隐藏的模式和关联,构造分析模型,进行分类和预测,并用可视化工具提供挖掘结果
Attribute-oriented induction
- Data generalization: a process which abstracts a large set of task- relevant data in a database from a relatively low conceptual level to higher conceptual levels
- control generalization: how high an attribute should be generalized