Datasets & Attributes

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 2

Dataset types

Kind	Subtypes
Record	data matrix (all values are numeric), document data, transaction data (data with sets of items e.g. grocery items)
Graph	graph databases, molecular structures, webpages
Ordered	spatial data, temporal data, sequential data, genetic sequence data

Attribute types

Dataset attributes can be categorized based the types of meaningful operations they support

Category	Kind	Operations	Examples	Transformations
nominal	categorical, qualitative	\(=\) \(\neq\)	zip code, id number, eye color	any permutation of values
ordinal	categorical, qualitative	\(=\) \(\neq\) \(<\) \(>\)	letter grades, shirt size	order preserving change
interval	numeric, quantitative	\(=\) \(\neq\) \(<\) \(>\) \(+\) \(-\)	dates, °C, °F	new_value = \(a *\) old_value \(+ b\)
ratio	numeric, quantitative	\(=\) \(\neq\) \(<\) \(>\) \(+\) \(-\) \(\times\) \(\div\)	°K, money, counts, age, length, mass	new_value = \(a *\) old_value

Attributes can be discrete (boolean, finite count, etc.) or continuous (real number values: age, mass, length) separately from the above categorization

Proximity measures

	Dissimilarity	Similarity
Nominal	\begin{equation} d = \begin{cases} 0 & \text{if } x = y \newline 1 & \text{if } x \neq y \ \end{cases} \end{equation}	\begin{equation} d = \begin{cases} 1 & \text{if } x = y \newline 0 & \text{if } x \neq y \ \end{cases} \end{equation}
Ordinal	\(\displaystyle d = \frac{\lvert x - y \rvert}{(n - 1)}\)	\(s = 1 - d\)
Interval, ratio	\(d = \lvert x - y \rvert\)	\(s = -d\) \(\displaystyle s = \frac{1}{1 + d}\) \(s = e^{-d}\) \(\displaystyle s = 1 - \frac{d - \text{min_d}}{\text{max_d} - \text{min_d}}\)

other (metric) dissimilarity/distance measures: Euclidean, Minkowski, Mahalanobis, correlation
properties of metric distance measures:
- \(d(x, y) = 0\) only if \(x = y\)
- symmetric: \(d(x, y) = d(y, x)\)
- triangle inequality: \(d(x, z) = d(x, y) + d(y, z)\)
properties of similarity measures:
- \(s(x, y) = 1\) only if \(x = y\)
- symmetry: \(s(x, y) = s(y, x)\) for all \(x\) and \(y\)
choice of right measure depends on domain, also see note on invariance
- results must be consistent with domain knowledge
- other considerations: tolerance to noise, outliers, ability to discover more patterns

Vector similarity

For binary vectors:

x = 1 0 0 0 0 0 0 0 0 0
y = 0 0 0 0 0 0 1 0 0 1

Simple matching = total matches / total attributes

\(\displaystyle \frac{f_{11}+f_{00}}{f_{00}+f_{01}+f_{10}+f_{11}} = \frac{0+7}{2+1+0+7} = 0.5\)

Jaccard coefficients = total non-0 matches / total non-zero attributes

\(\displaystyle \frac{f_{11}}{f_{01}+f_{10}+f_{11}} = \frac{0}{2+1+0} = 0\)

For numeric vectors, use cosine similarity:

x = 3 2 0 5 2 0
y = 1 0 0 0 1 2

\(\displaystyle \cos(x, y) = \frac{x \cdot y}{\Vert x \Vert \cdot \Vert y \Vert} = \frac{3*1 + 2*1}{(3^2 + 2^2 + 5^2 + 2^2)^{0.5} \cdot (1^2 + 1^2 + 2^2)^{0.5}} = 0.315\)

Invariance

if variable is scaled (multiplied by value) or translated (addition of constant value)

correlation is invariant to both scaling and translation
cosine similarity is invariant to scaling but not to translation
Euclidean distance will change in both cases

Quality Issues

Poor quality data negatively affects data processing tasks

Issue
noise	modification of original value
outliers	either noise or goal of analysis (anomaly)
missing values	information not collected or not applicable => resolutions: eliminate records or variable, estimate missing values, ignore during analysis
duplicated records	major issue when merging datasets
wrong data	measurement error
fake data	purposely or artificially generated records

Preprocessing

Possible preprocessing steps

Aggregation

combine two or more attributes into 1, to reduce number of attributes or reduce variance

Sampling

reduce number of instances because obtaining full data is too expensive/time consuming. Sample must be representative, i.e. have same properties as full data

Types of sampling:

simple random sampling
- equal probability of selecting any particular item
- can be without replacement or with replacement
stratified sampling
- split data into partitions, then draw random samples from each partition

Discretization

Converts continuous attribute to discrete (ordinal) attribute; can be applied in both supervised and unsupervised setting. Methods: equal frequency, equal interval width, K-means.

Binarization

Maps continuous or categorical attribute to one or more binary variables, example:

Category    Integer    x1  x2  x3  x4  x5
awful         0        1   0   0   0   0
poor          1        0   1   0   0   0
OK            2        0   0   1   0   0
good          3        0   0   0   1   0
great         4        0   0   0   0   1

Transformation

Also: normalization, standardization; maps all values to a new values such that each old value is identifiable within new values. Examples of transformation functions: \(x^k, e^x, log(x), \vert x \vert\), correlation

Dimensionality reduction

For purpose of avoiding curse of dimensionality

increased dimensionality causes data to become sparse in space => density and distance become less meaningful (not good for clustering)
reduce amount of time and memory needed for computation
ease visualization and eliminate irrelevant features and noise
techniques: principal components analysis (PCA), singular value decomposition, supervised and non-linear techniques
goal: find projection that captures the largest amount of variance in data

Feature subset selection

Another dimensionality reduction method, whose purpose is to remove redundant and irrelevant features

Feature creation

Create new attributes that capture more important/efficiently information than original attributes. Some methods:

feature extraction: e.g. extracting edges from images
feature construction: e.g. dividing mass by volume to get density
mapping data to new space: Fourier and wavelet analysis