Datasets & Attributes
Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 2
Dataset types
Kind | Subtypes |
---|---|
Record | data matrix (all values are numeric), document data, transaction data (data with sets of items e.g. grocery items) |
Graph | graph databases, molecular structures, webpages |
Ordered | spatial data, temporal data, sequential data, genetic sequence data |
Attribute types
Dataset attributes can be categorized based the types of meaningful operations they support
Category | Kind | Operations | Examples | Transformations |
---|---|---|---|---|
nominal | categorical, qualitative | \(=\) \(\neq\) | zip code, id number, eye color | any permutation of values |
ordinal | categorical, qualitative | \(=\) \(\neq\) \(<\) \(>\) | letter grades, shirt size | order preserving change |
interval | numeric, quantitative | \(=\) \(\neq\) \(<\) \(>\) \(+\) \(-\) |
dates, °C, °F | new_value = \(a *\) old_value \(+ b\) |
ratio | numeric, quantitative | \(=\) \(\neq\) \(<\) \(>\) \(+\) \(-\) \(\times\) \(\div\) |
°K, money, counts, age, length, mass | new_value = \(a *\) old_value |
Attributes can be discrete (boolean, finite count, etc.) or continuous (real number values: age, mass, length) separately from the above categorization
Proximity measures
Dissimilarity | Similarity | |
---|---|---|
Nominal | \begin{equation} d = \begin{cases} 0 & \text{if } x = y \newline 1 & \text{if } x \neq y \ \end{cases} \end{equation} | \begin{equation} d = \begin{cases} 1 & \text{if } x = y \newline 0 & \text{if } x \neq y \ \end{cases} \end{equation} |
Ordinal | \(\displaystyle d = \frac{\lvert x - y \rvert}{(n - 1)}\) | \(s = 1 - d\) |
Interval, ratio |
\(d = \lvert x - y \rvert\) | \(s = -d\) \(\displaystyle s = \frac{1}{1 + d}\) \(s = e^{-d}\) \(\displaystyle s = 1 - \frac{d - \text{min_d}}{\text{max_d} - \text{min_d}}\) |
-
other (metric) dissimilarity/distance measures: Euclidean, Minkowski, Mahalanobis, correlation
-
properties of metric distance measures:
- \(d(x, y) = 0\) only if \(x = y\)
- symmetric: \(d(x, y) = d(y, x)\)
- triangle inequality: \(d(x, z) = d(x, y) + d(y, z)\)
-
properties of similarity measures:
- \(s(x, y) = 1\) only if \(x = y\)
- symmetry: \(s(x, y) = s(y, x)\) for all \(x\) and \(y\)
-
choice of right measure depends on domain, also see note on invariance
- results must be consistent with domain knowledge
- other considerations: tolerance to noise, outliers, ability to discover more patterns
Vector similarity
For binary vectors:
1 2 |
|
Simple matching = total matches / total attributes
\(\displaystyle \frac{f_{11}+f_{00}}{f_{00}+f_{01}+f_{10}+f_{11}} = \frac{0+7}{2+1+0+7} = 0.5\)
Jaccard coefficients = total non-0 matches / total non-zero attributes
\(\displaystyle \frac{f_{11}}{f_{01}+f_{10}+f_{11}} = \frac{0}{2+1+0} = 0\)
For numeric vectors, use cosine similarity:
1 2 |
|
\(\displaystyle \cos(x, y) = \frac{x \cdot y}{\Vert x \Vert \cdot \Vert y \Vert} = \frac{3*1 + 2*1}{(3^2 + 2^2 + 5^2 + 2^2)^{0.5} \cdot (1^2 + 1^2 + 2^2)^{0.5}} = 0.315\)
Invariance
if variable is scaled (multiplied by value) or translated (addition of constant value)
- correlation is invariant to both scaling and translation
- cosine similarity is invariant to scaling but not to translation
- Euclidean distance will change in both cases
Quality Issues
Poor quality data negatively affects data processing tasks
Issue | |
---|---|
noise | modification of original value |
outliers | either noise or goal of analysis (anomaly) |
missing values | information not collected or not applicable => resolutions: eliminate records or variable, estimate missing values, ignore during analysis |
duplicated records | major issue when merging datasets |
wrong data | measurement error |
fake data | purposely or artificially generated records |
Preprocessing
Possible preprocessing steps
Aggregation
combine two or more attributes into 1, to reduce number of attributes or reduce variance
Sampling
reduce number of instances because obtaining full data is too expensive/time consuming. Sample must be representative, i.e. have same properties as full data
Types of sampling:
- simple random sampling
- equal probability of selecting any particular item
- can be without replacement or with replacement
- stratified sampling
- split data into partitions, then draw random samples from each partition
Discretization
Converts continuous attribute to discrete (ordinal) attribute; can be applied in both supervised and unsupervised setting. Methods: equal frequency, equal interval width, K-means.
Binarization
Maps continuous or categorical attribute to one or more binary variables, example:
1 2 3 4 5 6 |
|
Transformation
Also: normalization, standardization; maps all values to a new values such that each old value is identifiable within new values. Examples of transformation functions: \(x^k, e^x, log(x), \vert x \vert\), correlation
Dimensionality reduction
For purpose of avoiding curse of dimensionality
- increased dimensionality causes data to become sparse in space => density and distance become less meaningful (not good for clustering)
- reduce amount of time and memory needed for computation
- ease visualization and eliminate irrelevant features and noise
- techniques: principal components analysis (PCA), singular value decomposition, supervised and non-linear techniques
- goal: find projection that captures the largest amount of variance in data
Feature subset selection
Another dimensionality reduction method, whose purpose is to remove redundant and irrelevant features
Feature creation
Create new attributes that capture more important/efficiently information than original attributes. Some methods:
- feature extraction: e.g. extracting edges from images
- feature construction: e.g. dividing mass by volume to get density
- mapping data to new space: Fourier and wavelet analysis