Skip to content

Datasets & Attributes

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 2

Dataset types

Kind Subtypes
Record data matrix (all values are numeric), document data, transaction data (data with sets of items e.g. grocery items)
Graph graph databases, molecular structures, webpages
Ordered spatial data, temporal data, sequential data, genetic sequence data

Attribute types

Dataset attributes can be categorized based the types of meaningful operations they support

Category Kind Operations Examples Transformations
nominal categorical, qualitative \(=\) \(\neq\) zip code, id number, eye color any permutation of values
ordinal categorical, qualitative \(=\) \(\neq\) \(<\) \(>\) letter grades, shirt size order preserving change
interval numeric, quantitative \(=\) \(\neq\) \(<\) \(>\)
\(+\) \(-\)
dates, °C, °F new_value =
\(a *\) old_value \(+ b\)
ratio numeric, quantitative \(=\) \(\neq\) \(<\) \(>\)
\(+\) \(-\) \(\times\) \(\div\)
°K, money, counts, age, length, mass new_value =
\(a *\) old_value

Attributes can be discrete (boolean, finite count, etc.) or continuous (real number values: age, mass, length) separately from the above categorization

Proximity measures

Dissimilarity Similarity
Nominal \begin{equation} d = \begin{cases} 0 & \text{if } x = y \newline 1 & \text{if } x \neq y \ \end{cases} \end{equation} \begin{equation} d = \begin{cases} 1 & \text{if } x = y \newline 0 & \text{if } x \neq y \ \end{cases} \end{equation}
Ordinal \(\displaystyle d = \frac{\lvert x - y \rvert}{(n - 1)}\) \(s = 1 - d\)
Interval,
ratio
\(d = \lvert x - y \rvert\) \(s = -d\)     \(\displaystyle s = \frac{1}{1 + d}\)     \(s = e^{-d}\)

\(\displaystyle s = 1 - \frac{d - \text{min_d}}{\text{max_d} - \text{min_d}}\)
  • other (metric) dissimilarity/distance measures: Euclidean, Minkowski, Mahalanobis, correlation

  • properties of metric distance measures:

    • \(d(x, y) = 0\) only if \(x = y\)
    • symmetric: \(d(x, y) = d(y, x)\)
    • triangle inequality: \(d(x, z) = d(x, y) + d(y, z)\)
  • properties of similarity measures:

    • \(s(x, y) = 1\) only if \(x = y\)
    • symmetry: \(s(x, y) = s(y, x)\) for all \(x\) and \(y\)
  • choice of right measure depends on domain, also see note on invariance

    • results must be consistent with domain knowledge
    • other considerations: tolerance to noise, outliers, ability to discover more patterns

Vector similarity

For binary vectors:

1
2
x = 1 0 0 0 0 0 0 0 0 0
y = 0 0 0 0 0 0 1 0 0 1

Simple matching = total matches / total attributes

\(\displaystyle \frac{f_{11}+f_{00}}{f_{00}+f_{01}+f_{10}+f_{11}} = \frac{0+7}{2+1+0+7} = 0.5\)


Jaccard coefficients = total non-0 matches / total non-zero attributes

\(\displaystyle \frac{f_{11}}{f_{01}+f_{10}+f_{11}} = \frac{0}{2+1+0} = 0\)


For numeric vectors, use cosine similarity:

1
2
x = 3 2 0 5 2 0
y = 1 0 0 0 1 2

\(\displaystyle \cos(x, y) = \frac{x \cdot y}{\Vert x \Vert \cdot \Vert y \Vert} = \frac{3*1 + 2*1}{(3^2 + 2^2 + 5^2 + 2^2)^{0.5} \cdot (1^2 + 1^2 + 2^2)^{0.5}} = 0.315\)

Invariance

if variable is scaled (multiplied by value) or translated (addition of constant value)

  • correlation is invariant to both scaling and translation
  • cosine similarity is invariant to scaling but not to translation
  • Euclidean distance will change in both cases

Quality Issues

Poor quality data negatively affects data processing tasks

Issue
noise modification of original value
outliers either noise or goal of analysis (anomaly)
missing values information not collected or not applicable => resolutions: eliminate records or variable, estimate missing values, ignore during analysis
duplicated records major issue when merging datasets
wrong data measurement error
fake data purposely or artificially generated records

Preprocessing

Possible preprocessing steps

Aggregation

combine two or more attributes into 1, to reduce number of attributes or reduce variance

Sampling

reduce number of instances because obtaining full data is too expensive/time consuming. Sample must be representative, i.e. have same properties as full data

Types of sampling:

  • simple random sampling
    • equal probability of selecting any particular item
    • can be without replacement or with replacement
  • stratified sampling
    • split data into partitions, then draw random samples from each partition

Discretization

Converts continuous attribute to discrete (ordinal) attribute; can be applied in both supervised and unsupervised setting. Methods: equal frequency, equal interval width, K-means.

Binarization

Maps continuous or categorical attribute to one or more binary variables, example:

1
2
3
4
5
6
Category    Integer    x1  x2  x3  x4  x5
awful         0        1   0   0   0   0
poor          1        0   1   0   0   0
OK            2        0   0   1   0   0
good          3        0   0   0   1   0
great         4        0   0   0   0   1

Transformation

Also: normalization, standardization; maps all values to a new values such that each old value is identifiable within new values. Examples of transformation functions: \(x^k, e^x, log(x), \vert x \vert\), correlation

Dimensionality reduction

For purpose of avoiding curse of dimensionality

  • increased dimensionality causes data to become sparse in space => density and distance become less meaningful (not good for clustering)
  • reduce amount of time and memory needed for computation
  • ease visualization and eliminate irrelevant features and noise
  • techniques: principal components analysis (PCA), singular value decomposition, supervised and non-linear techniques
  • goal: find projection that captures the largest amount of variance in data

Feature subset selection

Another dimensionality reduction method, whose purpose is to remove redundant and irrelevant features

Feature creation

Create new attributes that capture more important/efficiently information than original attributes. Some methods:

  • feature extraction: e.g. extracting edges from images
  • feature construction: e.g. dividing mass by volume to get density
  • mapping data to new space: Fourier and wavelet analysis