Introduction to Data Science

Basic Concepts

Cases

Cases aka rows, observations, records, units, instances

Variables

Variables aka columns, fields, attributes, features, measurements, properties, parameters

Variables Categories

Categorical (qualitative) variables (aka factors, discrete)
- Nominal variables (no order)
- Ordinal variables (order)
- Cyclic variables (circular order, e.g. days of the week, months of the year, etc.
Quantitative (numerical) variables
- continuous variables
- discrete variables

Variable roles

Explanatory aka x, predictor, independent, input variable
Response aka y, target, dependent, outcome, output variable
Identifiers aka key, index, unique identifier, unique key, unique index, row name, ID
Confounding - a third variable that affects the relationship between the explanatory and response variables aka lurking, hidden, nuisance, spurious variable

Assocation & Causation

Association: a statistical relationship between two variables
Causation: a cause-and-effect relationship between two variables

Randomised experiment

The control group is that which receives no treatment
The treatment group is that which receives the treatment
A placebo is that which receives a fake treatment

Observational & Experimental Studies

Observational study: the researcher observes and measures the variables of interest, but does not attempt to influence the responses
Experimental study: the researcher applies a treatment and then observes the effect of the treatment on the response variable

cardinality: the number of distinct values in a variable

The cardinality of a variable = number of unique values of a variable

Population & Sample

Population: the entire group of individuals or instances about whom we hope to learn
Sample: a subset of the population
- Sampling bias: a sample that is not representative of the population
- Response Bias is a systematic favouring of certain outcomes that occurs when random individuals do not respond truthfully or are asked misleading questions in a study.
- Non-Response Bias is a systematic favouring of certain outcomes that occurs when random individuals who choose to participate in a study differ from those who choose not to participate.
- Haphazard and Random

Simple Random Sample

Standard Deviation

The standard deviation is a measure of the amount of variation or dispersion of a set of values.

Features

A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. 数据分散，开口比较宽

Sample Standard deviation (SD)

# 计算样本标准差
sample_sd <- sd(x)

# 计算总体标准差
population_sd <- sqrt(var(x))
# 或者
population_sd <- sd(x, na.rm = TRUE)

Z Score # needs to be able to calculate this by the z-score table

A Z-score is a numerical measurement that describes a value’s relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

Mode

# Calculate Mode
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

data <- c(2, 19, 44, 44, 44, 51, 56, 78, 86, 99, 99)
mode_value <- Mode(data)

Median

data <- c(23, 24, 26, 26, 28, 29, 30, 31, 33, 34)
result <- median(data)

Weighted Median

Weighted Median weighted.median(values, weights) # values and weights are vectors of the same length

Mean

Arithmetic Mean

x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 59, 23)
mean(x)

Geometric Mean

x <- c(8, 9, 4, 1, 6, 4, 6, 2, 5)
exp(mean(log(x)))  # one method
prod(x)^(1/length(x))  # second method
psych::geometric.mean(x) # third method

Weighted Mean

x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 59, 23)
weights <- c(3.1, 1.3, 2.4, 1.0, 3.5, 3.5, 1.1, 1.3, 1.6, 1.9, 4.1, 2.4, 1.4, 0.2)
weightedMean(x, weights)

Harmonic Mean

x <- c(8, 9, 4, 1, 6, 4, 6, 2, 5)
1/mean(1/x) # one method
psych::harmonic.mean(x) # second method

Trimmed Mean 修剪平均值

    data <- c(2, 3, 5, 6, 7, 8, 9, 10, 12, 15)
    # 计算需要剔除的极端值的数量
    trim_amount <- round(length(data) * 0.05)
    # 对数据进行排序
    sorted_data <- sort(data)
    # 剔除最低和最高的5%的数据
    trimmed_data <- sorted_data[(trim_amount + 1):(length(sorted_data) - trim_amount)]
    # 计算修剪均值
    trimmed_mean <- mean(trimmed_data)

    # another method is use Mean function with trim argument can from 0.1 to 0.5

Range

highest minus the lowest

# 定义一个向量
x <- c(2, 3, 5, 6, 7, 8, 9, 10, 12, 15)
# 计算范围
range_x <- range(x) # 返回最小值和最大值

Inter-quartile range (IQR)

75th percentile minus the 25th percentile

# quantile函数 example
x <- c(2, 3, 5, 6, 7, 8, 9, 10, 12, 15)
quantile(x, probs = c(0.25, 0.75)) # 返回第一个四分位数和第三个四分位数

Median absolute deviation (MAD)

mad(students$score)
# or
1.4826 * median( abs(students$score - median(students$score)) )

Average absolute deviation (from the mean)

lsr::aad(students$score) 
# or

# 定义一个向量
x <- c(2, 3, 5, 6, 7, 8, 9, 10, 12, 15)
# 计算均值
mean_x <- mean(x)
# 计算每个数据点与均值之间的绝对偏差
absolute_deviations <- abs(x - mean_x)
# 计算平均绝对偏差
average_absolute_deviation <- mean(absolute_deviations)

Covariance 协方差

衡量两个变量之间的线性关系：协方差的正负号表示了两个变量之间的线性关系的方向，即正协方差表示正相关关系，负协方差表示负相关关系，而接近零的协方差则表示变量之间基本没有线性关系。衡量变量之间的相关性强弱：协方差的绝对值大小表示了两个变量之间的相关性强度，绝对值越大表示相关性越强。

# 定义两个随机变量
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 5, 7, 6)

# 计算协方差
covariance_xy <- cov(x, y)

# 输出结果
print(covariance_xy)

Correlation coefficient 相关系数

The covariance is also related to the correlation coefficient, which is a measure of the linear relationship between two variables. The correlation coefficient is calculated by dividing the covariance by the product of the standard deviations of the two variables.

formula: x,y的协方差等于x,y的相关系数乘以x,y的标准差的乘积

# 定义两个随机变量
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 5, 7, 6)

# 计算相关系数
correlation_xy <- cor(x, y)

# 输出结果
print(correlation_xy)

Covariance Matrix 协方差矩阵

# 定义一个多维随机变量数据集
data <- matrix(c(1, 2, 3, 4, 5, 2, 3, 5, 7, 6), nrow = 5, byrow = TRUE)

# 计算协方差矩阵
covariance_matrix <- cov(data)

# 输出结果
print(covariance_matrix)

Variation Ratio: proportion of cases different to the mode

variationRatio <- function(x) {
  freq <- table(x)                   #tabulate the frequencies 
  maxfreq <- max(freq)               #record maximum freq
  vr <- 1 - maxfreq / sum(freq)
  vr                                 #return result
}

Find high outliers and low outliers

# 生成一组随机数据
data <- rnorm(100)

# 计算上四分位数（Q1）和下四分位数（Q3）
Q1 <- quantile(data, 0.25)
Q3 <- quantile(data, 0.75)

# 计算四分位距（IQR）
IQR <- Q3 - Q1

# 定义高异常值和低异常值的阈值
high_threshold <- Q3 + 1.5 * IQR
low_threshold <- Q1 - 1.5 * IQR

# 找到高异常值和低异常值
high_outliers <- data[data > high_threshold]
low_outliers <- data[data < low_threshold]

An example function to find outliers distribution

find_outlier_position <- function(min_val, q1, median_val, q3, max_val) {
  # 计算箱型图的 IQR（四分位距）
  iqr <- q3 - q1
  
  # 计算异常值的上限和下限
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr
  
  # 判断最小值和最大值是否在异常值的范围内
  is_lower_outlier <- min_val < lower_bound
  is_upper_outlier <- max_val > upper_bound
  
  # 根据异常值的情况返回结果
  if (is_lower_outlier & is_upper_outlier) {
    return("Both Sides")  # 两边都有异常值
  } else if (is_lower_outlier) {
    return("Lower Side")  # 最下面有异常值
  } else if (is_upper_outlier) {
    return("Upper Side")  # 最上面有异常值
  } else {
    return("No Outlier")  # 没有异常值
  }
}

Some Models

ARIMA Model

ARIMA (AutoRegressive Integrated Moving Average) is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting).

Basic R

Shortcuts

option + - : <-
ctrl + shift + m : %*% aka |>

Tidyverse - dplyr

Rows

filter() - to select rows based on some conditions

# < > <= >= == != %in% & | ! condition on columns/variables
flights |> 
    filter(month == 1 & day == 1)
    # or combining | and ==: %in%
    filter(month %in% c(1, 2))

arrange() - to reorder rows based on some variable

# It's functionality same as order by in SQL
flights |>
  arrange(year, month, day, desc(dep_time))

distinct() - to select unique rows

flights |> 
  distinct(origin, dest)
flights |> 
  distinct(origin, dest, .keep_all = TRUE)
flights |>
  count(origin, dest, sort = TRUE)

mutate() - to add new variables

flights |> 
  mutate(speed = distance / air_time * 60)
  # .before = 1, .after = day, .keep = "used"

select() - to select columns

flights |> 
  select(year:day, dep_delay, arr_delay) # select columns from year to day
flights |>
    select(starts_with("arr"), ends_with("time")) # select columns with "arr" in the name
flights |>
    select(contains("arr")) # select columns with "arr" in the name
flights |>
    select(where(is.character)) # select columns with character type
flights |>
    select(-contains("arr")) # remove columns

relocate() - to move columns

flights |>
    relocate(dep_delay, .after = day)

rename() - to rename columns

flights |>
    rename(dep_delay = dep_time)

group_by() and summarize() - to group rows

flights |> 
  group_by(month) |> # can be multiple variables
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    n = n() # which returns the number of rows in each group
  )

slice() - to select rows by their positions

df	> slice_head(n = 1) takes the first row from each group.
df	> slice_tail(n = 1) takes the last row in each group.
df	> slice_min(x, n = 1) takes the row with the smallest value of column x.
df	> slice_max(x, n = 1) takes the row with the largest value of column x.
df	> slice_sample(n = 1) takes one random row.

ungroup() - to remove grouping 取消已分组

flights |> 
  group_by(year, month, day) |> 
  ungroup()


near(x, c(1, 2)) # compare float numbers

is.na() - works with any type of vector and returns TRUE for missing values and FALSE for everything else:

flights |> 
  filter(is.na(dep_time))

if_else() - a vectorized if() function that is useful when working with data frames:

flights |> 
  mutate(dep_type = if_else(dep_time < 1200, "morning", "afternoon"))

case_when() - a vectorized version of ifelse() that is useful when working with data frames:

flights |> 
  mutate(dep_type = case_when(
    dep_time < 600 ~ "early",
    dep_time < 1200 ~ "morning",
    dep_time < 1800 ~ "afternoon",
    TRUE ~ "evening"
  ))

Confidende Interval

I am using wagon dataset to calculate the confidence interval.

Some function

get_mean_se <- function(x, repeats = 1000, size = 10, replace = FALSE){
  # get the standard error of mean sampling
  # x: variable interested
  # repeats: number of sampling iterations
  # size: sample size
  # replace: sampling method
  se <- sd(replicate(repeats, mean(sample(x, size, replace = replace))))
  return(se)
}

get_prop_se <- function(x, c, repeats = 1000, size = 10, replace = FALSE){
  # get the standard error of proportion sampling
  # x: variable interested
  # c: category interested
  # repeats: number of sampling iterations
  # size: sample size
  # replace: sampling method
  se <- sd(replicate(repeats, sum(sample(x, size, replace = replace) == c)/size))
  return(se)
}

An example

# get the mean of price
xbar <- mean(wagon$price)

# using bootstrap to get the standard error
se <- get_mean_se(x = wagon$price, size = length(wagon$price), replace = TRUE)

ci <- c(xbar-2*se, xbar+2*se) # Using vector in R to mimic tuple in Python
sprintf("We are 95%% confident that the true price of this certain type of used sports wagon is between $%.0f and $%.0f.", ci[1], ci[2])

Using Percentile Method to calculate the confidence interval

# bootstrap distribution
boot.wagon <- replicate(1000, mean(sample(wagon$price, 67, replace = TRUE)))


data.frame(boot.wagon)
hist(boot.wagon)
# 95% confidence interval is from 2.5th percentile to 97.5th percentile
quantile(x = boot.wagon, probs = c(0.025, 0.975))