[베이지안] A unified Bayesian inference on treatment means with order constraints

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

수리통계 분석 코딩 실습

[베이지안] A unified Bayesian inference on treatment means with order constraints 본문

통계 분석/통계 실습

[베이지안] A unified Bayesian inference on treatment means with order constraints

얼려먹는 요구르트 2024. 3. 27. 19:37

✔ 모수 차이를 검증하는 A unified Bayesian inference on treatment means with order constraints

논문을 알아보자!

[process]

[Model setting] - [Prior-Posterior] - [Code]

[1] Model Setting

Order restrictions

1) simple increasing order

$\mu_{1} \le \cdots \mu_{k-1} \le \mu_{k}$

2) simple tree order

$\mu_{1} \ge \mu_{j} , j=2 ,\cdots, k-1$

3) umbrella order: g를 기준으로 양쪽으로 값이 줄어드는 경우

$\mu_{1} \le \cdots \mu_{g} \ge \mu_{g+1} \cdots \mu_{k}$

라고 할 수 있다. 본 논문은 1)과 같이 단조 증가 값의 $\mu$들이 주어졌을 때,

이에 대한 모수 추정 모델링 과정을 담고 있다.

1)의 simple order의 경우

$\delta = (\delta_{1}, \cdots, \delta_{k-1})$로 연속적인 값의 diff로 $\delta$를 정의한다.

즉, $\delta_{i} = \mu_{i+1} - \mu_{i}, \quad i = 1, \cdots, k-1$

→ $\mu_{i} = \mu_{1} + \delta_{1} + \cdots + \delta_{i-1}$ for $i = 2,\cdots, k-1$

그러므로, $\mu_{i}$ ↔ $\delta_{i} \ge 0, i = 1, \cdots, k-1$의 제한 조건을 갖게된다.

이때, $\theta = c( \theta_{1}, \cdots, \theta_{k} ) = (\mu_{1}, \cdots, \delta_{1}, \cdots, \delta_{k-1} )$ 라고 두자.

그럼, 추정하고자 하는 모수 $\theta ↔ \mu$ 는 회귀분석 모델링으로 추정 가능해진다.

즉 Model- setting이,

[ANOVA Model]

$y_{ij} = \mu_{i} + \epsilon_{ij}, i = 1,\cdots, k, j= 1,\cdots, n_{i}$ $\epsilon_{ij} \sim N(0, \sigma^2)$

with order constraints (monotone increasing)

$\mu_{1} \ge \mu_{2} \cdots \ge \mu_{k}$

그냥 $\mu$로 두면 회귀분석 세팅 하에 모수 추정이 불가 해지므로

(※ 회귀분석 세팅이 이점인 이유는 normal가정으로 모수 분포를 가정해서 추정할 수 있음!)

$\theta$로 바꾸어서 회귀분석 세팅으로 바꿔준다.

Transform

$\theta_{1} = \mu_{1}$

$\theta_{i} = \mu_{i} - \mu_{i-1} , i \le 2$

그러므로, $y_{ij}$에 대한 추정 값은

$y_{ij} = \theta_{1} + \theta_{2} \cdots + \theta_{k} + \epsilon_{ij} $

↔ $Y = X\theta + \epsilon$

으로 회귀분석 모형으로 바꿀 수 있다.

이때, $Y = (y_{11},\cdots, y_{1n_{1}}, \cdots, y_{k1},\cdots, y_{kn_{k}})$

$ \epsilon = ( \epsilon_{11}, \cdots, \epsilon_{1n_{1}}, \cdots, \epsilon_{k1}, \cdots, \epsilon_{kn_{k}}) $

의 order에 대한 정보를 $X$ matrix에 담아 정의한다.

$\theta_{i} \ge 0, i = 2, \cdots, k$

라고 둘 수 있다.

$$Y \sim N(X\theta, \sigma^2 I) I(\theta_{i} \ge 0 , i = 2, \cdots, k)$$

의 가정하에 모수($\theta, \sigma^2)$를 추정한다. 근데, 알고자하는 $\theta$값을 생각하면, $theta$의 $\sigma$, 즉, $\tau$값을 같이 추정한다.

즉 정리하면

$\mu_{i}$ ↔ $\theta_{i}$
$\sigma$ : $\epsilon$의 분산값
$\tau_{i}$: $\theta_{i}$의 분산값

($\theta_{i}, \tau_{i})$, $\sigma^2$

을 구하고자 한다!

이때, ($\theta_{i}, \tau_{i}) \perp \sigma^2$ 이다!

[2] Prior - Posterior

[Prior]

모수값: $\theta$

$\theta_{1} = \mu_{1} \sim N(\mu_{10}, \sigma^2_{10}), \; \mu_{10} = \bar{y_{1}} = mean(y_{1j})$

$\sigma^2_{10} = c \cdot var(y_{1j}) = c \cdot s_{1}^2, \quad c:constant$

$\theta_{i} \, \sim N(0, \tau_{i}^2), \quad i = 2, \cdots, k$

↔ $N(\mu_{0},\Sigma_{0}), \, \mu_{0} = c(\bar{y_{1}}, 0, \cdots, 0), \, \Sigma_{0} = diag(c \cdot s^2_{1}, \tau_{2}^2, \cdots, \tau_{k}^2 )$

$\tau_{i}^2 \, \sim IG(a_{0}, b_{0})$

$\sigma_{i}^2 \, \sim IG(a, b)$

▪ hyper parameter는 어떻게 고를 까?

a = # of samples in prior information

1개의 샘플 만큼의 영향력을 준다라고 생각하면, posterior에 근거해서 a의 값을 지정할 수 있다.

즉, $\frac{1}{2}$이 posterior의 값으로 a에 대해 더해지는 값으로 있으므로, a의 초기값은 $\frac{1}{2}$로 둘 수 있다.

b는 IG의 mode의 식을 이용해서 추정하는데 a,b는 $\sigma$에서 추정하는 IG의 모수값이므로,

b는 mode식인 $\frac{b}{a+1} = \hat{\sigma}^2$를 이용해서 쓴다. (i.e $b = \hat{\sigma}^2 \cdot (a+1)$)

마찬가지로, $a_{0}$도 unit information prior인 샘플 1개에 대응된다고 생각해서

$a_{0} = \frac{1}{2}$로 두고,

$b_{0}$는 IG에서 추정하는 값이므로 마찬가지로 mode를 $\tau$에 대응되는 값으로 추정하면 되는데,

$b_{0} =\frac{ \sum \limits_{i=2}^k \hat{\tau_i}^2}{k-1} \cdot (a_{0}+1)$

로 둘 수 있다.

각각의 hat값은 아래의 Note를 이용해서 구한다!

Note
$ \hat{\tau_i}^2 = \widehat{Var}(\bar{y_{i}} - \bar{y_{i-1}}) = \widehat{Var}(\bar{y_{i}}) + \widehat{Var}(\bar{y_{i-1}})
= \frac{s_{i}^2}{n_{i}} + \frac{s_{i-1}^2}{n_{i-1}}$
$\hat{\sigma}^2$ = common variance = $ \frac{\sum \limits_{i=1}^k s_i^2}{k}$

[Posterior]

모수 추정은 아래와 같은 세팅아래 깁스 샘플링을 진행하면 된다!

$$\begin{align} l(\theta, \sigma^2 | Y) \pi(\theta | \tau^2) \pi(\sigma^2) \pi(\tau^2) \; & \propto \; (\sigma^2)^{- \frac{1}{2} \sum \limits_{i=1}^k n_{i}} \cdot exp\left([-\frac{1}{2\sigma^2}]\right)(Y-X\theta)^{'}(Y-X\theta) \\ & \times exp[-\frac{1}{2}(\theta - \mu_{0})\Sigma_{0}^{-1}(\theta - \mu_{0})] \prod_{i=2}^k (\tau_{i}^2)^(\frac{1}{2}) \\ & \times (\sigma^2)^{-a-1}exp[-\frac{b}{a}] \prod \limits_{i=2}^k (\tau_{i}^2)^{-a_{0}-1}exp[-\frac{b_{0}}{\tau_{i}^2}] \\ & \times I(\theta_{i} \le 0, \, i = 2, \cdots, k) \end{align}$$

Full conditional posterior

$$ \theta | others \; \sim \; N(\mu^{\pi}, \Sigma^{\pi})I(\theta_{i} \le 0, \; i = 2, \cdots, k) $$

$$ \sigma^2 | others \; \sim \; IG(a + \frac{1}{2}\sum \limits_{i=1}^k n_{i}, b + \frac{1}{2}(Y-X\theta)^{'}(Y - X\theta) $$

$$ \tau_{i}^2 | others \; \sim \; IG(a_{0} + \frac{1}{2}, b_{0} +\frac{1}{2} \theta_{i}^2) , \; i = 2, \cdots, k $$

where

$$ \Sigma^{ \pi} = ( \frac{1}{\sigma^2} \cdot X^{'}X + \Sigma_{0}^{-1} )^{-1} $$

$$ \mu^{\pi} = \Sigma^{\pi} (\frac{1}{\sigma^2}X^{'}Y + \Sigma_{0}^{-1} \mu_{0} )$$

초기값과 하이퍼파라미터만 잘 설정해서!

모델의 추정 값은 아래와 같이 하면 된다.

[초기값, 하이퍼 파라미터, 데이터로 주는 값 구분하기]

빨간색 - 초기값으로 주는 값

초록색 - 데이터로 계산해서 주는 값

파란색 - 하이퍼 파라미터

[3] Code

[setting] - [sampling sudo-code] - [code] - [분석]

1) setting

Rumas_bone의 data를 이용해 $\mu$의 order increasing 모수 추정 값을 구해보자

그렇다면, $Y, X, \theta$ 는 어떻게 정의 해야할까?

$Y$의 정의를 보고 살펴보자. 데이터는 아래와 같은 의미를 갖는다.

즉, $mu_{i}$들은 각 열의 값으로, 이 값들이 추정하고자 하는 값이고, 이때, $n_{k}$는 데이터의 각 열의 Nan값이 없는 값들의 개수를 말한다. 그러므로 X는

각 열들의 Nan이 없는 개수를 세서 그 개수만큼 1과 0을 rep한 뒤에 쌓아주면 된다.

여기선, 추정 모수의 수(= 열의수)가 4개 이므로, k = 4이고,

4개의 묶음이 1과 0의 값으로 y의 Nan값에 따라 쌓일 것이다.

각 열의 값이 지금은 모두 full인 상황이므로,

$n_{i}$들은 데이터 행의 수랑 같을 텐데,

$n_{1} = n_{2} = n_{3} = n_{4} = dim(data)[1]$와 같이 서술할 수 있다.

그러므로 X는

각 20개씩 80x4의 dimension을 갖는 행렬이 된다.

다른 하이퍼파라미터, 초기값, 모델은 코드를 통해 살펴보자.

샘플링 과정을 다음과 같은 순서로 살펴볼 수 있다.

2) sampling sudo-code

Sketch of the code

1. read data

2. prepare X according to the constraints

3. set hyper-parameters

4. set initials, nwarm, N

5. repeat iter = 1, N

1) compute $\mu^{\pi}, \Sigma^{\pi}$ given $\sigma^{2}, \tau^{2}$

2) repeat i = 1,...,k

a) compute the full conditional posterior mean and sd of $\theta_{i}$

b) generate a sample of $\theta_{i}$ from its full conditional posterior dist

3) repeat i = 2,...,k

generate a sample of $\tau_{i}$ from its full conditional posterior dist

4) generate a sample of $\sigma^2$ from its full conditional posterior dist

5) save the samples if iter > nwarm

6. convergence diagnostics: ACF plot, tsplot

7. posterior inference using the samples

3) code

1. read data

# read data
setwd("C:/Users/none/Desktop/대학원/24-1/bayesian")
data <- read.csv('./data/Ramus_bone.csv')
data <- data[,-1] # id 제거

2. prepare X according to the constraints

k = dim(data)[2]; n_num <- colSums(!is.na(data[,1:k]))
n<- sum(n_num)

# make X
X <- matrix(0,n,k, byrow = TRUE)
dim(X) # 80 x 4
X[1:n_num[1],1] <- 1
for(j in 1:(k-1) ){
  X[ ((sum(n_num[1:j])+1) : sum(n_num[1:(j+1)])), (1:(j+1)) ]  <- 1
}

3. set hyper-parameters

# 3. set hyper-parameters
s.sq <- apply(data,2,var)
sigma.sq.hat <- sum(s.sq)/k
tau.sq <- rep(NaN, (k-1) )
for(j in 2:k){
  tau.sq[j-1] <- (s.sq[j-1]/n_num[j-1] + s.sq[j]/n_num[j])
}

Y <- as.vector(data)

a <- 1/2; b <- sigma.sq.hat * (a+1)
a0 <- 1/2; b0 <- sum(tau.sq)/(k-1) * (a0 +1)
c<-10; 
mu0 <- c(mean(Y[1:20]),rep(0,k-1))
Sig0 <- diag(c(c*s.sq[1],tau.sq))

XtX <- t(X)%*%X
XtY <- t(X)%*%Y

4. set initials, nwarm, N

# 4. set initial, nwarm, N
N=100000
nwarm =5000

theta.samples <- matrix(0,N,k, byrow = TRUE)
sigsq.samples <- rep(0,N)
tausq.samples <- matrix(0, N, k-1, byrow = TRUE)

mu.hat <- apply(data,2,mean)
theta <-c(mu.hat[1],diff(mu.hat))
sigsq <- sigma.sq.hat
tausq <- tau.sq

5. repeat iter = 1, N

1) compute $\mu^{\pi}, \Sigma^{\pi}$ given $\sigma^{2}, \tau^{2}$

2) repeat i = 1,...,k

a) compute the full conditional posterior mean and sd of $\theta_{i}$

b) generate a sample of $\theta_{i}$ from its full conditional posterior dist

3) repeat i = 2,...,k

generate a sample of $\tau_{i}$ from its full conditional posterior dist

4) generate a sample of $\sigma^2$ from its full conditional posterior dist

5) save the samples if iter > nwarm

# sampling
for(iter in (1:(N+nwarm))){

  Sigpi <- solve(1/sigsq * XtX + solve(Sig0))
  Mupi <- Sigpi %*% (1/sigsq * XtY + solve(Sig0)%*%mu0)
  
  for(i in (1:k)){
    
    cond.mean <- cmnorm(mean = Mupi, sigma = Sigpi, given_ind = c(1:k)[-i],
                        given_x = theta[-i])$mean
    cond.sd <- cmnorm(mean = Mupi, sigma = Sigpi, given_ind = c(1:k)[-i],
                        given_x = theta[-i])$sigma
    
    if(i==1){ theta[i] <- rtruncnorm(1, a = -Inf, b = Inf, cond.mean, cond.sd)}
    
    else{ theta[i] <- rtruncnorm(1,a=0, cond.mean, cond.sd)
          tausq[i-1] <- 1/rgamma(1, a0 + 1/2, b0 + 1/2*(theta[i]^2)) }
  }

  Xtheta <- X%*%theta
  sigsq<- 1/rgamma(1, a+1/2*n, b+1/2*t(Y-Xtheta)%*%(Y-Xtheta))
  

  if(iter > nwarm){
    theta.samples[iter-nwarm,] <- theta
    tausq.samples[iter-nwarm,] <- tausq
    sigsq.samples[iter-nwarm ] <- sigsq
  }
}

4) 분석

[1] $\theta_{1}$의 범위는 제한이 없지만 $\theta_{2},\theta_{3}, \theta_{4}$의 경우 0보다 크거나 같다는 제한 조건이 있어, 완전히 noraml을 따르진 않는다.

$\tau^2$, $\sigma^2$

$\theta$의 분산 분포($\tau$)가 상당히 치우쳐져있는 것으로 보인다.

[2] overlay of $\mu_i$ s

$\theta$ vec를 기준으로 $ \mu_{i}$들은 $i \le 2$에 대해 각각 diff의 sum으로 표기되므로, $\mu$ 샘플은 $\theta$ 샘플을 이용해 더해준다. 그럼 $\theta$의 분포가 $\theta_{1} = \mu_{1}$로 두고, $ \theta_{i}, i \le 2$ 부턴 $ \mu_{i} - \mu_{i-1}$값으로 지정해두고 각각 $ \mu_{i}$ 들의 값차이가 많이 크지 않으므로 0에 몰려있는 분포와 50( $ \hat{\mu_{1}}$ )의 값에 mean값을 갖는 분포로 형성되어 있다. 따라서 이 샘플로 그려진 $ \mu$의 샘플은 오른쪽처럼 쌍봉의 형태로 합쳐져서 그려지는 것 같다.

[3] posterior mean and sd

$\tau^2$ 값이 상당히 크게 추정된 것으로 보인다

apply(theta.samples,2,mean) # [1] 49.6190145  0.2297443  0.3650927  0.3983493
apply(theta.samples,2,sd)   # [1] 0.1757398 0.1529326 0.2343158 0.2528525


apply(tausq.samples,2,mean) # [1] 12.57043 17.27771 14.31146
apply(tausq.samples,2,sd)   # [1]  500.3993 1121.9690  723.0620

mean(sigsq.samples)       # [1] 7.306086
sqrt(var(sigsq.samples))  # [1] 1.187315

[4] 5%,97.5% quantile

round(apply(theta.samples,2,quantile, probs = c(0.025,0.975)),4)
round(apply(tausq.samples,2,quantile, probs = c(0.025,0.975)),4)
round(apply(sigsq.samples,2,quantile, probs = c(0.025,0.975)),4)