Construction of Empirical Models (20-25%)
1. Estimate failure time and loss distributions using:
a) Kaplan-Meier estimator, including approximations for large data sets
2 concrete situations: 
Insurance:  observing loss amount per policy.  Left truncation when loss < deductible; right censoring when loss > policy limit
Mortality table:  observing age of death for each person.  Left truncation happens at age the person is 1st observed; right censoring happens at age if the person is still alive at last observation.
Symbols:  For each observation {$i$},
{$d_i$} = truncation point for that observation (0 if no truncation);
{$x_i$} = the observed value, if it wasn't censored;
{$u_i$} = censored value for that observation. 
Then group and relabel the {$x_i$}'s into {$y_j$} each occurring {$s_j$} times.  Divide up the data according to the {$y_j$}'s by defining the 
risk set to be 
{$r_j = \left(\text{#} x_i \text{ and } u_i \geq y_j \right) - \left(\text{#} d_i \geq y_j \right)$}
which is the same as
{$r_j = \left(\text{#} d_i < y_j \right) - \left(\text{#} x_i \text{ and } u_i < y_j \right) $}
So in our situations, the risk set counts
the difference between a) # of policies with observed loss amount >= {$y_j$} and b) # of policies with deductible >=   {$y_j$};
the difference between a) # of people entering the study before the age of {$y_j$} and b) # of people died  before the age of {$y_j$} (i.e., the # of people being observed alive at a certain age {$y_j$})
Recursively, given {$c$}, 
{$r_j = r_{j-1} + \left(\text{#} d_i \in [y_{j-1}, y_j) \right)  - \left(\text{#} x_i =y_{j-1}\right) - \left(\text{#} u_i \in [y_{j-1}, y_j) \right) $}
And so we define the 
Kaplan-Meier limit estimator as follows
{$S(t)=0$}, for {$t\in[0, y_1)$},
{$S(t)=\frac{r_1-s_1}{r_1}$} (probability of surviving past {$y_1$}), for {$t\in[y_1, y_2)$},
{$S(t)=S(y_1)\frac{r_2-s_2}{r_2}$} (probability of surviving past {$y_2$}), for {$t\in[y_2, y_3)$},...
{$S(t)=\prod_{i=1}^k\frac{r_i-s_i}{r_i}$} (probability of surviving past {$y_k$}), for {$t\in[y_k,\infty)$}.
Or we can define the last line to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.
Large Data Sets:  Sometimes the data is so large that it's not reasonable to keep track of all the {$y_j$}'s.  So instead we have intervals of interest:  Let {$c_o < c_1 < \ldots < c_k$} be boundaries of such intervals.  
Example:  Mortality study on a large population; {$c_j$}'s can be integer ages.
For {$i = 1, \ldots, n,$} let
{$d_i$} = # of truncated observations where the truncation point is in {$ [c_{i-1}, c_{i} ) $};
{$x_i$} = # of uncensored observations in {$(c_{i-1}, c_{i}]$};
{$u_i$} = # of censored observations with values in {$(c_{i-1}, c_{i}]$};
(Note indexing here is a little different from the book, in order to match the rest of the chapter.)  
Two approaches:
I.  All truncation occurs at the beginning of the interval, and all censoring occurs at the end of the interval.  
(For example, all lives enter and leave study using their insuring age.)
Define the risk set to be
{$r_1 = d_1$}, 
{$r_2 = r_1 - (x_1+u_1) + d_2$}, ...,
{$r_i = r_{i-1} - (x_{i-1}+u_{i-1}) + d_i$}, ...,
II.  Truncation and censoring occur uniformly throughout the interval, and the uncensored observations occur in the middle of the interval.
(For example, more detailed data gathering would invariably give something closer to this scenario than the former.)
Define the risk set to be
{$r_1 = (d_1-u_1)/2$}, 
{$r_2 = d_1 - (x_1+u_1) + (d_2-u_2)/2$}, ...,
{$r_i =\sum_{i=1}^{i-1} (d_{i} - (x_{i}+u_{i})) + (d_i-u_i)/2$}, ...,
Define a Kaplan-Meier type survival function on the boundary points by
{$S(c_1)= 1-\frac{x_1}{r_1}$},
{$S(c_2)= S(c_1)\left(1-\frac{x_1}{r_1} \right)$},
{$S(c_j)= \prod_{i=1}^j \left(1-\frac{x_i}{r_i}\right)$}.
It follows that the probability that someone is alive at {$c_j$}, but does not survive past {$c_{j+1}$}, is
{$\displaystyle{q_j = \frac{S(c_j)-S(c_{j+1})}{S(c_j)} = \frac{x_j}{r_j} }$}.
Moreover, in the situation where we can make more refined categories about the uncensored observations {$x_j$}'s, we can calculate 
single-decrement probabilities.  For example, in a mortality study we can have someone either withdrawl or die in a time interval.  We can then calculate
single-decrement mortality probabilities {$q_{j}^{'(d)}$}, by treating the death counts as uncensored observations and withdrawls as censored observations;
single-decrement withdrawl probabilities {$q_{j}^{'(w)}$}, by treating the withdrawl counts as uncensored observations and deaths as censored observations.
b) Nelson-Åalen estimator
This estimates the cumulative hazard rate function as follows:
{$\hat{H}(t)=0$}, for {$t\in[0, y_1)$},
{$\hat{H}(t)=\frac{s_1}{r_1}$}, for {$t\in[y_1, y_2)$},
{$\hat{H}(t)=\hat{H}(y_1) + \frac{s_2}{r_2}$}, for {$t\in[y_2, y_3)$},...
{$\hat{H}(t)=\sum_{i=1}^k\frac{s_i}{r_i}$}, for {$t\in[y_k,\infty)$}.
Then {$\hat{S}(t) =  e^{-\hat{H}(t)}$}.
For the end piece {$t\in[y_k,\infty)$},  we can define {$\hat{S}(t)$} to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.
c) Kernel density estimators
First we need to discuss empirical distributions:  
Suppose data is not 
grouped and there is no censoring or truncation.  Out of a total of {$n$} observations , let {$y_i$} be distinct data points each occurring {$s_i$} times.  Then {$S(t)=1-F(t),$} where
{$F(t)=0$}, for {$t\in[-\infty, y_1)$},
{$F(t)=\frac{s_1}{n}$}, for {$t\in[y_1, y_2)$},
{$F(t)=F(y_1) + \frac{s_2}{n}$}, for {$t\in[y_2, y_3)$},...
{$F(t)=\sum_{i=1}^k\frac{s_i}{n} = 1$}, for {$t\in[y_k,\infty)$}.
Note each {$y_i$} is assigned probability {${p(y_i)}= s_i/n$}.
A 
kernel density estimator is defined as
{$\hat{F}(t) = \sum_{i=1}^k p(y_i) K_{y_i}(t)$},
where {$K_{y_i}(t)$} are kernel functions that can defined by a chosen pdf with a certain bandwith {$b$}:
A uniform kernel  is a rectangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/2b$};
A triangular kernel  is a triangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/b$}.
Book also mentions a 
gamma kernel  where the parameters are given by {$\alpha$} and {$y_i/\alpha$}.  For the 2 kernels with bandwidths, the larger the bandwidth is, the smoother the resulting kernel density estimator.  For gamma kernels, the smaller {$\alpha$} is, the smoother the resulting kernel density estimator.
For 
grouped data, i.e., if the data is kept track of by a range of observed values instead of single, precise values, then we need to use ogives and histograms.  Instead of {$y_i$}, let {$c_0, \ldots, c_k$} be the boundary values of intervals of observations, and {$s_1, \ldots, c_k$} be the number of observations in {$(c_0, c_1), \ldots, (c_{k-1}, c_k)$}.  Then{$S(t)=1-F(t),$} where {$F(t)$} is the 
ogive,
{$F(c_0)=0$}, 
{$F(c_1)=\frac{s_1}{n}$},
{$F(c_2)=\frac{s_1+s_2}{n}$},...
{$F(c_n)=\sum_{i=1}^k\frac{s_i}{n} = 1$}, 
and the values of {$F$} between the {$c_i$}'s are connected linearly.  Take the derivative to get the 
histogram {$f(t)$}:
{$f(t)=0$}, for {$t\in[-\infty, c_0)$},
{$f(t)=\frac{s_1}{n(c_1-c_0)}$}, for {$t\in[c_0, c_1)$},
{$f(t)=\frac{s_2}{n(c_2-c_1)}$}, for {$t\in[c_1, c_2)$},...
{$f(t)=\frac{s_k}{n(c_k-c_{k-1})}$}, for {$t\in[c_{k-1},c_k)$},
{$f(t)=0$} (or undefined), for {$t\in[c_k,\infty)$}.
2. Estimate the variance of estimators and confidence intervals for failure time and loss distributions.
General setup:  Given a data set, name an estimator, use it to estimate some probablistic value of the RV associated with the data set, then use some general theory to find or estimate the mean and variance of the estimator itself.
a.  Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},
where {$Y$} is the # of observations that are greater than {$x$}.  Then {$Y$} is a binomial distribution with parameters {$n$} and {$S(x)$}.  (Check:  view {$Y$} as counting the # of successes of a biased coin toss, where success is if the observation is greater than {$x$}.  The probability of that happening is precisely {$S(x)$}.
{$$E[S_n(x)] = \frac{1}{n} E[Y] = \frac{1}{n} (n S(x)) = S(x); $$}
which shows {$S_n(x)$} is unbiased.
{$$Var[S_n(x)] = \frac{1}{n^2} Var[Y] = \frac{1}{n^2} (nS(x)(1-S(x)) = \frac{S(x)(1-S(x))}{n} \approx \frac{S_n(x)(1-S_n(x))}{n}; $$}
This is the same thing as
{$$\displaystyle{Var[S_n(x)] \approx \frac{ (\text{# of observations } > x)(\text{(# of observations } <= x)}{n^3}}. $$}
Note that {$Var[S_n(x)] \rightarrow 0$} as {$n \rightarrow \infty$}, showing that {$p_n$} is consistent.
b.  Empirical estimate of (say) {$q_2 = \frac{S(2) - S(3)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(2) - S_n(3)}{S_n(2)} = \frac{Y}{X}$}, where 
{$X$} = # of people alive at duration 2,
{$Y$} = # of deaths between duration 2 and 3.  
It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}.  Note that in this notation,
{$$\hat{q_2}=\frac{\text{# of deaths between 2 and 3}}{k}. $$}
Note also that we can estimate the resulting RV, {$Y/k$}, with a new empirical estimate {$S_k(x) = \text{# of deaths between 2 and 3}/k$}, where is Binomial with parameters {$k$} and {$(S(2)-S(3))/S(2)$}, so that as the case above, 
{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)} =  \frac{S(2) - S(3)}{S(2)};$$}
showing that the estimate is unbiased.  
{$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}. $$}
c.  Suppose we have ogives and histograms (i.e., the data is grouped). Let {$x \in (c_{j-1}, c_j)$}.  Then we can define an estimator for {$S(x)$} by
{$$S_n(x) = 1 - \frac{1}{n} (Y + t(x)Z ),$$}
where 
- {$Y = \text{# of observations } <= c_{j-1},$}
 - {$Z = \text{# of observations } > c_{j-1} \text{ and } <= c_{j},$}
 - {$t(x) = \frac{x-c_{j-1}}{c_j-c_{c-1}}.$}
 
Then
{$$E(S_n(x))= 1- \frac{1}{n} (E(Y) + t(x) E(Z)) =  1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j);$$}
This shows {$S_n$} is biased (unlike the previous 2 examples). Also,
{$$Var(S_n(x)) = \frac{1}{n^2} \left[\cdot Var(Y) + t(x)^2Var(Z) + 2t(x) Cov(Y,Z)\right],$$}
where
{$Var(Y) =n S(c_{j-1})(1-S(c_{j-1}) ,$}
{$Var(Z) =n (S(c_{j-1})-S(c_j))(1-S(c_{j-1})+S(c_j)),$}
{$Cov(Y,Z) =E(YZ)-E(Y)E(Z) = - n (1-S(c_{j-1}) (S(c_{j-1})-S(c_j)).$}
(See discussion of the Covariance of 2 binomial RVs 
here).
d.  Suppose we have a discrete RV, a set of {$n$} data points, and we want to estimate {$p(x)$} for an observed value {$x$}.  Then {$p_n(x) = N/n$} is the empirical estimator of {$p$}, where
{$N = \text{# of times } x \text{ was observed in the data.}$}
Then {$N$} is binomial with parameters {$n$} and {$p(x)$}, and
{$$E[p_n(x)] = \frac{1}{n} E[N] = \frac{1}{n} (n p(x)) = p(x); $$}
which shows {$p_n(x)$} is unbiased.
The variance of the estimator is
{$$Var[p_n(x)] = \frac{1}{n^2} Var[N] = \frac{1}{n^2} (np(x)(1-p(x)) = \frac{p(x)(1-p(x))}{n} \approx \frac{p_n(x)(1-p_n(x))}{n}; $$}
note that this goes to 0 as {$n \rightarrow \infty$}, showing that {$p_n$} is consistent.
e.  Given a quantity {$\theta$} that we're tyring to estimate, once we have an estimator {$\hat{\theta}$} and figured out its expected value and variance, we can obtain a {$1-\alpha$} confidence interval using the following formulas:
{$$1-\alpha \approx Prob (-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}} \leq z_{\frac{\alpha}{2}} ).$$}
For example, in the case of estimating {$p$} with {$p_n$} (valued at 2 as above), we have
{$$1-\alpha \approx Prob (-z_{\frac{\alpha}{2}} \leq \frac{p_n-p}{\sqrt{p(1-p)/n}} \leq z_{\frac{\alpha}{2}} );$$}
Solve for p to get the interval.  We can simplify it even further by approximating the {$p$}'s in the denominator with {$p_n$}.
f.
3. Apply the following concepts in estimating failure time and loss distribution:
- a) Unbiasedness
 - b) Consistency
 - c) Mean squared error