James’s Page | Actuary / CExamEmpModels

October 10, 2011, at 06:16 PM by 38.106.150.109 -

Changed lines 50-51 from:

{$r_i =\sum_{i=1}^{i-1} (r_{i} - (x_{i}+u_{i})) + (d_i-u_i)/2$}, ...,

to:

{$r_i =\sum_{i=1}^{i-1} (d_{i} - (x_{i}+u_{i})) + (d_i-u_i)/2$}, ...,

September 20, 2011, at 08:55 PM by 38.106.150.109 -

Changed lines 155-157 from:

Given a quantity {$\theta$} that we're tyring to estimate, once we have an estimator {$\hat{\theta}$} and figured out its expected value and variance, we can obtain a {$1-\alpha$} confidence interval using the following formulas:

{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}} \leq z_{\frac{\alpha}{2}} \right)$$}

to:

e. Given a quantity {$\theta$} that we're tyring to estimate, once we have an estimator {$\hat{\theta}$} and figured out its expected value and variance, we can obtain a {$1-\alpha$} confidence interval using the following formulas:

{$$1-\alpha \approx Prob (-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}} \leq z_{\frac{\alpha}{2}} ).$$}

For example, in the case of estimating {$p$} with {$p_n$} (valued at 2 as above), we have

{$$1-\alpha \approx Prob (-z_{\frac{\alpha}{2}} \leq \frac{p_n-p}{\sqrt{p(1-p)/n}} \leq z_{\frac{\alpha}{2}} );$$}

Solve for p to get the interval. We can simplify it even further by approximating the {$p$}'s in the denominator with {$p_n$}.

f.

September 20, 2011, at 08:48 PM by 38.106.150.109 -

Changed lines 156-157 from:

{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}}{} \right)$$}

to:

{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}} \leq z_{\frac{\alpha}{2}} \right)$$}

September 20, 2011, at 08:47 PM by 38.106.150.109 -

Changed line 147 from:

{$N = \text{# of times } x { was observed in the data.}$}

to:

{$N = \text{# of times } x \text{ was observed in the data.}$}

Changed lines 155-156 from:

to:

Given a quantity {$\theta$} that we're tyring to estimate, once we have an estimator {$\hat{\theta}$} and figured out its expected value and variance, we can obtain a {$1-\alpha$} confidence interval using the following formulas:

{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}}{} \right)$$}

September 20, 2011, at 08:29 PM by 38.106.150.109 -

Changed lines 111-112 from:

{$\displaystyle{Var[S_n(x)] \approx \frac{ (\text{# of observations } > x)(\text{(# of observations } <= x)}{n^3}} $}.

to:

{$$\displaystyle{Var[S_n(x)] \approx \frac{ (\text{# of observations } > x)(\text{(# of observations } <= x)}{n^3}}. $$}

Note that {$Var[S_n(x)] \rightarrow 0$} as {$n \rightarrow \infty$}, showing that {$p_n$} is consistent.

September 20, 2011, at 08:28 PM by 38.106.150.109 -

Changed line 108 from:

which shows {$$S_n(x)$$} is unbiased.

to:

which shows {$S_n(x)$} is unbiased.

Changed line 137 from:

{$$Var(S_n(x)) = \frac{1}{n^2} \cdot Var(Y) + t(x)^2Var(Z) + 2t(x) Cov(Y,Z),$$}

to:

{$$Var(S_n(x)) = \frac{1}{n^2} \left[\cdot Var(Y) + t(x)^2Var(Z) + 2t(x) Cov(Y,Z)\right],$$}

Changed lines 142-150 from:

(See discussion of the Covariance of 2 binomial RVs here).

{$E[S_n(x)] = S(x) \approx S_n(x); $}

{$E[S_n(x)] = S(x) \approx S_n(x);j $}

to:

(See discussion of the Covariance of 2 binomial RVs here).

d. Suppose we have a discrete RV, a set of {$n$} data points, and we want to estimate {$p(x)$} for an observed value {$x$}. Then {$p_n(x) = N/n$} is the empirical estimator of {$p$}, where

{$N = \text{# of times } x { was observed in the data.}$}

Then {$N$} is binomial with parameters {$n$} and {$p(x)$}, and {$$E[p_n(x)] = \frac{1}{n} E[N] = \frac{1}{n} (n p(x)) = p(x); $$}

which shows {$p_n(x)$} is unbiased.

The variance of the estimator is

{$$Var[p_n(x)] = \frac{1}{n^2} Var[N] = \frac{1}{n^2} (np(x)(1-p(x)) = \frac{p(x)(1-p(x))}{n} \approx \frac{p_n(x)(1-p_n(x))}{n}; $$}

note that this goes to 0 as {$n \rightarrow \infty$}, showing that {$p_n$} is consistent.

September 20, 2011, at 08:19 PM by 38.106.150.109 -

Changed lines 141-143 from:

{$Cov(Y,Z) =E(YZ)-E(Y)E(Z) = - n^2 (1-S(c_{j-1}) (S(c_{j-1})-S(c_j)).$}

to:

{$Cov(Y,Z) =E(YZ)-E(Y)E(Z) = - n (1-S(c_{j-1}) (S(c_{j-1})-S(c_j)).$}

(See discussion of the Covariance of 2 binomial RVs here).

September 19, 2011, at 10:07 PM by 38.106.150.109 -

Changed lines 137-139 from:

{$$Var(S_n(x)) = $$}

to:

{$$Var(S_n(x)) = \frac{1}{n^2} \cdot Var(Y) + t(x)^2Var(Z) + 2t(x) Cov(Y,Z),$$}

where

{$Var(Y) =n S(c_{j-1})(1-S(c_{j-1}) ,$}

{$Var(Z) =n (S(c_{j-1})-S(c_j))(1-S(c_{j-1})+S(c_j)),$}

{$Cov(Y,Z) =E(YZ)-E(Y)E(Z) = - n^2 (1-S(c_{j-1}) (S(c_{j-1})-S(c_j)).$}

September 19, 2011, at 09:25 PM by 38.106.150.109 -

Changed lines 133-137 from:

{$$E(S_n(x))= 1- \frac{1}{n} (E(Y) + t(x) E(Z)) $$} {$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right]$$} {$$ = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

to:

{$$E(S_n(x))= 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j);$$}

September 19, 2011, at 09:24 PM by 38.106.150.109 -

Changed lines 134-137 from:

{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) $$} {$$ + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

to:

{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right]$$} {$$ = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

September 19, 2011, at 09:23 PM by 38.106.150.109 -

Changed lines 134-136 from:

{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

to:

{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) $$} {$$ + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

September 19, 2011, at 09:23 PM by 38.106.150.109 -

Changed lines 131-135 from:

{$$what??$$}

{$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

to:

{$$E(S_n(x))= 1- \frac{1}{n} (E(Y) + t(x) E(Z)) $$} {$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

September 19, 2011, at 09:22 PM by 38.106.150.109 -

Changed lines 131-133 from:

{$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

to:

{$$what??$$}

{$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;}

Changed lines 142-146 from:

{$E[S_n(x)] = S(x) \approx S_n(x); $}

to:

{$E[S_n(x)] = S(x) \approx S_n(x);j $}

September 19, 2011, at 09:03 PM by 38.106.150.109 -

Changed lines 131-134 from:

{$$E(S_n(x) = 1- ]frac{1}{n} (E(Y) + t(x) E(Z)) = 1 - \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

to:

{$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

September 19, 2011, at 09:02 PM by 38.106.150.109 -

Changed lines 131-134 from:

{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

to:

{$$E(S_n(x) = 1- ]frac{1}{n} (E(Y) + t(x) E(Z)) = 1 - \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

September 19, 2011, at 08:59 PM by 38.106.150.109 -

Changed lines 117-119 from:

It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with a new empirical estimate {$S_k(x) = \text{# of deaths between 2 and 3}/k$}, where is Binomial with parameters {$k$} and {$(S(2)-S(3))/S(2)$}, so that as the case above,

to:

It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that in this notation,

{$$\hat{q_2}=\frac{\text{# of deaths between 2 and 3}}{k}. $$}

Note also that we can estimate the resulting RV, {$Y/k$}, with a new empirical estimate {$S_k(x) = \text{# of deaths between 2 and 3}/k$}, where is Binomial with parameters {$k$} and {$(S(2)-S(3))/S(2)$}, so that as the case above,

Changed lines 121-122 from:

showing that the estimate is unbiased. Note that {$$\hat{q_2}=\frac{\text{# of deaths between 2 and 3}}{k}. $$}

to:

showing that the estimate is unbiased.

Changed line 131 from:

{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_(j-1)) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

to:

{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

September 19, 2011, at 08:55 PM by 38.106.150.109 -

Changed line 124 from:

{$$S_n(x) = 1 - \frac{1}{n} Y + t(x)Z,$$}

to:

{$$S_n(x) = 1 - \frac{1}{n} (Y + t(x)Z ),$$}

Changed line 130 from:

{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = (1-t(x))S(c_{j-1}+t(x)S(c_j) (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

to:

{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_(j-1)) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

September 19, 2011, at 08:43 PM by 38.106.150.109 -

Changed lines 107-108 from:

{$$E[S_n(x)] = \frac{1}{n} E[Y] = \frac{1}{n} (n S(x)) = S(x) \approx S_n(x); $$}

to:

{$$E[S_n(x)] = \frac{1}{n} E[Y] = \frac{1}{n} (n S(x)) = S(x); $$}

which shows {$$S_n(x)$$} is unbiased.

Changed lines 118-120 from:

{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{k}; $$}

to:

{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)} = \frac{S(2) - S(3)}{S(2)};$$} showing that the estimate is unbiased. Note that {$$\hat{q_2}=\frac{\text{# of deaths between 2 and 3}}{k}. $$}

Changed lines 123-135 from:

c. If we have ogives and histograms (i.e., the data is grouped), {$E[S_n(x)] = S(x) \approx S_n(x); $}

to:

c. Suppose we have ogives and histograms (i.e., the data is grouped). Let {$x \in (c_{j-1}, c_j)$}. Then we can define an estimator for {$S(x)$} by

{$$S_n(x) = 1 - \frac{1}{n} Y + t(x)Z,$$} where

{$Y = \text{# of observations } <= c_{j-1},$}

{$Z = \text{# of observations } > c_{j-1} \text{ and } <= c_{j},$}

{$t(x) = \frac{x-c_{j-1}}{c_j-c_{c-1}}.$}

Then

{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = (1-t(x))S(c_{j-1}+t(x)S(c_j) (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;}

This shows {$S_n$} is biased (unlike the previous 2 examples). Also,

{$$Var(S_n(x)) = $$}

{$E[S_n(x)] = S(x) \approx S_n(x); $}

September 19, 2011, at 08:13 PM by 38.106.150.109 -

Changed lines 117-119 from:

{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx = \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $$} {$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}=Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}. $$}

to:

{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{k}; $$} {$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}. $$}

September 19, 2011, at 08:11 PM by 38.106.150.109 -

Changed lines 117-121 from:

{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx = \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $}

{$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k};$$}

This is the same as

{$$\displaystyle{Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}}. $$}

to:

{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx = \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $$} {$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}=Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}. $$}

September 19, 2011, at 08:10 PM by 38.106.150.109 -

Changed line 112 from:

b. Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{X}$}, where

to:

b. Empirical estimate of (say) {$q_2 = \frac{S(2) - S(3)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(2) - S_n(3)}{S_n(2)} = \frac{Y}{X}$}, where

Changed lines 116-117 from:

It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,

{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k]\approx \frac{S_k(x)}{k} = \frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $}

to:

It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with a new empirical estimate {$S_k(x) = \text{# of deaths between 2 and 3}/k$}, where is Binomial with parameters {$k$} and {$(S(2)-S(3))/S(2)$}, so that as the case above,

{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx = \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $}

September 19, 2011, at 07:52 PM by 38.106.150.109 -

Changed lines 107-108 from:

{$E[S_n(x)] = S(x) \approx S_n(x); $}

{$Var[S_n(x)] = \frac{1}{n^2} Var[Y] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

to:

{$$E[S_n(x)] = \frac{1}{n} E[Y] = \frac{1}{n} (n S(x)) = S(x) \approx S_n(x); $$} {$$Var[S_n(x)] = \frac{1}{n^2} Var[Y] = \frac{1}{n^2} (nS(x)(1-S(x)) = \frac{S(x)(1-S(x))}{n} \approx \frac{S_n(x)(1-S_n(x))}{n}; $$}

September 19, 2011, at 06:47 PM by 38.106.150.109 -

Changed line 118 from:

{$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}$};

to:

{$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k};$$}

Changed lines 120-121 from:

{$\displaystyle{Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}.

to:

{$$\displaystyle{Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}}. $$}

September 19, 2011, at 06:44 PM by 38.106.150.109 -

Changed lines 110-111 from:

{$\displaystyle{{$Var[S_n(x)] \approx \frac{\text{(# of observations } > x)(\text{# of observations } <= x)}{n^3}} $}.

to:

{$\displaystyle{Var[S_n(x)] \approx \frac{ (\text{# of observations } > x)(\text{(# of observations } <= x)}{n^3}} $}.

Changed line 117 from:

{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k]\approx \frac{S_k(x)}{k} = \frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}}; $}

to:

{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k]\approx \frac{S_k(x)}{k} = \frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $}

September 19, 2011, at 06:42 PM by 38.106.150.109 -

Changed lines 115-116 from:

{$E[\hat{q_2}] = ?? \approx S_n(x); $}

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,

to:

It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,

{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k]\approx \frac{S_k(x)}{k} = \frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}}; $}

September 19, 2011, at 06:11 PM by 38.106.150.109 -

Changed lines 110-111 from:

{$\displaystyle{\frac{\text{(# of observations } > x)(\text{# of observations } <= x}{n^3}} $}.

to:

{$\displaystyle{{$Var[S_n(x)] \approx \frac{\text{(# of observations } > x)(\text{# of observations } <= x)}{n^3}} $}.

Changed lines 118-119 from:

This is {$\displaystyle{\frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}.

to:

This is the same as

{$\displaystyle{Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}.

September 19, 2011, at 06:09 PM by 38.106.150.109 -

Changed line 105 from:

Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},

to:

a. Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},

Changed lines 109-117 from:

Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{X}$}, where

{$X$} = # of people alive at duration 2,

{$Y$} = # of deaths between duration 2 and 3.

{$E[\hat{q_2}] = ?? \approx S_n(x); $}

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,

{$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}$};

This is {$\displaystyle{\frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}

to:

This is the same thing as

{$\displaystyle{\frac{\text{(# of observations } > x)(\text{# of observations } <= x}{n^3}} $}.

b. Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{X}$}, where

{$X$} = # of people alive at duration 2,

{$Y$} = # of deaths between duration 2 and 3.

{$E[\hat{q_2}] = ?? \approx S_n(x); $}

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,

{$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}$};

This is {$\displaystyle{\frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}.

c. If we have ogives and histograms (i.e., the data is grouped), {$E[S_n(x)] = S(x) \approx S_n(x); $}

Changed lines 123-128 from:

{$E[S_n(x)] = S(x) \approx S_n(x); $}

1). Given hen {$S(t)$} comes from an empirical estimateor, then we can view

to:

September 18, 2011, at 09:08 PM by 38.106.150.109 -

Changed lines 115-116 from:

{$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k} = \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3} $}

to:

{$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}$};

This is {$\displaystyle{\frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}

September 18, 2011, at 09:07 PM by 38.106.150.109 -

Changed lines 115-116 from:

{$Var[\hat{q_2 | \text{# of people alive at }2 = k} | ] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k} = \frac{\text{(# of deaths between durations 2 and 3)(}k -\text{# of deaths between durations 2 and 3}}{k^3} $}

to:

{$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k} = \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3} $}

September 18, 2011, at 09:06 PM by 38.106.150.109 -

Changed lines 108-112 from:

{$Var[S_n(x)] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{n-X}$}, where

{$X$} = # of deaths between duration 0 and 2,

{$Y$} = # of deaths between duration 2 and 3.

to:

{$Var[S_n(x)] = \frac{1}{n^2} Var[Y] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{X}$}, where

{$X$} = # of people alive at duration 2,

{$Y$} = # of deaths between duration 2 and 3.

Changed lines 114-116 from:

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 30$} would make the denominator zero; an alternative is to calculate the conditional variance

{$Var[\hat{q_2} | X= 29 ]$ = Var[] S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

to:

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,

{$Var[\hat{q_2 | \text{# of people alive at }2 = k} | ] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k} = \frac{\text{(# of deaths between durations 2 and 3)(}k -\text{# of deaths between durations 2 and 3}}{k^3} $}

September 17, 2011, at 10:26 AM by 38.106.150.109 -

September 17, 2011, at 10:24 AM by 38.106.150.109 -

Changed line 114 from:

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 30$} would make the denominator zero; an alternative is to calculate

to:

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 30$} would make the denominator zero; an alternative is to calculate the conditional variance

September 17, 2011, at 10:22 AM by 38.106.150.109 -

Changed lines 105-106 from:

Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},

to:

Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},

Changed line 110 from:

Empirical estimate of (say) {$q_2)$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{n-X}$}, where

to:

Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{n-X}$}, where

Changed lines 113-115 from:

{$E[\hat{q_2}] = S(x) \approx S_n(x); $}

{$Var[S_n(x)] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

to:

{$E[\hat{q_2}] = ?? \approx S_n(x); $}

It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 30$} would make the denominator zero; an alternative is to calculate

{$Var[\hat{q_2} | X= 29 ]$ = Var[] S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

September 16, 2011, at 10:57 PM by 38.106.150.109 -

Changed lines 105-106 from:

Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; one way to see this is

{$S_n(x) = Y/n$},

to:

Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},

Deleted lines 107-109:

Then

Empirical estimate of {$S(x)$} is {$S_n(x)$};

Changed lines 111-112 from:

Empirical estimate of (say) {$q_2)$} is {$\frac{S_n(3) - S_n(2)}{S_n(2)}$};

{$E[S_n(x)] = S(x) \approx S_n(x); $}

to:

Empirical estimate of (say) {$q_2)$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{n-X}$}, where

{$X$} = # of deaths between duration 0 and 2,

{$Y$} = # of deaths between duration 2 and 3.

{$E[\hat{q_2}] = S(x) \approx S_n(x); $}

September 16, 2011, at 10:46 PM by 38.106.150.109 -

Changed line 105 from:

Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; then

to:

Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; one way to see this is

Changed lines 107-122 from:

where {$Y$} is the # of observations that are greater than {$x$}. Then {$Y$} is a binomial distribution with parameters {$n$} and {$S(x)$}.

to:

where {$Y$} is the # of observations that are greater than {$x$}. Then {$Y$} is a binomial distribution with parameters {$n$} and {$S(x)$}. (Check: view {$Y$} as counting the # of successes of a biased coin toss, where success is if the observation is greater than {$x$}. The probability of that happening is precisely {$S(x)$}.

Then

Empirical estimate of {$S(x)$} is {$S_n(x)$};

{$E[S_n(x)] = S(x) \approx S_n(x); $}

{$Var[S_n(x)] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

Empirical estimate of (say) {$q_2)$} is {$\frac{S_n(3) - S_n(2)}{S_n(2)}$};

{$E[S_n(x)] = S(x) \approx S_n(x); $}

{$Var[S_n(x)] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}

{$E[S_n(x)] = S(x) \approx S_n(x); $}

September 16, 2011, at 10:10 PM by 38.106.150.109 -

September 16, 2011, at 07:26 PM by 24.148.27.45 -

Added lines 103-104:

General setup: Given a data set, name an estimator, use it to estimate some probablistic value of the RV associated with the data set, then use some general theory to find or estimate the mean and variance of the estimator itself.

September 16, 2011, at 07:16 PM by 24.148.27.45 -

Deleted lines 101-103:

Changed lines 103-108 from:

to:

Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; then

{$S_n(x) = Y/n$},

where {$Y$} is the # of observations that are greater than {$x$}. Then {$Y$} is a binomial distribution with parameters {$n$} and {$S(x)$}.

1). Given hen {$S(t)$} comes from an empirical estimateor, then we can view

September 16, 2011, at 07:03 PM by 24.148.27.45 -

Changed lines 60-64 from:

single-decrement mortality probabilities {$q_{j}^{'(d)}$}, by treating the death counts as uncensored observations and withdrawls as cencored observations;

single-decrement withdrawl probabilities {$q_{j}^{'(w)}$}, by treating the withdrawl counts as uncensored observations and deaths as cencored observations.

to:

single-decrement mortality probabilities {$q_{j}^{'(d)}$}, by treating the death counts as uncensored observations and withdrawls as censored observations;

single-decrement withdrawl probabilities {$q_{j}^{'(w)}$}, by treating the withdrawl counts as uncensored observations and deaths as censored observations.

September 16, 2011, at 07:02 PM by 24.148.27.45 -

Changed lines 57-58 from:

{$y_j = \frac{S(c_j)-S(c_{j+1}}{S(c_j)} = \frac{x_j}{r_j} $}.

to:

{$\displaystyle{q_j = \frac{S(c_j)-S(c_{j+1})}{S(c_j)} = \frac{x_j}{r_j} }$}.

September 16, 2011, at 07:00 PM by 24.148.27.45 -

Changed line 28 from:

Large Data Sets: Sometimes the data is so large that it's not reasonable to keep track of all the {$y_j$}'s. So instead we have intervals of interest: Let {$c_o < c_1 < \ldot < c_k$} be boundaries of such intervals.

to:

Large Data Sets: Sometimes the data is so large that it's not reasonable to keep track of all the {$y_j$}'s. So instead we have intervals of interest: Let {$c_o < c_1 < \ldots < c_k$} be boundaries of such intervals.

September 16, 2011, at 06:59 PM by 24.148.27.45 -

Deleted line 28:

Changed line 32 from:

{$d_i$} = # of truncated observations where the truncation point is in {$[c_{i-1}, c_{i})$};

to:

{$d_i$} = # of truncated observations where the truncation point is in {$ [c_{i-1}, c_{i} ) $};

Changed lines 37-51 from:

I. All truncation occurs at the beginning of the interval, and all censoring occurs at the end of the interval.

(For example, all lives enter and leave study using their insuring age.)

Define the risk set to be

{$r_1 = d_1$},

{$r_2 = r_1 - (x_1+u_1) + d_2$}, ...,

{$r_i = r_{i-1} - (x_{i-1}+u_{i-1}) + d_i$}, ...,

II. Truncation and censoring occur uniformly throughout the interval, and the uncensored observations occur in the middle of the interval.

(For example, more detailed data gathering would invariably give something closer to this scenario than the former.)

Define the risk set to be

{$r_1 = (d_1-u_1)/2$},

{$r_2 = d_1 - (x_1+u_1) + (d_2-u_2)/2$}, ...,

{$r_i =\sum_{i=1}^{i-1} (r_{i} - (x_{i}+u_{i})) + (d_i-u_i)/2$}, ...,

to:

I. All truncation occurs at the beginning of the interval, and all censoring occurs at the end of the interval.

(For example, all lives enter and leave study using their insuring age.)

Define the risk set to be

{$r_1 = d_1$},

{$r_2 = r_1 - (x_1+u_1) + d_2$}, ...,

{$r_i = r_{i-1} - (x_{i-1}+u_{i-1}) + d_i$}, ...,

II. Truncation and censoring occur uniformly throughout the interval, and the uncensored observations occur in the middle of the interval.

(For example, more detailed data gathering would invariably give something closer to this scenario than the former.)

Define the risk set to be

{$r_1 = (d_1-u_1)/2$},

{$r_2 = d_1 - (x_1+u_1) + (d_2-u_2)/2$}, ...,

{$r_i =\sum_{i=1}^{i-1} (r_{i} - (x_{i}+u_{i})) + (d_i-u_i)/2$}, ...,

September 16, 2011, at 06:56 PM by 24.148.27.45 -

September 16, 2011, at 06:52 PM by 24.148.27.45 -

Added lines 45-52:

II. Truncation and censoring occur uniformly throughout the interval, and the uncensored observations occur in the middle of the interval.

(For example, more detailed data gathering would invariably give something closer to this scenario than the former.)

Define the risk set to be

{$r_1 = (d_1-u_1)/2$},

{$r_2 = d_1 - (x_1+u_1) + (d_2-u_2)/2$}, ...,

{$r_i =\sum_{i=1}^{i-1} (r_{i} - (x_{i}+u_{i})) + (d_i-u_i)/2$}, ...,

Changed lines 57-58 from:

to:

It follows that the probability that someone is alive at {$c_j$}, but does not survive past {$c_{j+1}$}, is

{$y_j = \frac{S(c_j)-S(c_{j+1}}{S(c_j)} = \frac{x_j}{r_j} $}.

Moreover, in the situation where we can make more refined categories about the uncensored observations {$x_j$}'s, we can calculate single-decrement probabilities. For example, in a mortality study we can have someone either withdrawl or die in a time interval. We can then calculate

single-decrement mortality probabilities {$q_{j}^{'(d)}$}, by treating the death counts as uncensored observations and withdrawls as cencored observations;

single-decrement withdrawl probabilities {$q_{j}^{'(w)}$}, by treating the withdrawl counts as uncensored observations and deaths as cencored observations.

September 16, 2011, at 06:18 PM by 24.148.27.45 -

Changed lines 28-32 from:

Large Data Sets: Sometimes the data is so large that it's not reasonable to keep track of all the {$y_j$}'s. So instead we have intervals of interest: Let {$c_o < c_1 < \ldot < c_k$} be boundaries of such intervals. For {$i = 1, \ldots, n,$} let

to:

Large Data Sets: Sometimes the data is so large that it's not reasonable to keep track of all the {$y_j$}'s. So instead we have intervals of interest: Let {$c_o < c_1 < \ldot < c_k$} be boundaries of such intervals.

Example: Mortality study on a large population; {$c_j$}'s can be integer ages.

For {$i = 1, \ldots, n,$} let

Changed lines 36-41 from:

Then the risk set is

to:

(Note indexing here is a little different from the book, in order to match the rest of the chapter.)

Two approaches:

I. All truncation occurs at the beginning of the interval, and all censoring occurs at the end of the interval.

(For example, all lives enter and leave study using their insuring age.)

Define the risk set to be

Changed lines 45-48 from:

Example: Mortality study on a large population; {$c_j$}'s can be integer ages.

to:

Define a Kaplan-Meier type survival function on the boundary points by

{$S(c_1)= 1-\frac{x_1}{r_1}$},

{$S(c_2)= S(c_1)\left(1-\frac{x_1}{r_1} \right)$},

{$S(c_j)= \prod_{i=1}^j \left(1-\frac{x_i}{r_i}\right)$}.

September 16, 2011, at 05:33 PM by 24.148.27.45 -

Changed line 10 from:

{$u_i$} = censored value for that observation.

to:

{$u_i$} = censored value for that observation.

Added lines 28-39:

Large Data Sets: Sometimes the data is so large that it's not reasonable to keep track of all the {$y_j$}'s. So instead we have intervals of interest: Let {$c_o < c_1 < \ldot < c_k$} be boundaries of such intervals. For {$i = 1, \ldots, n,$} let

{$d_i$} = # of truncated observations where the truncation point is in {c_{i-1}, c_{i})$};

{$x_i$} = # of uncensored observations in {$(c_{i-1}, c_{i}$};

{$u_i$} = # of censored observations with values in {$(c_{i-1}, c_{i}]$};

Then the risk set is

{$r_1 = d_1$},

{$r_2 = r_1 - (x_1+u_1) + d_2$}, ...,

{$r_i = r_{i-1} - (x_{i-1}+u_{i-1}) + d_i$}, ...,

Example: Mortality study on a large population; {$c_j$}'s can be integer ages.

September 15, 2011, at 04:44 PM by 209.174.60.10 -

Added lines 1-73:

Construction of Empirical Models (20-25%)

1. Estimate failure time and loss distributions using:

a) Kaplan-Meier estimator, including approximations for large data sets

2 concrete situations:

Insurance: observing loss amount per policy. Left truncation when loss < deductible; right censoring when loss > policy limit

Mortality table: observing age of death for each person. Left truncation happens at age the person is 1st observed; right censoring happens at age if the person is still alive at last observation.

Symbols: For each observation {$i$},

{$d_i$} = truncation point for that observation (0 if no truncation);

{$x_i$} = the observed value, if it wasn't censored;

{$u_i$} = censored value for that observation.

Then group and relabel the {$x_i$}'s into {$y_j$} each occurring {$s_j$} times. Divide up the data according to the {$y_j$}'s by defining the risk set to be

{$r_j = \left(\text{#} x_i \text{ and } u_i \geq y_j \right) - \left(\text{#} d_i \geq y_j \right)$}

which is the same as

{$r_j = \left(\text{#} d_i < y_j \right) - \left(\text{#} x_i \text{ and } u_i < y_j \right) $}

So in our situations, the risk set counts

the difference between a) # of policies with observed loss amount >= {$y_j$} and b) # of policies with deductible >= {$y_j$};

the difference between a) # of people entering the study before the age of {$y_j$} and b) # of people died before the age of {$y_j$} (i.e., the # of people being observed alive at a certain age {$y_j$})

Recursively, given {$c$},

{$r_j = r_{j-1} + \left(\text{#} d_i \in [y_{j-1}, y_j) \right) - \left(\text{#} x_i =y_{j-1}\right) - \left(\text{#} u_i \in [y_{j-1}, y_j) \right) $}

And so we define the Kaplan-Meier limit estimator as follows

{$S(t)=0$}, for {$t\in[0, y_1)$},

{$S(t)=\frac{r_1-s_1}{r_1}$} (probability of surviving past {$y_1$}), for {$t\in[y_1, y_2)$},

{$S(t)=S(y_1)\frac{r_2-s_2}{r_2}$} (probability of surviving past {$y_2$}), for {$t\in[y_2, y_3)$},...

{$S(t)=\prod_{i=1}^k\frac{r_i-s_i}{r_i}$} (probability of surviving past {$y_k$}), for {$t\in[y_k,\infty)$}.

Or we can define the last line to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.

b) Nelson-�alen estimator

This estimates the cumulative hazard rate function as follows:

{$\hat{H}(t)=0$}, for {$t\in[0, y_1)$},

{$\hat{H}(t)=\frac{s_1}{r_1}$}, for {$t\in[y_1, y_2)$},

{$\hat{H}(t)=\hat{H}(y_1) + \frac{s_2}{r_2}$}, for {$t\in[y_2, y_3)$},...

{$\hat{H}(t)=\sum_{i=1}^k\frac{s_i}{r_i}$}, for {$t\in[y_k,\infty)$}.

Then {$\hat{S}(t) = e^{-\hat{H}(t)}$}.

For the end piece {$t\in[y_k,\infty)$}, we can define {$\hat{S}(t)$} to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.

c) Kernel density estimators

First we need to discuss empirical distributions:

Suppose data is not grouped and there is no censoring or truncation. Out of a total of {$n$} observations , let {$y_i$} be distinct data points each occurring {$s_i$} times. Then {$S(t)=1-F(t),$} where

{$F(t)=0$}, for {$t\in[-\infty, y_1)$},

{$F(t)=\frac{s_1}{n}$}, for {$t\in[y_1, y_2)$},

{$F(t)=F(y_1) + \frac{s_2}{n}$}, for {$t\in[y_2, y_3)$},...

{$F(t)=\sum_{i=1}^k\frac{s_i}{n} = 1$}, for {$t\in[y_k,\infty)$}.

Note each {$y_i$} is assigned probability {${p(y_i)}= s_i/n$}.

A kernel density estimator is defined as

{$\hat{F}(t) = \sum_{i=1}^k p(y_i) K_{y_i}(t)$},

where {$K_{y_i}(t)$} are kernel functions that can defined by a chosen pdf with a certain bandwith {$b$}:

A uniform kernel is a rectangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/2b$};

A triangular kernel is a triangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/b$}.

Book also mentions a gamma kernel where the parameters are given by {$\alpha$} and {$y_i/\alpha$}. For the 2 kernels with bandwidths, the larger the bandwidth is, the smoother the resulting kernel density estimator. For gamma kernels, the smaller {$\alpha$} is, the smoother the resulting kernel density estimator.

For grouped data, i.e., if the data is kept track of by a range of observed values instead of single, precise values, then we need to use ogives and histograms. Instead of {$y_i$}, let {$c_0, \ldots, c_k$} be the boundary values of intervals of observations, and {$s_1, \ldots, c_k$} be the number of observations in {$(c_0, c_1), \ldots, (c_{k-1}, c_k)$}. Then{$S(t)=1-F(t),$} where {$F(t)$} is the ogive,

{$F(c_0)=0$},

{$F(c_1)=\frac{s_1}{n}$},

{$F(c_2)=\frac{s_1+s_2}{n}$},...

{$F(c_n)=\sum_{i=1}^k\frac{s_i}{n} = 1$},

and the values of {$F$} between the {$c_i$}'s are connected linearly. Take the derivative to get the histogram {$f(t)$}:

{$f(t)=0$}, for {$t\in[-\infty, c_0)$},

{$f(t)=\frac{s_1}{n(c_1-c_0)}$}, for {$t\in[c_0, c_1)$},

{$f(t)=\frac{s_2}{n(c_2-c_1)}$}, for {$t\in[c_1, c_2)$},...

{$f(t)=\frac{s_k}{n(c_k-c_{k-1})}$}, for {$t\in[c_{k-1},c_k)$},

{$f(t)=0$} (or undefined), for {$t\in[c_k,\infty)$}.

2. Estimate the variance of estimators and confidence intervals for failure time and loss distributions.

3. Apply the following concepts in estimating failure time and loss distribution:

a) Unbiasedness

b) Consistency

c) Mean squared error

CExamEmpModels

Actuary.CExamEmpModels History

Construction of Empirical Models (20-25%)