|
Actuary /
CExamEmpModelsActuary.CExamEmpModels HistoryShow minor edits - Show changes to markup Changed lines 50-51 from:
to:
Changed lines 155-157 from:
Given a quantity {$\theta$} that we're tyring to estimate, once we have an estimator {$\hat{\theta}$} and figured out its expected value and variance, we can obtain a {$1-\alpha$} confidence interval using the following formulas:
{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}} \leq z_{\frac{\alpha}{2}} \right)$$} to:
e. Given a quantity {$\theta$} that we're tyring to estimate, once we have an estimator {$\hat{\theta}$} and figured out its expected value and variance, we can obtain a {$1-\alpha$} confidence interval using the following formulas:
{$$1-\alpha \approx Prob (-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}} \leq z_{\frac{\alpha}{2}} ).$$} For example, in the case of estimating {$p$} with {$p_n$} (valued at 2 as above), we have
{$$1-\alpha \approx Prob (-z_{\frac{\alpha}{2}} \leq \frac{p_n-p}{\sqrt{p(1-p)/n}} \leq z_{\frac{\alpha}{2}} );$$} Solve for p to get the interval. We can simplify it even further by approximating the {$p$}'s in the denominator with {$p_n$}.
f.
Changed lines 156-157 from:
{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}}{} \right)$$} to:
{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}} \leq z_{\frac{\alpha}{2}} \right)$$} Changed line 147 from:
to:
Changed lines 155-156 from:
to:
Given a quantity {$\theta$} that we're tyring to estimate, once we have an estimator {$\hat{\theta}$} and figured out its expected value and variance, we can obtain a {$1-\alpha$} confidence interval using the following formulas:
{$$1-\alpha \approx Prob \left(-z_{\frac{\alpha}{2}} \leq \frac{\hat{\theta}-\theta}{\sqrt{Var(\hat\theta)}}{} \right)$$} Changed lines 111-112 from:
to:
{$$\displaystyle{Var[S_n(x)] \approx \frac{ (\text{# of observations } > x)(\text{(# of observations } <= x)}{n^3}}. $$} Note that {$Var[S_n(x)] \rightarrow 0$} as {$n \rightarrow \infty$}, showing that {$p_n$} is consistent.
Changed line 108 from:
which shows {$$S_n(x)$$} is unbiased.
to:
which shows {$S_n(x)$} is unbiased.
Changed line 137 from:
{$$Var(S_n(x)) = \frac{1}{n^2} \cdot Var(Y) + t(x)^2Var(Z) + 2t(x) Cov(Y,Z),$$} to:
{$$Var(S_n(x)) = \frac{1}{n^2} \left[\cdot Var(Y) + t(x)^2Var(Z) + 2t(x) Cov(Y,Z)\right],$$} Changed lines 142-150 from:
(See discussion of the Covariance of 2 binomial RVs here).
{$E[S_n(x)] = S(x) \approx S_n(x); $}
to:
d. Suppose we have a discrete RV, a set of {$n$} data points, and we want to estimate {$p(x)$} for an observed value {$x$}. Then {$p_n(x) = N/n$} is the empirical estimator of {$p$}, where
{$N = \text{# of times } x { was observed in the data.}$}
Then {$N$} is binomial with parameters {$n$} and {$p(x)$}, and {$$E[p_n(x)] = \frac{1}{n} E[N] = \frac{1}{n} (n p(x)) = p(x); $$} which shows {$p_n(x)$} is unbiased.
The variance of the estimator is
{$$Var[p_n(x)] = \frac{1}{n^2} Var[N] = \frac{1}{n^2} (np(x)(1-p(x)) = \frac{p(x)(1-p(x))}{n} \approx \frac{p_n(x)(1-p_n(x))}{n}; $$} note that this goes to 0 as {$n \rightarrow \infty$}, showing that {$p_n$} is consistent.
Changed lines 141-143 from:
to:
(See discussion of the Covariance of 2 binomial RVs here).
Changed lines 137-139 from:
{$$Var(S_n(x)) = $$} to:
{$$Var(S_n(x)) = \frac{1}{n^2} \cdot Var(Y) + t(x)^2Var(Z) + 2t(x) Cov(Y,Z),$$} where
{$Var(Y) =n S(c_{j-1})(1-S(c_{j-1}) ,$}
{$Var(Z) =n (S(c_{j-1})-S(c_j))(1-S(c_{j-1})+S(c_j)),$}
{$Cov(Y,Z) =E(YZ)-E(Y)E(Z) = - n^2 (1-S(c_{j-1}) (S(c_{j-1})-S(c_j)).$}
Changed lines 133-137 from:
{$$E(S_n(x))= 1- \frac{1}{n} (E(Y) + t(x) E(Z)) $$} {$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right]$$} {$$ = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} to:
{$$E(S_n(x))= 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j);$$} Changed lines 134-137 from:
{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) $$} {$$ + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} to:
{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right]$$} {$$ = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} Changed lines 134-136 from:
{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} to:
{$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) $$} {$$ + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} Changed lines 131-135 from:
{$$what??$$} {$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} to:
{$$E(S_n(x))= 1- \frac{1}{n} (E(Y) + t(x) E(Z)) $$} {$$= 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} Changed lines 131-133 from:
{$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} to:
{$$what??$$} {$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1})+t(x)S(c_j)$$;} Changed lines 142-146 from:
to:
Changed lines 131-134 from:
{$$E(S_n(x) = 1- ]frac{1}{n} (E(Y) + t(x) E(Z)) = 1 - \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} to:
{$$E(S_n(x)) = 1- \frac{1}{n} (E(Y) + t(x) E(Z)) = 1- \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} Changed lines 131-134 from:
{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} to:
{$$E(S_n(x) = 1- ]frac{1}{n} (E(Y) + t(x) E(Z)) = 1 - \frac{1}{n} \cdot \left[n (1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) \right] = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} Changed lines 117-119 from:
It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with a new empirical estimate {$S_k(x) = \text{# of deaths between 2 and 3}/k$}, where is Binomial with parameters {$k$} and {$(S(2)-S(3))/S(2)$}, so that as the case above,
to:
It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that in this notation,
{$$\hat{q_2}=\frac{\text{# of deaths between 2 and 3}}{k}. $$} Note also that we can estimate the resulting RV, {$Y/k$}, with a new empirical estimate {$S_k(x) = \text{# of deaths between 2 and 3}/k$}, where is Binomial with parameters {$k$} and {$(S(2)-S(3))/S(2)$}, so that as the case above,
Changed lines 121-122 from:
showing that the estimate is unbiased. Note that {$$\hat{q_2}=\frac{\text{# of deaths between 2 and 3}}{k}. $$} to:
showing that the estimate is unbiased. Changed line 131 from:
{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_(j-1)) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} to:
{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_{j-1}) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} Changed line 124 from:
{$$S_n(x) = 1 - \frac{1}{n} Y + t(x)Z,$$} to:
{$$S_n(x) = 1 - \frac{1}{n} (Y + t(x)Z ),$$} Changed line 130 from:
{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = (1-t(x))S(c_{j-1}+t(x)S(c_j) (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} to:
{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = 1- \frac{1}{n} \cdot n(1-S(c_(j-1)) + n t(x) (S(c_{j-1})-S(c_j)) (1-t(x))S(c_{j-1}+t(x)S(c_j) = (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} Changed lines 107-108 from:
{$$E[S_n(x)] = \frac{1}{n} E[Y] = \frac{1}{n} (n S(x)) = S(x) \approx S_n(x); $$} to:
{$$E[S_n(x)] = \frac{1}{n} E[Y] = \frac{1}{n} (n S(x)) = S(x); $$} which shows {$$S_n(x)$$} is unbiased.
Changed lines 118-120 from:
{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{k}; $$} to:
{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)} = \frac{S(2) - S(3)}{S(2)};$$} showing that the estimate is unbiased. Note that {$$\hat{q_2}=\frac{\text{# of deaths between 2 and 3}}{k}. $$} Changed lines 123-135 from:
to:
c. Suppose we have ogives and histograms (i.e., the data is grouped). Let {$x \in (c_{j-1}, c_j)$}. Then we can define an estimator for {$S(x)$} by
{$$S_n(x) = 1 - \frac{1}{n} Y + t(x)Z,$$} where
Then
{$$E(S_n(x) = 1- E(Y) + t(x) E(Z) = (1-t(x))S(c_{j-1}+t(x)S(c_j) (1-t(x))S(c_{j-1}+t(x)S(c_j)$$;} This shows {$S_n$} is biased (unlike the previous 2 examples). Also,
{$$Var(S_n(x)) = $$} {$E[S_n(x)] = S(x) \approx S_n(x); $} Changed lines 117-119 from:
{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx = \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $$} {$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}=Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}. $$} to:
{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{k}; $$} {$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}. $$} Changed lines 117-121 from:
{$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k};$$} This is the same as
{$$\displaystyle{Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}}. $$} to:
{$$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx = \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $$} {$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}=Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}. $$} Changed line 112 from:
b. Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{X}$}, where
to:
b. Empirical estimate of (say) {$q_2 = \frac{S(2) - S(3)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(2) - S_n(3)}{S_n(2)} = \frac{Y}{X}$}, where
Changed lines 116-117 from:
It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,
{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k]\approx \frac{S_k(x)}{k} = \frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $}
to:
It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with a new empirical estimate {$S_k(x) = \text{# of deaths between 2 and 3}/k$}, where is Binomial with parameters {$k$} and {$(S(2)-S(3))/S(2)$}, so that as the case above,
{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k] = \frac{1}{k}\frac{k(S(2)-S(3))}{S(2)}\approx = \frac{S_n(2) - S_n(3)}{S_n(2)}=\frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}; $}
Changed lines 107-108 from:
to:
{$$E[S_n(x)] = \frac{1}{n} E[Y] = \frac{1}{n} (n S(x)) = S(x) \approx S_n(x); $$} {$$Var[S_n(x)] = \frac{1}{n^2} Var[Y] = \frac{1}{n^2} (nS(x)(1-S(x)) = \frac{S(x)(1-S(x))}{n} \approx \frac{S_n(x)(1-S_n(x))}{n}; $$} Changed line 118 from:
to:
{$$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k};$$} Changed lines 120-121 from:
to:
{$$\displaystyle{Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}}. $$} Changed lines 110-111 from:
to:
Changed line 117 from:
to:
Changed lines 115-116 from:
{$E[\hat{q_2}] = ?? \approx S_n(x); $}
It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,
to:
It's not possible to calculate {$E[\hat{q_2}]$} and {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,
{$E[\hat{q_2}| \text{# of people alive at }2 = k] = E[Y/k]\approx \frac{S_k(x)}{k} = \frac{\text{# of deaths between 2 and 3}}{\text{# of people alive at 2}}}; $}
Changed lines 110-111 from:
to:
Changed lines 118-119 from:
This is {$\displaystyle{\frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}.
to:
This is the same as
{$\displaystyle{Var[\hat{q_2} | \text{# of people alive at }2 = k] \approx \frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}.
Changed line 105 from:
Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},
to:
a. Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},
Changed lines 109-117 from:
to:
This is the same thing as
{$\displaystyle{\frac{\text{(# of observations } > x)(\text{# of observations } <= x}{n^3}} $}.
b. Empirical estimate of (say) {$q_2 = \frac{S(3) - S(2)}{S(2)}$} is {$\hat{q_2}=\frac{S_n(3) - S_n(2)}{S_n(2)} = \frac{Y}{X}$}, where
{$X$} = # of people alive at duration 2,
{$Y$} = # of deaths between duration 2 and 3.
{$E[\hat{q_2}] = ?? \approx S_n(x); $}
It's not possible to calculate {$Var[\hat{q_2}]$}, because {$X = 0$} would make the denominator zero; an alternative is to calculate the conditional variance given, say, {$X=k$}. Note that we can estimate the resulting RV, {$Y/k$}, with its own empirical estimate {$S_k(x) =Y/k$}, so that as the case above,
{$Var[\hat{q_2} | \text{# of people alive at }2 = k] = Var[Y/k] \approx \frac {S_k(x)(1-S_k(x)}{k}$};
This is {$\displaystyle{\frac{\text{(# of deaths between 2 and 3)(}k -\text{# of deaths between 2 and 3)}}{k^3}} $}.
c. If we have ogives and histograms (i.e., the data is grouped), {$E[S_n(x)] = S(x) \approx S_n(x); $}
Changed lines 123-128 from:
to:
Changed lines 115-116 from:
to:
Changed lines 115-116 from:
to:
Changed lines 108-112 from:
to:
Changed lines 114-116 from:
to:
Changed line 114 from:
to:
Changed lines 105-106 from:
to:
Given a complete data set of {$n$} points, the empirical estimate of {$S(x)$} is {$S_n(x)= Y/n$},
Changed line 110 from:
to:
Changed lines 113-115 from:
to:
Changed lines 105-106 from:
Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; one way to see this is
{$S_n(x) = Y/n$},
to:
Deleted lines 107-109:
Then
Empirical estimate of {$S(x)$} is {$S_n(x)$};
Changed lines 111-112 from:
to:
Changed line 105 from:
Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; then
to:
Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; one way to see this is
Changed lines 107-122 from:
where {$Y$} is the # of observations that are greater than {$x$}. Then {$Y$} is a binomial distribution with parameters {$n$} and {$S(x)$}.
to:
where {$Y$} is the # of observations that are greater than {$x$}. Then {$Y$} is a binomial distribution with parameters {$n$} and {$S(x)$}. (Check: view {$Y$} as counting the # of successes of a biased coin toss, where success is if the observation is greater than {$x$}. The probability of that happening is precisely {$S(x)$}.
Then
Empirical estimate of {$S(x)$} is {$S_n(x)$};
{$E[S_n(x)] = S(x) \approx S_n(x); $}
{$Var[S_n(x)] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}
Empirical estimate of (say) {$q_2)$} is {$\frac{S_n(3) - S_n(2)}{S_n(2)}$};
{$E[S_n(x)] = S(x) \approx S_n(x); $}
{$Var[S_n(x)] = S(x)(1-S(x))/n \approx S_n(x)(1-S_n(x))/n; $}
{$E[S_n(x)] = S(x) \approx S_n(x); $}
{$E[S_n(x)] = S(x) \approx S_n(x); $}
{$E[S_n(x)] = S(x) \approx S_n(x); $}
Added lines 103-104:
General setup: Given a data set, name an estimator, use it to estimate some probablistic value of the RV associated with the data set, then use some general theory to find or estimate the mean and variance of the estimator itself.
Deleted lines 101-103:
Changed lines 103-108 from:
to:
Given a complete data set with {$n$} data points with some unknown survival function {$S(x)$}, we can define the empirical estimator {$S_n(x)$} as above; then
{$S_n(x) = Y/n$},
where {$Y$} is the # of observations that are greater than {$x$}. Then {$Y$} is a binomial distribution with parameters {$n$} and {$S(x)$}.
1). Given hen {$S(t)$} comes from an empirical estimateor, then we can view
Changed lines 60-64 from:
to:
Changed lines 57-58 from:
to:
Changed line 28 from:
to:
Deleted line 28:
Changed line 32 from:
to:
Changed lines 37-51 from:
to:
Added lines 45-52:
Changed lines 57-58 from:
to:
Changed lines 28-32 from:
to:
Changed lines 36-41 from:
to:
Changed lines 45-48 from:
Example: Mortality study on a large population; {$c_j$}'s can be integer ages. to:
Changed line 10 from:
to:
Added lines 28-39:
Example: Mortality study on a large population; {$c_j$}'s can be integer ages. Added lines 1-73:
Construction of Empirical Models (20-25%)1. Estimate failure time and loss distributions using: a) Kaplan-Meier estimator, including approximations for large data sets
2 concrete situations:
Insurance: observing loss amount per policy. Left truncation when loss < deductible; right censoring when loss > policy limit
Mortality table: observing age of death for each person. Left truncation happens at age the person is 1st observed; right censoring happens at age if the person is still alive at last observation.
Symbols: For each observation {$i$},
{$d_i$} = truncation point for that observation (0 if no truncation);
{$x_i$} = the observed value, if it wasn't censored;
{$u_i$} = censored value for that observation.
Then group and relabel the {$x_i$}'s into {$y_j$} each occurring {$s_j$} times. Divide up the data according to the {$y_j$}'s by defining the risk set to be
{$r_j = \left(\text{#} x_i \text{ and } u_i \geq y_j \right) - \left(\text{#} d_i \geq y_j \right)$}
which is the same as
{$r_j = \left(\text{#} d_i < y_j \right) - \left(\text{#} x_i \text{ and } u_i < y_j \right) $}
So in our situations, the risk set counts
the difference between a) # of policies with observed loss amount >= {$y_j$} and b) # of policies with deductible >= {$y_j$};
the difference between a) # of people entering the study before the age of {$y_j$} and b) # of people died before the age of {$y_j$} (i.e., the # of people being observed alive at a certain age {$y_j$})
Recursively, given {$c$},
{$r_j = r_{j-1} + \left(\text{#} d_i \in [y_{j-1}, y_j) \right) - \left(\text{#} x_i =y_{j-1}\right) - \left(\text{#} u_i \in [y_{j-1}, y_j) \right) $}
And so we define the Kaplan-Meier limit estimator as follows
{$S(t)=0$}, for {$t\in[0, y_1)$},
{$S(t)=\frac{r_1-s_1}{r_1}$} (probability of surviving past {$y_1$}), for {$t\in[y_1, y_2)$},
{$S(t)=S(y_1)\frac{r_2-s_2}{r_2}$} (probability of surviving past {$y_2$}), for {$t\in[y_2, y_3)$},...
{$S(t)=\prod_{i=1}^k\frac{r_i-s_i}{r_i}$} (probability of surviving past {$y_k$}), for {$t\in[y_k,\infty)$}.
Or we can define the last line to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.
b) Nelson-Åalen estimator
This estimates the cumulative hazard rate function as follows:
{$\hat{H}(t)=0$}, for {$t\in[0, y_1)$},
{$\hat{H}(t)=\frac{s_1}{r_1}$}, for {$t\in[y_1, y_2)$},
{$\hat{H}(t)=\hat{H}(y_1) + \frac{s_2}{r_2}$}, for {$t\in[y_2, y_3)$},...
{$\hat{H}(t)=\sum_{i=1}^k\frac{s_i}{r_i}$}, for {$t\in[y_k,\infty)$}.
Then {$\hat{S}(t) = e^{-\hat{H}(t)}$}.
For the end piece {$t\in[y_k,\infty)$}, we can define {$\hat{S}(t)$} to be 0, or something in between that decays exponentially to 0 as {$t\rightarrow\infty$}.
c) Kernel density estimators
First we need to discuss empirical distributions:
Suppose data is not grouped and there is no censoring or truncation. Out of a total of {$n$} observations , let {$y_i$} be distinct data points each occurring {$s_i$} times. Then {$S(t)=1-F(t),$} where
{$F(t)=0$}, for {$t\in[-\infty, y_1)$},
{$F(t)=\frac{s_1}{n}$}, for {$t\in[y_1, y_2)$},
{$F(t)=F(y_1) + \frac{s_2}{n}$}, for {$t\in[y_2, y_3)$},...
{$F(t)=\sum_{i=1}^k\frac{s_i}{n} = 1$}, for {$t\in[y_k,\infty)$}.
Note each {$y_i$} is assigned probability {${p(y_i)}= s_i/n$}.
A kernel density estimator is defined as
{$\hat{F}(t) = \sum_{i=1}^k p(y_i) K_{y_i}(t)$},
where {$K_{y_i}(t)$} are kernel functions that can defined by a chosen pdf with a certain bandwith {$b$}:
A uniform kernel is a rectangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/2b$};
A triangular kernel is a triangle centered at each {$y_i$} and has pdf with width {$2b$} and height {$1/b$}.
Book also mentions a gamma kernel where the parameters are given by {$\alpha$} and {$y_i/\alpha$}. For the 2 kernels with bandwidths, the larger the bandwidth is, the smoother the resulting kernel density estimator. For gamma kernels, the smaller {$\alpha$} is, the smoother the resulting kernel density estimator.
For grouped data, i.e., if the data is kept track of by a range of observed values instead of single, precise values, then we need to use ogives and histograms. Instead of {$y_i$}, let {$c_0, \ldots, c_k$} be the boundary values of intervals of observations, and {$s_1, \ldots, c_k$} be the number of observations in {$(c_0, c_1), \ldots, (c_{k-1}, c_k)$}. Then{$S(t)=1-F(t),$} where {$F(t)$} is the ogive,
{$F(c_0)=0$},
{$F(c_1)=\frac{s_1}{n}$},
{$F(c_2)=\frac{s_1+s_2}{n}$},...
{$F(c_n)=\sum_{i=1}^k\frac{s_i}{n} = 1$},
and the values of {$F$} between the {$c_i$}'s are connected linearly. Take the derivative to get the histogram {$f(t)$}:
{$f(t)=0$}, for {$t\in[-\infty, c_0)$},
{$f(t)=\frac{s_1}{n(c_1-c_0)}$}, for {$t\in[c_0, c_1)$},
{$f(t)=\frac{s_2}{n(c_2-c_1)}$}, for {$t\in[c_1, c_2)$},...
{$f(t)=\frac{s_k}{n(c_k-c_{k-1})}$}, for {$t\in[c_{k-1},c_k)$},
{$f(t)=0$} (or undefined), for {$t\in[c_k,\infty)$}.
2. Estimate the variance of estimators and confidence intervals for failure time and loss distributions. 3. Apply the following concepts in estimating failure time and loss distribution:
|