"I have to pay a certain sum, which I have collected in my pocket. I take the bills and coins out of my pocket and give them to the creditor in the order I find them until I have reached the total sum. This is the Riemann integral. But I can proceed differently. After I have taken all the money out of my pocket I order the bills and coins according to identical values and then I pay the several heaps one after the other to the creditor. This is my integral." - Henri L. Lebesgue
0 Introduction
I can still remember and recite the hypotheses that Richard Courant$^{[1]}$ would sometimes invoke in his expository material as is part of his expository profile - suppose we were given a sectionally smooth function, in the sense that the first-order derivative exists and is continuous everywhere with the exception of a finite set of points. Then, certain convergence theorems will hold as a result.$^{[2]}$ But more than that is that uniform convergence results would also hold on sufficiently well-behaved closed sets as disjoint from those points of insufficient well-behavedness.$^{[3]}$ Eventually, one discovers that many such classical results can actually be reformulated in terms of the theory of measurable functions as initiated by the likes of Borel and Lebesgue during the late 19th century and early 20th century,$^{[4]}$ that, there exists an entire systematic general theory that allows the extracting and controlling of sets of points that are the sites of insufficient well-behavedness of certain classes of functions.
In this blog post, we intend to provide a number of such results, along with the quick motivating and development of certain mathematical objects as central to such a general theory, particularly in the measurable functions and the Lebesgue integral. Non-trivial applications will also be demonstrated at the end of this blog post.
1 The Littlewood Heuristics
For the so-called "Littlewood's three principles" as so named by Stein & Shakarchi in their Real Analysis: Measure Theory, Integration, & Hilbert Spaces, essentially achieved is "perfection", in a somewhat crude sense to be sure, for a certain class of functions in the measurable functions (for the moment, our notion of "measurable functions" are only as general as those measurable functions that map from finite products of finite dimensional real Euclidean spaces to the real-line).$^{[5]}$ That is, as long as the functions that we deal with are measurable, and as long as the sets that we deal with are measurable, then we can realize the following,
I. Measurable sets can be approximated by finite unions of intervals, with the exception of a set of measure 0.
II. If a sequence of measurable functions converges pointwise to a limit function, then such a sequence of measurable functions converges uniformly to the limit function, with the exception of a set of measure 0.
III. Measurable functions are continuous, with the exception of a set of measure 0.
One must clarify that the above heuristics are only intended in the sense of real analysis given our current self-imposed limitations, that is, we adopt the conventional topology of the reals, and, that the sigma-algebra, the structure that clarifies the measure-theoretic structure, is that of a Borel sigma-algebra (i.e., it can be provided by the intersection of all sigma-algebras that contain all open sets of the reals). Additionally, we also state that our notion of "measurability" of sets is provided in the sense of the Lebesgue-Borel measure of finite dimensional real Euclidean spaces (and possibly their finite products). And so, for measurable sets $E$, they can be characterized by the fact that, for arbitrarily small real $\epsilon > 0$, one can always find some open set $G$ containing $E$ and some closed set $F$ contained in $E$ such that $m \left( G \setminus F \right) < \epsilon$ for $m$ the Lebesgue-Borel measure.$^{[6]}$ Now, we provide the theorems that justify the Littlewood heuristics,
Theorem 1.1 For $k, n \in \mathbb{N}$, given some $E \subset \mathbb{R}^k$ a bounded measurable set of finite nonnegative measure and $Q_n \subset \mathbb{R}^k$ closed cubes, for arbitrarily small real $\epsilon > 0$, one can always find some $F = \bigcup_{n = 1}^{N} Q_n$ for sufficiently large $N \in \mathbb{N}$ such that $m \left( E \triangle F \right) < \epsilon$.
PROOF. Since the proof is reasonably straightforward, we will set out to principally identify subtleties. For the Lebesgue-Borel measure given our current hypotheses, we can always find a countable union of closed cubes $\bigcup_{n =1}^{\infty} Q_n$ that cover $E$ and that differ arbitrarily little from $E$ in measure,$^{[7]}$ and so we can immediately give, for $m$ the Lebesgue-Borel measure and $F = \bigcup_{n = 1}^{N} Q_n$,
$$\begin{align} m \left( E \triangle F \right) &= m \left( E \setminus F \right) + m \left( F \setminus E \right) \\ & \leq m \left( \bigcup_{n = N + 1}^{\infty}Q_n \right) + m \left( \bigcup_{n = 1}^{N} Q_n \setminus E \right) \\ & \leq \bigcup_{n = N + 1}^{\infty} m \left( Q_n \right) + \bigcup_{n = 1}^{N} m \left( Q_n \right) - m\left( E \right) \\ &= \bigcup_{n = 1}^{\infty} m \left( Q_n \right) - m(E) \end{align}$$
We note that the countable subadditive property of the measure has been utilized. Then, since we have that the last line above can be made arbitrarily small, the argument is complete. $\square$
So, part I. of the Littlewood heuristics is verified as a special case of Theorem 1.1 which provides further clarity to the statement, and where the real analytic character may appear somewhat obvious since, after all, we have specifically exploited the properties of the Borel sigma-algebra, that is, properties of those measurable open sets of finite dimensional real Euclidean spaces.$^{[8]}$ Indeed, for part I. of the Littlewood heuristics, it does not simply suffice that one can find some arbitrary $\bigcup_{n = 1}^{\infty}C_n$ such that $\bigcup_{n = 1}^{\infty} m \left( C_n \right) - m(E)$ can be made arbitrarily small since $\bigcup_{n = 1}^{\infty} C_n$ may only overlap with $E$, from the perspective of the measure $m$, in somewhat "pathological" fashions,$^{[9]}$ that, in part I. of the Littlewood heuristics, we specifically insisted upon the symmetric difference $\triangle$ (and in which case, $\bigcup_{n = 1}^{\infty} Q_n$ is not a finite union). We now proceed to,
Theorem 1.2 (Egorov's Theorem) For $l \in \mathbb{N}$, given some $E \subset \mathbb{R}^l$ a bounded measurable set of finite nonnegative measure, if we have a sequence of measurable functions $\{ f_i \}_{i = 1}^{\infty}$ that converges almost everywhere to some limit function $f$ on $E$, then, for any arbitrarily small real $\epsilon > 0$, one can always find some set $E_{\epsilon} \subset E$ with $m \left( E \setminus E_{\epsilon} \right) < \epsilon$ such that the sequence of measurable functions $\{ f_i \}_{i = 1}^{\infty}$ converges uniformly to $f$ on $E_{\epsilon}$.
PROOF. First, since the sequence of functions converges almost everywhere and we intend to demonstrate uniform convergence with the exception of a set of measure 0, we can merely focus our attention on those $x \in E$ such that $f_i(x)$ converges to $f$ for $i \rightarrow \infty$. So, without losing generality, we instead suppose that $\{ f_i \}_{i = 1}^{\infty}$ converges to $f$ everywhere, and, for $k, n \in \mathbb{N}$, we can then consider sets of objects provided by,
$$E_{k}^{n} := \left\{ x \in E \, \middle| \, \forall i \in \mathbb{N}, \left( \left| f_i(x) - f(x) \right| < \frac{1}{n} \right) \land (i > k) \right\}$$
That is, the sets $E_{k}^{n}$ consists of exactly those objects $x$ such that $i > k$, simultaneously as $\left| f_i(x) - f(x) \right| < \frac{1}{n}$. Now, since sets of the form $E_{k}^{n}$ increases to $E$ in the sense that, for all $k \in \mathbb{N}$, $E_{k}^{n} \subset E_{k + 1}^{n}$ with $\lim_{K \rightarrow \infty} \bigcup_{k = 1}^{K} E_{k}^{n} = E$ as, by hypothesis, $f_i \rightarrow f$ everywhere, lower semi-continuity$^{[10]}$ then allows us to conclude that $\lim_{k \rightarrow \infty} m \left( E_{k}^{n} \right) = m(E)$ in the sense of, for arbitrarily small real $\epsilon > 0$ and fixed $n \in \mathbb{N}$, one can always find some sufficiently large $K$ such that for all $k > K$, we have that $m \left( E \setminus E_{k}^{n} \right) < \frac{\epsilon}{2^n}$. Then, in recalling that $f_i \rightarrow f$ everywhere, we also ultimately require $\left( \frac{1}{n} \rightarrow 0 \right) \Rightarrow \left( n \rightarrow \infty \right)$, and so we then provide the following set,
$$E_{\epsilon} = \bigcap_{n = 1}^{\infty} E_{k}^{n}$$
Indeed, for such a set, we have that,
$$\begin{align} m \left( E \setminus E_{\epsilon} \right) &= m \left( E \setminus \bigcap_{n = 1}^{\infty} E_{k}^{n} \right) \\ &= m \left( \bigcup_{n = 1}^{\infty} \left( E \setminus E_{k}^{n} \right) \right) \\ & \leq \sum_{n = 1}^{\infty} m(E \setminus E_{k}^{n}) \\ &< \epsilon \sum_{n = 1}^{\infty} \frac{1}{2^n} \end{align}$$
For further clarity, we comment that in the 2nd line above, we have made use of deMorgan's law. In the 3rd line above, the countable subadditivity property of the measure has been used. And, in the 4th line above, a convergent geometric sum $\sum_{n = 1}^{\infty} \frac{1}{2^n}$ is provided which converges to $\frac{1}{2} \left( \frac{1}{1 - \frac{1}{2}} \right) = 1$. That is, we can conclude that $m \left( E \setminus E_{\epsilon} \right) < \epsilon$. $\square$
There are a few subtleties in the above proof. Notice, we had essentially defined sets $E_{\epsilon}$ consisting of exactly those objects $x \in E$ such that uniform convergence is demonstrated, and, all that remained was to demonstrate that such a set coincides with $E$ almost everywhere in the sense that, for any arbitrarily small real $\epsilon > 0$, one can always find some $E_{\epsilon}$ such that $m \left( E \setminus E_{\epsilon} \right) < \epsilon$ for $\{ f_i \}_{i = 1}^{\infty}$ uniformly convergent everywhere on $E_{\epsilon}$. One can recall a certain definition of uniform convergence, along with a recitation of sets $E_{k}^{n}$ for the sake of comparison, where, for all $x$ in the domain of $f$, we have,$^{[11]}$
$$\left( \{ f_n \}_{n = 1}^{\infty} \rightrightarrows f \right) := \forall \epsilon \in \{ \mathbb{R} \setminus (-\infty, 0] \}, \exists N \in \mathbb{N}, \forall x \in \mathbb{R}, \forall n \in \mathbb{N}, (n > N) \Rightarrow \left( \left| f_n(x) - f(x) \right| < \epsilon \right)$$
$$E_{k}^{n} := \left\{ x \in E \, \middle| \, \forall i \in \mathbb{N}, \left( \left| f_i(x) - f(x) \right| < \frac{1}{n} \right) \land (i > k) \right\}$$
In this way, part II. of the Littlewood heuristics is also verified. Although, there is a slight difference, since we had initially stated part II. of the Littlewood heuristics as a special case of Egorov's theorem in that we did not relax the condition of pointwise convergence to that of almost everywhere pointwise convergence, whereas Egorov's theorem allows such a relaxing of the hypothesis. Nonetheless, we then transition to the next part in,
Theorem 1.3 (Luzin's Theorem) For $k \in \mathbb{N}$, given some $E \subset \mathbb{R}^k$ a bounded measurable set of finite nonnegative measure, if we have a measurable function $f$ defined on $E$, then, for any arbitrarily small real $\epsilon > 0$, one can always find some $F_{\epsilon} \subset E$ such that $f$ is continuous on $F_{\epsilon}$ with $m \left( E \setminus F_{\epsilon} \right) < \epsilon$.$^{[12]}$
PROOF. Since measurable functions can be approximated by simple functions to an arbitrary degree of accuracy almost everywhere,$^{[13]}$ for $f$ measurable, one can provide a sequence of simple functions $\{ f_i \}_{i = 1}^{\infty}$ defined on $E$ for $\lim_{i \rightarrow \infty} f_i = f$ almost everywhere and such that the functions $f_i$'s are continuous with the exception of corresponding sets $E_i$ of measure 0. Then, by Egorov's theorem, one can find some set $E_{\epsilon}$ such that $\{ f_i \}_{i = 1}^{\infty}$ converges uniformly to $f$ on $E_{\epsilon}$. Finally, $\{ f_i \}_{i = 1}^{\infty}$ also converges uniformly on $F_{\epsilon} = E_{\epsilon} \setminus \bigcup_{i = 1}^{\infty} E_i$,$^{[14]}$ and, moreover, the functions $f_i$ for all $i$ are continuous on $F_{\epsilon}$,$^{[15]}$ so, uniform convergence of continuous functions on $F_{\epsilon}$ allows us to ascertain the continuity of $f$ on $F_{\epsilon}$ due to the uniform limit theorem. $\square$
And so, part III. of the Littlewood heuristics is demonstrated. Now, we would like to provide an informal discussion regarding the Lebesgue integral. As Lebesgue had described it in quite an intuitive fashion, the contrast between the Lebesgue integral and the Riemann integral comes in the sense that the Riemann integral imposes unnecessarily strict restrictions pertaining to the provided partitions of the domains of definition, whereas the Lebesgue integral does not impose such restrictions. Then, the greater generality in the definition of the Lebesgue integral essentially allows one to be able to provide "partitions" via subsets consisting of "identical" objects in the sense that the objects satisfy the same properties that we intend. In fact, in the proof of Ergorov's theorem as provided above, we had made use of sets $E_{k}^{n}$ that consist of those $x \in E$ that satisfy the same properties. In the sense of Lebesgue's intuitive description, the "coins" $x$ are arranged into their corresponding sets $E_{k}^{n}$ according to their identical properties, and here lies one of the principle intuition of measurable functions and their analogous integrals in the Lebesgue integrals.
2 Applications in Approximation Theory: Simple Functions
Suppose we were provided a problem of trying to determine a function in the description of particular phenomena in the sciences or engineering - unless further information was provided, say, in the form of a generalized theoretical framework as in Newtonian mechanics, trying to derive such a function may seem ill-posed and potentially impossible. Yet, approximations are possible if only there were appropriate sampling techniques, that is, we would sample functional values on objects of the domain of the hypothetical function so as to recover more and more "nodes", and, many approximation methods become open and appropriate. As many will know, there is an entire branch of approximation theory that comes in interpolation polynomials, however, for the purpose of this blog post, we would like to describe the simple functions instead. So, consider the following,
Theorem 2.1 Given a measurable function $f$ as defined on $\mathbb{R}$, one can always find a sequence of simple functions $\{ \phi_n \}_{n = 1}^{\infty}$ with $\left| \phi_{n_1} \right| \leq \left| \phi_{n_2} \right|$ whenever $n_1 < n_2$ such that $\lim_{n \rightarrow \infty} \phi_{n} = f$ pointwise almost everywhere.$^{[16]}$
PROOF. Without losing generality, we can consider the case where $f$ is non-negative since, if $f$ is not non-negative, then it can be decomposed into non-negative functions $f_1$ and $f_2$ in the sense of $f = f_1 - f_2$ for $f_1 = \max(f, 0)$ and $f_2 = \max(-f, 0)$ with both $f_1$ and $f_2$ measurable.$^{[17]}$ Thus, for $f$ non-negative, first consider the set of truncated functions $\{ F_n \}$ for $n \in \mathbb{N}$ as defined on closed intervals such that $F_n$ is defined to be $f(x)$ for $\left( x \in [-n, n] \right) \land \left( f(x) \leq n \right)$, defined to be $n$ for $\left( x \in [-n, n] \right) \land \left( f(x) > n \right)$, and $0$ otherwise. Then, consider the following sets for $k \in \mathbb{N}$ and $0 \leq k < n \cdot 2^n$,
$$E_{n, k} = \left\{ x \in [-n, n] \middle| \frac{k}{2^n} \leq F_n < \frac{k + 1}{2^n} \right\}$$
So we provide the functions $\phi_n$ in the form of,
$$\phi_n = \sum_{k} \frac{k}{2^n} \mathbb{1}_{E_{n, k}}$$
In the above, $\mathbb{1}_{E_{n, k}}$ denotes the indicator functions defined on sets $E_{n, k}$. As further remarks to introduce clarity in the above argument, we can note that $\lim_{n \rightarrow \infty} F_n = f$, and, that for arbitrarily small real $\epsilon > 0$, one can always find some sufficiently large $n$ such that $\left| F_n - \phi_n \right| < \epsilon$ almost everywhere, and so a simple application of the triangle inequality from elementary analysis completes the argument. $\square$
Notice that the functions $\phi_n$ provided in the above proof are quite reminiscent of the step functions, and since, after all, such functions are generalizations of the step functions. Additionally, we feel it appropriate to comment on a particular subtlety in that the function $f$ was required to be measurable - the measurability property has been utilized in the recovery of measurable pre-images, thus, that it makes sense to define indicator functions on the sets $E_{n, k}$, after all, we were trying to recover simple functions as approximations, and we can be reminded that simple functions are provided by finite linear combinations of indicator functions as defined on measurable sets. In recalling Lebesgue's intuitive description, what the proof had essentially achieved is to collect "coins" $x$ of the same type into their corresponding sets $E_{n, k}$. Incidentally, if we had defined $\phi_n = \sum_{k} \frac{k + 1}{2^n} \mathbb{1}_{E_{n, k}}$, then our sequence of simple functions could be decreasing rather than increasing.
Before moving to the next section, I would like to point the attention of the interested reader to a particularly interesting demonstration of such a kind of mathematical phenomenon in the applied setting, and in the industrial setting no less. Such comes in a kind of machinery sometimes described as "3D CNC metal sheet bending machines", where, a collection of rods would exert varying pressures so as to bend steel sheets into particular surfaces - such is exactly a demonstration, although a special case, of the possibility that measurable functions can always be approximated by simple functions, where we repeat, for simple functions being functions that are provided by finite linear combinations of indicator functions as defined on appropriate collections of subsets of the domain.
In fact, for such kinds of 3D CNC metal sheet bending machines, we have that we desire certain surfaces for sheets of metals to conform to, and so, instead, a discrete approximation to it, in the form of a simple function, is provided and mimicked via a collection of rods as extended at different relative heights, somewhat akin to what one can observe regarding the natural phenomenon of the "Giant's Causeway", just with greater fineness. We can note finally however that the demonstration by such kinds of 3D CNC metal sheet bending machines provides special cases of simple functions, that of Riemann sums. But then, if Riemann sums suffice, why bother to continue to pursue more general notions of mathematical objects? Are we merely pursuing unnecessary generalizations?$^{[18]}$ Does it make sense to suppose that Riemann sums, and even Riemann integrals, suffice completely in "real life"? Well, we now provide examples in transitioning to section 3 of this blog post where Riemann sums do not suffice.
3 Applications in Stochastic Analysis: Markov's Inequality, Chebyshev's Inequality, & Monte Carlo Integration
But, finally, we want to illustrate the utilitarian aspects of the Lebesgue integral in a fashion that the Riemann integral cannot - in the example of the ubiquitous Monte Carlo integration. Prior, we provide a number of prerequisites in,
Theorem 3.1 (Markov's Inequality) Given a real random variable $X$ and any arbitrarily small real $\epsilon > 0$, if we have that $f$ is a real non-negative monotonically increasing function that maps from $[0, \infty]$ to $[0, \infty]$, then the following Markov's inequality always hold,
$$P\left( \left| X \right| \geq \epsilon \right) \leq \frac{E \left[ f \left( \left| X \right| \right) \right]}{f(\epsilon)}$$
PROOF. First, for arbitrarily small real $\epsilon > 0$, consider the following inequality,
$$E[f(|X|)] \geq E \left[ f( \left| X \right| ) \mathbb{1}_{\{ f( \left| X \right| ) \geq f(\epsilon) \}} \right]$$
Where in the above, $\mathbb{1}_{\{ f( \left| X \right| ) \geq f(\epsilon) \}}$ denotes the indicator function as defined on the set of objects of the domain of $f$ such that $f( \left| X \right| ) \geq f(\epsilon)$. Since such a set of objects is a subset of the domain of the function, and since $f$ is a real non-negative function, the above inequality is clearly satisfied. Then, due to the fact that $f$ is a non-negative monotonically increasing function, we have that $f(\epsilon)$ is a lower bound for $f$ for the set of objects of the domain of $f$ such that $f( \left| X \right| ) \geq f(\epsilon)$, and so,
$$E \left[ f( \left| X \right| ) \mathbb{1}_{\{ f( \left| X \right| ) \geq f(\epsilon) \}} \right] \geq E \left[ f(\epsilon) \mathbb{1}_{\{ f( \left| X \right| ) \geq f(\epsilon) \}} \right]$$
Then, finally, we have,
$$\begin{align} E \left[ f(\epsilon) \mathbb{1}_{\{ f( \left| X \right| ) \geq f(\epsilon) \}} \right] &= \int f(\epsilon) \mathbb{1}_{\{ f( \left| X \right| ) \geq f(\epsilon) \}} dP \\ &= f(\epsilon) P \left( \left| X \right| \geq \epsilon \right) \end{align}$$
In the above, the last equality is justified as $f$ is monotonically increasing, that is, $f( \left| X \right| ) \geq f(\epsilon)$ implies $\left| X \right| \geq \epsilon$.Then, we obtain $E \left[ f \left( \left| X \right| \right) \right] \geq f(\epsilon) P \left( \left| X \right| \geq \epsilon \right)$, and a mere rearranging thus completes the proof. $\square$
We note again that in the above proof, we have identified measurable subsets of the sigma-algebra consisting of exactly those objects that all satisfy the same properties, where, in the above case, the property was provided by $f( \left| X \right|) \geq f(\epsilon)$.
Theorem 3.2 (Chebyshev's Inequality) For $X$, $\epsilon$, and $f$ as in Theorem 3.1, if $X \in \mathcal{L}^2$ with respect to an appropriate probability measure as defined on the real-line, then the following Chebyshev's inequality always hold,
$$P \left( \left| X - E[X] \right| \geq \epsilon \right) \leq \frac{\text{Var}[X]}{\epsilon^2}$$
PROOF. Chebyshev's inequality can be given as almost an immediate consequence of Markov's inequality, where, we opt for $f(x) = x^2$ in Markov's inequality, and so we obtain,
$$P\left( \left| X \right| \geq \epsilon \right) \leq \frac{E \left[ \left| X \right|^2 \right]}{\epsilon^2}$$
Finally, it suffices to replace $X$ with $X' - E[X']$, and to recall the identity $\text{Var[X]} = E \left[ \left( X - E[X] \right)^2 \right]$. $\square$
Theorem 3.3 (Strong Law of Large Numbers) Given a set of real random variables $\{ X_n \}$ with $X_n \in \mathcal{L}^2$ for all $n \in \mathbb{N}$, are pairwise independent, and are identically distributed, we have that the following is satisfied,
$$P \left[ \limsup_{N \rightarrow \infty} \left| \frac{\sum_{n = 1}^N \left( X_n - E[X_n] \right)}{N} \right| = 0 \right] = 1$$
PROOF. Without losing generality, we suppose that the random variables $\{ X_n \}$ for all $n \in \mathbb{N}$ are non-negative. For arbitrary real $\epsilon > 0$, we set $k_n = \lfloor (1 + \epsilon)^n \rfloor$ where $\lfloor \cdot \rfloor$ denotes the lower integer floor, and consider,
$$\sum_{n = 1}^{\infty} P \left[ \left| \frac{X_1 + \dots + X_{k_n}}{k_n} - E \left[ X_1 \right] \right| \geq (1 + \epsilon)^{-\frac{n}{4}} \right]$$
An application of Chebyshev's inequality in Theorem 3.2 then gives an upper bound in,
$$\begin{align} \sum_{n = 1}^{\infty} \left( 1 + \epsilon \right)^{\frac{n}{2}} \text{Var} \left[ \frac{X_1 + \dots + X_{k_n}}{k_n} \right] &= \sum_{n = 1}^{\infty} \left( 1 + \epsilon \right)^{\frac{n}{2}} \frac{1}{k_n^2} \text{Var} \left[ X_1 + \dots + X_{k_n} \right] \\ &= \sum_{n = 1}^{\infty} \left( 1 + \epsilon \right)^{\frac{n}{2}} \frac{1}{k_n} \text{Var} \left[ X_1 \right] \\ &< 2 \sum_{n = 1}^{\infty} \left( 1 + \epsilon \right)^{-\frac{n}{2}} \text{Var} \left[ X_1 \right] \end{align}$$
Where the second equality above is justified as the random variables $\{ X_i \}$ are pairwise independent and therefore uncorrelated. Furthermore, since $\sum_{n = 1}^{\infty} \left( 1 + \epsilon \right)^{-\frac{n}{2}}$ provides a convergent geometric series for $(1 + \epsilon)^{-2} < 1$ and since $\text{Var}[X_1]$ is bounded as $X_1$ belong to $\mathcal{L}^2$, we have boundedness. That is, we have that,
$$\sum_{n = 1}^{\infty} \left( 1 + \epsilon \right)^{\frac{n}{2}} \text{Var} \left[ \frac{X_1 + \dots + X_{k_n}}{k_n} \right] < \infty$$
Implying, more importantly,
$$\sum_{n = 1}^{\infty} P \left[ \left| \frac{X_1 + \dots + X_{k_n}}{k_n} - E \left[ X_1 \right] \right| \geq (1 + \epsilon)^{-\frac{n}{4}} \right] < \infty$$
By an application of the Borel-Cantelli lemma,$^{[19]}$ we then conclude that, for any arbitrarily small $\epsilon'$, there exists some $N \in \mathbb{N}$ such that for all $n > N$, we have that subsets of the sigma-algebra satisfying $\left( \frac{X_1 + \dots + X_{k_n}}{k_n} - E \left[ X_1 \right] \right) \geq (1 + \epsilon)^{-\frac{n}{4}}$ are sets probability measure less than $\epsilon'$, and so, we have, almost surely,
$$\left| \frac{X_1 + \dots + X_{k_n}}{k_n} - E \left[ X_1 \right] \right| < (1 + \epsilon)^{-\frac{n}{4}}$$
In other words, almost surely, for $n \rightarrow \infty$, we obtain,
$$\limsup_{n \rightarrow \infty} \left| \frac{X_1 + \dots + X_{k_n}}{k_n} - E \left[ X_1 \right] \right| = 0$$
Now, given some positive integer $l$, we find some $n$ such that $k_n \leq l \leq k_{n + 1}$ and where we also have that $k_{n + 1} = \lfloor (1 + \epsilon)(1 + \epsilon)^{n} \rfloor \leq (1 + \epsilon) \lfloor (1 + \epsilon)^{n} \rfloor + \epsilon \lfloor (1 + \epsilon)^n \rfloor = (1 + 2 \epsilon)k_n$ for sufficiently large $n$, which, when combined with $\limsup_{n \rightarrow \infty} \left| \frac{X_1 + \dots + X_{k_n}}{k_n} - E \left[ X_1 \right] \right| = 0$, finally gives,
$$\begin{align} \limsup_{l \rightarrow \infty} \left| \frac{X_1 + \dots + X_l}{l} - E[X_1] \right| & \leq \limsup_{n \rightarrow \infty} \left| \frac{X_1 + \dots + X_{k_{n + 1}}}{k_{n}} - E[X_1] \right| \\ & \leq \limsup_{n \rightarrow \infty} \left| \left( 1 + 2 \epsilon \right) \left( \frac{X_1 + \dots + X_{k_{n + 1}}}{k_{n + 1}} \right) - E[X_1] \right| \\ & \leq \limsup_{n \rightarrow \infty} \left| \frac{X_1 + \dots + X_{k_{n + 1}}}{k_{n + 1}} - E[X_1] \right| + 2 \epsilon \limsup_{n \rightarrow \infty} \left| \frac{X_1 + \dots + X_{k_{n + 1}}}{k_{n + 1}} \right| \\ & \leq 2 \epsilon E[X_1] \end{align}$$
And the above holds almost surely. $\square$
As a result, we realize the following as almost an immediate application of the strong law of large numbers stated above, in the form of,
Theorem 3.4 (Monte Carlo Integration) For i.i.d. real random variables $\{ X_n \}$ defined on the unit interval $[0, 1]$ with uniform distributions on $[0, 1]$, we have that, almost surely,
$$\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} f(X_n) = \int_{0}^{1} f(x) dx$$
PROOF. This is an immediate consequence of the strong law of large numbers provided in Theorem 3.3, since, for example, $E\left[ \frac{1}{N} \sum_{n = 1}^{N}f(X_n) \right] = \frac{1}{N} \sum_{n = 1}^{N}E[f(X_n)] = \frac{1}{N} \left( NE[f(X_1)] \right) = \int_{0}^{1} f(x) dx$. That is,
$$P \left[ \limsup_{N \rightarrow \infty} \left| \frac{1}{N} \sum_{n = 1}^N f \left( X_n \right) - \int_{0}^{1} f(x) dx \right| = 0 \right] = 1$$
$\square$
Thus completes our final result. Indeed, in the construction that had led to the stochastic sampling technique in Monte Carlo integration, the definition of the Lebesgue integral had been utilized in such a fashion that it cannot be reduced into the Riemann integral, contrary to the case of the previous section. One can be reminded here yet again of the restrictions imposed by the Riemann integral, in that one requires a certain style of partitioning of the domain of the integrand in such a fashion that provided unambiguously are collections of non-intersecting, non-degenerate, intervals, as given one after the other - that is, the Riemann integral is not provided by linear combinations of functional values as defined on degenerate points of the domain. Even such techniques as simple as the stochastic uniform sampling of the unit interval reject such a possibility as the sets to be provided are points of the domain which are, at the very least, not non-degenerate intervals.
APPENDIX
Appendix I: A Short Story - An Analytic Number Theorist's Challenge
"I already informed you, that problem is an impossible problem! The entire roundtable of analytic number theorists had recognized such a problem to be impossible immediately upon inspection. If you want the problem solved, you need to provide us more explicit information, end of."
In the background, the rapid clicks of a running supercomputer, a mechanical one, pierce through the momentary silence. For the analytic number theorists, the formulation of conjectures is conventional, but there must be some quantitative grounding at the very least, and in the miraculous era of Steam-Punk, many fantasies have never appeared so within reach.
"That can't be. How is it that I have noticed recurring patterns in so many phenomena? There must be something that lies underneath! I may lack the analytic number theoretic skills to see it to its end but I really do think my intuition-"
"Intuition can only bring you so far, Mr. Lyn."
In the few seconds that had passed, another thousand iterations have been verified by the divinely industrious machine as its gears whirl and click.
"I'm sorry to retort you in this fashion, Mr. Lyn. But, you must understand our position. You want us to find the limit of an arithmetic sum of a family of functions, yet, no explicit information for the family of functions is provided. This is simply not possible."
BEEP BEEP BEEP- a sheet of paper slithers through a slit of the mechanical computer, producing sets of numbers, data, that would indicate a normal distribution in the limit.
"Wait... what was the machine trying to demonstrate?"
"Changing the topic now are we? Well, if you must know, it was merely compiling the cognitive scores of millions of people of New London. And it just so happens, here is another example for your consideration. We cannot derive, explicitly, the limit of the arithmetic sum of the cognitive scores of millions of people if we do not gather the relevant data directly because we do not know what we do not know. This is not astrology or numerology, Mr. Lyn, I'm sorry."
"Actually, forgive my persistence, sir, but I think I may have solved a general class of problems... what if I told you, there could exist an entire mathematical subfield that deals with the science of uncertainty?"
"Preposterous."
FOOTNOTES
[1] Richard Courant is the author of such fantastic expositions as Methods of Mathematical Physics: Volume I, Methods of Mathematical Physics: Volume II, Introduction to Calculus and Analysis: Volume I, Introduction to Calculus and Analysis: Volume II, and What Is Mathematics? An Elementary Approach to Ideas and Methods, which has no doubt influenced many generations of STEM human capital. One will discover many works that have cited some of Courant's own works and thus his "downstream" influence is even greater, whether in such mainstream expositions as those of Vladimir Zorich's in Mathematical Analysis I and Mathematical Analysis II, or even in popular research articles including in James Kajiya's 1986 article titled The Rendering Equation.
[2] Consider page 604 of Courant & John's Introduction to Calculus and Analysis: Volume I, pages 70-71 of Courant & Hilbert's Methods of Mathematical Physics: Volume I, or, page 11 of Courant's Dirichlet's Principle, Conformal Mapping, and Minimal Surfaces. That, Courant has a tendency of investigating results for classes of functions that are sectionally smooth. Incidentally, Courant had implied that to investigate such a class of functions provides a legitimate alternative to Lebesgue's theory of real functions in Courant & Hilbert's Methods of Mathematical Physics: Volume I - an approach that requires less the relatively modern theory of measures, but one that sacrifices generality. We comment further that Courant's notion of "smooth" functions is distinct from its usage in many conventional differential geometry subcultures, where "smoothness" alternatively denotes infinitely diffeomorphic functions.
[3] Although the references as indicated in footnote [2] already provide some examples, we can provide a general perspective. For continuous real functions, if only we were to restrict their domains to those of closed intervals, then we would retrieve uniformly continuous functions. As a result, we can always reach for well-behavedness via such an approach. To engage with subtleties further, we are not merely extracting finite numbers of isolated points of insufficient well-behavedness, but extracting entire neighborhoods around these isolated points, leaving their compliments, in the domain of relevant functions, possibly as finite unions of closed intervals.
[4] Consider Jahnke's A History of Analysis.
[5] We loosely remind the reader that measurable functions are simply functions where, for each measurable set in the image space of such functions, one can always recover measurable sets as pre-images. Then, one recognizes that the particular sigma-algebras are important, and that different classes of functions may have different kinds of sigma-algebras as generated by their image spaces and their domains. Such is analogous with continuous functions, where, in the context of continuous functions, we concern ourselves with open sets rather than measurable sets, and topologies rather than sigma-algebras. For additional details, measurable sets are simply sets that belong to sigma-algebras. Although, in the real analysis of the early 20th century, measurable sets, and functions, are typically identified more intuitively - for instance, see Lebesgue's historic definition of measurable functions in DEFINITION 9 of page 287 of Jahnke's A History of Analysis.
[6] See for instance page 21 of Stein & Shakarchi's Real Analysis: Measure Theory, Integration, & Hilbert Spaces.
[7] See for instance page 33 of Klenke's Probability Theory: A Comprehensive Course on "outer regularity" and "inner regularity". Compare with footnote [6].
[8] Actually, we had not even justified the fact that, given the conventional Borel sigma-algebra of the reals, one can cover measurable sets with countable unions of cubes, whether they'd be open or closed. In fact, a common heuristic is simply this - the conventional Borel sigma-algebra of $\mathbb{R}$ can be generated by the class of all closed intervals, or, the class of all open intervals, where, for $\mathbb{R}^k$ with $k > 1$, we would graduate to cubes. And, since sigma-algebras are closed with respect to countable unions, the result is thus quickly justified in an intuitive fashion since such kinds of measurable sets can always be given, in some fashion, in terms of some countable union of cubes.
[9] As an example, consider $[0, 1)$ and $[1, 2]$, where although both sets possess the same measure $1$, they clearly do not overlap as they are disjoint. Here, the symmetric difference produces $[0, 1) \triangle [1, 2] = [0, 2]$. Thus, we see that $m \left( [0, 1) \right) - m \left( [1, 2] \right) = 0$, whereas $m \left( [0, 1) \triangle [1, 2] \right) = 2$. That is, if we were interested in some kind of a measurement of the "similarity" of "forms" in making use of the Lebesgue-Borel measure of the real-line, it would appear that the symmetric difference is more appropriate - we may feel it inappropriate to state that $[0, 1)$ and $[1, 2]$ overlaps, from the perspective of the measure $m$, even if $m \left( [0, 1) \right) - m \left( [1, 2] \right) = 0$. It may feel even more inappropriate to state that $[0, 1)$ can be given as $[1, 2]$ with the exception of a set of measure 0 simply due to $m \left( [0, 1) \right) - m \left( [1, 2] \right) = 0$.
[10] See for instance page 20 of Stein & Shakarchi's Real Analysis: Measure Theory, Integration, & Hilbert Spaces. Alternatively, see page 16 of Klenke's Probability Theory: A Comprehensive Course.
[11] Consider, compare and contrast the definition of convergence, with that of uniform convergence, for $x$ in the domain of $f$,
$$\left( \{ f_n \}_{n = 1}^{\infty} \rightarrow f \right) := \forall \epsilon \in \{ \mathbb{R} \setminus (-\infty, 0] \}, \forall x \in \mathbb{R}, \exists N \in \mathbb{N}, \forall n \in \mathbb{N}, (n > N) \Rightarrow \left( \left| f_n(x) - f(x) \right| < \epsilon \right)$$
$$\left( \{ f_n \}_{n = 1}^{\infty} \rightrightarrows f \right) := \forall \epsilon \in \{ \mathbb{R} \setminus (-\infty, 0] \}, \exists N \in \mathbb{N}, \forall x \in \mathbb{R}, \forall n \in \mathbb{N}, (n > N) \Rightarrow \left( \left| f_n(x) - f(x) \right| < \epsilon \right)$$
What is the difference? It is in the order of $\exists N$ and $\forall x$. For the conventional definition of pointwise convergence, one can always find some $N$ once $x$ has been selected, and thus the notion of point, wise, convergence, since such a $N$ is selected after the selection of points $x$ and may be dependant on points $x$ for certain classes of functions, as in the obvious class of convergent functions that are not uniformly convergent. In the definition of sets $E_{k}^{n}$, $k$ plays the analogous role as $N$ and was already selected prior to the selection of points $x$ for every $E_{k}^{n}$ - hence the uniform convergence, where, for the sake of brevity, sets $E_{k}^{n}$ have even been indexed by $k$. Such classic subtleties can aptly demonstrate the utilitarian uses of mathematical logical notation in analysis. Incidentally, the difference between the notion of continuity, and uniform continuity, can be clarified in such a fashion as well.
[12] Actually, a generalization of Luzin's theorem allows one to conclude equivalence. That is, if it is true that, for any arbitrarily small real $\epsilon > 0$, one can always find some compact set $F_{\epsilon}$ for $m \left( E \setminus F_{\epsilon} \right) < \epsilon$ and such that $f$ is continuous everywhere on $F_{\epsilon}$, then one can conclude that $f$ is in fact measurable - notice that we had required $F_{\epsilon}$ to be compact. One can see, for instance, page 92 of Bogachev & Smolyanov's Real and Functional Analysis.
[13] See Theorem 2.1 in the second section titled 2 Applications in Approximation Theory: Simple Functions. Although Theorem 2.1 was stated for functions defined on $\mathbb{R}$, its extension to $\mathbb{R}^k$ is particularly straightforward.
[14] Recall that the countable union of sets of measure 0 is also a set of measure 0.
[15] To reiterate, continuous on $F_{\epsilon}$, where $F_{\epsilon}$ may not necessarily be convex. Many are no doubt quite familiar with continuous real functions as defined on convex intervals from elementary analysis, however, such a key property of convexity of the domain may actually become non-trivial in certain further topics of analysis.
[16] We note that the sequence of simple functions is indexed by a countable set. This is important as, if we had not provided a countable index set, then $\bigcup_{j = 1}^{\infty} E_j$ would not necessarily be a set of measure 0. Indeed, we use a sequence as opposed to a "net", that is, a topological net in the sense of topological filters where index sets are not assumed to be countable.
[17] Actually, the $\max(\cdot, \cdot)$ and $\min(\cdot, \cdot)$ "operators" are sometimes identified in contexts of "lattice operations". And, it is known that, as is sometimes described, measurability of measurable functions is closed with respect to lattice operations. For instance, see page 15 of Rudin's Real & Complex Analysis, or, page 40 of Klenke's Probability Theory: A Comprehensive Course. And, as to an example of a conventional exposition that uses the term "lattice" in such a context, see page 9 of Yosida's Functional Analysis. Also, we comment that there is such a notion as a "sigma-frame", stated here possibly for future reference.
[18] Consider another example in the notion of a "cohomology" in general. Some time ago, I thought there existed quite a few examples that illustrate the non-trivial utilitarian uses of such notions as cohomology in mathematical physics, until I realized that all of the examples that I had accessed are merely special cases, that are, for instance, de Rham cohomologies and what not, of differential geometry. As a result, one sometimes cannot help but feel, if all the special cases suffice for applications, how could one justify the more general notion of cohomology, of algebraic topology, from a utilitarian perspective? Of course, many perspectives and arguments exist, and one perspective comes in that abstraction provides a means of introducing further information-theoretic efficiency, and in reducing working memory load. More to come concerning such discussions.
[19] By an application of the Borel-Cantelli lemma, we formally conclude that,
$$P \left[ \limsup_{n \rightarrow \infty} \left( \left| \frac{X_1 + \dots + X_{k_n}}{k_n} - E \left[ X_1 \right] \right| \geq (1 + \epsilon)^{-\frac{n}{4}} \right) \right] = 0$$
And so, by taking a measure-theoretic interpretation to probability theory, the influences of Borel, and Cantelli, can be felt.
REFERENCES
Bogachev, V. I., & Smolyanov, O. G. (2020). Real and Functional Analysis. Springer Cham. https://doi.org/10.1007/978-3-030-38219-3
Courant, R. (1977). Dirichlet's Principle, Conformal Mapping, and Minimal Surfaces. Springer New York. (Original work published 1950). https://doi.org/10.1007/978-1-4612-9917-2
Courant, R., & Hilbert, D. (1989). Methods of Mathematical Physics: Volume I. Wiley-VCH Verlag GmbH & Co. KGaA. (Original work published 1953)
Courant, R., & John, F. (1989). Introduction to Calculus and Analysis: Volume I. Springer New York. (Original work published 1965). https://doi.org/10.1007/978-1-4613-8955-2
Jahnke, H. N. (2003). A History of Analysis. American Mathematical Society; London Mathematical Society. (Original work published 1999). https://doi.org/10.1090/hmath/024
Klenke, A. (2020). Probability Theory: A Comprehensive Review (3rd ed.). Springer Cham. https://doi.org/10.1007/978-3-030-56402-5
Rudin, W. (1987). Real & Complex Analysis (3rd ed.). McGraw-Hill.
Stein, E. M., & Shakarchi, R. (2005). Real Analysis: Measure Theory, Hilbert Spaces, & Integration. Princeton University Press.
Yosida, K. (1995). Functional Analysis. Springer Berlin. (Original work published 1980). https://doi.org/10.1007/978-3-642-61859-8
Zorich, V. A. (2016). Mathematical Analysis II (2nd ed., R. Cooke, & O. Paniagua, Trans). Springer Berlin. (Original work published 2012). https://doi.org/10.1007/978-3-662-48993-2