Reference documentation for deal.II version 8.4.1

Table of contents  

This program was contributed by Katharina Kormann and Martin Kronbichler.
The algorithm for the matrixvector product is based on the article A generic interface for parallel cellbased finite element operator application by Martin Kronbichler and Katharina Kormann, Computers and Fluids 63:135–147, 2012, and the paper "Parallel finite element operator application: Graph partitioning and coloring" by Katharina Kormann and Martin Kronbichler in: Proceedings of the 7th IEEE International Conference on eScience, 2011.
This example shows how to implement a matrixfree method, that is, a method that does not explicitly store the matrix elements, for a secondorder Poisson equation with variable coefficients on a hypercube. The elliptic equation will be solved with a multigrid method.
The major motivation for matrixfree methods is the fact that today access to main memory (i.e., for objects that don't fit in the cache) has become the bottleneck in scientific computing: To perform a matrixvector product, modern CPUs spend far more time waiting for data to arrive from memory than on actually doing the floating point multiplications and additions. Thus, if we could substitute looking up matrix elements in memory by recomputing them — or rather, the operator represented by these entries —, we may win in terms of overall runtime (even if this requires a significant number of additional floating point operations). That said, to realize this with a trivial implementation is not enough and one needs to really look at what it takes to make this happen. This tutorial program (and the papers referenced above) show how one can implement such a scheme and demonstrates the speedup that can be obtained.
In this example, we consider the Poisson problem
\begin{eqnarray*}  \nabla \cdot a(\mathbf x) \nabla u &=& 1, \\ u &=& 0 \quad \text{on}\ \partial \Omega \end{eqnarray*}
where \(a(\mathbf x)\) is a variable coefficient. Below, we explain how to implement a matrixvector product for this problem without explicitly forming the matrix. The construction can, of course, be done in a similar way for other equations as well.
We choose as domain \(\Omega=[0,1]^3\) and \(a(\mathbf x)=\frac{1}{0.05 + 2\\mathbf x\^2}\). Since the coefficient is symmetric around the origin but the domain is not, we will end up with a nonsymmetric solution.
In order to find out how we can write a code that performs a matrixvector product, but does not need to store the matrix elements, let us start at looking how a finite element matrix A is assembled:
\begin{eqnarray*} A = \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_{\mathrm{cell,{locglob}}}^T A_{\mathrm{cell}} P_{\mathrm{cell,{locglob}}}. \end{eqnarray*}
In this formula, the matrix P_{cell,locglob} is a rectangular matrix that defines the index mapping from local degrees of freedom in the current cell to the global degrees of freedom. The information from which this operator can be built is usually encoded in the local_dof_indices
variable we have always used in the assembly of matrices. Moreover, A_{cell} denotes the celloperation associated with A.
If we are to perform a matrixvector product, we can hence use that
\begin{eqnarray*} y &=& A\cdot u = \left(\sum_{\text{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T A_\mathrm{cell} P_\mathrm{cell,{locglob}}\right) \cdot u \\ &=& \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T A_\mathrm{cell} u_\mathrm{cell} \\ &=& \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T v_\mathrm{cell}, \end{eqnarray*}
where u_{cell} are the values of u at the degrees of freedom of the respective cell, and v_{cell}=A_{cell}u_{cell} correspondingly for the result. A naive attempt to implement the local action of the Laplacian would hence be to use the following code:
Here we neglected boundary conditions as well as any hanging nodes we may have, though neither would be very difficult to include using the ConstraintMatrix class. Note how we first generate the local matrix in the usual way as a sum over all quadrature points for each local matrix entry. To form the actual product as expressed in the above formula, we extract the values of src
of the cellrelated degrees of freedom (the action of P_{cell,locglob}), multiply by the local matrix (the action of A_{cell}), and finally add the result to the destination vector dst
(the action of P_{cell,locglob}^{T}, added over all the elements). It is not more difficult than that, in principle.
While this code is completely correct, it is very slow. For every cell, we generate a local matrix, which takes three nested loops with loop length equal to the number of local degrees of freedom to compute. The multiplication itself is then done by two nested loops, which means that it is much cheaper.
One way to improve this is to realize that conceptually the local matrix can be thought of as the product of three matrices,
\begin{eqnarray*} A_\mathrm{cell} = B_\mathrm{cell}^T D_\mathrm{cell} B_\mathrm{cell}, \end{eqnarray*}
where for the example of the Laplace operator the (q*dim+d,i)th element of B_{cell} is given by fe_values.shape_grad(i,q)[d]
. The matrix consists of dim*n_q_points
rows and dofs_per_cell
columns). The matrix D_{cell} is diagonal and contains the values fe_values.JxW(q) * coefficient_values[q]
(or, rather, dim
copies of each of these values). This kind of representation of finite element matrices can often be found in the engineering literature.
When the cellbased matrix is applied to a vector
\begin{eqnarray*} A_\mathrm{cell}\cdot u_\mathrm{cell} = B_\mathrm{cell}^T D_\mathrm{cell} B_\mathrm{cell} \cdot u_\mathrm{cell}, \end{eqnarray*}
one would then never form the matrixmatrix products, but rather multiply with the vector from right to left so that only three successive matrixvector products are formed. This removed the three nested loops in the calculation of the local matrix. What happens is as follows: We first transform the vector of values on the local dofs to a vector of gradients on the quadrature points. In the second loop, we multiply these gradients by the integration weight. The third loop applies the second gradient (in transposed form), so that we get back to a vector of (Laplacian) values on the cell dofs. This reduces the complexity of the work on one cell from something like \(\mathcal {O}(\mathrm{dofs\_per\_cell}^3)\) to \(\mathcal {O}(\mathrm{dofs\_per\_cell}^2)\).
The bottleneck in the above code is the operations done by the call to FEValues::reinit for every cell
, which take about as much time as the other steps together (at least if the mesh is unstructured; deal.II can recognize that the gradients are often unchanged on structured meshes). That is certainly not ideal and we would like to do better than this. What the reinit function does is to calculate the gradient in real space by transforming the gradient on the reference cell using the Jacobian of the transformation from real to reference cell. This is done for each basis function on the cell, for each quadrature point. The Jacobian does not depend on the basis function, but it is different on different quadrature points in general. If you only build the matrix once as we've done in all previous tutorial programs, there is nothing one can do about the need to call FEValues::reinit on every cell since this transformation has to be done when we want to compute the local matrix elements.
However, in a matrixfree implementation, we are not interested in applying the matrix only once. Rather, in iterative solvers, we need to expect that we have to apply the matrix many times, and so we can think about whether we may be able to cache something between different applications. On the other hand, we realize that we must not cache too much data since otherwise we get back to the situation where memory access becomes the dominating factor.
The trick is now to factor out the Jacobian transformation and first apply the gradient on the reference cell only. That transforms the vector of values on the local dofs to a vector of gradients on the quadrature points. There, we first apply the Jacobian that we factored out from the gradient, then we apply the weights of the quadrature, and we apply the transposed Jacobian for preparing the third loop which again uses the gradients on the unit cell.
Let us again write this in terms of matrices. Let the matrix B_{cell} denote the cellrelated gradient matrix, with each row containing the values on the quadrature points. It is constructed by a matrixmatrix product as
\begin{eqnarray*} B_\mathrm{cell} = J_\mathrm{cell} B_\mathrm{ref\_cell}, \end{eqnarray*}
where B_{ref_cell} denotes the gradient on the reference cell and J_{cell} denotes the Jacobian transformation from unit to real cell (in the language of transformations, the operation represented by J_{cell} represents a covariant transformation). J_{cell} is blockdiagonal, and the blocks size is equal to the dimension of the problem. Each diagonal block is the Jacobian transformation that goes from the reference cell to the real cell.
Putting things together, we find that
\begin{eqnarray*} A_\mathrm{cell} = B_\mathrm{cell}^T D B_\mathrm{cell} = B_\mathrm{ref\_cell}^T J_\mathrm{cell}^T D_\mathrm{cell} J_\mathrm{cell} B_\mathrm{ref\_cell}, \end{eqnarray*}
so we calculate the product (starting the local product from the right)
\begin{eqnarray*} v_\mathrm{cell} = B_\mathrm{ref\_cell}^T J_\mathrm{cell}^T D J_\mathrm{cell} B_\mathrm{ref\_cell} u_\mathrm{cell}, \quad v = \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T v_\mathrm{cell}. \end{eqnarray*}
Note how we create an additional FEValues object for the reference cell gradients and how we initialize it to the reference cell. The actual derivative data is then applied by the inverse, transposed Jacobians (deal.II calls the Jacobian matrix from unit to real cell inverse_jacobian, because the transformation direction in deal.II is from real to unit cell).
Finally, we are using tensor product basis functions and now that we have separated out the gradient on the reference cell B_{ref_cell}, we can exploit the tensorproduct structure to further reduce the complexity. We illustrate this in two space dimensions, but the same technique can be used in higher dimensions. On the reference cell, the basis functions are of the tensor product form \(\phi(x,y,z) = \varphi_i(x) \varphi_j(y)\). The part of the matrix B_{ref_cell} that computes the first component has the form \(B_\mathrm{sub\_cell}^x = B_\mathrm{grad,x} \otimes B_\mathrm{val,y}\), where B_{grad,x} and B_{val,y} contain the evaluation of all the 1D basis functions on all the 1D quadrature points. Forming a matrix U with U(j,i) containing the coefficient belonging to basis function \(\varphi_i(x) \varphi_j(y)\), we get \((B_\mathrm{grad,x} \otimes B_\mathrm{val,y})u_\mathrm{cell} = B_\mathrm{val,y} U B_\mathrm{grad,x}\). This reduces the complexity for computing this product from \(p^4\) to \(2 p^3\), where p1 is the degree of the finite element (i.e., equivalently, p is the number of shape functions in each coordinate direction), or \(p^{2d}\) to \(d p^{d+1}\) in general.
Implementing a matrixfree and cellbased finite element operator requires a somewhat different design compared to the usual matrix assembly codes shown in previous tutorial programs. The data structures for doing this are the MatrixFree class that collects all data and issues a (parallel) loop over all cells and the FEEvaluation class that evaluates finite element basis functions by making use of the tensor product structure.
The implementation of the matrixfree matrixvector product shown in this tutorial is slower than a matrixvector product using a sparse matrix for linear elements, but faster for all higher order elements thanks to the reduced complexity due to the tensor product structure and due to less memory transfer during computations. The impact of reduced memory transfer is particularly beneficial when working on a multicore processor where several processing units share access to memory. In that case, an algorithm which is computation bound will show almost perfect parallel speedup, whereas an algorithm that is bound by memory transfer might not achieve similar speedup (even when the work is perfectly parallel and one could expect perfect scaling like in sparse matrixvector products). An additional gain with this implementation is that we do not have to build the sparse matrix itself, which can also be quite expensive depending on the underlying differential equation. Moreover, the above framework is simple to generalize to nonlinear operations, as we demonstrate in step48.
Above, we have gone to significant lengths to implement a matrixvector product that does not actually store the matrix elements. In many user codes, however, one wants more than just performing some number of matrixvector products — one wants to do as little of these operations as possible when solving linear equation systems. In theory, we could use the CG method without preconditioning; however, that would not be very efficient. Rather, one uses preconditioners for improving speed. On the other hand, most of the more frequently used preconditioners such as SSOR, ILU or algebraic multigrid (AMG) cannot be used here because their implementation requires knowledge of the elements of the system matrix.
One solution is to use multigrid methods as shown in step16. They are known to be very fast, and they are suitable for our purpose since they can be designed based purely on matrixvector products. All one needs to do is to find a smoother that works with matrixvector products only (our choice requires knowledge of the diagonal entries of the matrix, though). One such candidate would be a damped Jacobi iteration, but that is often not sufficiently good in damping highfrequency errors. A Chebyshev preconditioner, eventually, is what we use here. It can be seen as an extension of the Jacobi method by using Chebyshev polynomials. With degree zero, the Jacobi method with optimal damping parameter is retrieved, whereas higher order corrections improve the smoothing properties if some parameters are suitably chosen. The effectiveness of Chebyshev smoothing in multigrid has been demonstrated, e.g., in the article M. Adams, M. Brezina, J. Hu, R. Tuminaro. Parallel multigrid smoothers: polynomial versus Gauss–Seidel, J. Comput. Phys. 188:593–610, 2003 . This publication also identifies one more advantage of Chebyshev smoothers that we exploit here, namely that they are easy to parallelize, whereas SOR/Gauss–Seidel smoothing relies on substitutions, for which a naive parallelization works on diagonal subblocks of the matrix, thereby decreases efficiency (for more detail see e.g. Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, 2nd edition, 2003, chapters 11 & 12).
The implementation into the multigrid framework is then straightforward. The multigrid implementation in this program is based on a simplified version of step16 that disregards adaptivity.
The computational kernels for evaluation in FEEvaluation are written in a way to optimally use computational resources. Indeed, they operate not on double data types, but something we call VectorizedArray (check e.g. the return type of FEEvaluationBase::get_value, which is VectorizedArray for a scalar element and a Tensor of VectorizedArray for a vector finite element). VectorizedArray is a short array of doubles or float whose length depends on the particular computer system in use. For example, systems based on x8664 support the streaming SIMD extensions (SSE), where the processor's vector units can process two doubles (or four singleprecision floats) by one CPU instruction. Newer processors with support for the socalled advanced vector extensions (AVX) with 256 bit operands can use four doubles and eight floats, respectively. Vectorization is a singleinstruction/multipledata (SIMD) concept, that is, one CPU instruction is used to process multiple data values at once. Often, finite element programs do not use vectorization explicitly as the benefits of this concept are only in arithmetic intensive operations. The bulk of typical finite element workloads are memory bandwidth limited (operations on sparse matrices and vectors) where the additional computational power is useless.
Behind the scenes, optimized BLAS packages might heavily rely on vectorization, though. Also, optimizing compilers might automatically transform loops involving standard code into more efficient vectorized form. However, the data flow must be very regular in order for compilers to produce efficient code. For example, already the automatic vectorization of the prototype operation that benefits from vectorization, matrixmatrix products, fails on most compilers (as of writing this tutorial in early 2012, neither gcc4.6 nor the Intel compiler v. 12 manage to produce useful vectorized code for the FullMatrix::mmult function, and not even on the simpler case where the matrix bounds are compiletime constants instead of runtime constants as in FullMatrix::mmult). The main reason for this is that the information to be processed at the innermost loop (that is where vectorization is applied) is not necessarily a multiple of the vector length, leaving parts of the resources unused. Moreover, the data that can potentially be processed together might not be laid out in a contiguous way in memory or not with the necessary alignment to address boundaries that are needed by the processor. Or the compiler might not be able to prove that data arrays do not overlap when loading several elements at once.
In the matrixfree implementation in deal.II, we have therefore chosen to apply vectorization at the level which is most appropriate for finite element computations: The cellwise computations are typically exactly the same for all cells (except for reading from and writing to vectors), and hence SIMD can be used to process several cells at once. In all what follows, you can think of a VectorizedArray to hold data from several cells. Remember that it is not related to the spatial dimension and the number of elements e.g. in a Tensor or Point.
Note that vectorization depends on the CPU that is used for deal.II. In order to generate the fastest kernels of FEEvaluation for your computer, you should compile deal.II with the socalled native processor variant. When using the gcc compiler, it can be enabled by setting the variable CMAKE_CXX_FLAGS
to "march=native"
in the cmake build settings (on the command line, specify DCMAKE_CXX_FLAGS="march=native"
, see the deal.II README for more information). Similar options exist for other compilers.
First include the necessary files from the deal.II library.
This includes the data structures for the efficient implementation of matrixfree methods or more generic finite element operators with the class MatrixFree.
To be efficient, the operations performed in the matrixfree implementation require knowledge of loop lengths at compile time, which are given by the degree of the finite element. Hence, we collect the values of the two template parameters that can be changed at one place in the code. Of course, one could make the degree of the finite element a runtime parameter by compiling the computational kernels for all degrees that are likely (say, between 1 and 6) and selecting the appropriate kernel at run time. Here, we simply choose second order \(Q_2\) elements and choose dimension 3 as standard.
We define a variable coefficient function for the Poisson problem. It is similar to the function in step5 but we use the form \(a(\mathbf x)=\frac{1}{0.05 + 2\\bf x\^2}\) instead of a discontinuous one. It is merely to demonstrate the possibilities of this implementation, rather than making much sense physically. We define the coefficient in the same way as functions in earlier tutorial programs. There is one new function, namely a value
method with template argument number
.
This is the new function mentioned above: Evaluate the coefficient for abstract type number
. It might be just a usual double, but it can also be a somewhat more complicated type that we call VectorizedArray. This data type is essentially a short array of doubles as discussed in the introduction that holds data from several cells. For example, we evaluate the coefficient shown here not on a simple point as usually done, but we hand it a Point<dim,VectorizedArray<double> > point, which is actually a collection of two points in the case of SSE2. Do not confuse the entries in VectorizedArray<double> with the different coordinates of the point. Indeed, the data is laid out such that p[0]
returns a VectorizedArray<double>, which in turn contains the xcoordinate for the first point and the second point. You may access the coordinates individually using e.g. p[0][j]
, j=0,1, but it is recommended to define operations on a VectorizedArray as much as possible in order to make use of vectorized operations.
In the function implementation, we assume that the number type overloads basic arithmetic operations, so we just write the code as usual. The standard functions value
and value_list that are virtual functions contained in the base class are then computed from the templated function with double type, in order to avoid duplicating code.
The following class, called LaplaceOperator
, implements the differential operator. For all practical purposes, it is a matrix, i.e., you can ask it for its size (member functions m(), n()
) and you can apply it to a vector (the various variants of the vmult()
function). The difference to a real matrix of course lies in the fact that this class doesn't actually store the elements of the matrix, but only knows how to compute the action of the operator when applied to a vector.
In this program, we want to make use of the data cache for finite element operator application that is integrated in deal.II. The main class that collects all data is called MatrixFree. It contains mapping information (Jacobians) and index relations between local and global degrees of freedom. It also contains constraints like the ones from Dirichlet boundary conditions (or hanging nodes, if we had any). Moreover, it can issue a loop over all cells in parallel, where it makes sure that only cells are worked on that do not share any degree of freedom (this makes the loop threadsafe when writing into destination vectors). This is a more advanced strategy compared to the WorkStream class described in the Parallel computing with multiple processors accessing shared memory module that serializes operations that might not be threadsafe. Of course, to not destroy threadsafety, we have to be careful when writing into classglobal structures.
First comes the implementation of the matrixfree class. It provides some standard information we expect for matrices (like returning the dimensions of the matrix), it implements matrixvector multiplications in several forms (transposed and untransposed), and it provides functions for initializing the structure with data. The class has three template arguments, one for the dimension (as many deal.II classes carry), one for the degree of the finite element (which we need to enable efficient computations through the FEEvaluation class), and one for the underlying scalar type. We want to use double
numbers (i.e., double precision, 64bit floating point) for the final matrix, but floats (single precision, 32bit floating point numbers) for the multigrid level matrices (as that is only a preconditioner, and floats can be worked with twice as fast).
In this class, we store the actual MatrixFree object, the variable coefficient that is evaluated at all quadrature points (so that we don't have to recompute it during matrixvector products), and a vector that contains the diagonal of the matrix that we need for the multigrid smoother. We choose to let the user provide the diagonal in this program, but we could also integrate a function in this class to evaluate the diagonal. Unfortunately, this forces us to define matrix entries at two places, once when we evaluate the product and once for the diagonal, but the work is still much less than when we compute sparse matrices.
As a sidenote, if we implemented several different operations on the same grid and degrees of freedom (like a mass matrix and a Laplace matrix), we would have to have two classes like the current one for each of the operators (maybe with a common base class). However, in that case, we would not store a MatrixFree object in this class to avoid doing the expensive work of precomputing everything MatrixFree stores twice. Rather, we would keep this object in the main class and simply store a reference.
VectorizedArray<number>
requires care: Here, we use the deal.II table class which is prepared to hold the data with correct alignment. However, storing it in e.g. std::vector<VectorizedArray<number> >
is not possible with vectorization: A certain alignment of the data with the memory address boundaries is required (essentially, a VectorizedArray of 16 bytes length as in SSE needs to start at a memory address that is divisible by 16). The table class (as well as the AlignedVector class it is based on) makes sure that this alignment is respected, whereas std::vector can in general not, which may lead to segmentation faults at strange places for some systems or suboptimal performance for other systems.This is the constructor of the LaplaceOperator
class. All it does is to subscribe to the general deal.II Subscriptor
scheme that makes sure that we do not delete an object of this class as long as it used somewhere else, e.g. in a preconditioner.
The next functions return the number of rows and columns of the global matrix (i.e. the dimensions of the operator this class represents, the point of this tutorial program was, after all, that we don't actually store the elements of the rows and columns of this operator). Since the matrix is square, the returned numbers are the same. We get the number from the vector partitioner stored in the data field (a partitioner distributes elements of a vector onto a number of different machines if programs are run in parallel; since this program is written to run on only a single machine, the partitioner will simply say that all elements of the vector – or, in the current case, all rows and columns of a matrix – are stored on the current machine).
Once we have created the multigrid dof_handler and the constraints, we can call the reinit function for each level of the multigrid routine (and the active cells). The main purpose of the reinit function is to setup the MatrixFree
instance for the problem. Also, the coefficient is evaluated. For this, we need to activate the update flag in the AdditionalData field of MatrixFree that enables the storage of quadrature point coordinates in real space (by default, it only caches data for gradients (inverse transposed Jacobians) and JxW values). Note that if we call the reinit function without specifying the level (i.e., giving level = numbers::invalid_unsigned_int
), we have told the class to loop over the active cells.
We also set one option regarding task parallelism. We choose to use the partition_color
strategy, which is based on subdivision of cells into partitions where cells in partition \(k\) (or, more precisely, the degrees of freedom on these cells) only interact with cells in partitions \(k1\), \(k\), and \(k+1\). Within each partition, cells are colored in such a way that cells with the same color do not share degrees of freedom and can, therefore, be worked on at the same time without interference. This determines a task dependency graph that is scheduled by the Intel Threading Building Blocks library. Another option would be the strategy partition_partition
, which performs better when the grid is more unstructured. We could also manually set the size of chunks that form one task in the scheduling process by setting tasks_block_size
, but the default strategy to let the function decide works well already.
To initialize the coefficient, we directly give it the Coefficient class defined above and then select the method coefficient_function.value
with vectorized number (which the compiler can deduce from the point data type). The use of the FEEvaluation class (and its template arguments) will be explained below.
Here comes the main function of this class, the evaluation of the matrixvector product (or, in general, a finite element operator evaluation). This is done in a function that takes exactly four arguments, the MatrixFree object, the destination and source vectors, and a range of cells that are to be worked on. The method cell_loop
in the MatrixFree class will internally call this function with some range of cells that is obtained by checking which cells are possible to work on simultaneously so that write operations do not cause any race condition. Note that the total range of cells as visible in this class is usually not equal to the number of (active) cells in the triangulation. In fact, "cell" may be the wrong term to begin with, since it is rather a collection of quadrature points from several cells, and the MatrixFree class groups the quadrature points of several cells into one block to enable a higher degree of vectorization. The number of such "cells" is stored in MatrixFree and can be queried through MatrixFree::n_macro_cells(). Compared to the deal.II cell iterators, in this class all cells are laid out in a plain array with no direct knowledge of level or neighborship relations, which makes it possible to index the cells by unsigned integers.
The implementation of the Laplace operator is quite simple: First, we need to create an object FEEvaluation that contains the computational kernels and has data fields to store temporary results (e.g. gradients evaluated on all quadrature points on a collection of a few cells). Note that temporary results do not use a lot of memory, and since we specify template arguments with the element order, the data is stored on the stack (without expensive memory allocation). Usually, one only needs to set two template arguments, the dimension as first argument and the degree of the finite element as the second argument (this is equal to the number of degrees of freedom per dimension minus one for FE_Q elements). However, here we also want to be able to use float numbers for the multigrid preconditioner, which is the last (fifth) template argument. Therefore, we cannot rely on the default template arguments and must also fill the third and fourth field, consequently. The third argument specifies the number of quadrature points per direction and has a default value equal to the degree of the element plus one. The fourth argument sets the number of components (one can also evaluate vectorvalued functions in systems of PDEs, but the default is a scalar element), and finally the last argument sets the number type.
Next, we loop over the given cell range and then we continue with the actual implementation:
read_dof_values
), including the resolution of constraints. This stores \(u_\mathrm{cell}\) as described in the introduction. get_gradient
that applies the Jacobian and returns the gradient in real space. Then, we just need to multiply by the (scalar) coefficient, and let the function submit_gradient
apply the second Jacobian (for the test function) and the quadrature weight and Jacobian determinant (JxW). Note that the submitted gradient is stored in the same data field as where it is read from in get_gradient
. Therefore, you need to make sure to not read from the same quadrature point again after having called submit_gradient
on that particular quadrature point. In general, it is a good idea to copy the result of get_gradient
when it is used more often than once. distribute_local_to_global
, the same name as the corresponding function in the ConstraintMatrix (only that we now store the local vector in the FEEvaluation object, as are the indices between local and global degrees of freedom). Now to the vmult
function that is called externally: In addition to what we do in a vmult_add
function further down, we set the destination to zero first. The transposed matrixvector is needed for welldefined multigrid preconditioner operations. Since we solve a Laplace problem, this is the same operation, and we just refer to the vmult operation.
This function implements the loop over all cells. This is done with the cell_loop
of the MatrixFree class, which takes the operator() of this class with arguments MatrixFree, OutVector, InVector, cell_range. Note that we could also use a simple function as local operation in case we had constant coefficients (all we need then is the MatrixFree, the vectors and the cell range), but since the coefficient is stored in a variable of this class, we cannot use that variant here. The cell loop is automatically performed on several threads if multithreading is enabled (this class uses a quite elaborate algorithm to work on cells that do not share any degrees of freedom that could possibly give rise to race conditions, using the dynamic task scheduler of the Intel Threading Building Blocks).
After the cell loop, we need to touch the constrained degrees of freedom: Since the assembly loop automatically resolves constraints (just as the ConstraintMatrix::distribute_local_to_global call does), it does not compute any contribution for constrained degrees of freedom. In other words, the entries for constrained DoFs remain zero after the first part of this function, as if the matrix had empty rows and columns for constrained degrees of freedom. On the other hand, iterative solvers like CG only work for nonsingular matrices, so we have to modify the operation on constrained DoFs. The easiest way to do that is to pretend that the subblock of the matrix that corresponds to constrained DoFs is the identity matrix, in which case application of the matrix would simply copy the elements of the right hand side vector into the left hand side. In general, however, one needs to make sure that the diagonal entries of this subblock are of the same order of magnitude as the diagonal elements of the rest of the matrix. Here, the domain extent is of unit size, so we can simply choose unit size. If we had domains that are far away from unit size, we would need to choose a number that is close to the size of other diagonal matrix entries, so that these artificial eigenvalues do not change the eigenvalue spectrum (and make convergence with CG more difficult).
The next function is used to return entries of the matrix. Since this class is intended not to store the matrix entries, it would make no sense to provide access to all those elements. However, diagonal entries are explicitly needed for the implementation of the Chebyshev smoother that we intend to use in the multigrid preconditioner. This matrix is equipped with a vector that stores the diagonal.
Regarding the calculation of the diagonal, we expect the user to provide a vector with the diagonal entries (and we will compute them in the code below). We only need it for the level matrices of multigrid, not the system matrix (since we only need these diagonals for the multigrid smoother). Since we fill only elements into unconstrained entries, we have to set constrained entries to one in order to avoid the same problems as discussed above.
Eventually, we provide a function that calculates how much memory this class uses. We just need to sum up the memory consumption in the MatrixFree object and the memory for storing the other member variables. As a remark: In 3D and for Cartesian meshes, most memory is consumed for storing the vector indices on the local cells (corresponding to local_dof_indices). For general (nonCartesian) meshes, the cached Jacobian transformation consumes most memory.
This class is based on the one in step16. However, we replaced the SparseMatrix<double> class by our matrixfree implementation, which means that we can also skip the sparsity patterns. Notice that we define the LaplaceOperator class with the degree of finite element as template argument (the value is defined at the top of the file), and that we use float numbers for the multigrid level matrices.
The class also has a member variable to keep track of all the time we spend on setting up the entire chain of data before we actually go about solving the problem. In addition, there is an output stream (that is disabled by default) that can be used to output details for the individual setup operations instead of the summary only that is printed out by default.
When we initialize the finite element, we of course have to use the degree specified at the top of the file as well (otherwise, an exception will be thrown at some point, since the computational kernel defined in the templated LaplaceOperator class and the information from the finite element read out by MatrixFree will not match).
This is the function of step16 with relevant changes due to the LaplaceOperator class. We do not use adaptive grids, so we do not have to compute edge matrices. Thus, all we do is to implement Dirichlet boundary conditions through the ConstraintMatrix, set up the (onedimensional) quadrature that should be used by the matrixfree class, and call the initialization functions.
In the process, we output data on both the run time of the program as well as on memory consumption, where we output memory data in megabytes (1 million bytes).
Next, initialize the matrices for the multigrid method on all the levels. The function MGTools::make_boundary_list returns for each multigrid level which degrees of freedom are located on a Dirichlet boundary; we force these DoFs to have value zero by adding to the ConstraintMatrix object a zero condition by using the command ConstraintMatrix::add_line. Once this is done, we close the ConstraintMatrix on each level so it can be used to read out indices internally in the MatrixFree.
The assemble function is significantly reduced compared to step16. All we need to do is to assemble the right hand side. That is the same as in many other tutorial programs. In the end, we condense the constraints from Dirichlet boundary conditions away from the right hand side.
Here is another assemble function. Again, it is simpler than assembling matrices. We need to compute the diagonal of the Laplace matrices on the individual levels, send the final matrices to the LaplaceOperator class, and we need to compute the full matrix on the coarsest level (since that is inverted exactly in the deal.II multigrid implementation).
The solution process again looks like step16. We now use a Chebyshev smoother instead of SOR (SOR would be very difficult to implement because we do not have the matrix elements available explicitly, and it is difficult to make it work efficiently in parallel). The multigrid classes provide a simple interface for using the Chebyshev smoother which is defined in a preconditioner class: MGSmootherPrecondition.
Then, we initialize the smoother with our level matrices and the mandatory additional data for the Chebyshev smoother. We use quite a high degree here (6), since matrixvector products are comparably cheap and more parallel than the leveltransfer operations. We choose to smooth out a range of \([1.2 \hat{\lambda}_{\max}/10,1.2 \hat{\lambda}_{\max}]\) in the smoother where \(\hat{\lambda}_{\max}\) is an estimate of the largest eigenvalue. In order to compute that eigenvalue, the Chebyshev initializations performs a few steps of a CG algorithm without preconditioner. Since the highest eigenvalue is usually the easiest one to find and a rough estimate is enough, we choose 10 iterations.
Finally, write out the memory consumption of the Multigrid object (or rather, of its most significant components, since there is no builtin function for the total multigrid object), then create the solver object and solve the system. This is very easy, and we didn't even see any difference in the solve process compared to step16. The magic is all hidden behind the implementation of the LaplaceOperator::vmult operation. Note that we print out the solve time and the accumulated setup time through standard out, i.e., in any case, whereas detailed times for the setup operations are only printed in case the flag for detail_times in the constructor is changed.
Here is the data output, which is a simplified version of step5. We use the standard VTU (= compressed VTK) output for each grid produced in the refinement process.
The function that runs the program is very similar to the one in step16. We make less refinement steps in 3D compared to 2D, but that's it.
main
functionThis is as in most other programs.
Since this example solves the same problem as step5 (except for a different coefficient), there is little to say about the solution. We show a picture anyway, illustrating the size of the solution through both isocontours and volume rendering:
Of more interest is to evaluate some aspects of the multigrid solver. When we run this program in 2D for quadratic ( \(Q_2\)) elements, we get the following output:
As in step16, we see that the number of CG iterations remains constant with increasing number of degrees of freedom. We can also see that the various objects we have to store for the multigrid method on the individual levels of our mesh together make up more than twice as much as the matrix on the finest level. For the present example, about half the memory consumption of the multigrid objects are the level transfer matrices, and the other half is consumed by the matrixfree objects (and there, mainly the indices and the variable coefficient).
Not much changes if we run the program in three spatial dimensions, with the exception that the multilevel objects now take up some more memory (because the level transfer matrices are denser) and the computing times are somewhat larger:
In order to understand the capabilities of the matrixfree implementation, we compare the performance on the 3d example above with a SparseMatrix implementation and we measure the computation times for both initialization of the problem (distribute DoFs, setup and assemble matrices, setup multigrid structures) and the actual solution for the matrixfree variant and the variant based on sparse matrices. We base the preconditioner on float numbers and the actual matrix and vectors on double numbers, as shown above. Tests are run on an Intel Core i72620M notebook processor (two cores and AVX support, i.e., four operations on doubles can be done with one CPU instruction, which is heavily used in FEEvaluation) and optimized mode. The example makes use of multithreading, so both cores are actually used.
Sparse matrix  Matrixfree implementation  

n_dofs  Setup + assemble  Solve  Setup + assemble  Solve 
125  0.0048s  0.00075s  0.0025s  0.00033s 
729  0.014s  0.0022s  0.0026s  0.0018s 
4,913  0.10s  0.012s  0.017s  0.013s 
35,937  0.80s  0.14s  0.11s  0.071s 
274,625  5.93s  1.05s  0.82s  0.51s 
2,146,689  46.7s  8.44s  6.39s  3.77s 
The table clearly shows that the matrixfree implementation is twice as fast for the solver, and more than six times as fast when it comes to initialization costs. As the problem size is made a factor 8 larger, we note that the times usually go up by a factor eight, too (as the solver iterations are constant at 5). There are two deviations. The first is in the sparse matrix between 5k and 36k degrees of freedom, where the time increases by a factor 12. This is the threshold when the cache in the processor can no longer hold all data necessary for the matrixvector products and all matrix elements must be fetched from main memory. The second deviation is the times for the matrixfree solve which increase by less than a factor 8. This is because of more parallelism from more cells, exploited by the (involved) dynamic task scheduling approach taken in the cell loop of the MatrixFree class. Note that about 30% of the time in the matrixfree solver is spent on restriction and prolongation, which use sparse matrices. So the speedup could be even better if all parts were done efficiently.
Of course, this picture does not necessarily translate to all cases, as there are problems where knowledge of matrix entries enables much better solvers (as happens when the coefficient is varying more strongly than in the above example). Moreover, it also depends on the computer system. The present system has good memory performance, so sparse matrices perform comparably well. Nonetheless, the matrixfree implementation gives a nice speedup already for the Q_{2} elements used in this example. This becomes particularly apparent for timedependent or nonlinear problems where sparse matrices would need to be reassembled over and over again, which becomes much easier with this class. And of course, thanks to the better complexity of the products, the method gains increasingly larger advantages when the order of the elements increases (the matrixfree implementation has costs 4d^{2}p per degree of freedom, compared to 2p^{d} for the sparse matrix, so it will win anyway for order 4 and higher in 3d).
Above, we have shown figures for secondorder finite elements. Our implementation gains more compared to sparse matrices if higher order elements are used. However, FE_Q elements with equidistant nodes are badly conditioned if the order increases. In this case, the smoother and the multigrid solver break down. Node clustering close to the element boundaries resolves this problem (and the multigrid solver converges in 5 or 6 iterations also for very high order). Elements with this properties are the GaussLobatto FE_Q elements, which are presented in step48.