AUTOMATIC PARALLEL OCTREE GRID GENERATION SOFTWARE WITH AN EXTENSIBLE SOLVER FRAMEWORK AND A FOCUS ON URBAN SIMULATION By

The development of an automatic, dynamic, parallel, Cartesian, linear forest-of-octree grid generator and partial differential equation (PDE) solver framework is presented. This research is bundled into an application programmed with C++ which uses MPI for distributed parallelism. The application is named paroswhich stands for PARallel Octree Solver. In its current implementation, the application provides a ‘zeroth’ order representation of the target geometry, and as such, no cutcell algorithm, projection method, or immersed boundary condition are implemented. In this case, ‘zeroth’ ordermeans that no geometry is ever exactly represented in the final computationalmesh: an octree element is either completely in the domain or entirely outside of it. Any element that contains or is intersected by a geometry facet is removed from the final mesh which results in a ‘blocky’ or ‘stepped’ geometry representation and simplifies boundary computations. The computational octree mesh creation is completely parallel and automated. The algorithm is dynamic in the sense that it is repartitioned dynamically throughout the grid generation process to maintain optimal load balancing during all phases of the mesh genesis. A linear octree data structure is used to store the octree mesh elements and is leveraged for optimal load balancing. An additional hierarchical octree is used to significantly improve algorithms that suffer from this linear storage paradigm. This work focuses on, but is not limited to, applications related to urban simulations and may be applied to plume/contaminant propagation. Within the PDE solution framework a cell-centered, incompressible, unsteady, Navier-Stokes solverwith an energy term to account for thermal buoyancy

interest in the role CFD can play in optimizing responses to such events [1][2][3][4][5]. It is valuable to be able to predict where and how a potentially toxic contaminant might spread once released into a city environment in order to provide first responders the means to perform focused disaster mitigation.
This type of application does not have to be limited to nefarious events; city planners may be interested in air quality control as it relates to pollution [6][7][8][9][10][11][12]. For example, Huber et al. [9] use FLUENT to simulate the events of September 11, 2001 in Manhattan in order to study the transport and dispersion of smoke and particles into the surrounding metropolitan area. Additionally, Hanna et al. [10] compare and contrast the simulations from five different CFD software in the Madison Square Garden area of Manhattan, New York. Four of these CFD models were used to plan the Department of Homeland Security's Urban Dispersion Program MSG05 experiment [13].
Numerous models and software exist that specialize in performing urban simulations. Eichhorn created MISKAM [14,15] System [16]. QUIC is a fast response urban dispersion model that may run on a laptop; it combines a three-dimensional wind field model, a transport and dispersion model, and a pressure solver into an application capable of running neighborhood scale simulations in minutes. It is also able to simulate chemical, biological, and radiological agent dispersion. The Institute of Meteorology and Climatology at the Leibniz Universität in Hannover, Germany has developed PALM: A PArallelized Large-Eddy Simulation Model for Atmospheric and Oceanic Flows [17,18] and while it is not designed specifically for urban simulations, it has been used to simulate large swaths of Japan by Kanda et al. [19]. Also, Keck et al. [20,21] used PALM to simulate the city of Macau China using a 6 km by 5 km computational domain with a minimum spacing of one meter. A finite volume application called CFD-Urban, based on CFD-ACE+ [22], was developed by Coirier et al. [23]. Löhner created FEFLO-CFD [24], Chan and Leach at Lawrence Livermore National Laboratory developed FEM3MP [25]; these are two finite element applications for urban simulation.
The Defense Science and Technology Laboratory in the UK created the Urban Dispersion Model (UDM) [26][27][28]. The Defense Threat Reduction Agency in the United States developed the Urban Hazard Prediction Assessment Capability (Urban HPAC) transport and dispersion model [29,30].

Literature Review
This research is mainly focused on the generation of urban scale computational meshes.
As such, the particulars of the various approaches these software take with regard to the physics simulation will not be emphasized in this review. What follows is an overview of how some of the aforementioned software models address the difficulties associated with the generation of the computational grid. PALM [18] uses a Cartesian topography based on the mask method described by Briscolini and Santangelo [31]; the spatial discretization is uniform in the horizontal direction but allows for vertical stretching. PALM uses 2.5 dimensional topological input, e.g.
digital elevation model (DEM) data, which means that building geometries are simply represented by building outlines or footprints and an associated height. Using this input data the mesh elements are separated into elements either 100% interior to the mesh (a fluid element) or 100% exterior (an obstacle element); thus, the computational domain comprises three unique sub-domains: elements without boundary faces (interior elements), elements with boundary faces (next to an obstacle) within which wall functions are used, and elements within an obstacle that are excluded from any computations. This implies that buildings are only approximated in the mesh and precludes exact discretization of any building, especially those with holes, overhangs, or any significant threedimensionality. The computational domain is parallelized by a uniform two-dimensional domain decomposition in the horizontal direction with equally sized sub-domains. MISKAM [14,15] also uses a Cartesian aligned hexahedral mesh to define its domain and, like PALM, buildings and obstructions are blocked out such that elements are either entirely within or entirely outside the computational mesh, resulting in similarly blocky discretizations. However, unlike PALM, MISKAM allows for three dimensionality, e.g. spires, domes, or sloped roofs and holes/flowthrough areas in buildings, e.g. arcades or passageways, as well as anisotropy in all three dimensions.
The Windows software, WinMISKAM [32], is a wrapper to the core MISKAM model and provides a graphical user interface (GUI) that enables digitizing of building geometry from bitmaps as well as an interactive meshing interface to aid in the creation of the mesh. QUIC [16] also uses Cartesian grids to discretize its domain; the mesh is uniform in the horizontal directions but may be non-uniform in the vertical direction. QUIC also allows for a coarser outer grid which may be used to extend the domain well outside of the neighborhood of the building geometry without having to maintain the same refinement in the farfield; it is unclear however, exactly how this is implemented. There are two ways to define the building geometry: via a 'City Builder' GUI interface or using an ESRI ArcGis shapefile [33]; there is support for vegetation and buildings with courtyards as well as experimental support for parking garages and bridges. Like many other similar applications, a QUIC mesh element is either completely in or completely out of the computational domain, which results in non-grid-aligned buildings being stair-stepped. QUIC is one of the few applications-out of those that do not use a pre-built mesh-that support importing terrain data instead of assuming a flat ground; implementation details are unclear, however. CFD-Urban [23] provides two different meshing solutions: unstructured prismatic extrusion and prismatic quadtree based approaches. Cityscape geometry is generated directly from geographical information system (GIS) data, and buildings may be either explicitly resolved in the mesh or implicitly modeled by introducing drag source terms into the equations within cells that would contain buildings. The prismatic quadtree/octree mesh approach, explained by Coirier and Kim [34], creates a single root Cartesian element spanning the domain of interest. This root element is then sub-divided in the horizontal directions only in a quadtree manner until a specified element size is reached within a target region of the domain. A 2:1 balance is enforced to keep the mesh 'smooth' by enforcing that no element is ever more than twice the size of any of its neighboring elements. Upon completion of this quadtree refinement, the root octree cell contains only one element in the vertical direction; these cells are then recursively refined in the vertical direction until the desired number of layers in the z-direction has been obtained. These layers are subsequently mapped to a vertical clustering in order to properly resolve the buildings and atmospheric boundary layer (ABL). This results in what can be most easily thought of as an extruded quadtree mesh. Using this meshing approach, buildings are modeled implicitly by computing element porosity using the provided GIS data; elements with a porosity under a certain threshold are eliminated from the mesh. CFD-Urban provides optional terrain mapping functionality as well, where the lowest vertical plane in the grid may be projected onto the underlying terrain, which may be provided using several standard digital elevation data formats. Two different element mapping approaches are available: a displacement method or a compression method. The displacement method simply displaces the vertical stack of elements a distance prescribed by the terrain data at the current (x, y) location, whereas the compression method keeps the uppermost vertical plane at its constant height and compresses the elements between the terrain and the upper boundary. Coirier and Kim claim that the displacement method is preferred as the compression method reduces mesh skew angle quality [34]. FEFLO-Urban [24] uses unstructured body-fitted tetrahedral meshes or, for large urban simulation, building geometry is modeled using an embedded boundary approach [35] on unstructured tetrahedral meshes. Camelli et al. [36] describe its use in both modes: a body fitting tetrahedral mesh was used to model the flow and dispersion patterns around Tysons Corner in Fairfax County Virginia, while a much larger area around the Madison Square Garden area of Manhattan was simulated using the embedded boundary paradigm to dramatically reduce the time-to-simulation. FEM3MP [25,37] offers a 'simplified CFD approach' wherein there is an option to explicitly resolve a select few important buildings, or none at all, and virtually model the remaining buildings with drag forces implemented as sink terms in the mean momentum equations [25]. The underlying mesh is a structured mesh that may be stretched and distorted and is relatively coarse in regions containing the virtual building geometry. The immersed boundary grid generator TOMMIE [38] generates potentially anisotropic Cartesian meshes from geometry in ASCII Stereo-Lithography format (STL) and has been used in urban scale simulation of downtown Oklahoma City [39,40]. Meteodyn [41] offers the commercial software URBAWIND [42,43] which claims to provide automatic, flow-aligned mesh generation and refinement within a parallel unstructured multigrid solver framework. Technical details are not readily available since this is a commercial code, but it appears that it uses hierarchical meshes and perhaps some sort of immersed boundary conditions for the building geometries. See Fahssis et al. [44] and Kalmikov et al. [45,46] for examples of the use of UrbaWind in literature.

Motivation
There are many difficulties that must be overcome to reliably and accurately perform CFD simulations at the urban environment scale; not the least of which is adequately modeling the cityscape geometry. One of the biggest problems one faces when attempting to simulate a CFD problem at a cityscape scale is the generation of the computational mesh. It is a non-trivial task to accurately model a cityscape geometry in a manner amenable to the generation of quality volume meshes. In fact, manual generation of grids on these types of geometries can easily take weekseven months. To further exacerbate the problem, viscous meshes adequate to accurately model viscous stresses on solid surfaces at a cityscape scale strains the ability of most grid generation and CFD software in terms of memory as well as CPU requirements. This is the case even in today's high performance computing (HPC) environments.
However, in many cityscape scale simulations, the CFD practitioner may only be interested in the macro-behavior of the flow physics. The micro-behavior of the flow, while it does play a significant role in the macro-behavior, may not be the primary interest of the simulation. This may be the case when considering how a contaminant propagates throughout a city landscape and when the primary interest lies in knowing approximately where a contaminant goes, especially when evacuation considerations are of primary concern. In this case, it is less important that the micro-physics be modeled accurately and more important that the macro-physics are approximately correct. When this is the case, certain liberties may be taken, both with regard to the mesh generation, and the CFD implementation.
Regarding mesh generation, one of the easiest aspects that may be leveraged toward this end is the viscous grid spacing; if the accuracy of the micro-physics may be sacrificed to gain a computational edge, the boundary layer may be under-resolved or not resolved at all. In recent simulations involving Chattanooga, TN and Washington, DC, (see Nichols et al. [47,48]), the computational mesh had nominal viscous spacing of around one meter on all viscous surfaces.
This research is mainly interested in 'zeroth' order geometric accuracy; this does not imply that the spatial and temporal accuracy of the solution algorithms are poor. In fact, the spatial and temporal accuracy are at least first order and up to second order in the current implementation of this research. What 'zeroth' order geometric accuracy does mean, is that the geometry may only be approximated in the mesh. The primary goal of the application is to approximate the macro-physics in lieu of resolving the micro-physics. With this in mind, the boundary layer mesh may be exceedingly under resolved or even non-existent as long as appropriate steps are taken to ensure that viscous flow physics are accounted for approximately. Furthermore, one could even imagine a scenario where the accurate representation of the cityscape geometry could be sacrificed for a 'close-enough' representation. Given these considerations, one attractive solution to the geometrically 'zeroth' order meshing problem is non-conformal, octree style meshes.
Octree style meshes are not a new technology and date back to at least 1984 [50,51]. Other early work, although technically not tree-based, include that of Quirk [52][53][54] and Berger [55].
Furthermore, numerous successful applications that solve partial differential equations (PDEs)  [74]. Most of the aforementioned software, however, are designed to be geometry conforming and must take appropriate steps to accurately represent the specified geometry using unstructured near body meshes, projection methods, or other techniques.
Optimally parallelizing the octree mesh generation process is a separate problem in and of itself, and several approaches have been taken to achieve this goal. One popular method used to obtain optimally parallelized and load balanced octrees is the use of a 'linear' octree storage paradigm. Linear octrees store only leaf elements of the tree in linear arrays. Space filling curves, like Morton or Z order curves and Hilbert curves, are often used to convert the multi dimensional octree data into one dimensional arrays; in order to obtain ideal parallelization, these arrays are split evenly between all available processors. This results in the ability to have functionally perfect load balancing in terms of elements-per-processor; each processor has the same number of elements plus or minus one. Additionally, where appropriate, weighting schemes may be used to give certain elements more or less importance and the sum of the weights may be distributed evenly among the processors in order to achieve load balancing by a different metric. While linear octrees are extremely attractive from the parallelization point of view, they do present a problem in contrast with their hierarchical counterparts: the ability to traverse the tree hierarchically is lost. In this regard, linear octrees are inferior to hierarchical octrees as tree traversal is significantly slower. Whereas, while hierarchical meshes offer fast traversal, parallelization of these types of data structures is much more difficult. Karman and Betro [61] obtained distributed parallelism on hierarchical octree meshes by uniformly refining a single root octree, or several root octrees, on a single processor until there exists at least as many octants as the target number of parallel processes. In the event that there are exactly the same number of octants as parallel processes, each processor is assigned one of the resulting octants. If the ratio of leaf octants to processors is exactly two or four, each group of eight leaf octants having the same parent is divided between four or two processors respectively, otherwise, METIS [75,76] is used to distribute the available octants. This is an acceptable way to generate an initial partitioning of a hierarchical octree data structure, but it does not address nor provide for the all but inevitable load balancing issues that will arise upon further mesh refinement.
While the efficient, automatic, and parallel generation of an octree style computational mesh is a significant part of this research, it is not the only aspect that must be considered. The other major aspect of this work is the application of the resulting mesh to the simulation of large urban environments. Thus, the physics of the problem must be identified and analyzed. At the time of this writing, the National Centers for Environmental Information (NCEI) at the National Oceanic and Atmospheric Administration (NOAA) [77] reports that the windiest location in the U.S. over a 30 year average is Mount Washington, NH with a mean wind speed of 35.1 mph.
However, the windiest cities reported have much lower 30 year means; the top three cities are Dodge City, KS, Amarillo, TX, and Cheyenne, WY with mean wind speeds of 13.9, 13.5, and 12.9 mph respectively. These data were last updated in 2008 [78]. This type of low speed fluid flow can be considered functionally incompressible, however, buoyancy effects should be considered because pavement, concrete, and buildings can be significantly hotter than the air surrounding them.
Turbulence must also be taken into consideration and an appropriate turbulence model chosen that adequately resolves the physics without introducing excessive overhead. Therefore, the goal of this research is to develop a parallel, geometrically zeroth order, incompressible Navier-Stokes (NS) solver using an automatically generated Cartesian octree grid that approximates the urban geometry as discussed throughout this introduction. This application takes advantage of the open source software p4est [63], which provides a framework for parallel octree grid generation upon which a parallel Navier-Stokes solver will be built.

Outline
The remainder of this dissertation is organized as follows. In Chapter 2 the techniques, algorithms, and design choices that were considered and implemented during the development of the automatic, parallel, linear forest-of-octree grid generator are explained and demonstrated.
In Chapter 3 the governing equations are presented and the solution methodology explained. In Chapter 4 various validation cases are presented in order to instill confidence in the quality of the implementation of the research. Furthermore, several large-scale urban demonstration cases are shown in order to exhibit the practical applicability of the code. Finally, in Chapter 5, the research is summarized and recommendations for future work are given.

GRID GENERATION METHODOLOGY
This chapter explains the methods used to automatically generate the parallel octree grid and discusses various aspects of the procedure that pose difficulties or present significant benefits in the overall design process.

Grid Generation With p4est
The open source p4est library [63] is employed in this project as the framework for the parallel octree grid generation tool. The p4est library generates a 'forest-of-octrees' mesh by splitting the computational domain up into multiple single octrees via a provided connectivity deemed the macromesh or macrostructure. Each of the macroelements in the macromesh is itself a single octree, deemed the micromesh or microstructure, and may be arbitrarily refined independently of its neighboring octree elements. Figure 2.1 demonstrates what a simple macromesh comprising eight trees might look like, as well as an example of a corresponding micromesh. The macromesh, or connectivity, is allowed to take on almost any form as long as each macroface and macroedge common between two neighboring macroelements is shared completely or not at all; this implies that any structured mesh may be used as a connectivity for the p4est. In this way, p4est allows for very general connectivities, providing functionality for periodicities and rotations in the connectivity and even degeneracies, e.g. a hexahedron with a collapsed face. This generality makes it possible   [63]. This distinction concerning the octree storage paradigm is important and leads to some very distinct advantages, as well as some disadvantages which will be discussed later.
The parallel partitioning of the forest is trivially accomplished by simply dividing the linear array of octants into p segments, where p corresponds to the number of parallel partitions; this results in an optimal partitioning such that every process owns the exact same number of octants plus or minus one. The partitioning may also be weighted to give certain elements more importance/weight during the partitioning. The p4est library is written in the C programming language designed with distributed parallelism in mind using MPI, and provides the following high level functionality discussed at length by Burstedde et al. [63]: • New: Create a new forest.
• Refine: Refine the forest based on user defined callbacks.
• Coarsen: Coarsen the forest based on user defined callbacks.
• Nodes: Create unique global node numbers.
Note that New, Refine, and Coarsen are processor local and do not require communication, whereas the remaining functions require varying degrees of parallel communication.

Connectivity/Macromesh
For the current project, a connectivity spanning the computational domain must be provided to p4est by the user. However, only very simple connectivities are required for this application and, as such, none of the exceedingly generic connectivities discussed in the previous section are ever required. The only connectivities necessary for this project are very simple structured connectivities which may be provided in a few different ways: by providing a structured 'skeleton' mesh, a bounding box around the geometry, or explicitly specifying the forest dimensions. The bounding box approach is the main modus operandi and results in isotropic elements only; however, manually specifying the forest dimensions gives finer control over the number of trees in the forest and additionally allows for anisotropic mesh generation. This manual approach simply takes the bounding box extracted from the geometry and splits it into the desired number of trees in each of the coordinate directions. Lastly, supplying a structured mesh as a 'skeleton' for p4est is a useful way to provide the connectivity when the geometry is planar and Cartesian aligned. Using a structured mesh in this way guarantees that the geometry is exactly represented in the final grid and is useful mostly for simple test cases like a flat plate, cavity cases, step cases, etc. when the geometry must not be approximated and anisotropic grids are required. Grids generated in this way take as input a structured ASCII AFLR3 UGRID [80] mesh; each hexahedron in the UGRID mesh becomes a single octree in the forest-of-octrees grid. Internally, the UGRID mesh is converted into the connectivity required by p4est and defines the computational domain. These structured UGRID files may be anisotropic but must be planar and Cartesian aligned. In this manner, all elements in the connectivity will exist in the final octree mesh.
In cases where it is not possible to use a planar, Cartesian aligned skeleton mesh for the connectivity, i.e., for most real cases, bounding box approaches must be used. These methods require that the discretized watertight geometry be provided via an Xpatch [81] facet file. However, any watertight geometry representation could in theory be used if the geometry library was extended to support it. The bounding box is extracted from the provided geometry representation and is subsequently subdivided into some number of isotropic or anisotropic hexahedra to create a 'brick' connectivity for p4est. When isotropic elements are desired, the bounding box is tweaked/stretched in any of the three directions if necessary such that the resulting connectivity will be isotropic. It should be noted, though, that the bounding box is never shrunk to guarantee isotropy; it is only stretched. This assures that the provided geometry is always present in the resulting octree grid.
However, if isotropic elements are not necessary, the generation of the connectivity is easier; the desired forest dimensions may be specified at runtime, and the bounding box is simply subdivided into said dimensions.
The main disadvantage of these bounding box approaches is that the geometry must be provided as a water-tight discretization; for cityscape geometries, this may pose a significant problem. One way that this could be significantly alleviated is by extracting the geometry from available geographic information system (GIS) data like digital elevation model (DEM), triangulated irregular network (TIN), shapefile, keyhole markup language (KML), or GeoTIFF data, to name just a few. PALM [17,18], QUIC [16], and MISKAM [14,15] all provide functionality to generate their grids this way. One disadvantage of this approach, however, is that significant building threedimensionality may not be properly represented in these formats, so buildings with curved surfaces or complicated architecture may be difficult or impossible to represent or even approximate.

Refinement of the Micromesh
Once the brick connectivity has been created using the bounding box, the forest-of-octrees mesh must be refined to capture the geometry. This refinement is achieved by iterating over the forest using the refinement function provided by p4est, which itself calls a custom geometry refinement callback function. For each element in the forest, the callback function determines if the element contains or is intersected by any of the geometry facets. If so, the element is refined if it has not yet reached the maximum allowed refinement level defined by the user. The approach p4est employs to achieve this is different from what is normally used to refine an octree since the p4est grid is stored linearly, i.e., only octree leaves are stored, and no information about ancestors/descendants is kept.
The more traditional approach, which takes advantage of the hierarchical octree data structure, is to pass facets one at a time down the tree hierarchically to determine which element(s) contain them; this hierarchical traversal is optimal for searching an octree structure but is difficult to parallelize.
On the other hand, linear octree data structures are trivial to parallelize by dividing the linear array of leaf octants among the available processors. This functionality comes, however, at the cost of a less efficient, more complicated traversal [82].  eliminate subsets of facets that are not contained by 'parent' elements. The p4est approach is clearly sub-optimal and even prohibitive for large geometries. There are, however, a few things that can be done to make this inefficiency less problematic. One involves dynamic repartitioning during the refinement process, and the other uses a temporary hierarchical octree to store geometry facets for rapid retrieval. These optimizations are explained below.
There are two refinement modes in p4est: recursive and non-recursive. The recursive mode will refine an element and also consider the eight newly created children elements for refinement recursively, whereas the non-recursive mode will replace an element with its eight children elements but will not consider any of the newly created elements for refinement. Each of these approaches have certain advantages, but the non-recursive mode is preferred for geometry refinement for one important reason: it allows for dynamic re-partitioning after each single non-recursive refinement pass. This eliminates potentially large load imbalances caused by certain processors having refined their portions of the forest more than others. Experience during this research shows that this non-recursive approach can prove to be more than an order of magnitude faster than its recursive alternative during geometry refinement.
This application further alleviates the octree traversal problem by creating a single, relatively coarse, hierarchical octree that stores the entire geometry. This intermediate octree may be used to quickly generate a list of facets contained in a specific spatial area. The p4est refinement callback uses this hierarchical geometry octree each time it visits a new element to quickly generate a list of facets spatially local to the current element, thus almost minimizing the number of facets it must query for intersection. This procedure could be improved by creating subsets of processor local facets, so that no processor ever needs to process facets outside of its spatial partition.
Elements containing or intersected by elements of the geometry will be refined provided that they have not already reached the maximum octree level allowed. The software provides another way to specify a refinement stopping criterion for finer control over this refinement process. This approach causes the refinement to be based on the size of the current octree element as it compares to the size of the intersecting geometry elements. It may be undesirable for the octree to refine further in regions where the octree elements are already sized commensurately with the surrounding geometry facets. The user may specify a factor that scales the ratio of the average dimensions of both the octree element and geometry facets to give maximum control over this process. This logic may also be disabled such that any intersected octree element will be refined down to the maximum allowable level regardless of the relative size of the nearby geometry facets.
There are situations when parts of the geometry that are important for water tightness, extent definition, etc., are not necessarily important for use during mesh refinement. One specific example of this is farfield and sometimes ground boundaries; while these are important for the definition of the geometry/computational domain, it is undesirable to have mesh refinement in these areas, or at least undesirable to have the same level of refinement in these areas as in areas near other solid boundaries. This issue has been addressed herein by the addition of a boundary attribute specifying that certain boundaries are not to be used for mesh refinement. The attribute is called 'norefine' and any boundary marked as a 'norefine' boundary will be ignored during the geometry refinement callback execution, which ensures that mesh refinement will not happen in areas of the domain where it is undesirable.

Quality Control
Once the refinement callback has completed, the forest-of-octrees has been refined down to a maximum allowable level anywhere it is intersected by geometry facets. This refined forest is of low quality as the 2:1 balance requirement is not yet enforced. Figure 2.4 demonstrates the nature of this low quality; notice that near the spherical body, element sizes vary dramatically over short distances, and many elements have neighbors whose level in the octrees differ by more than one. This is an unbalanced octree, and the next step is to balance the refined forest; p4est also provides a balancing function, and this function provides all necessary logic required to enforce the specified balance throughout the forest in parallel. The balancing algorithms use non-iterative communication afforded by the concept of the 'insulation layer' to optimally balance the forest across processor boundaries; this approach and associated algorithms are explained in detail by Sundar et al. [83]. When the 2:1 balancing p4est function has completed, the p4est mesh is now refined and balanced wherever it is near geometry. This mesh is of much higher quality and is balanced mesh. The main problem at this point is that the elements change size very rapidly as they grow away from the geometry surfaces; every time an element has a neighbor at a different level than its own, their volume ratio is eight. As such, it is desirable to control the number layers of uniformly sized elements that must exist at a certain level before the refinement level may change.
The application requires the user to specify this number of layers-deemed buffer layers-which allows for finer control over the gradation of element sizes. These buffer regions make for a higher quality mesh. Note, because of the way octree refinement creates two layers at a time, precise control over these buffer regions is not possible. For example, if there is only one layer of elements at level x and the user specifies a two-layer buffer, the elements adjacent to the elements at level x will be refined causing there to be three layers of elements at level x. This is the nature of octree meshes and cannot be overcome. Quality control is implemented by looping over the balanced  Support for areas of custom refinement has also been implemented. Two different types of custom refinement are supported: bounding boxes and baffles. Bounding box custom refinement allows for a user to specify that the region in space bounded by a given box should be at a userspecified refinement level. Baffle custom refinement has been implemented, such that a user may input any arbitrarily shaped geometry in space-not just a plane-and any elements intersecting said geometry will be refined to a specified level. This allows for finer control over refinement in wake regions or other regions away from geometry.

Mesh Extraction and Boundary Recovery
Now that the grid has been generated, boundaries must be found and marked. This 'marking' process is one of the last phases of the grid generation process. There are two ways the marking may occur depending on which mode is used to generate the p4est octree mesh. If a skeleton UGRID mesh is provided, there are no elements cut by geometry. Thus, faces must be marked. However, if a bounding box is used, there will be mesh elements intersected by geometry facets, thus elements must be marked as either in or out of the geometry, or cut by it. Since the latter approach uses a bounding box to define the computational domain, it is highly likely that some resulting elements in the final octree discretization are outside of the domain of interest. For this reason, one must be able to query whether a certain element is in or out of the geometry in order to determine which elements in the computational mesh are actually inside the geometry and, thus, are a part of the computational domain. This querying logic is provided by the library handling the geometry via a ray-tracing algorithm † . Another p4est callback is used to mark either the mesh elements as cut, if the geometry is not Cartesian aligned and provided by an Xpatch [81] facet file, or their faces, if a † The geometry library that provides this ray-tracing functionality is an in-house library developed by Dr. Steve L.
Karman Jr. at the University of Tennessee at Chattanooga SimCenter.
Cartesian geometry is provided via a UGRID [80] structured skeleton mesh. This callback works much like the geometry refinement callback. It iterates over the p4est mesh elements and, using the temporary hierarchical geometry octree, determines which facets, if any, are spatially local to the element. If the method finds that the element contains or is intersected by any geometry facet or facets, that element is marked as cut. This marking process is slightly different, and actually easier, when a UGRID skeleton file has been provided. In this mode all elements are interior to the mesh, so no element needs to be marked as cut. Also, boundary elements are found, and their exposed faces are marked appropriately using tags from the adjacent geometry facet(s).
At this point, the p4est forest-of-octrees structure is complete, and a mesh object must be extracted. This mesh provides element-to-element connectivity information, which is not explicitly available from the native p4est data. Functionality is provided via the p4est API to loop through the mesh while providing necessary information about all neighbors (e.g. the neighbor's level, metrics, etc.). It is important to note that this mesh contains all elements in the p4est forest-of-octrees, even elements that are outside of the geometry. So, this algorithm requires that the elements be given a 'status', i.e., they are marked as either part of the computational mesh or not part of it. There are four valid statuses for elements in the p4est mesh: in, out, cut, or boundary. The boundary status is the only status that might need explaining. It marks elements that contain boundary faces; these elements are also in, but they are either next to elements marked cut, in the case of a mesh generated using a bounding box, or one or more of their faces have no neighbors, in the case of a mesh generated using a UGRID skeleton connectivity. Element statuses are generated via a flood-fill algorithm. This procedure starts by iterating over the p4est mesh until an element with an unknown status is located. The actual status of this element is then determined by querying whether the element is in or out of the geometry. Note that the only possible statuses in the mesh at this point are either cut or unknown; all unknown elements should be delimited by cut elements. Then the algorithm 'floods' through this element's neighbors and its neighbor's neighbors etc., marking their statuses the same as its own and stopping only when it finds elements with cut statuses. This flood-fill pass may occur multiple times, depending on the geometry, until all elements have one of the four valid statuses. Also, boundary elements are marked during this pass. When an element is adjacent to a cut element and it would normally be assigned an in status, it is marked instead as a 'boundary' element. At this point, it is important to note that this flood-fill algorithm poses a load-balancing challenge as the amount and locality of work depends highly on the geometry. This is a difficult challenge to overcome and will be explored in more depth later.
Now that the mesh has been generated it is important to note that the manner in which the forest is partitioned needs to change for optimal solver load balancing. Before this point, during the mesh generation, it usually makes sense for each processor to have an equal number of mesh elements, and this uniform partitioning be maintained dynamically as much as possible. This element balancing is especially important during parts of the algorithm when many elements might be added to the mesh. Now, however, load balancing must be obtained by a different metric. At this point, the simulation is beginning, and only elements contained in the computational mesh, i.e. elements with in or boundary statuses, have any work associated with them. For this reason the partitioning approach now only gives weight to elements with these active statuses. With this partitioning scheme, processors may have dramatically different total element counts, but as long as they own equal numbers of active elements, load balancing is achieved because inactive elements are simply ignored in the solver.

Approximate Nature of Resulting Computational Mesh
One important distinction that must be made about the computational meshes that are produced during this grid generation process is that the geometry is never exactly represented in the final mesh, i.e., this algorithm does not implement a cut-cell, nor projection method. Any element in the forest that contains or is intersected by any part of the geometry is eliminated from the computational mesh. This procedure is different from grids that use a cut-cell or projection method, which allows for exact representation of any geometry. This accurate geometry representation is accomplished using a cut-cell method by modifying any element that is cut by geometry so that the boundary facet is included in the mesh. This approach obviously creates a mesh that contains more than just hexahedral elements near boundaries and has the potential to create poor quality 'sliver' or 'small' elements as well [84,85]. Projection methods are different from cut-cell methods, however, they tend to exhibit similar difficulties. Since one of the goals of this project is to quickly generate 'zeroth' order solutions on cityscape geometries, it was decided that exact geometry representation was not necessary nor desirable. Also, even highly curved geometries may be fairly accurately approximated by allowing the forest to refine down to very low levels. Figure 2.7 demonstrates how the accuracy of the representation improves as the maximum level in the octree mesh is allowed to increase.
However, by eliminating elements that are cut by geometry, there is the potential to end up with a highly undesirable 'stair-stepping' feature when increasingly larger elements are cut by geometry and eliminated from the final computational mesh. boundary attribute has been added. Boundaries that do not require refinement, but that are in areas where this 'stair-stepping' character of the grid is undesirable, may be marked with the 'keep' attribute. Any element intersected by a boundary marked with this attribute will be marked directly as a boundary element in lieu of a cut element, forcing it to be a part of the final computational mesh and eliminating the problematic 'stair-stepping' feature. Figure 2.8b shows how the 'keep' attribute improves the quality of the mesh.
Another place that the geometry-approximating characteristic of this software may produce less than ideal grids is in urban geometries where the street canyons or alleys are not Cartesian aligned. This leads to sub-optimal discretization of buildings in the mesh as seen in Figure 2.9.
While it is certainly not possible to completely fix this problem in any real situation, it may be

Parallelism and Parallel Performance
As opposed to hierarchical octree data structures, linear octree storage provides for optimal parallelism. In order to maintain this optimal parallelism however, octant repartitioning must occur at appropriate times during the mesh creation process. As noted in Section 2.1, Refine is a processor local routine so anytime a refinement callback is executed an octant imbalance is likely.
Furthermore, the Balance function has the potential to create many new octants while ensuring 2:1 balance, and, even though it employs communication during its execution, it does nothing to ensure proper repartitioning. Therefore, calls to the p4est Partition function should be made after every call to the Refine and Balance functions. These two p4est functions are used liberally, especially during the quality refinement stages. A parallel framework is provided by p4est within which the current application must function; thus, for ease of use, all data that may be required during parallel communication should reside within the p4est data structures. Custom user data may be created both at the forest level and at the individual octant level within p4est. Custom data not unique to each octant may be stored at the forest level, whereas custom data that may vary from octant to octant must be stored in each individual element; e.g., the connectivity/macromesh and geometry are stored at the forest level, whereas the solution state vector, among other things, is stored in each of the elements in the micromesh. Using the Ghost function, p4est provides an interprocess neighbor mapping used to synchronize parallel data across partitions, as well as ways to perform said synchronization.
One great advantage of this approach to parallel grid generation is that the domain is not statically decomposed a priori; it is decomposed dynamically throughout the generation process.
As mentioned earlier, one place where this dynamic decomposition is particularly advantageous is during the geometry refinement stage. During this phase of the grid generation process, the grid goes through the most dynamic changes and transforms from a simple connectivity of root octants into a nearly final version of the mesh. As this refinement process evolves, areas of the mesh that are near important parts of the geometry can see drastic changes in element counts. For this reason, the refinement process is done non-recursively a single level at a time with repartitioning occurring after each pass. Using this software, a mesh is generated on a large urban geometry, shown in Figure 2.10, which comprises 927 buildings and 4.25 million facets. Unfortunately, given the optimal partitioning of the grid afforded by p4est in terms of elements per processor, there are still algorithms that are difficult to load-balance. The flood-fill algorithm, among other methods that are highly dependent on the geometry, is a good example of this and may play a role in the sub-optimal speedup shown in Figure 2.11. These algorithms are much more sensitive to the geometry when it comes to load-balancing; while all processors have functionally the same number of elements, they might have very different percentages of the geometry which can cause significant load-imbalances. Repartitioning schemes that take proximity to geometry into account as a weighting for the partitions will improve this imbalance. One such scheme that has been explored during this research creates a 'distance' field by looping over the mesh and marking each element with the number of elements or the number of levels that lie between it and the nearest target boundary; i.e., an element adjacent to a target boundary would receive a level value of one, while all of its neighbors that are not directly adjacent to the boundary would receive a value of two, and all of their neighbors not adjacent to any elements marked with a value of one would get marked three, etc. For an example of this, see Figure 2.12. By choosing Figure 2.12 Example of the distance map used to repartition the mesh for better load balancing during the flood-fill algorithm a weighting that favors elements near the target boundaries, i.e., with level values close to one, the resulting partitions can be shown to contain approximately equal numbers of these elements. As seen in Figure 2.13, this approach does produce better load balances for these geometry-centric algorithms, but it does not completely alleviate the problem. In the case of the flood-fill algorithm, it can be shown that the load balancing is directly proportional to the number of individual buildings each process owns. This is because, for each building in a single partition, the owning processor must query the underlying geometry to determine whether an element is inside or outside of the computational domain at least once per building. The root cause of this sub-optimal load balancing stems from the fact that while each processor can be influenced to own an approximately equal number of elements near target boundaries, it is difficult to enforce that each processor contains approximately equal numbers of individual buildings. Therefore, some processors are required to make many more queries to the geometry library than others. This inefficiency was explored further during this research, and a method to significantly improve this load imbalance was discovered. As previously discussed, the source of the imbalance is caused by certain processors owning disproportionately more buildings than others. As such, it is desirable to weight the forest in such a way that all processors can own approximately the same number of buildings in their partitions. In order to count the number of buildings each processor owns, each building must be identifiable as a single, unique structure in the geometry. Logic has been added to the geometry library that is able to take a single geometry boundary tag that is the same for all buildings and split it into one unique boundary tag for each distinct building contained in the geometry. In the event that the buildings were not tagged separately from the terrain, one could extend this logic to identify where buildings are located based on the angle between adjacent facets. Now that each building has a unique boundary tag, each processor may count the number of buildings it owns and identify which elements contain facets belonging to each building. Since each building should 'weigh' the same, regardless of its size, the following scheme has been implemented. After the forest has been refined and the intersected elements marked as cut, the forest is traversed, and the maximum number of elements intersecting any single building is counted. Once this number is known, all elements intersecting this building are assigned a weight of one while all other element-per-building counts are normalized by this number in such a way that the sum total of all element weights per building are approximately equal. For example, if the maximum number of elements-per-building is 1000, each of these 1000 elements will be assigned a weight of one, and if there is a building discretized by 500 elements, each of these elements will get a weight of two, while each element associated with a building discretized by 100 elements will get a weight of ten, etc. Note that only intersected elements are assigned weights using this approach. Repartitioning the forest using this new metric results in a much more uniform The required geometry query used to determine if an element is in or out of the geometry executes in constant time and, even though the new building distribution is much improved, it is not completely uniform, and, therefore, neither is the load balance. Another of the outstanding problems is that a single process may own a single building, but it is not guaranteed to own the entire building, and, as such, multiple geometry queries will need to be executed for the building in question instead of only one. Table 2 To be clear, this load-balancing problem is very geometry-specific; geometries that do not exhibit the same type of complexity as urban geometries, i.e., ones that do not require hundreds or thousands of 'in/out' queries to the geometry library, should not suffer as badly from this load imbalance.
Finally, in Figure 2.16, the improved speedup curve is compared to the original. Notice that the speedup curve is indeed better but still exhibits the same behavior at the higher end of the curve. This seems to point to the fact that communication imbalances are the likely culprit in this case and that the overhead incurred by the load balancing logic may outweigh the benefits as the problem scales.  process also varies dynamically during the execution of the parallel code. This is in stark contrast to the traditional approach described above. So, while it is possible that the optimal partitioning provided by the underlying p4est code does yield optimal cache lines, which may contribute to the observed 'super-linear' speedup, it might not be a fair comparison considering that this may not be a conventional application of a speedup study. Nevertheless, the data does not lie: the observed run-times exhibit very good trends regardless of whether this is a proper speedup study.
In order to give credence to a statement made earlier, alluding to the fact that the sub-optimal speedup towards the higher end of the curve (as the processor count approaches its maximum) may be related to low per-process element counts, the same case has been re-run two times on 128, 256, and 512 processors, each time the maximum level allowed in the forest is increased by one.
The maximum level allowed in the original mesh is 9, the next larger case allows a maximum of 10 levels, and in the final case 11 levels of refinement is allowed. The original mesh contains 6.2 million elements, the next mesh~24.8 million elements, and the final mesh~57.3 million elements. Table 2.2 shows the approximate element-per-process count for all three cases; the two larger cases both contain significantly more elements on each process than the original, and should exhibit improved communication-to-work ratios (or surface-to-volume ratios). The grid-generation  It is also worth noting, that since this process is so dynamic in nature, the end product is not exactly the same as the number of processors varies. This slight variation in the final mesh is likely not significant and can be explained by the fact that the 'path' taken to obtain the final product is indeed different for each processor count. However, it is well known that numerical solutions to PDEs can be grid-dependent and so any variance in a grid caused simply by changing the number of processors should be quantified to enable a user to gain an understanding about how changing the number processors in a simulation might affect the overall flow solution.

Adaptive Mesh Refinement
This framework also provides solution-based adaptive mesh refinement (AMR); the mesh may be refined and coarsened based on user specifiable threshold values of a user specified solution feature. This functionality is built into the basic p4est framework and all that must be provided are three callback functions: a refinement callback, coarsening callback, and agglomeration callback.
The refinement and coarsening callbacks must simply provide the appropriate logic to determine whether the current element should be refined, or the current family of elements should be coarsened, based on the particular solution feature and the input threshold values. The agglomeration callback must also be provided to each of the refinement and coarsening callbacks and is used to either assign solution values to the eight new child elements based on the solution state of the parent element (in the case of refinement) or average/agglomerate the solution data from eight child elements into a parent element (in the case of coarsening). Solution gradient data is used to extrapolate solution data from a parent element onto each of its child elements during refinement if it is available, whereas a simple averaging procedure is used to combine the solution data from all eight child elements into their parent during coarsening. After the refinement and coarsening passes are made through the mesh, the 2:1 balance restriction is almost certainly no longer enforced; thus, the balancing function must be invoked. The same agglomeration callback provided to the refinement and coarsening functions must also be provided here during the balancing pass to allow the redistribution of solution data while the mesh changes during balancing.
This logic does allow for boundary refinement; however, currently the representation of the geometry in the mesh never actually changes. Existing boundary faces may be subdivided potentially multiple times, but the original planar mesh face will always remain in the mesh. This also means that when boundary elements are refined, none of the new elements are checked for geometry intersection, which means the geometry must be adequately resolved in the original mesh as the original representation will never change during AMR. Boundary coarsening is a feature that may be valuable to include, but the logic required for its implementation is significantly more difficult than that required for boundary refinement.
At runtime the user may specify the function used for AMR as well as the threshold values used to determine when refinement or coarsening is required. Currently, only vorticity magnitude is supported for solution mesh adaptation, but adding new functions is a trivial exercise. Also, the frequency at which AMR occurs during the solution iteration is another parameter that the user may specify at runtime. In this manner the AMR method may be called periodically (e.g., once every five or ten time steps) as the user sees fit. Figure 2.19 demonstrates the AMR functionality in action on the revised simple frigate shape (SFS2) geometry. As the grid changes dynamically where buoyancy may play a significant role. As such, the appropriate physics are modeled by the unsteady, incompressible, Navier-Stokes (NS) equations. However, there is a complication that arises within the context of the incompressible NS equations: the pressure becomes decoupled from the system when incompressibility is enforced, and, in order to maintain the hyperbolicity required by most time-marching CFD schemes, pressure must be recoupled to the other equations.
The pseudo-compressibility method is the approach used herein to address the problems related to pressure decoupling; see Chorin [86] and Taylor [87] for details about this method. Furthermore, as heat transfer/buoyancy is important in the simulation of plume propagation, the energy equation cannot be ignored, and requisite steps must be taken to ensure proper coupling with the energy term. This work mimics the work of Kress [88] and Er [89]. This chapter defines the governing equations utilized, as well as the numerical formulation, spatial and temporal discretizations, and source and boundary terms used for their solution.

Governing Equations
In this section, the Navier-Stokes equations will be given, starting with the most general compressible form of the equations, which will subsequently be simplified and augmented until the required incompressible NS equations with pseudo-compressibility and heat transfer terms are presented. The resulting equations are a subset of the compressible NS equations.

Compressible Navier-Stokes Equations
The Navier-Stokes (NS) equations can be written in conservative form as where Ω is a closed domain. The vector of conservative flow variables, Q, and the inviscid and viscous Cartesian flux vectors, F and F v , are defined by where ρ, p, and E denote the fluid density, pressure, and specific total energy, respectively. The vector u = (u, v, w) represents the Cartesian velocity vector. The pressure is determined by the definition of total energy and the equation of state for an ideal gas, where γ is defined as the ratio of specific heats, which is 1.4 for air. Temperature and thermal conductivity are represented by T and κ, respectively, and are related to the total energy and velocity where Pr and Pr T are the 'laminar' Prandtl and turbulent Prandtl numbers which are set to 0.72 and 0.9, respectively. The fluid viscous stress tensor, τ, is defined for a Newtonian fluid as where the summation convention has been used. Subscripts i, j, k refer to the Cartesian coordinate components of x = (x, y, z) and δ i j is the Kronecker delta. In addition, µ refers to the fluid dynamic viscosity and is obtained via Sutherland's law, while µ T denotes a turbulence eddy viscosity, which is obtained from a turbulence model. The source term, S, in equation (3.1) may take on many forms and often contains body force terms as well as turbulence model terms.

Incompressible Navier-Stokes Equations
In Note that the viscous Cartesian flux vector F v remains as defined in equation ( Note, however, that the stress tensor for these incompressible equations, τ i j , is slightly different because the equations have been divided through by ρ ∞ . The incompressible form of τ i j is where ν = µ ρ is the kinematic viscosity of the fluid.

Pseudo-Compressible Navier-Stokes Equations
Note that for the incompressible NS equations, equation ( [87]. However, in reality, stability, convergence, and accuracy are the limiting factors which put an upper bound on its value. Pan and Chakravarthy [90] report a normal range is from 1 to 10. Hyams [91] reports that a value of 15 is typical. Palacios et al. [92] demonstrate results using values ranging from 1 to 20, as well as quantify the effect of the value of β on convergence and solution accuracy. So, the pseudo-compressibility formulation of the unsteady, incompressible NS equations can be written in conservative form as equation (3.1), where Q and F are defined by Note that F v does not change in form from equation (3.8).

Accounting for Buoyancy
Since thermal buoyancy is an important part of this research as it applies to urban simulation, appropriate considerations must be taken to couple an energy equation to the pseudo-compressible NS equations. This coupling is achieved by the addition of a temperature equation. The solution vector is now where T represents fluid temperature. While the density variation as a result of temperature gradient is small, it is not negligible. Following Er [89] and Kress [88], the Boussinesq approximation is applied to the momentum equation that acts in the direction of gravity. The Boussinesq approximation [93] treats the fluid density as a constant in every term of this momentum equation except for the source term, where it is multiplied by the acceleration due to gravity g; there, density is treated as a function of temperature. Thus, the source term may be written as if gravity acts in the y direction, or in general where e g = (e g x , e g y , e g z ) is a unit vector pointing in the direction of gravitational acceleration.
Using a truncated Taylor series expansion, density may be expressed as a function of temperature where ρ ∞ is the density of the fluid at temperature T ∞ . The coefficient of thermal expansion, α T , in a fluid at constant pressure is which may be rewritten as where p * absorbs the hydrostatic pressure into the definition of pressure and is defined as p * = p ρ ∞ + g y if gravity acts in the y direction, or p * = p ρ ∞ − gB, where B = x · e g , for an arbitrary gravitational direction.

Non-Dimensionalization
The NS equations, in the pseudo-incompressible buoyant form defined by equation (3.1) and equations (3.19) and (3.20), will now be non-dimensionalized. To this end, the following definitions for the non-dimensional forms of the constituent variables will be used.
where the subscript ∞ signifies a reference value and the circumflex, or hat (ˆ), represents a non- In these equations, the non-dimensional parameters are the Reynolds number: . An additional non-dimensional parameter, the Rayleigh number, will be used later and is defined as the product of the Grashoff and Prandtl numbers: Ra = Gr Pr. By a simple calculation, the coefficient on the source term vector may be written as Although not included here, any other form of the NS equations may also be nondimensionalized following a similar approach.

Turbulence Modeling
The motivation of this research is to quickly generate flow field solutions in an urban setting where viscous spacing can be very large and drag forces are not important nor desired; thus certain liberties may be taken with the choice of turbulence models. Since the overall effect of blockages, e.g., buildings and other features, is of main interest, and it is intractable to properly resolve the boundary layer physics, the use of typical turbulence models that are designed to accurately simulate surface phenomenon may be overkill. As such, simple large eddy simulation (LES) models seem a better fit and, consequently, the Smagorinski LES turbulence model has been chosen. In 1963 Joseph Smagorinski proposed a turbulence model, one which now bears his name, to simulate atmospheric circulation [94]. This model is algebraic, and its simplicity makes it easy to implement and also makes it faster than other standard turbulence models. Furthermore, Nichols [95] lauds the effectiveness of this model as compared to other standard models like the SAS, SST, and Wilcox stress-ω models. Nichols [95] concluded that the Smagorinski model performed as well as the SST and Wilcox stress-ω models on the surface mounted cube case and better than the SAS model. The Smagorinski model directly computes the eddy viscosity as where C s is the 'Smagorinski constant' and is set to 0.17 for this research. ∆ is a length scale or filter width related to the size of the grid and is defined to be 3 √ V , where V is the volume of the grid element. Lastly, |S| = 2S i jSi j represents the magnitude of the strain rate tensor:

Spatial Discretization on a p4est Grid
The integral form of the partial differential equations given in equation (3.1) is discretized over the computational domain, Ω, using a cell-centered, finite-volume method on a Cartesian octree style grid. Octree meshes contain hanging nodes in all but the unusual case of a uniformly refined mesh; hanging nodes occur when an element in the mesh has neighbors at levels different from its own. Figure 3.1 demonstrates the concept of a hanging node. Octree meshes used for finite volume discretizations are generally required to be 2:1 balanced in order to maintain reasonable volume/area ratios. This 2:1 balancing requires that neighboring elements differ by at most one level in their octree. This requirement implies that neighboring elements never have volumes that differ by more than a factor of eight. Extra care must be taken when calculating fluxes through a face of an element when its neighbor across the face is at a different level: the flux must be calculated through the smaller of the two faces. Higher order reconstruction must also be done with care as the extrapolation occurs in different directions in each cell instead of along a single vector.

Solution Methodology
This section explains the solution methodology used in the current research as well as specifics about certain algorithms implemented therein. The flow solver is written in C++ and is implicit, time-accurate, and designed with a 'plug-and-play' style framework, which means that any PDE may be solved as long as key methods (inviscid and viscous fluxes, Jacobians, etc.) are derived that describe the appropriate physics. The resulting application is named paros (PARallel Octree Solver).

Spatial Discretization
Spatial discretization is obtained herein using a cell-centered, finite-volume method on the p4est octree meshes described in Section 2.1. This approach is much like the node-centered approach described by Hyams [91], except for the cases where neighboring elements are at different levels in the octree, which implies that there will be hanging nodes. These cases must be approached with more care as was discussed in Section 3.4. If it is assumed that the solution state Q is constant within each control volume, the volume-averaged state vector is Thus the spatial terms discretized over each control volume, i, may be expressed as where V i represents the volume of the i th control volume, ∂ q i ∂t the change in the conservative variables in the i th cell with respect to time, and R the spatial residual containing all numerical fluxes and, where applicable, any source terms. In general for this research R = R inviscid + R viscous + S.

Fluxes
The inviscid term in equation (3.1), ∇·F, is integrated over the control volume and converted into a surface integral by Gauss' divergence theorem: The surface integral is approximated discretely as a sum over the faces defining the cell-centered control volume where n f is the unnormalized face area vector, and F represents all the faces that make up the surface of the control volume. The inviscid fluxes, Φ f , are approximated at each interface using the HLLC flux [96] formulated for the incompressible NS equations [97]. The solution variables can be reconstructed with first or second order accuracy; consequently, the fluxes may be first or second order in space as well.
Using the same technique, the viscous term, ∇ · F v , is discretized via the following sum- where F v f is the viscous flux evaluated using solution data at face f between two adjacent control volumes. For the incompressible NS equations, the only solution data required for the viscous flux are solution gradients. Also, it is typically considered more accurate to use weighted gradients for the viscous fluxes when they are available [91,98]. The default approach in this research is to use the weighted least-squares gradients for these calculations if they are available. The gradients are obtained at the face by using either a simple weighted average of the gradient data of the current element and neighboring element or by using a directional derivative approach. See Hyams [91] for a thorough definition of the directional derivative. Using a weighted average is vital in this octree cell-centered paradigm because, when a face neighbor is at a different level in the octree, the interface location is twice as far from the centroid of the larger element as it is from the centroid of the smaller element; refer to Figures 3.1 and 3.2 for an illustration of this problem in two dimensions. Thus, a simple average is inferior to a weighted average because the interface is not halfway between both centroids.

Higher Order Spatial Accuracy
A spatially second order accurate reconstruction is implemented in this work by using solution gradient data to extrapolate cell-centered solution data to the flux interface. The extrapolation is performed using a truncated Taylor series expansion as follows where φ f is any one of the components of Q extrapolated to the face, φ c is a component of Q at the cell centroid, and s is a vector from the cell centroid to the flux interface. That is, where x f is the midpoint of the flux interface and x c is the cell centroid. Green's theorem and the least-squares method are used in this work to obtain cell-centered solution gradients.

Green's Theorem
Green's theorem is defined as where again, φ f represents any component of the solution vector Q on the surface of the volume V andn is a unit surface normal. Assuming that the components of Q vary linearly (or ∇Q is constant) over V and replacing the surface integral with a summation over the faces of V , the discrete form of equation (3.33) is where n is the area vector belonging to face f . Also, φ f is obtained by a weighted average of solution values from either side of the face f ; the weighting procedure used here is a distance weighted average which is essentially a linear interpolation of the solution data from each cell centroid to the face midpoint.

Least-Squares Gradient
The least-squares method for computing solution gradients is a much more complicated approach. It may be derived generally using the following truncated Taylor series to express a solution value at a neighboring cell center, e j , in terms of the solution data in a central cell, e i , as Writing this equation once for every neighboring cell, e i , usually leads to an over determined system of equations, represented by equation (3.36), that must be solved in order to get an approximation of the cell-centered solution gradient ∇φ i . weighted least-squares gradients. A common approach is to weight each equation by the inverse of the distance between the two adjacent cell centroids. Notice that since the higher order terms of the Taylor series expansion have been dropped, the least-squares method described here is only able to exactly calculate gradients of a linear function.
Since the system in equation (3.36) is over determined, a least-squares solution procedure must be utilized. If the system is represented by A x = b, a least-squares solution may be obtained by solving the system attained by pre-multiplying each side of the system by the transpose of A, e.g.
In order to accomplish this, the matrix A is factorized using a QR algorithm, which employs the Gram-Schmidt process for orthogonalization. This process is quite complicated to derive but, in the end, turns out to be relatively simple to implement. For a thorough derivation of this method, as well as some important implementation issues, see Appendix A of Hyams' dissertation [91]. There may be ways to alleviate or improve this shortcoming of the Green-Gauss method by exploring different ways to 'weight' cell-centered values when averaging them to the face; currently, a distance-weighted average is computed, but a volume-weighted approach could be tested, and there are likely other more sophisticated methods that could be used as well.

Boundary Gradients
Boundary gradients are handled differently than usual in this research; the boundary gradi-

Temporal Discretization
The temporal term in equation (3.1) remains to be approximated after the spatial terms have been discretized. Following Beam and Warming [99], and Hyams [91], the following generalized difference expression is used for the temporal discretization Using equation (3.28) to eliminate the time derivative term gives Since in this research the grid is static, V does not vary in time and thus the Geometric Conservation Law (GCL) does not need to be applied. Thus, equation (3.41) simplifies to For a good explanation of how the GCL terms would be added, see Hyams [91]. In order to maintain a divergence free velocity field for incompressible flow regimes, the contribution from the time derivative term (and potentially the GCL term) are explicitly left out of the continuity equation.

Time Evolution
Equation (3.42) is the final form of the temporal and spatial discretization; the implicit nature of the temporal discretization requires that the spatial residual, R, be evaluated at time level n + 1. Since the solution at this time level is unknown, Newton's method will be used to linearize the equations about the known solution state Q n . Note that the following derivation follows closely with that of Hyams [91]. From equation (3.42) the following equation may be derived where F n+1 is the function that should be driven to zero by Newton's method. F n+1 is now expanded using a Taylor series about a known level (n + 1, m) Lastly, since the left hand side of the equation above is zero upon convergence of the Newton's method, and after dropping the O(∆Q 2 ) term, equation (3.44) may be written as Expanding these terms and differentiating F yields the final, full expression for Newton's method where ∂ R ∂Q is the Jacobian matrix.
The sparse matrix system A x = b, represented by equation (3.46), is solved iteratively using an 'improved' Symmetric Gauss-Seidel method, which is explained in detail by Hyams [91] in Section 4.4.2.

Boundary Conditions
This section gives an overview of the boundary condition implementation. Finite volume schemes use the concept of 'ghost' elements-ghost nodes for node-centered schemes and ghost cells for cell-centered schemes-to facilitate the implementation of boundary conditions. These ghost elements store appropriate solution data, which are used to reconstruct solution values required at the boundary sufficient to enforce the specified boundary condition. Since this research implements a cell-centered scheme, the boundary values specified by each particular boundary condition are reflected into the ghost element in such a way as to enforce that the average of the ghost and interior elements equals the boundary value unless otherwise noted, e.g.,

Flow-Through/Far-Field Boundaries
Far-field boundaries are implemented using the standard Characteristic Variable Boundary Conditions (CVBCs), wherein the governing equations are rewritten in the direction normal to the boundary, and standard characteristic variable analysis is used to decouple the system. For a detailed explanation of this method see Section 4.5.2 of Hyams' dissertation [91].

Inviscid Solid Wall Boundaries
Inviscid solid wall boundary conditions are imposed by forcing tangential boundary flow; this is done during the evaluation of the solid wall flux by enforcing u · n = un x + vn y + wn z = 0, where u is the fluid velocity vector and n is the boundary face area vector. This implies that the only contributions to the inviscid flux at these boundaries come from the pressure terms in the momentum equations. Neumann boundary conditions are used for the pressure and temperature terms on slip boundaries; the condition on pressure is usually a zero pressure gradient, however, in buoyant cases, the Neumann condition for pressure, presented in Section 3.5.6.4, must be used instead. for this reason, it has been adopted as the default method used to evaluate inviscid fluxes on solid wall.

No-Slip Solid Wall Boundaries
No-slip walls have a straight forward implementation; wall velocities are imposed directly, and this implementation supports imposing moving boundaries. The same Neumann boundary conditions for the pressure and temperature terms used in the inviscid/slip wall boundary condition defined above are used for no-slip boundaries as well. Since boundary data are not explicitly solved for in this cell-centered paradigm, nothing special must be done to the Jacobian for this boundary condition.

Buoyant Solid Wall Boundary Condition
In cases where buoyancy is important, an additional Neumann boundary condition for pressure on wall boundaries must be implemented. This boundary condition is outlined by Kress [88].
In this case, in lieu of using a zero pressure gradient, the buoyant pressure gradient is set equal to the buoyant source term explained in Section 3.5.7. Namely, where η represents the direction normal to the wall. This boundary condition must be used on both slip and no-slip walls in simulations where buoyancy is important.

Percent-of-Slip Wall Boundaries
In certain situations neither the no-slip nor slip wall boundary conditions are fully applicable. In practice, this boundary condition proves to be very useful. With a full no-slip boundary condition the velocity field does not mix very well; i.e., cross flows are inhibited and alleyways often do not 'feel' the effect of the bulk flow. However, when this percent-of-slip boundary condition is used, much better mixing is observed. This will be demonstrated in Chapter 4. However, it should be noted that using this approximation is obviously a 'modeling' issue and more research is needed to thoroughly 'validate' its use in urban environments that deviate substantially from those explored in the present work.

Custom Inflow Boundary Conditions
This application supports two custom inflow boundary conditions: a parabolic profile (for internal flow situations) and a power law profile (for atmospheric boundary layer type applications).
All parameters for both inflow profiles are specifiable at runtime, and the interface is general enough to allow for easy addition of new custom inflow boundary conditions.

Source Terms
If the source terms, S, in equation ( where V is the volume of the current element, T is its temperature, and e g is the normalized vector pointing in the direction of gravitational acceleration.

CHAPTER 4 NUMERICAL RESULTS
This chapter will present and discuss various test cases used to validate the flow solver implemented for this project.

Laminar Flat Plate
The flat plate is one of the fundamental test cases used to validate any implementation of a Navier-Stokes solver. The theoretical solution to laminar flat plate flow was derived by Paul Richard Heinrich Blasius [100] in 1908. The Blasius solution is the de facto laminar Navier-Stokes test case and will be used here to verify the current implementation.
A structured UGRID mesh is provided to p4est as a skeleton for this test case and is run without any extra refinement. The flat plate is two-dimensional, two units long and two-sided; it is floating in free space with a farfield extending five units away from the plate in the x and y directions. In the z direction there are four planes-corresponding to three elements-in order to ensure that there is one slice of elements completely interior to the mesh. Since it is a two-sided

Cavity Validation Cases
This section presents various validation cases run on a square cavity geometry. The following test cases are designed to test aspects of the code related to convection and heat transfer.
The geometry is a simple unit square extruded a small distance in the z direction so the simulation may be treated as two-dimensional. These cases were run using two different types of grids: a uniform mesh, i.e., all elements were required to be at the same refinement level, and a non-uniform mesh containing elements at different refinement levels. The uniform cases were run using almost the same 81 × 81 × 2 structured mesh that Kress [88] and Er [89] used to obtain their results, except that it contains one more plane in the z direction in order to have a stack of completely interior elements. Both grids have normal/viscous spacing of 1 × 10 −3 . This is easy to accomplish with the uniform grid; the non-uniform grid, however, requires more work. The non-uniform grid was designed to have three levels of refinement such that the elements at level three were isotropic cubes achieving the target normal spacing. As such the non-uniform meshes had a z dimension of 8 × 10 −3 with levels three, two, and one each containing eight, four, and two elements in the z direction respectively. Given the way these grids are designed, they exhibit dramatically different cell counts: the uniform grid comprises 19,200 elements while 429,192 elements compose the nonuniform grid. This explicitly demonstrates a potential disadvantage of using isotropic, non-uniform octree meshes as their cell counts can grow very quickly. In these cases, both types of meshes are used to compare the accuracy of paros run on two fundamentally different mesh types. The case using the non-uniform mesh exercises parts of the code that the uniform mesh does not: namely the logic that deals with hanging nodes/faces. Refer to

Natural Convection
The natural convection case is designed to test the heat transfer/buoyancy ability of the code.
The case is defined as follows: the flow is initialized at rest, adn the x-min no-slip solid wall is kept at a constant non-dimensional temperature of one, while the x-max wall is kept at a constant nondimensional temperature of zero. Both the x-min and x-max walls have a zero pressure gradient condition enforced. The y-min and y-max walls are adiabatic and have the non-zero Neumann buoyancy source term condition enforced, ∂ p ∂η = Gr Re 2 T, as described in Section 3.5.6.4. The z-min and z-max walls have symmetry boundary conditions applied. Figure 4.4 shows a schematic of the geometry, as well as the initial and boundary conditions for this natural convection cavity case.   Table 4.1 shows that the current results match well with these values as well as those reported by Kress [88] and Er [89]. Velocity profiles along cavity center lines are shown below, and there is excellent agreement between the results using both the anisotropic and isotropic meshes as well as the results presented by Kress [88].

Forced Convection
Presented here is the canonical lid driven cavity test case used to validate the convective terms of the code. The x-min, x-max, y-min, and y-max boundaries are no-slip solid walls and have zero pressure and temperature gradient conditions enforced. The y-max boundary has an imposed non-dimensional velocity of u = 1.0 and w = v = 0.0, which drives the flow in the cavity.
Again, the z-min and z-max walls have symmetry boundary conditions applied. The velocity is initialized to zero everywhere else. The geometry, initial, and boundary condition schematic corresponding to this case is given in Figure 4.8.
The case is run viscous and steady and, depending on the Reynolds number, may have multiple recirculation zones. Ghia et al. [102] and Mandal et al. [103] study this case in depth.

Mixed Convection
The mixed convection test case presented here is designed to simultaneously test both the convective terms and the heat transfer/buoyancy capability of the code. This problem is set up almost exactly the same as the forced convection case, presented in Section 4.2.2, except now the moving top boundary is also maintained at a constant non-dimensional temperature of one, making it an isothermal boundary in lieu of an adiabatic boundary. The Reynolds number is 1000, and since there is heat transfer, a Grashof number is required; for this case the Grashof number is 1 × 10 6 .
Furthermore, the top and bottom walls have the Neumann buoyant source term condition applied as described in Section 4.2.1. Once again, the geometry, initial, and boundary condition schematic is shown in Figure 4.12. The steady state solution for this mixed convection case is shown in   Iwatsu [104] and Agrawal et al. [105] have studied mixed convection in a cavity extensively; the current work is compared to their results and is shown in Figures 4.14 and 4.15. While agreement with previous results by Iwatsu [104] and Agrawal et al. [105] is not as good as the agreement seen in both the natural and forced cavity cases presented earlier in this chapter, similar levels of disagreement are observed in the results presented by Kress [88].

Square Cylinder
This section presents laminar validation cases-both steady and unsteady-on a square cylinder. Although the square cylinder is not as ubiquitous as its circular counterpart, it is judged to be a very good test case given the Cartesian aligned nature of the research herein. It is recognized that validation of a numerical result is usually performed using information with a 'pedigree;' e.g., measurements with known and/or accepted uncertainties. However, it is difficult to identify laminar test cases-especially unsteady ones-similar to the unit cavity problems presented herein that are available on Cartesian aligned geometries. Consequently, the decision has been made to use computed results from a trusted computational methodology as a standard for comparison (Tenasi, see Hyams et al. [106] and Taylor et al. [107]).
The mesh used for these square cylinder validation cases is a structured UGRID mesh. The cylinder is a unit square centered in a square mesh dimensioned 100 × 100. The normal spacing Zdravkovich [108] shows that for a circular cylinder there exists an empirical relationship between recirculation length, L r , and Reynolds number Sharma reports that this curve fit has a maximum deviation of 5%.
Various steady square cylinder cases were run, and recirculation lengths were compared against the results above. As seen in Figure 4.17, the results compare well with those of Sharma.
Note that, in this plot, the data corresponding to the results presented by Zdravkovich  blockage ratio, and inflow and outflow boundary placement. Sohankar et al. [111] report the critical Reynolds number between 50 and 55 for a blockage ratio of 5%; Breuer et al. [109] cite a critical Reynolds number of approximately 60 for a blockage ratio of 12.5%, and Sharma et al. [110] report a critical Reynolds number of 50 with a blockage ratio of 5%. Note that Sohankar et al. [112] imply that if the wall boundaries are sufficiently far away from the cylinder, i.e., a blockage ratio less than 5%, the effect of these boundaries on the flow near the body is minimal. Figure 4.18 and Table 4.2 demonstrate the unsteady results as they compare against those of the University of Tennessee SimCenter at Chattanooga code Tenasi [106,107]; the Strouhal number associated with vortex shedding is compared to assess the time accuracy of the simulation. The Strouhal number is defined as St = f L U ; f is the shedding frequency computed using forces exerted on the square cylinder by the steady-periodic flow, L is a characteristic length, and U is the free stream fluid velocity.

Turbulent Flow Over a Surface-Mounted Cube in a Tunnel
This final validation case examines the validity of the implemented Smagorinski LES turbulence model by comparing simulation results against the experiment of Martinuzzi and Tropea [113] as well as results obtained using Tenasi [106,107]. Martinuzzi and Tropea explore the effects of various surface-mounted, prismatic, and fully three-dimensional obstacles in fully developed turbulent channel flow. The specific case used for this validation is that of a square cube of dimension H = 1 mounted in a channel with dimensions l = 24H, w = 16H, and h = 2H. The cube is mounted in the center of the channel 5H downstream of the inlet. This setup is chosen because it is a widely studied test case with ample experimental data, that exhibits fully three-dimensional, highly turbulent flow. Furthermore, the features of this flow are representative of those that might occur around buildings in an urban environment. Since the Smagorinski turbulence model is likely insufficient to accurately model the type of turbulent flow exhibited by this test case, good agreement with experimental data is not expected. For this reason, Tenasi is also used to simulate this test case using the same Smagorinski turbulence model, boundary, and initial conditions, as well as an analogous grid in order to have baseline information for comparison against the current results.
The grid used for this simulation is completely structured; however two distinct meshes were built: one for paros and one for Tenasi. The version of the grid designed for Tenasi was built in the traditional manner, meaning that the grid exported from the grid generation software was used directly in Tenasi with no further refinement or modification. However, the grid for paros had to be designed for a target refinement level of two in order to keep the 'tree' count manageable. As such, the element dimensions of the original structured mesh designed for Tenasi were reduced by approximately a factor of four and the off-the-wall spacing increased by the same factor to create the skeleton mesh/connectivity used as the initial discretization for paros. The target normal spacing is 0.002 non-dimensional units, so the initial off-the-wall spacing for the skeleton mesh is 0.016. This design causes the final mesh used by paros to have a slightly less smooth variation of elements as they grow away from the boundaries, because every macroelement is uniformly refined twice to create the final mesh. However, it is anticipated that difference in the final meshes should have minimal effect on the accuracy of the results. The Reynolds number, based on the height, H, of the cube, is 40,000. A parabolic velocity profile is imposed at the inflow boundary and the exit boundary condition is implemented such that the back/exit pressure is the reference/free-stream pressure, and the fluid is allowed to simply exit the boundary. All other boundaries are no-slip solid walls. This case is run spatially first order for several hundred iterations; second order spatial accuracy is then activated, and the simulation run for a few hundred more iterations to obtain an initial state for unsteady time stepping. After the unsteady solution evolves several hundred iterations into a physically realistic state, time averaging begins. The period over which to time average the unsteady solution is chosen in order to mimic the approach taken by Hajjawi [114], who time averages over a non-dimensional time period of 50 units by setting ∆t = 0.05 and averaging over 1000 iterations. Unfortunately, for this grid, paros proved to be unstable using time steps of this size. Instead, a time step of ∆t = 0.01 is found to be stable and time averaging is performed over 5000 iterations. Experience suggests that dual time stepping [115] could alleviate this stability problem As seen in these profiles, neither simulation results compare well with experiment, although the overall trends are captured reasonably well. The lack of detailed agreement is due to several reasons, two of which are that neither the grid nor the turbulence model are sufficient to accurately resolve the turbulent physics inherent to this problem. Also, there are distinct differences between the time-averaged profiles produced by paros and Tenasi; there are numerous differences between the implementation specifics in these two codes that could play a role in the observed discrepancies.
Notably, Tenasi is a node-centered solver while paros is cell-centered. Also, there are significant differences between the two codes regarding how the surface gradients are computed, and Tenasi has been validated much more thoroughly than the current implementation of paros. However, the

Large Scale Urban Simulation
This section will demonstrate the ability of the application as it is applied to a large urban geometry. Multiple cases will be run, compared, and summarized in this section. These cases will be unsteady, with and without buoyancy. In addition, cases exercising the contaminant transport functionality will be shown. The geometry used for these test cases is the same one used for the parallel grid generation speedup analysis previously shown in Section 2.3 and contains 927 buildings and 4.25 million facets. Refer to Figure 2.10 which shows this geometry.

Buoyant Versus Non-Buoyant Cases
The results in this section will demonstrate the differences in the unsteady solutions obtained

Buoyant Versus Non-Buoyant Cases Including Contaminant Transport
This section presents a case similar to one given by Nichols et al. [5].          an application that demonstrates excellent parallel grid generation speedup behavior and results in a computational mesh that is optimally load balanced. This application does not use cut-cell or projection methods, nor does it implement immersed boundary type algorithms; this means that the resulting computational mesh does not accurately represent any input geometry and is 'blocky' in nature. However, the software does provide a mechanism whereby certain specific geometries may be represented exactly in the final mesh. This mode is possible only when the geometry is itself planar and Cartesian aligned, e.g., a flat plate, square cylinder, forward or backward facing step, etc. A challenge pertaining to load balancing in certain algorithms used during the grid generation phase, specifically geometry-centric algorithms, has been identified, explained, and improved significantly.
The forest-of-octrees structure is stored using a linear storage paradigm implemented using Morton encoding. Ramifications of this storage paradigm are far reaching-in both positive and negative ways. The main advantage is the functionally perfect load balancing it affords in terms of elements-per-process. One of the major drawbacks of this decision is the loss of hierarchical traversal and the benefits that come with hierarchical data structures. Certain portions of this application suffer significantly from the inability to traverse the octrees structure hierarchically. In order to lessen this impact, a single, coarser hierarchical octree is created in the geometry library and stores the geometry facets. This work leverages both types of octree structures during the creation of the computational mesh: the linear paradigm is used as a data-structure to store mesh elements because of its optimal parallel partitioning properties, while the hierarchical paradigm is leveraged for speed during geometry queries that benefit from spatial 'knowledge'. This mixed approach has proven to be an important feature as algorithms that benefit from the hierarchical storage do so significantly, sometimes gaining speedups greater than an order of magnitude.
Within the PDE solution framework, the buoyant incompressible NS PDEs are implemented using the pseudo-compressibility approach. This implementation is thoroughly tested and validated using various canonical test cases appropriate to the physics in question. The software is then successfully tested on a large scale urban geometry, both with and without heat and contaminant transfer.
To the author's knowledge, this is the first application of its type to use a linear forest-ofoctrees data structure for its mesh generation. The forest of trees aspect of the mesh generation affords greater flexibility with respect to the size and shape of mesh elements; it provides mechanisms to easily and naturally create meshes with isotropic or anisotropic elements without stringent requirements on the shape of the root element. For example, the only way to have isotropic elements in a single tree structure is if the root element itself is isotropic and, conversely, the only way to have anisotropic elements in the mesh using a single tree structure is if the root element is anisotropic. These restrictions go away when using a forest-of-trees structure, i.e. an anisotropic forest can include isotropic mesh elements and vice versa. Furthermore, the octree aspect of the application makes it possible to approximate buildings with significant three dimensionality, e.g.
domes, spires, overhangs, or even bridges, whereas many other similar applications only support '2.5' dimensional building geometry.
Also, advances made regarding the load balancing of the flood-fill routine provide tangible benefits. The approach developed provides a method to significantly improve the balance of work during the flood-fill algorithm and leads to faster grid generation times. Although this method yields diminishing returns as the number of processes increases, it is still a valuable contribution.

Recommendations for Future Work
While this software has the potential to become a valuable tool with many areas of application, it is still very young and immature. There are many aspects of this research that would benefit from further development, refinement, validation, and research.

Grid Generation Considerations
The current adaptive grid refinement (AMR) functionality is sufficiently robust, but the coarsening logic needs to be extended to support the coarsening of boundary elements. The addition of this feature is encumbered by the fact that the initial discretization of the geometry is also the final discretization; i.e., any boundary face present in the original discretization must always be present in any subsequent form of the grid. These faces may be refined or coarsened as long as the underlying shape of the initial discretization is maintained; unfortunately, this is more difficult during the coarsening phase of AMR. This application does not currently support exact discretization of curved geometry; all geometry is approximated by explicitly removing any mesh element that intersects the geometry resulting in 'blocky' or 'stepped' computational meshes. A cut-cell, immersed boundary, or projection method should be implemented in order to accurately represent any arbitrary geometry. These methods are well established and their implementation within the application should not be exceedingly difficult. Mesh movement would also be a valuable addition to the application. In order to do this however, the robustness of the application will need to be explored to make sure it is able to regenerate the computational mesh reliably every time geometry moves; currently the mesh is static once it has been created, except for when AMR is being used. This should not be a monumental task but one that must be approached with thought and care.
Another limitation of this application within the context of urban simulation is the manner in which the cityscape geometry must be represented. Currently, the building and terrain must be represented with a water-tight tessellation that the software uses to define the computational domain and determine the in-out status of elements therein. Obtaining said water-tight geometry may be a difficult task in and of itself and may prove to be equally prohibitive as the manual generation of the computational domain itself. There are a number of ways this could be assuaged. For example, many similar applications leverage commonly available GIS data directly for the definition of the geometry without the need to generate a water-tight representation. Furthermore, for significant urban areas, if there does exist a water-tight representation, it will likely be too big for each process to load into memory; a client-server model could be implemented such that a single, larger-memory machine may load the geometry and handle all the related queries. This approach may pose a bit of a bottleneck; but given that the grid generation portion of the run-time is relatively small compared to the simulation time, this drawback should not present a significant problem.
Another aspect of this application that could use refinement pertains to urban simulations where there are significant features in the terrain: e.g., mountainous regions or places where humans have significantly modified the urban landscape during urban planning and development. Given the nature of the grid generation regarding its inability to exactly represent curved geometries, terrain poses an interesting problem, especially in regions near farfield boundaries where large mesh elements are desired. Succinctly, the 'blocky' or 'stepped' nature of the mesh creates large, physically unrealistic features in the mesh near areas of significant terrain curvature. This problem makes urban areas with appreciable variation in terrain more difficult to run and has the potential to adversely affect the solution quality because of non-physical boundaries present in the final computational domain. The implementation of proper boundary resolving mesh generation would clearly eliminate this problem but may add to the computational overhead. It may be possible for elements near terrain boundaries to be treated as porous media in such a way to model the effect of the terrain on the flow without explicitly resolving it.

Computational Considerations
An original goal of this research also entails developing a method for rapid simulation of an urban disaster scenario in order to provide real-time feedback to first responders. While the current state of the application does provide a significant automated advantage in such an event, the time-to-solution still requires concentrated optimization. Part of this goal is realized by the nature of the computational mesh generated by the software wherein geometry is under-resolved and full viscous resolution relaxed in favor of smaller representative meshes that afford an acceptable level of solution fidelity. Effort could be spent to maximize the unsteady time-step while maintaining stability in order to simulate larger physical time periods in a minimum number of time-steps.
One way this could be improved is by implementing dual time-stepping. However, the most ground stands to be gained by taking advantage of hardware solutions and leveraging multiple appropriate parallelization paradigms. The state-of-the-art in high performance computing (HPC) is moving toward heterogeneous systems that take advantage of both distributed and shared memory paradigms, as well as accelerators, to attain maximum parallel performance. Significant effort must be put forth to take advantage of shared memory and accelerator paradigms within the context of a distributed memory application; non-trivial algorithmic and programming changes must be thought out and implemented. However, in order to achieve this level of parallel efficiency, these options must be explored thoroughly.
There are outstanding load-balancing issues that do not pose significant bottlenecks, but could benefit from some focused attention. The main difficulty is in achieving optimal load balancing when considering parts of the application that deal directly with geometry and geometrycentric functionality, e.g., boundary condition implementation, surface force calculations, etc.
This application provides nearly perfect load-balancing when it comes to volume-to-volume type computations, but it is more difficult to assure that all processors have been assigned an equal amount of boundary work. However, this problem is not specific to this application, it is a challenge common to many applications that perform the same types of computations.
The contaminant transport functionality of the code seems to work well, but for more realistic plume simulations the application should be modified to support non-neutrally buoyant contaminants. Many contaminants of interest are either heavier or lighter than air, and this characteristic of the particular contaminant will significantly affect the way that it propagates. That is, heavier plumes will tend to sink and spread outwardly more, while lighter plumes will rise more quickly and spread outwardly less. Integrating these effects more accurately into the solver will result in more accurate plume propagation simulations.
This appendix will give more details about how the buoyant Navier-Stokes equations used during this research are derived. I am indebted to Dr. D. Steven Nichols for his contributions to this derivation; this entire appendix is an almost verbatim replication of notes he provided to me outlining this approach.
To start out, the equations will be derived assuming that gravity acts in the negative ydirection; in this case only the y-momentum equation needs to be considered as it is the only equation that is modified by the addition of buoyancy terms. Starting with the compressible y-momentum equation with a gravitational source term and assuming incompressibility and dividing through by the constant reference density ρ ∞ yields the incompressible y-momentum equation Thus, the source term vector, S, in equation (3.1) is or, in another way By the Boussinesq approximation [93], the density may be written as a function of temperature as where ρ = −α T (T − T ∞ ). Substituting equation (A.5) into equation (A.4) yields where p * = p ρ ∞ + g y is the modified pressure. Applying the non-dimensionalization discussed in Section 3.2 to p * and the y-momentum equation using equations (3.21) yields the following non-dimensional equations p * = p + y Fr 2 (A.9) and ∂ v ∂t where Fr = u ∞ √ gL , is the Froude number. At this point, the source term vector is The source term needs to be rewritten in order to arrive at the final form used in this work. This form is required because it is written in terms of temperature, and temperature drives the buoyancy.
Start with the dimensional source term ρ g = α T (T − T ∞ )g, (although, note that ρ is already non-dimensional) and recall that during the non-dimensionalization, this term gets multiplied by Thus, dropping the hat (ˆ) syntax, the source term vector may be written as Recall that α T T∆T Fr 2 may be easily written as T Gr Re 2 , which is the final form used in this dissertation. This is the derivation for the simple case where gravity is aligned in a coordinate direction, specifically, the negative y-direction.
A generalization is outlined below that allows for gravity to be aligned arbitrarily in lieu of choosing a Cartesian axis. Beginning with the compressible momentum equations written in tensor form with an unaligned gravity vector 14) The gravity vector is g = ge where e is a unit vector defined as and g ≈ 9.81 m s −2 or g ≈ 32.19 ft s −2 . In order to recover the original formulation with gravity aligned in the negative y-direction, e may be written as Applying the Boussinesq approximation as before and dividing through by ρ ∞ gives the incompressible momentum equations where ρ e i Fr 2 = − α T T∆T Fr 2 e i , p * = p −δ Fr 2 , andδ = δ L = x L · e. Finally, the final form of the source term vector for an arbitrarily aligned gravitational vector is Using this approach, δ is the distance between the control volume centroid and a plane perpendicular to e that intersects with the origin. This plane does not have to be specified, only the 'down' direction must be known.