Data development process Software and algorithms The methods listed in this section are all implemented in the statistical programming language R (https://www.r-project.org/) and the more general-purpose programming language Python (https://www.python.org/) combined with the Esri ArcGIS Python Application Program Interface (API) (https://developers.arcgis.com/python/) and Spatial Analyst (http://www.esri.com/software/arcgis/extensions/spatialanalyst). The ArcGIS code invoked here is used only during the ETL process for tasks such as clipping and projecting raster data, while the analytical work is done using R. Ideally, all of the software code invoked by this project would be open-source and available for audit, but the proprietary Esri ArcGIS was chosen for the convenience of its availability and implementation. Other GIS software, such as GRASS GIS (https://grass.osgeo.org/), would be an ideal alternative, given its open-source implementation. The Python programming language has seen widespread adoption in the environmental sciences, and has been developed under an open-source license (Lin, 2012). The Python code used to implement processing and derivation methods for each input layer is made available on GitHub (https://github.com/), a popular web site used for sharing open-source software code. The Python code is also commented with citations to papers that describe the processing methods. Collected Data Products The input products downloaded for use in the model are listed below. The general workflow for each data input is to 1. Download data from provider 2. Pre-process the data a. Project, if necessary, into Albers Equal Area projection b. Clip data to remove any cells outside the REACCH bounding box c. Write the resulting raster to disk Two of the input data products involve processing beyond this general workflow. Using the digital elevation model, we derive layers for slope and topographic wetness index. The point-location soil samples require more extensive processing, described below. Once this workflow is complete for each input, the covariate raster data are combined into a comma-separated text file that can be loaded in R. The input products are shown with spatial and temporal information in Table 1. The downloaded and processed input data are available in the downloadable package for this project. Study Area The REACCH study area is a polygonal area of about 93,800 km2 located within Northern Idaho, Eastern Washington, and Northeastern Oregon (Fig 1, left). This area encompasses the major cereal crop growing region of the inland northwestern United States. To minimize the occurrence of edge effects during statistical modeling, a bounding rectangle (122° W, 49° N, 115.5° W, 44° N) of approximately 277,000 km2 was constructed containing parts of Idaho, Montana, Oregon, and Washington (Fig. 1, right). The REACCH bounding rectangle is the spatial extent to which other data products are clipped during pre-processing. Carbon: USDA NCSS The USDA National Cooperative Soil Survey produces a Soil Characterization database that includes soil organic carbon (percent weight), bulk density (g/cm3), and percent rock content sampling at point locations throughout the US. Samples are taken per soil horizon, nominally for the full depth of the soil profile, but not all analyses are performed for all horizons at every location. The NCSS SOC samples are used as ground-truthing data to train the random forest model. Because the NCSS data are collected by disparate agencies across the study region, their form is more heterogeneous than other data used in the model and require more intensive processing to prepare for use. The data are downloaded as a set of comma-separated value (CSV) files for each county that intersects the REACCH bounding box. The CSV files for each county are joined using identifiers for each observation, and then values of interest (SOC percent weight, soil bulk density, and rock content) are extracted for each. Samples lacking any of the values of interest are discarded. Soil organic carbon content (g/m2) is computed following the method described by Bliss et al. (1995) for each soil layer and summed to find a total SOC value for each sample location. Sample locations outside the REACCH bounding box are removed and all remaining samples are added to a point-based shapefile. The process is repeated for each county, building upon the shapefile until all counties are processed. Soil: USDA NRCS gSSURGO The USDA Natural Resources Conservation Service (NRCS) produces the Gridded Soil Survey Geographic Database, a 10 m spatial resolution gridded dataset with associated tabular data describing soil series and associated characteristics across the US. The gSSURGO dataset is based upon the SSURGO polygon vector dataset, a product of rasterizing the polygons. From the NRCS fact sheet on gSSURGO, “The raster map data have a 10-meter cell size that approximates the [SSURGO] vector polygons in an Albers Equal Area projection,” which is to say, the vector polygons of SSURGO have been divided into 10 m grid cells and all the values of each polygon transferred to each grid cell (USDA NRCS, 2016). Because gSSURGO is based upon county soil survey data, the temporal scale varies depending on the update frequency of the counties. The associated tabular data include a measure of soil organic carbon (g/m2) and are used primarily as a basis for comparison of the model output of the scorpan process. The data are downloaded as a spatial grid for each state (Idaho, Oregon, Washington) and are processed by mosaicking the state grids together, then extracting the grid cells from within the area of the REACCH bounding box, and finally the SOC value from the associated table is joined to the grid. Climate: GRIDMET Given the importance of precipitation on SOC dynamics, a precipitation and temperature dataset has been included from Abatzoglou’s Gridded Surface Meteorological Data (GRIDMET) (Abatzoglou, 2013). Mean annual temperature and mean annual precipitation have been shown to be significant predictors of SOC variability (Morrow, 2014). These data are 4 km spatial resolution raster data describing average precipitation and minimum and maximum temperature from 1979-2010. Using these data with a 4 km spatial resolution involves commission of the Ecological Fallacy, however, the importance of these climate variables on carbon dynamics in combination with the difficulty of obtaining higher resolution data make these data the best currently available for the purpose. The data are downloaded from the Northwest Knowledge Network’s (https://www.northwestknowledge.net) Thematic Real-time Environmental Distributed Data Services (THREDDS) server (http://www.unidata.ucar.edu/software/thredds/current/tds/), which allows the user to specify spatial and temporal bounds as well as aggregation criteria. Therefore, the downloaded data already represent the correct variables and spatial extent and require no pre-processing other than re-projection. Organisms: USDA NASS NCDL The USDA National Agricultural Statistics Service (NASS) produces the National Cropland Data Layer (NCDL) yearly as a 30 m spatial resolution grid aligned to the 30 m National Elevation Dataset (NED) grid. The NCDL algorithmically classifies individual grid cells into agricultural cover types using satellite imagery, supervised classification techniques, and ground truthing. Since the area of interest of this study area is primarily agricultural land, the NCDL crop categories provide detailed information about land cover within the region. The data are downloaded as a spatial grid for each state (Idaho, Oregon, Washington) and are processed by mosaicking the state grids together, then extracting the grid cells from within the area of the REACCH bounding box. Due to limitations of the random forest implementation in R, categorical inputs to random forest models are limited to a maximum of 53 classes (https://cran.r-project.org/web/packages/randomForest/NEWS). The clipped NCDL layer has 84 cover classes, which means that some classes must be collapsed (combined to form less specific classes). Care has been taken to avoid collapsing the classes of primary interest to cereal production and to prefer collapse of classes that are sparsely represented (or completely absent) within the REACCH study area. Note that this aggregation of classes is done dynamically within the R code for the modeling; the cover classes stored on disk are left in their original form. Table 2 shows a mapping of classes that have been collapsed into more general groupings. Relief: USGS NED The USGS National Elevation Dataset (NED) is a 30 m spatial resolution gridded digital elevation model (DEM) for the US. The varying topography of the study area influences erosional and depositional patterns of SOC across the landscape. Several products are derived from the NED: slope, a depression-filled DEM (O'Callaghan and Mark, 1984), flow accumulation and flow direction (Jenson and Domingue, 1988), and topographic wetness index (TWI) (Quinn et al., 1991; Moore et al., 1993; Gessler et al., 1995). Flow accumulation and direction are intermediate layers in this model; only elevation, slope, and TWI are used as model inputs. The data are downloaded as a spatial grid for each state (Idaho, Oregon, Washington) and are processed by mosaicking the state grids together, then extracting the grid cells from within the area of the REACCH bounding box. Sinks are filled (Reuter et al., 2009) and a slope grid is generated, followed by flow direction, flow accumulation, and TWI. Parent material and Age: USGS Aeroradiometric Grids The USGS aeroradiometric grids are 2 km spatial resolution maps of potassium, thorium, and uranium concentration in the top 30 cm of the Earth’s surface (Duval et al., 2005). The aeroradiometric data are thought to be a convenient proxy for parent materials (Bierwirth et al., 1996; Gessler et al., 1995). The data were collected between 1973 and 1981 by various contractors at various spatial resolutions, and were combined into a single dataset for the conterminous United States by the USGS with some data loss (Hill et al., 2009). Because of the heterogeneous composition of these data, their variable spatial and temporal resolutions, and their age relative to the rest of the input products, they may serve as a weak proxy for present-day parent material, but are nonetheless included as the best nationwide aeroradiometric product that currently exists. As with the climate data, the mismatch in spatial resolution between these data and other gridded inputs represents commission of the Ecological Fallacy. The data are downloaded as a spatial grid for each element (potassium, thorium, uranium) for the continental US and are processed by extracting the grid cells from within the area of the REACCH bounding box.