Jekyll2023-11-13T20:22:54+00:00/feed.xmlJesse Windle’s SiteSharing some work and a few thoughtsUsing AI to extract root detections2023-11-01T00:00:00+00:002023-11-01T00:00:00+00:00/2023/11/01/ai-for-root-phenotyping<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
</script>
<h2 id="introduction">Introduction</h2>
<p>At Hi Fidelity Technologies, we were interested in measuring root growth. Roots
are critical for nutrient and water uptake, but they have long been ignored
because roots are difficult to measure. We invented and developed a novel
device called RootTracker, which could measure root growth using something akin
to capacitance touch sensing. That may sound easy, but it was a slog. The
basic problem is that roots are made up of mostly water, and they are in a
medium that is mostly water (and opaque). How do you detect water in water?</p>
<p>Here I am going to walk you through two different answers to that question. The
first is a physics answer, while the second is a data science answer. I’ll
spend most of my time on the data science answer, since that is where I did the
most work, but I’ve got to give a shout out to my colleage, <a href="https://www.linkedin.com/in/jeffrey-aguilar-2940867/">Jeff
Aguilar</a> and his physics
based approach, since it is what originally cracked the problem.</p>
<h2 id="background">Background</h2>
<p>Below is a picture of a RootTracker device. There is a circuit encased in
urethane at the top, which houses the microprocessors that drive the device.
Emanating out of the urethane are printed circuit board “paddles”, which have 22
gold plated electrodes running down the edge. An individual electrode is
charged up and then voltage at that elected is measured. This is done for all
22 electrodes on all 12 paddles for 2 different charging times every 5 minutes,
resulting in something like a 2x12x22 dimensional time series of voltages. We
want to figure out how to use those voltages to identify when a root passes near
a sensor.</p>
<p><img src="../../07/images/roottracker.jpg" alt="" /></p>
<h2 id="using-physics-to-identify-roots">Using physics to identify roots</h2>
<p>This problem was first dealt with by <a href="https://www.linkedin.com/in/jeffrey-aguilar-2940867/">Jeff
Aguilar</a>, Hi Fidelity’s
lead engineering. Jeff has a physics background and it was incredibly
informative to see a talented physicist like Jeff think. Jeff would tackle a
problem using a mix of both theory and experimentation. Theory, in the sense of
trying to define models or metrics that capture the dynamics of whatever is
being observed. Experimentation, in the sense of trying to identify and
bifurcate the factors that might be driving the system. He was also a master of
visualization, making plots or movies to gain intuition. Let us start with the
experimentation and then go to the theory.</p>
<h3 id="experimentation">Experimentation</h3>
<p>One of our biggest challenges was identifying ground truth. How do you identify
when a root passes by a sensor if you cannot see it? Jeff came up with an
ingenious solution to that problem, dubbed an “ant farm experiment.”</p>
<p>Below is a picture of an “ant farm.” It consisted of a 3d-printed frame, which
sandwiched soil between two pieces of plexiglass. A corn seed was placed in the
tiny box on top. When it germinated its early roots would grow through the ant
farm.</p>
<p><img src="ant-farm-mock-up.jpg" alt="" /></p>
<p>Embedded in the soil were electrodes, which were attached to a microcontroller
to record their voltages. The electrodes can be seen here as the blue-to-green
lines in the left hand side of the image. On the right you can see the voltages
over time, color coded to match up with their corresponding electrodes. Jeff
placed a camera on either side of the ant farm and then synced the subsequent
video with that of the time series so that you could monitor the point on the
time series that corresponded to the current frame of the video. In that way,
one could watch a root grow past a sensor and then see what it did in the time
series. Using this methodology convinced us that when a root growing past a
sensor could cause a disturbance in observed voltages.</p>
<p><img src="ant-farm-2.jpg" alt="" /></p>
<h3 id="theory">Theory</h3>
<p>The time series were largely smooth, but they displayed a variety of behavior.
In addition to changes when a root grew past, the humidity of the soil could
impact the signal, along with shifts in the soil, which were not uncommon.
What, specifically, indicated the presense of a root? To answer that Jeff,
developed a model circuit whose parameters included the local resistance and
capacitance at a sensor. One could transform the voltage data to these
resistance and capacitance parameters, so that one in effect was monitoring
local resistance and capacitance at each sensor over time. Constructing videos
in R-C space and then flagging times that corresponded to root activity found in
the aforementioned videos, Jeff was able to identify “signatures” for when a
root grew past a sensor. I am being very succinct in my description here, but
that is the basic idea.</p>
<p>This was all a tremendous achievement, but it begs the question: how can we use
AI/ML/statistics to pick the optimal “signature” of a root passing a sensor.
That’s what we want to describe next.</p>
<h2 id="using-data-science-to-identify-roots">Using data science to identify roots</h2>
<p>Our aim at Hi Fidelity Genetics was to measure root growth in real world
experiments, either in the greenhouse or in the field. Like before, we run into
the same problem: how does one identify ground truth when one cannot see when a
root passes a sensor? Here we will adopt a different approach than the “ant
farm” experiment. Instead, we will conduct an experiment in which some
devices have plants in them and some do not.</p>
<p>Let’s think about this for a second, because it is not obvious that it will
work. We are collecting a 2x12x22 dimensional time series. Let’s say there are
8000 time points. In that case, we effectively have 2x12x22x8000 covariates for
a binary (yes plant/no plant) response. We can observe tens to hundreds of
plants. Thus, we have relatively few cases against a large number of
covariates. Thankfully, using intuition guided by our knowledge of the physics
approach, we <em>can</em> find a way to do this. And, we can also recover individual
root touches in the process, which is to say the phenotype of interest, not just
the binary response!</p>
<h3 id="data-set">Data set</h3>
<p>The data that we will be using throughout is from a greenhouse experiment that
took place over the course of 1 month. We tested devices (like the one pictured
in the Background section) with maize, wheat, soybean, cotton, and tomato plants
in them as well as devices with no plants. There were 24 devices per treatment,
including for the no plant control. For our purposes here, we will only focus
on maize vs no plant.</p>
<p>Soil was placed in pots. RootTracker devices were inserted into the soil and a
seed was placed at the center of the device, except for those devices with no
plant. All pots were hand watered as necessary.</p>
<h3 id="data-exploration">Data exploration</h3>
<p>One of the first things that jumps out when looking at raw voltage data is that
the two measurements are often highly correlated. Below, we randomly pick three
electrodes from a device and plot the voltages for the two different charge
times against one another. (The charge times are 1 and 255 microseconds
respectively.)</p>
<center>Voltages for 2 different charging times over time for several devices</center>
<p><img src="V255-V1.jpg" alt="" /></p>
<p>In effect, the voltages seem to be moving along a common manifold (a line in
this case). At first glance, it seems like we might only need to use one of the
voltages — why use both when they are highly correlated? But what we know
from the physics approach is that key information comes from when there is
movement off of the common manifold, which is to say when the voltages are
diverging in some way. We want to get at that information. While the lines
seem to have a very similar slope and position, they are not identical. Thus,
we will treat each individually.</p>
<p>For each paddle and electrode, we compute the principal components decomposition
of the time series of the two voltage measurements. Most of the variation will
be in the first component, which captures the joint movement of the two time
series. The second component captures how much the two time series are
converging or diverging. A key element here is we want to orient the principal
components decomposition so that the second component always corresponds to
movement down and to the right and the first component always corresponds to
movement up and to the right.</p>
<p>Below we plot the (log) variance found in the second principal component for
each paddle and electrode. Keep in mind this is a global statistic constructed
using all of the time points available. We see clear separation between the two
groups, which is a good sign that monitoring the second component will indeed
tell us something about when a root is passing by an electrode.</p>
<p><img src="log-variance-second-component.png" alt="" /></p>
<p>Let us conceptualize of the 2D time series as the trajectory of a particle over
time. What we have done so far is for each paddle and electrode chosen a
convenient and informative set of coordinates for tracking the trajectory. Now,
we want to summarize the particle’s movement through time. To that end we will
compute some statistics on a rolling basis. In particular, for each coordinate
we will keep track of the mean and standard deviation of the instantaneous
velocity over a set window, like 2 hours time. (We also measure the angular
velocity) in 2D.</p>
<p>Below we plot the time series of these and other statistics: the mean location
of coord. 1, the mean location of coord. 2, the mean veloicty of coord. 1, the
mean velocity of coord. 2, the standard deviation of the velocity of coord. 1,
the standard deviation of the velocity of coord. 2, the mean angular velocity,
and the standard deviation in the angular velocity. I may call a velocity
“momentum”, since the ideas are interchangeable.</p>
<center>Example of rolling statistics for one electrode</center>
<p><img src="rolling-statistics.png" alt="" /></p>
<p>For each of these statistics, I found the 1st and 99th percentile globally, then
I computed the time each device was either “low” or “high” on that basis for
each statistic. Averaging over each group (corn / no plant), one finds that
the extreme negative values for the mean velocity of the second component and
the extreme high values for the standard deviation of the velocity of the second
component are the most informative statistics.</p>
<p>All of this lead to the following algorithm:</p>
<ol>
<li>Identify periods of “anomalous” behavior, which is to say when there is large
negative velocity in the second principal component. (We also enforce that
the standard deviation must be above zero, but not too high, which
corresponds to when we have a steady negative velocity.)</li>
<li>Combine periods of time that are very close to one other in time or are on
adjascent electrodes at similar times on the same paddle.</li>
<li>Remove periods that occur simultaneously across paddles, which might
correspond to when the plants are watered or some other global phenomenon.
Also remove periods that are too short.</li>
<li>Either integrate the total anomalous time or count the number of anomalous
periods.</li>
</ol>
<p><strong>Doing that separates the two groups!</strong> Below we plot both metrics, either
total anomalous time or number of anomalous periods. By either approach we
see corn separate from the no plant control (NPC).</p>
<p><img src="mom2-phenotypes.png" alt="" /></p>
<p>At this point, this is all a bit ad hoc. For instance, what are the optimal
cutoffs for flagging a detection? We want to do some actual statistics to show
that we can optimize this process, which is what we tackle next.</p>
<h3 id="using-ai-to-flag-roots">Using AI to flag roots</h3>
<p>The first element of the algorithm above is to identify periods when the
velocity of the second component are excessively low. Mathematically speaking,
this is pretty straightforward, it is</p>
\[f(x) = \begin{cases} 1, \; \text{ if } x < \tau \\ 0, \; \text{ else }.
\end{cases}\]
<p>We can approximate this function via a sigmoid. In particular, if we let
\(\sigma(x) = 1 / (1 + e^{-x})\), and \(g(x; \kappa) = \sigma(- \kappa (x -
\tau))\) then</p>
\[g(x; \kappa) \rightarrow f(x) \text{ as } \kappa \rightarrow \infty.\]
<p>In other words \(g\) is a soft threshold approximation of \(f\). In the
language of neural networks, this is a linear layer followed by a sigmoid
activation function. Thus, it should perhaps not be surprising that we can use
this model to separate the two groups.</p>
<p>To be completely clear, we use a slightly more complicated model, but doing
that, we can replicate the work above. In the language of PyTorch, our model
(M1) is the following</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nn.Sequential(
SoftThreshold(),
RemoveSimultaneous(),
Integrate(sum_dim = (1, 2, 3)),
nn.Linear(1,1),
nn.Flatten(start_dim=0)
)
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">SoftThreshold</code> is akin to \(g\) above, <code class="language-plaintext highlighter-rouge">RemoveSimultaneous</code> tries to
remove periods high activation across multiple paddes (effectively (3) from
above), and <code class="language-plaintext highlighter-rouge">Integrate</code> (approximately) integrates the time above the threshold.</p>
<p>A slight modification (M2) of this is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nn.Sequential(
SoftThreshold(),
RemoveSimultaneous(),
Integrate(sum_dim = (1, 2, 3)),
nn.Flatten(start_dim=1),
nn.BatchNorm1d(1, momentum = 0.0),
nn.Flatten(start_dim=0)
)
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">BatchNorm1d</code> layer is equivalent to the linear layer, but resolves a
technical issue that occurs when using the linear layer. Our batch consists of
the whole data.</p>
<p>Instantiating the model M1 using values similar to those used in (1) above, we
see that we can nearly separate the two groups.</p>
<p><img src="ground-truth-vs-anomalous-time.png" alt="" /></p>
<p>If we extract the activations from the very first layer, which is effectively
the indicator being above the threshold, and filter as in steps (2), (3), and
(4) we can completely separate the two groups and produce a picture very similar
to the one above.</p>
<center>M1 separation of two groups after filtering</center>
<p><img src="m1-anomalous-time.png" alt="" /></p>
<p>Lastly, using (M2), we can actually learn the parameters of interest. Below is
the ground truth vs. log-odds plot after having optimized the parameters.</p>
<p><img src="ground-truth-vs-log-odds-m2.png" alt="" /></p>
<p>We have almost separated the two groups without even doing steps (2), (3), and
(4). Comparing the cutoff value for the mean velocity using our ad hoc method
compared to what we can learn via optimization, we find that the optimal
parameter is slightly lower.</p>
<table>
<thead>
<tr>
<th>ad hoc threshold</th>
<th>learned threshold</th>
</tr>
</thead>
<tbody>
<tr>
<td>-0.25</td>
<td>-0.29</td>
</tr>
</tbody>
</table>
<p>Interestingly, though the ad hoc approach uses log variance between -2. and
-1., after optimizing it lands between -3.27 and -2.93, considerably lower and
quite a narrow range (0.20 to 0.23 in terms of standard deviation).</p>
<p>(Note, we have not used cross-validation here because we are fitting a 5
parameter model using 46 cases.)</p>
<h1 id="conclusion">Conclusion</h1>
<p>It took some work, but we have shown that it is possible to create a model that
learns its parameters solely from a binary response (yes plant / no plant) to
recover root detection information. This proof-of-concept shows that one can
extend this approch to construct larger models, e.g. deep neural networks, to
learn more complex patterns that indicate the presense of a root or not.</p>
<p>If you are curious, the analysis above employes the following notebooks and a
custom Python package (which is not yet published):</p>
<ul>
<li><a href="analysis-01-dynamics.html">analysis-01-dynamics</a></li>
<li><a href="inference-01-dynamics-01-univariate">inference-01-dynamics-01-univariate</a></li>
<li><a href="inference-01-dynamics-02-bivariate-A.html">inference-01-dynamics-02-bivariate-A</a></li>
<li><a href="inference-01-dynamics-02-bivariate-B.html">inference-01-dynamics-02-bivariate-B</a></li>
</ul>Jesse WindleRoot system architecture impacts row crop emissions2023-09-12T00:00:00+00:002023-09-12T00:00:00+00:00/2023/09/12/rtda<p>At Hi Fidelity Technologies, our root phenotyping device (called
RootTracker) was used to interrogate many scientific questions, but
there was one question that intrigued us most of all: <strong>do differences
in root system architecture lead to differences in greenhouse gas
emissions?</strong></p>
<p>Why is this plausible in the first place? Let us consider maize.
Maize requires a large nitrogen application, often ammonium nitrate.
While it is possible to target the application of fertilizer to a
specific location, usually it is just sprayed onto the soil surface.
Some of that nitrogen will be absorbed by the plant, but a large
portion of it will not. Ultimately, the unabsorbed fertilizer runs
into water or decomposes, giving off nitrous oxide (N2O) in the
process.</p>
<p>Nitrous oxide is a major greenhouse gas. It is the third largest
contributor after carbon dioxide (CO2) and methane, and is nearly 300
times more potent than CO2 (<a href="https://www.ipcc.ch/report/ar4/wg1/changes-in-atmospheric-constituents-and-radiative-forcing/">IPCC</a> and <a href="https://www.science.org/doi/10.1126/science.1179571">Wuebbles</a>). Row crop
agriculture is by far the largest source of anthropogenic N2O
emissions (<a href="https://www.nature.com/articles/nclimate1458">Reay et al.</a> and the <a href="https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks-1990-2019">EPA</a>).</p>
<p>If one could identify cultivars with root system architecture that did
a better job acquiring fertilizer, then one could reduce N2O emissions.
The plants would be better and so would the environment.</p>
<p>And it is not only maize that is a major N2O emitter. Soybean
production results in N2O emissions as well. Together, soybean and
corn make up an enormous amount of US agriculture — over 75% of
acres planted (<a href="https://usda.library.cornell.edu/concern/publications/j098zb09z?locale=en">USDA</a>). Clearly, finding cultivars with lower
emissions profiles could reduce N2O emissions in a major way.</p>
<h1 id="experiment">Experiment</h1>
<p>The experiment took place in the summer of 2021 and 2022 near Ames,
IA. I will only go into the results for 2021, since that year of work
was completely public- and self-funded and made use of proprietary
germplasm, but the results for 2022 aligned with the 2021 results.</p>
<p>We grew maize at one location using two proprietary hybrids (i.e. two
types of maize seed) — HFG1071 and HFG1111. There were four blocks,
one plot per variety within each block. These four blocks tried to
capture variance across both previous crop as well as soil type. Each
plot had 5 rows. 25 equally spaced RootTracker devices were placed on
plants within the plot. I will use the terms rep and block
interchangeably here.</p>
<p><img src="isu-n2o-field.jpg" alt="" /></p>
<p>In the present post, I won’t go into the details of how RootTracker
measures root growth, which can be found
<a href="../../../2023/07/26/rootmodel.html">here</a>. The key thing is that
RootTracker measures the amount of root growth over time in terms of
the number of root touches per day (within a certain depth band).</p>
<p>Measuring N2O is tricky. Traditionally and the way we did it in this
experiment, one puts a “collar” in each plot, which is just a wide,
short tube like a bit of 12” PVC. To measure the N2O emissions, one
attaches a lid to the tube for a while, and then pulls out some gas,
which is sent to the lab for an assay. (In our 2022 experiment, we
were actually able to use a laser to take instantaneous in-field
measurements.) Part of the challenge is that N2O emissions are known
to be sporadic with certain events like rainfall leading to large
belches of N2O. Further, emissions outside of the growing season may
matter as well. In 2021, we were able to take roughly weekly
measurements in June and July.</p>
<h1 id="results">Results</h1>
<p>As mentioned previously, N2O emissions can be event driven. For
instance, rainfall leads to big belches of N2O. Conversely, when it
is dry for extended periods, there is likely very little emissions.
We saw this in 2021, when July was a very dry month and we recorded
minimal emissions. For that reason, I will focus on June when there
was enough rainfall to produce non-trivial emissions data.</p>
<p>Below you will find the June, 2021 emissions vs. root growth. The
“rate” of root growth is the number of detections divided by amount of
device uptime within a period of interest (in days). We normalize by
time because the devices can have slightly different uptimes —
e.g. a battery might need to be replaced, so a device does not collect
data for several hours. For each rep and variety, I have computed a
trimmed mean (10% symmetric trim) of rates over all devices in a plot.</p>
<p><img src="isu-n2o-results.jpg" alt="" /></p>
<p>Two things immediately jump out:</p>
<ol>
<li>On an absolute basis, there is an inverse relationship between the
amount of emissions and amount of roots. In other words, more
roots correlates with less emissions, regardless of variety or
block.</li>
<li>For 3/4 of the blocks, HFG1071 has more root growth and less
emissions, so we also see this pattern on a relative basis. And
for the one block in which this does not occur, both hybrids have
relatively low root growth.</li>
</ol>
<p>We can quantify these differences. Regarding the first point, after
running a linear regression we find that for every additional root
grown per day the monthly emissions are reduced by -0.083 kg/ha
(p=0.18). If we remove the outlying point, then this becomes -0.067
kg/ha (p=0.049) — a lower effect size, but with less uncertainty.</p>
<p>Regarding the second point, we want to understand if varietal
differences in root growth could be used to drive differences in
emissions. Running a mixed model where variety is a fixed effect and
the rep is a random effect, we find that HFG1111 has 25% less root
growth compared to HFG1071 (-0.41 roots/day less compared to 1.68
roots/day for HFG1071, p=1e-5). In other words, the two hybrid
varieties do seem to have noticeably different patterns of root growth
in this experiment.</p>
<p>Keep in mind that this is one month in one location in one year. We
should be careful to draw conclusions from such a limited experiment.
As with yield, interactions with the environment can have a big impact
on results, and we have seen cultivar by environment interactions in
other experiments. But the early evidence does lend itself to the
idea that root system architecture impacts emissions and that this
might be exploited to chose cultivars with lower greenhouse gas costs.
(Sadly, we had to shut our doors before we could adequately test this
hypothesis.)</p>
<h1 id="conclusion">Conclusion</h1>
<p>This is just one example of an analysis I would conduct at Hi Fidelity
Technologies. We ran many experiments both for ourselves and for
clients. Other questions we addressed include:</p>
<ul>
<li>How do root systems respond to drought, and how does that vary with
genetics?</li>
<li>Do biostimulant seed treatments impact root growth, and is there an
interaction with the environment?</li>
<li>How does herbicide tolerance manifest itself in root growth?</li>
</ul>
<p>Each experiment we ran demanded its own unique analysis, but there are
certain tools that are frequently used. The linear model and variants
are ubiquitous: mixed models if there are random effects, quantile
regression if there are outliers or potentially an unusual
distribution of error terms, or spatial models when trying to capture
variations due to soil or other slowly varying environmental factors.</p>
<p>While I am not a partisan for any particular statistical philosophy,
one of the nice things about the Bayesian perspective is that it
easily encompasses all of these methods in one framework. And one of
the major advantages of working with a linear model is that it
provides interpretable results. The most important thing at the end
of the day is being able to capture a result in a narrative that is
easy to understand.</p>Jesse WindleAt Hi Fidelity Technologies, our root phenotyping device (called RootTracker) was used to interrogate many scientific questions, but there was one question that intrigued us most of all: do differences in root system architecture lead to differences in greenhouse gas emissions?Data engineering for RootTracker2023-08-09T00:00:00+00:002023-08-09T00:00:00+00:00/2023/08/09/data-engineering<p>One of the nice things about working at a young start up is that you
get to wear multiple hats. For instance, at Hi Fidelity Genetics /
Technologies, my title was data scientist, but there were times where
I got to be a data engineer as well, which was a lot of fun! Here I
will describe our data stack and offer a few lessons learned along the
way.</p>
<h1 id="background">Background</h1>
<p>Hi Fidelity Genetics developed a root phenotyping device called
RootTracker. The device has electrodes arrayed around a root crown in
a cylindrical fashion. As seen in the image below, there were 12
vertical, printed circuit board (PCB) “paddles”, each of which had 22
electrodes. The device measured voltage at these electrodes for three
different charging parameters at regular intervals. These voltages
were ultimately transformed into root detections — that is the
device detected a root at a certain time in a certain place. (You can
read more about the set up in our paper <a href="https://academic.oup.com/plphys/article/187/3/1117/6328791">“Capturing in-field root
system dynamics with
RootTracker”</a>.)</p>
<p><img src="../../07/images/roottracker.jpg" alt="" /></p>
<p>The devices used a radio to send data to a base station. This data was
aggregated across several devices, which was then uploaded as a single
file to Amazon Web Services (AWS). There are probably several ways
we could have improved our data management at the point of the base
station, but we will take the single file uploads from the base
station as our starting point for describing our data stack. The
whole thing evolved in a rather organic way and what I describe below
mostly describes where we ended, as opposed to where we started. I
will also make some modifications to avoid delving into historical or
technical details.</p>
<h1 id="schema">Schema</h1>
<p>Instead of giving a technical description of the schema, let me
describe the objects that were modeled, since that provides a better
overview. As mentioned previously, we had <code class="language-plaintext highlighter-rouge">Device</code>s that measured
voltages. Each paddle on a device sent a <code class="language-plaintext highlighter-rouge">Measurement</code> that reported
the device and paddle that the measurements came from, the charging
parameters, and 22 voltages, one for each sensor on a paddle. A
<code class="language-plaintext highlighter-rouge">Trial</code> refers a specific experiment. Each <code class="language-plaintext highlighter-rouge">Trial</code> was associated
with a set of device <code class="language-plaintext highlighter-rouge">Deployment</code>s, effectively linking a <code class="language-plaintext highlighter-rouge">Device</code>
with a <code class="language-plaintext highlighter-rouge">Trial</code> and recording some additional information in the
process, like the location of the <code class="language-plaintext highlighter-rouge">Device</code>. Each <code class="language-plaintext highlighter-rouge">Device</code> possessed
firmware, which controlled how the device operated along with
statically set network information. Because that firmware could
change, there were separate <code class="language-plaintext highlighter-rouge">Software</code> records that tracked the
firmware version and the configuration of the <code class="language-plaintext highlighter-rouge">Device</code>.</p>
<p>In agricultural experiments, it is standard to have a “field book”
that tracks the <code class="language-plaintext highlighter-rouge">Treatment</code>s used across the field. A <code class="language-plaintext highlighter-rouge">Treatment</code> was
always a categorical (potentially ordered) variable.</p>
<p>An algorithm processed the raw data to produce detections. (The story
of the detection algorithm is quite interesting in and of itself.)
The first version of our <code class="language-plaintext highlighter-rouge">Detection</code>s recorded the time, paddle, and
electrode where it occurred. Later, we realized that we also needed
to include quality control metrics to identify periods when the
detection algorithm might be unreliable. That led to the a more
generic notion of <code class="language-plaintext highlighter-rouge">Feature</code>s, which were a time-indexed
multidimensional array
derived from the raw data.</p>
<h1 id="api">API</h1>
<p>Like our discussion of the schema above, we describe how to interact
with objects from the schema — the API — at a high level.</p>
<p>In some cases, the API was very transparent. For instance, one could
get <code class="language-plaintext highlighter-rouge">Trial</code>s, <code class="language-plaintext highlighter-rouge">Device</code>s, <code class="language-plaintext highlighter-rouge">Deployment</code>s using a variety of filters.
Inserting those objects was (mostly) just a matter of populating their
fields.</p>
<p>In other cases, the API was doing a lot under the hood. For instance,
our field scientist designed the experiments and kept track of additional
experiment data using a traditional field book, which we can think of
as a spreadsheet. Each row of the field book was one experimental
unit, which were plants when conducting a RootTracker trial. The
columns of the field book were effectively covariates — treatments
or observations, like the variety, growth stage, location, or device
barcode of a plant. One could send a row from a field book to the API
to update or insert the treatments. However, the treatments were
recorded in a key-value fashion, so in actuality the API would be
updating or inserting several records in one go. Conversely, when
exporting a field book, one would have to gather many records to
recapitulate a single row of the field book.</p>
<p>One could get <code class="language-plaintext highlighter-rouge">Measurement</code>s transparently by the device barcode,
time, and charging parameters, but we had a bespoke procedure for
inserting <code class="language-plaintext highlighter-rouge">Measurement</code>s since they came from a tarballed file of
aggregated measurements. In that case it made more sense to process a
whole file together, instead of each measurement within a file
individually.</p>
<h1 id="implementation">Implementation</h1>
<p>Based on our teams’ experience, we used SQL for our database
(<a href="https://www.postgresql.org/">PostgreSQL</a> and
<a href="https://www.sqlalchemy.org/">SQLAlchemy</a>, specifically). Eventually,
we also developed an alternative approach for storing the
<code class="language-plaintext highlighter-rouge">Measurement</code>s and <code class="language-plaintext highlighter-rouge">Feature</code>s — using <a href="https://iceberg.apache.org/">Apache
Iceberg</a> /
<a href="https://tabular.io/">Tabular</a>.</p>
<p>We employed the <a href="https://www.openapis.org/">OpenAPI</a> framework to
specify our API. We found the free <a href="https://editor.swagger.io/">Swagger
editor</a> to be sufficient for our purposes.
We used <a href="https://flask.palletsprojects.com/">Flask</a> to implement the
API. While our system was only used internally, we did ultimately
develop the capability to control access to the API using Keycloak.</p>
<p>An aside… obviously, cost is an important factor when deciding how
to put together a system. On AWS, it is more expensive to run an SQL
database than it is to retrieve data from S3 storage. Why is that and
what is the difference?</p>
<p>When you run an Instance on AWS, it is like turning on a computer.
You have a CPU, RAM, and a hard drive. The hard drive is the key for
our current discussion. A hard drive has a file system. You can
combine several hard drives using RAID to create a larger file system
with insurance against failure; however, there is a limit to how big
you can make the file system. That becomes problematic if you are
doing something like crawling the web, as the
data collected is so copious.</p>
<p>To overcome that problem, a new approach was developed with Apache
Hadoop being the canonical software. Hadoop uses commodity hardware
to scale a file system to arbitrary size. To do that the Hadoop file
system gives up some of the POSIX requirements that are found in a
typical file system. Hadoop also allows one to compute statistics and
similar computations for a dataset on this file system, which is
nontrivial when you consider that it is a highly decentralized system.</p>
<p>In terms of how all this relates to options on AWS, the hard drive one
uses for an AWS Instance or for a persistent SQL database uses the old
school hard drive, whereas AWS S3 is effectively Hadoop (or a
successor of Hadoop).</p>
<p>Naturally, one still wants to organize data stored on something like
S3. The original solution to that was Apache Hive. Data is stored as
flat files, which possess some sort of schema. Those files can be
indexed on a value that is constant within the file or on metadata
about fields of the file. Data can be retrieved using an SQL-like
language. Apache Iceberg is a successor to Hive that aims to solve
some of its shortcomings. The bottom line is that one can store,
retrieve, and compute statistics for large amounts of structured data
using a more cost effective form of storage, which can also obviate
any worries about how large that data may grow.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Looking back on this, the idea of design keeps wandering through my
mind.</p>
<p>If we think about architecture, like architecture of buildings, an
architect needs to know about the limitations of the materials used in
the design, but does not need to know how to do carpentry, masonry,
etc. The architect provides the blueprint and then the subcontractors
do the work. In some sense though, the building is done when the
blueprint is done.</p>
<p>Within the context of a system, you could make a similar split. The
blueprint is the database schema and API documentation. The
subcontractors are the people and software used to implement the API
and data storage. I am sure there are cases where the requirements of
the system are so clear that one can write the “blueprint” down ahead
of time. But in our case, when one is creating something totally new
and trying to do it quickly, it becomes necessary to draw the
blueprint and build the house simultaneously. It is not surprising
that when doing that there is a cost, which is refactoring the
codebase or even rewriting portions of it.</p>Jesse WindleOne of the nice things about working at a young start up is that you get to wear multiple hats. For instance, at Hi Fidelity Genetics / Technologies, my title was data scientist, but there were times where I got to be a data engineer as well, which was a lot of fun! Here I will describe our data stack and offer a few lessons learned along the way.Inferring root growth using RootTracker2023-07-26T00:00:00+00:002023-07-26T00:00:00+00:00/2023/07/26/rootmodel<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
</script>
<p>At the startup I used to worked for, Hi Fidelity Genetics, I helped
invent a <a href="https://patents.google.com/patent/US11293910B2/">patented</a>
technology to measure root growth using impedance sensing called
RootTracker. If we conceptualize a root as a random walk, it was like
our device could observe one point along this trajectory. Given only
that information, we wanted to describe differences in root structure
between varieties and recapitulate root growth over time.</p>
<p>This post is based on the paper, <a href="../root-modeling.pdf">“Inferring monocotyledon crown root
trajectories from limited data.”</a></p>
<h1 id="background">Background</h1>
<h2 id="roottracker">RootTracker</h2>
<p>Below is a picture of our circular RootTracker. It has a cylindrical
symmetry. The green, vertical “paddles” are printed circuit board.
Each paddle has 22 gold plated electrodes that act as sensors running
down the side. You place a seed in the center of a device and let it
grow. We measure voltages at the sensors and convert them to
detections using an algorithm. We describe this and early data in
more detail in our paper <a href="https://academic.oup.com/plphys/article/187/3/1117/6328791">“Capturing in-field root system dynamics
with
RootTracker”</a>.</p>
<p><img src="../images/roottracker.jpg" alt="" /></p>
<p>Using RootTracker detection data, we could quantify differences
between varieties and over time. However, this does not provide an
actual representation of root growth using a model. While there are
several quite complex models out there to simulate root growth, like
<a href="https://rootmodels.gitlab.io/">OpenSimRoot</a>, they are not applicable
for our data, which is rather limited. These simulation-based models
seek to recapitulate root growth in the most realistic way possible.
Parameters of those models can be tweaked to learn how they impact
hypothetical root growth, but they cannot be easily used to make
inferences. In contrast, our attempt at modeling started with the
goal of making inferences and from there we tried to build a model
that could recapitulate root growth.</p>
<h2 id="monocot-root-growth">Monocot root growth</h2>
<p>In the picture above we have drawn a corn plant inside of the device.
The system works with both monocots (like corn, wheat, and rice) and
dicots (like soybean and cotton). For the purposes of modeling
though, we will restrict our attention to monocots. In the black and
white image below, we see a monocot (young wheat) on the left and a
dicot (young lupin) on the right. Dicots have a tap root off of which
roots grow. In contrast, monocots have crown roots that emerge at the
stem of the plant near the soil-surface interface effectively giving
us an origin for root emergence, which is useful for modeling. In the
color image, we see a cartoon of monocot root growth. RootTracker can
detect both crown roots and lateral roots. For the sake of
simplicity, we just assume that all roots detected are crown roots.
The images below are reproduced from <a href="http://plantsinaction.science.uq.edu.au">Plants in
Action</a>, published by the
Australian Society of Plant Scientists.</p>
<p><img src="../images/monocot-dicot.png" alt="" /> <img src="../images/monocot-root-anatomy-small.png" alt="" /></p>
<h1 id="methods">Methods</h1>
<p>Data we use here came from an experiment at Alamance Community College
in Alamance, NC in early 2022. The aim of the experiment was to
compare root growth of maize, wheat, soybean, cotton, and tomato.
Here we will just focus on maize and wheat. (We use the terms maize
and corn interchangeably.) Five gallon pots were filled with soil. A
RootTracker device was placed into the soil and then a seed was placed
in the soil at the center of the RootTracker. Each treatment had 24
RootTrackers.</p>
<h1 id="data-exploration">Data Exploration</h1>
<p>Using just our detection data and no modeling, we can examine how root
growth changes over time. Below we plot maize root growth over time,
after having been smoothed using a Gaussian process. The image on the
left is fit using cross validation, while the image on the right uses
hyperparameters that impose more smoothing. In both images there
seems to be more root growth at the middle depths early on and then
root growth concentrates at greater depths as time goes on. Further,
we see two majors times of root growth, first around days 10-14 and
later around days 24-28. Having a periodic pattern to root growth
makes sense for monocots, since the several crown roots emerge
simultaneously in what are called whorls.</p>
<p><img src="../images/gaussian-process-plots-1.png" alt="" /></p>
<h1 id="model">Model</h1>
<p>Our model is motivated by gravitropism. A root grows in a given
direction for a period and then changes direction, presumably a
direction that is a little steeper than before. Below we have a
picture of that process. We considered two different approaches to
modeling. First, we considered modeling changes in the slope, \(m_i,
i = 1, 2, 3\). If the changes in slope are generally downward, then
we get a root that moves generally downward. Second, we considered
modeling changes in the angle \(\theta_i, i = 1, 2, 3\). If the
changes in angle are generally downward, then we get a root that moves
generally downward. (The choice of three pieces here is arbitrary.)</p>
<p><img src="../images/root-trajectory-explained.png" alt="" /></p>
<p>We modeled the time of emergence separately from the depth. In
particular, we modeled the counts using a binomial model where the
time-varying probability of root emergence was modeled on the log-odds
scale using a Gaussian process with a periodic kernel. In the plot we
show the expected number of roots to emerge each day. You can see
that we capture the two periods of heightened root growth between days
10-14 and 24-28. The plot on the right shows the posterior for the
period parameter, which is roughly two weeks.</p>
<p><img src="../images/p_binom_mean_and_per_hist-corn.png" alt="" /></p>
<p>We also introduce time varying parameters when modeling the depth
distribution. We tried several alternatives for modeling the changing
slopes or angles. Below we show the results when modeling the changing
slopes using a skew normal distribution. You can see that both the
mean and shape of these distributions changes over time as the root
growth goes from shallower to deeper.</p>
<p><img src="../images/m23-p-sn_all-corn-small.png" alt="" /></p>
<h1 id="recapitulating-root-growth">Recapitulating root growth</h1>
<p>We can bring all of this statistical modeling together to recapitulate
root growth. Below, we show a video of “canonical” root systems from
this experiment for corn (top) and wheat (bottom).</p>
<iframe src="https://drive.google.com/file/d/1tYYcOHJAJUyKGpEtj37sf5UaYZBsaGEI/preview" width="700" height="400" allow="autoplay"></iframe>Jesse WindleEasy estimators using automatic differentiation2023-06-09T00:00:00+00:002023-06-09T00:00:00+00:00/2023/06/09/gmm-fun<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
</script>
<p>In the <code class="language-plaintext highlighter-rouge">gmmfun</code> package we explore the generalized method of moments (GMM) using automatic differentiation (AD). GMM is a likelihood-free way of estimating population parameters. AD is kind of like symbolic differentiation; it is software that can create the gradient of an expression and evaluate it. We employ <a href="https://jax.readthedocs.io/en/latest/index.html">Jax</a> for automatic differentiation (AD). After reading this introduction, I would suggest walking through our example <a href="https://github.com/jwindle/gmmfun/tree/main/notebooks">Jupyter Notebooks</a> to see how it all works.</p>
<h1 id="gmm-for-population-parameter-estimation">GMM for population parameter estimation</h1>
<p><a href="https://faculty.washington.edu/ezivot">Eric Zivot</a> has an accessible <a href="https://faculty.washington.edu/ezivot/econ583/gmm.pdf">summary</a> of GMM. We recapitulate the essential parts here.</p>
<p>The idea is the following: you have \(i = 1, \ldots, K\) functions, or moment conditions, that satisfy</p>
\[\mathbb{E}[g_i(X; \theta_0)] = 0, i = 1, \ldots, K\]
<p>(We will use capital letters like \(X\) to denote a random variable and lower case letters like \(x\) to denote a realization from the random variables distribution.)</p>
<p>The sample equivalent of this is</p>
\[\bar g_i(\theta) = \frac{1}{n} \sum_{t=1}^{n} g_i(x_t, \theta) = 0.\]
<p>Because we may have more moment conditions than parameters, we may not be able to solve this exactly using sampled data. An obvious thing to do is to minmize the squared error of these sample moment conditions.</p>
<p>Because it will be useful later, we actually want to think about minimizing the squared error using a symmetric, positive definite weighting matrix \(W\). That is, we want to minimize</p>
\[J_n(\theta, W) = n \bar g' W \bar g.\]
<p>The choice of \(W\) can have a dramatic effect on the efficiency of the estimator, but, for any choice of \(W\), solving for the \(\theta\) that minimizes \(J_n\) will produce a consistent estimate as \(n \rightarrow \infty\). The most efficient estimator is when \(n W\) is the inverse of the asymptotic variance of \(\bar g\) as our data grows without bound, i.e. \(n \rightarrow \infty\), under the true value \(\theta = \theta_0\). Assuming we have IID data,</p>
\[Var[ \frac{1}{n} \sum_{t=1}^{n} g(X_t, \theta) ] = \frac{1}{n} Var[ g(X_1, \theta) ].\]
<p>Let \(\hat S\) be an estimator of the variance term,</p>
\[\hat S(\theta) = V_n(\theta) = \frac{1}{n} \sum_{i=1}^{n} g(x_i, \theta) g(x_i, \theta)'\]
<p>and let \(\hat W = \hat S^{-1}\). Then we can iteratively cylce through values of \(\hat \theta\) and \(\hat S\) by:</p>
\[\hat \theta = \underset{\theta}{\text{argmin}} \; J_n(\theta, \hat S)\]
<p>and</p>
\[\hat S = V_n(\hat \theta)\]
<p>to arrive at the estimator \(\hat \theta\) that arises using the optimal weight matrix, approximately speaking.</p>
<p>Lastly, and critically, when we have \(K\) moment conditions and only \(L\) parameters, then asymptotically, \(J_n\) converges to a \(\chi^2_{K-L}\) distribution under the null hypothesis. Thus, we can use \(J_n\) as a statistic to create a p-value:</p>
\[p = 1 - CDF_{\chi^2}(J_n(\hat \theta, \hat S), \; df=K - L).\]
<h1 id="moment-conditions-for-estimating-population-parameters">Moment conditions for estimating population parameters</h1>
<p>We will be using the moment generating function (MGF) to define our moment conditions. The MGF is defined as</p>
\[M(t) = \mathbb{E}[e^{t X}]\]
<p>and has the property \(\mathbb{E}[X^k] = M^{(k)}(0)\) under certain regularity conditions. Thus, the obvious moment conditions are:</p>
\[g_i(x, \theta) = x^i - M_\theta^{(i)}(0), i = 1, \ldots, K.\]
<p>where \(M^{(i)}\) is the \(i\)th derivative of the moment generating function with respect to \(t\) and \(\theta\) is the parameter vector. AD makes computing \(M^{(i)}\) trivial.</p>
<p>The code is very simple and can be found <a href="https://github.com/jwindle/gmmfun">here</a>.</p>
<h2 id="known-asymptotic-variance">Known asymptotic variance</h2>
<p>We can actually go a step further here by computing the aysmptotic variance directly. Suppose we want to compute the covariance between the \(i\) and \(j\) moment condition. Letting \(\mu_i = M^{(i)}(0)\), \(i=1, \ldots, 2K\) we have</p>
\[\mathbb{E}[(X^i - \mu_i)((X^j - \mu_j)] =
\mathbb{E}[X^{i+j}] - \mu_i \mu_j = \mu_{i+j} - \mu_i \mu_j.\]
<p>In other words, for the price of computing not \(K\) derivatives, but the first \(2K\) derivatives, we can compute the asymptotic variance for a given parameter \(\theta\) directly.</p>
<p>In our method of moments discussion above, that allows us to replac \(\hat S(\theta)\) above with \(S(\theta)\) where</p>
\[S_{ij}(\theta) = M^{(i+j)}(\theta) - M^{(i)}(\theta) M^{(j)}(\theta),\]
<p>the exact asymptotic variance for a given value of \(\theta\).</p>
<h1 id="uses">Uses</h1>
<p>The motivation for this work arose when considering the distribution of a linear combination of iid random variables, and estimating the underlying parameters governing that distribution. In particular, suppose one observes \(Y = v' X\), where \(X\) is a vector of iid random variables and \(v\) is a constant vector. We want to estimate the parameters governing the distribution of the components of \(X\).</p>
<p>When working in the Bayesian setting, either one would need to marginalize out \(X\) to arrive at a closed form likelihood of the observed, or one would need to generate posterior samples from those latent variables themselves in order to generate estimates of the parameters. The latter can lead to Markov Chain Monte Carlos samplers with high autocorrelation in the parameter samples.</p>
<p>However, if we know the moment generation function of \(X\), then we can use that to easily construct the moment generating function of \(Y\). We can then apply the GMM method outlined above to estimate the parameters of the distribution governing the components of \(X\).</p>
<h1 id="gmmfun-package">gmmfun package</h1>
<p>We have written the <code class="language-plaintext highlighter-rouge">gmmfun</code> package to implement these methods and a series of notebooks that covers:</p>
<ul>
<li>How to use the package</li>
<li>How the asymptotics change if one estimates the aymptotic variannce or computes it directly</li>
<li>And how one can use these methods to estimate the parameters of iid \(X_i, i = 1, \ldots K\) when one observes the linear combination \(Y = v' X\)</li>
</ul>
<p>The repo for the package can be found <a href="https://github.com/jwindle/gmmfun">here</a>.</p>Jesse WindleGaussian conditioned on a piecewise affine, continuous function2023-03-22T00:00:00+00:002023-03-22T00:00:00+00:00/2023/03/22/ctgauss<script type="text/javascript" id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
</script>
<p>Here is a problem that arose while we were modeling root growth in
monocots (e.g. maize).</p>
<p>Suppose that a priori \(X \sim N(\mu, I_n)\). Subsequently, we learn
that \(\ell(X) = 0\) where \(\ell\) is a piecewise affine, continuous
function. How does one sample \((X \vert \ell(X) = 0)\)?</p>
<p>It turns out that you can sample this conditional distribution using
Hamiltonian Monte Carlo (HMC). I have written up the methods in a
<a href="https://arxiv.org/abs/2303.12185">paper</a>, if you want to check out the details, and put together
a Python package called <a href="https://github.com/jwindle/ctgauss">ctgauss</a> that does the sampling.</p>
<p>This lets you do things like sample \(X \sim (N(0, \Sigma) \; \vert \;
\|X\|_1 = 1)\). In the figure below you will find this example for
two different covariance structures. You can even use a piecewise
quadratic log density, if you want.</p>
<p><img src="../images/onenorm-example.jpg" alt="onenorm-example" /></p>
<h1 id="approach">Approach</h1>
<p>In the paper, you will find that I have synthesized the work of
<a href="https://arxiv.org/abs/1208.4118">Pakman and Paninski</a>, who deal with truncations, and
<a href="https://proceedings.neurips.cc/paper/2015/file/8303a79b1e19a194f1875981be5bdb6f-Paper.pdf">Mohasel Afshar and Domke</a>, who deal with steps, and then
place that within the context of a particle moving on a manifold
within a higher dimensional space.</p>
<p>The key to all of this is that we know the exact dynamics of a
particle when the log-density of the distribution of interest is
piecewise quadratic. A piece on which our function is defined is
determined by a finite number of hyperplanes. Since the dynamics are
known exactly on that piece, we can compute the hypothetical time to
reaching each boundary, the smallest time being the one that matters,
since that is the one the particle will hit.</p>
<p>It is then just a matter of understanding the physics. If the
particle encounters a hard wall or a change in potential that is
greater then its kinetic energy, it will be reflected. Otherwise, it
will “climb” over the potential, but lose some momentum in the
process.</p>
<h1 id="when-to-use-ctgauss">When to use ctgauss</h1>
<p>If you hear “HMC”, you should immediately think
<a href="https://mc-stan.org/">Stan</a>, which is an excellent piece of software
for Bayesian inference using Hamiltonian Monte Carlo. In many
modeling scenarios, Stan will work, even with truncations or for some
non-smooth densities (see e.g. <a href="https://mc-stan.org/docs/functions-reference/index.html">Stan’s function
reference</a>,
section 3.7). For instance, you can sample approximately from the
distribution above using the model below with data, e.g. <code class="language-plaintext highlighter-rouge">list(N=3,
y=1.)</code>, if you relax the requirement that the sample lie exactly
on the surface \(\|X\|_1 =1\).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=1> N;
real<lower=0> y;
}
parameters {
vector[N] x;
}
transformed parameters {
real z = sum(fabs(x));
}
model {
for (i in 1:N) {
x[i] ~ normal(0.0, 1.0);
}
y ~ normal(z, 0.01);
}
</code></pre></div></div>
<p>However, to the best of our knowledge and as of this writing, Stan may
be unable to accommodate some use cases, like arbitrary truncations by
hyperplane in multiple dimensions or when the piecewise affine,
continuous function used for conditioning is complex, in which case
<a href="https://github.com/jwindle/ctgauss">ctgauss</a> may be useful. (Or if you really, really want to
sample on \(\ell\) and not near it.)</p>
<p>For instance, generalizing the example above, we can do the same thing
for any n-sided top. Below we plot various samples on a 6-sided top
using ctgauss.</p>
<p><img src="../images/ntop-example.jpg" alt="ntop-example" /></p>
<h1 id="conclusion">Conclusion</h1>
<p>The bottom line is, if you can use Stan, use it. If you have a weird
use case, then consider <a href="https://github.com/jwindle/ctgauss">ctgauss</a>. Another upside to
<a href="https://github.com/jwindle/ctgauss">ctgauss</a> is that it is easy to understand the code base. The
<a href="https://arxiv.org/abs/2303.12185">paper</a> includes an appendix with the algorithm, and that
algorithm is closely mapped to what is actually implemented in Python,
making it easy to hack the code, if you want.</p>
<p>Let me know if you have any questions!</p>Jesse Windle