Predict fish phenology: nested
Morgan Sparks, Bryan M. Maitland
Source:vignettes/Predict_phenology_nested.Rmd
Predict_phenology_nested.Rmd
Overview
hatchR is designed to be flexible to achieve many applications. However, by virtue of being built as a scripting application, hatchR is able to tackle very large datasets relatively quickly and efficiently. Below we demonstrate an example of a nested dataset with multiple sites that include multiple years of data.
First, we load packages:
library(hatchR)
library(purrr) # for mapping functions
library(tidyr) # for nesting data
library(dplyr) # for data manipulation
library(lubridate) # for date manipulation
library(ggplot2) # for plotting
library(ggridges) # for plotting
Example Data
We’ll use a dataset of water temperature data from Central Idaho. These data are downloaded (and modified to long format) from Isaak et al. (2018) and generally cover the Boise, Payette, Clearwater, and upper Salmon River watersheds. The map below shows the locations of 226 water temperature monitoring sites overlaid on an August stream temperature scenario for the 29 600 km network in the study area.
To get started lets take a quick look at the data.
# look at the first few rows
idaho
#> # A tibble: 773,681 × 3
#> date site temp_c
#> <dttm> <chr> <dbl>
#> 1 2010-12-01 00:00:00 PIBO_1345 0.07
#> 2 2010-12-01 00:00:00 PIBO_1346 0.53
#> 3 2010-12-01 00:00:00 PIBO_1350 0.47
#> 4 2010-12-01 00:00:00 PIBO_1368 0.1
#> 5 2010-12-01 00:00:00 PIBO_1375 0.33
#> # ℹ 773,676 more rows
# count number of unique sites
idaho |>
pull(site) |>
unique() |>
length()
#> [1] 226
You can see there are 226 sites with 773,681 individual records of water temperature.
For this application, we’ll be thinking about these sites at putative bull trout (Salvelinus conflentus) spawning habitat. Bull trout are generally not observed where mean August temperatures are above 13 °C. So, we’ll first filter sites out for those cooler than 13 °C.
# create a vector of site names with temps at or below 13 C
bull_trout_sites <- idaho |>
mutate(month = month(date)) |> #make a month column (numeric)
filter(month == 8) |> # filter out Aug.
group_by(site) |> # apply grouping by site
summarise(mean_aug_temp = mean(temp_c)) |>
filter(mean_aug_temp <= 13) |> # keep only sites 13 C or cooler
pull(site) |>
unique()
# we now have a list of 139 sites
length(bull_trout_sites)
#> [1] 139
# only keep sites in our vector of bull trout sites
idaho_bt <- idaho |>
filter(site %in% bull_trout_sites)
# still lots of data!
idaho_bt
#> # A tibble: 473,595 × 3
#> date site temp_c
#> <dttm> <chr> <dbl>
#> 1 2010-12-01 00:00:00 PIBO_1345 0.07
#> 2 2010-12-01 00:00:00 PIBO_1346 0.53
#> 3 2010-12-01 00:00:00 PIBO_1350 0.47
#> 4 2010-12-01 00:00:00 PIBO_1368 0.1
#> 5 2010-12-01 00:00:00 PIBO_1375 0.33
#> # ℹ 473,590 more rows
Next we’ll want to do some data checks to make sure everything looks alright.
# lets look at a couple individual sites
PIBO_1345 <- idaho_bt |> filter(site == "PIBO_1345")
# looks nice
plot_check_temp(PIBO_1345,
dates = date,
temperature = temp_c)
# order by sample date
PIBO_1345 |> arrange(date)
#> # A tibble: 3,601 × 3
#> date site temp_c
#> <dttm> <chr> <dbl>
#> 1 2010-12-01 00:00:00 PIBO_1345 0.07
#> 2 2010-12-01 00:00:00 PIBO_1345 0.07
#> 3 2010-12-02 00:00:00 PIBO_1345 0.07
#> 4 2010-12-02 00:00:00 PIBO_1345 0.07
#> 5 2010-12-03 00:00:00 PIBO_1345 0.06
#> # ℹ 3,596 more rows
# looks like there are multiple records per day
# so we need to summarize those in the larger dataset (idaho_bt)
Nested Dataframes
The package tidyr allows us to use a functionality
called nested data (tidyr::nest()
). For our datatset, we
can think of it in terms of a dataframe made up of a bunch of smaller
dataframes where the identifier for separating the data is the name in
the site
column. These sub dataframes then are the records
for each site in our example and we can separate them to programatically
so that we can use the same function across them without skipping into
the next data field. It utilizes the same approach we applied with
purrr::map()
but allows us to separate the function across
individual datasets stored in our larger dataframe.
In this first example we want to summarize our data by day, which we can do using the following code.
First let’s just look at what nesting does before we actually make an object.
idaho_bt |>
group_by(site) |> # we group by site
nest() |> # nest our grouped data
head()
#> # A tibble: 6 × 2
#> # Groups: site [6]
#> site data
#> <chr> <list>
#> 1 PIBO_1345 <tibble [3,601 × 2]>
#> 2 PIBO_1346 <tibble [3,570 × 2]>
#> 3 PIBO_1350 <tibble [3,571 × 2]>
#> 4 PIBO_1368 <tibble [3,581 × 2]>
#> 5 PIBO_1375 <tibble [3,582 × 2]>
#> # ℹ 1 more row
The resulting data structure is tibble with a new data column called data. The data column is actually a list and each row contains its own individual tibble (dataframe).
isaak_summ_bt <- idaho_bt |>
group_by(site) |>
nest() |>
mutate(
summ_obj = map(
data,
summarize_temp,
temperature = temp_c,
dates = date
)
) |>
select(site, summ_obj)
# look at first rows of full object
isaak_summ_bt
#> # A tibble: 139 × 2
#> # Groups: site [139]
#> site summ_obj
#> <chr> <list>
#> 1 PIBO_1345 <tibble [1,826 × 2]>
#> 2 PIBO_1346 <tibble [1,826 × 2]>
#> 3 PIBO_1350 <tibble [1,826 × 2]>
#> 4 PIBO_1368 <tibble [1,826 × 2]>
#> 5 PIBO_1375 <tibble [1,826 × 2]>
#> # ℹ 134 more rows
# use purrr::pluck() the first site to see its structure
isaak_summ_bt |> pluck("summ_obj", 1)
#> # A tibble: 1,826 × 2
#> date daily_temp
#> <date> <dbl>
#> 1 2010-12-01 0.07
#> 2 2010-12-02 0.07
#> 3 2010-12-03 0.06
#> 4 2010-12-04 0.06
#> 5 2010-12-05 0.07
#> # ℹ 1,821 more rows
Since we’ll be operating on these as nests we will keep them as a
nest, however if you wanted to change back to the original dataframe
format it’s easy with tidyr::unnest()
.
isaak_summ_bt |> unnest(cols = summ_obj)
#> # A tibble: 253,814 × 3
#> # Groups: site [139]
#> site date daily_temp
#> <chr> <date> <dbl>
#> 1 PIBO_1345 2010-12-01 0.07
#> 2 PIBO_1345 2010-12-02 0.07
#> 3 PIBO_1345 2010-12-03 0.06
#> 4 PIBO_1345 2010-12-04 0.06
#> 5 PIBO_1345 2010-12-05 0.07
#> # ℹ 253,809 more rows
Data Check
We can do a last data check to make sure we have continuous data.
First we’ll use a smaller example to show how it works and then expand to our full dataset.
# Pull data from one sits
PIBO_1345_summ <- isaak_summ_bt |>
filter(site == "PIBO_1345") |>
unnest(cols = "summ_obj")
# We create a column that either contains NA, TRUE, or FALSE
# NA is for first data
# TRUE is if the difference between one row and the row preceding it is 1
# FALSE is the difference is not 1
PIBO_1345_summ |>
mutate(diff = c(NA, diff(date)) == 1) |>
filter(diff == FALSE) # since the output is empty there are no FALSE in diff
#> # A tibble: 0 × 4
#> # Groups: site [0]
#> # ℹ 4 variables: site <chr>, date <date>, daily_temp <dbl>, diff <lgl>
# We can do the same to our isaak_summ_bt dataset
# only difference here is we are mapping with an anonymous function hence the
# ~{..., .x$date...} syntax.
# ~{} tells us it's an anonymous function while the .x allows to us to
# call the column from whatever data is piped in
isaak_summ_bt |>
mutate(diff = map(
summ_obj, ~ {
c(NA, diff(.x$date) == 1)
}
)
) |>
unnest(cols = c(summ_obj, diff)) |>
filter(diff == FALSE)
#> # A tibble: 0 × 4
#> # Groups: site [0]
#> # ℹ 4 variables: site <chr>, date <date>, daily_temp <dbl>, diff <lgl>
# all continuous!
Mapping Across Nested Data
Now that we have our data in the format we want and we’re confident that it doesn’t have any gaps we can start to map our hatchR functions onto the data.
First we need to make a vector of spawn dates and get our model set up.
# spawn datest
spawn_dates <- map(
c(2011:2014), # year vector to map for custom function
function(year) { # custom paste function
c(
paste0(year, "-09-01"),
paste0(year, "-09-15"),
paste0(year, "-09-30")
)
}
) |>
unlist()
# bull trout hatch model
bt_hatch <- model_select(
development_type = "hatch",
author = "Austin et al. 2017",
species = "bull trout",
model = "MM"
)
Then we can map it across our nested dataframe.
hatch_res <- isaak_summ_bt |>
mutate(
dev_period = map2(
summ_obj, spawn_dates,
predict_phenology,
temperature = daily_temp,
model = bt_hatch,
dates = date
) |>
map_df("dev.period") |>
list()
) |>
select(site, dev_period) |> # just select the columns we want
unnest(cols = c(dev_period)) |> # unnest everything
mutate(days_to_hatch = stop - start) # make a new column of days to hatch
hatch_res
#> # A tibble: 1,668 × 4
#> # Groups: site [139]
#> site start stop days_to_hatch
#> <chr> <date> <date> <drtn>
#> 1 PIBO_1345 2011-09-01 2011-11-11 71 days
#> 2 PIBO_1345 2011-09-15 2011-12-27 103 days
#> 3 PIBO_1345 2011-09-30 2012-02-07 130 days
#> 4 PIBO_1345 2012-09-01 2012-11-26 86 days
#> 5 PIBO_1345 2012-09-15 2013-01-04 111 days
#> # ℹ 1,663 more rows
It’s not output here, but this will output a boatload of warnings that some fish aren’t hatching. However, it’s pretty safe to assume that some of these habitats are actually too cold for bull trout and those are the places where they won’t have hatched. If we were trying to make management decisions about these fish or testing a some kind of hypothesis it would be worth following up with why those fish weren’t hatching, but for now, we’ll rest on the assumption it’s just too cold.
A couple notes about what we did above. First we use
purrr::map2()
instead of purrr::pmap()
but
they are interchangeable here as long as you set up your variable grid
for purrr::pmap()
. Secondly, we’re actually piping the
output of the purrr::map2()
into
purrr::map_df()
function so we can just pull out the
dev.period
list element out of the
predict_phenology()
output. Otherwise we’d end up with a
list of the 4 outputs from predict_phenology()
, which would
end up being a pretty big object in your memory. So better just to store
the 1x2 vector that is dev.period
. Lastly, the
purrr::map_df()
is piped into a list because the nested
dataframe doesn’t know how to hold onto multiple dimension object (the
1x2 vector), it just wants a singular list containing all the
elements.
With that in mind we can plot our data.
Briefly, we’ll add an extra set of columns to define whether a fish spawned early, mid, or late and then plot the distributions across those time periods.
hatch_res |>
mutate(day = day(start)) |>
mutate(spawn_time = case_when(
day == 1 ~ "Early",
day == 15 ~ "Mid",
day == 30 ~ "Late"
)) |>
mutate(spawn_time = factor(
spawn_time, levels = c("Late", "Mid", "Early"),
ordered = TRUE)
) |>
ggplot(aes(
x = as.integer(days_to_hatch),
y = spawn_time,
fill = spawn_time,
color = spawn_time
)) +
geom_density_ridges(alpha = 0.9) +
scale_fill_brewer(palette = "Blues", direction = 1) +
scale_color_brewer(palette = "Blues", direction = 1) +
labs(x = "Days to hatch", y = "Spawn time") +
theme_classic() +
theme(legend.position = "none")