Predict fish phenology: nested

Overview

hatchR is designed to be flexible to achieve many applications. However, by virtue of being built as a scripting application, hatchR is able to tackle very large datasets relatively quickly and efficiently. Below we demonstrate an example of a nested dataset with multiple sites that include multiple years of data.

First, we load packages:

library(hatchR)
library(purrr)      # for mapping functions
library(tidyr)      # for nesting data
library(dplyr)      # for data manipulation
library(lubridate)  # for date manipulation
library(ggplot2)    # for plotting
library(ggridges)   # for plotting

Example Data

We’ll use a dataset of water temperature data from Central Idaho. These data are downloaded (and modified to long format) from Isaak et al. (2018) and generally cover the Boise, Payette, Clearwater, and upper Salmon River watersheds. The map below shows the locations of 226 water temperature monitoring sites overlaid on an August stream temperature scenario for the 29 600 km network in the study area.

To get started lets take a quick look at the data.

# look at the first few rows
idaho
#> # A tibble: 773,681 × 3
#>   date                site      temp_c
#>   <dttm>              <chr>      <dbl>
#> 1 2010-12-01 00:00:00 PIBO_1345   0.07
#> 2 2010-12-01 00:00:00 PIBO_1346   0.53
#> 3 2010-12-01 00:00:00 PIBO_1350   0.47
#> 4 2010-12-01 00:00:00 PIBO_1368   0.1 
#> 5 2010-12-01 00:00:00 PIBO_1375   0.33
#> # ℹ 773,676 more rows

# count number of unique sites
idaho |> 
  pull(site) |>
  unique() |> 
  length()
#> [1] 226

You can see there are 226 sites with 773,681 individual records of water temperature.

For this application, we’ll be thinking about these sites at putative bull trout (Salvelinus confluentus) spawning habitat. Bull trout are generally not observed where mean August temperatures are above 13 °C. So, we’ll first filter sites out for those cooler than 13 °C.

# create a vector of site names with temps at or below 13 C
bull_trout_sites <- idaho |> 
  mutate(month = month(date)) |> #make a month column (numeric)
  filter(month == 8) |> # filter out Aug.
  group_by(site) |> # apply grouping by site
  summarise(mean_aug_temp = mean(temp_c)) |> 
  filter(mean_aug_temp <= 13) |> # keep only sites 13 C or cooler
  pull(site) |> 
  unique()

# we now have a list of 139 sites
length(bull_trout_sites)
#> [1] 139

# only keep sites in our vector of bull trout sites
idaho_bt <- idaho |> 
  filter(site %in% bull_trout_sites) 

# still lots of data!
idaho_bt
#> # A tibble: 473,595 × 3
#>   date                site      temp_c
#>   <dttm>              <chr>      <dbl>
#> 1 2010-12-01 00:00:00 PIBO_1345   0.07
#> 2 2010-12-01 00:00:00 PIBO_1346   0.53
#> 3 2010-12-01 00:00:00 PIBO_1350   0.47
#> 4 2010-12-01 00:00:00 PIBO_1368   0.1 
#> 5 2010-12-01 00:00:00 PIBO_1375   0.33
#> # ℹ 473,590 more rows

Next we’ll want to do some data checks to make sure everything looks alright.

# lets look at a couple individual sites
PIBO_1345 <- idaho_bt |> filter(site == "PIBO_1345")

# looks nice
plot_check_temp(PIBO_1345,
                dates = date,
                temperature = temp_c)


# order by sample date
PIBO_1345 |> arrange(date)
#> # A tibble: 3,601 × 3
#>   date                site      temp_c
#>   <dttm>              <chr>      <dbl>
#> 1 2010-12-01 00:00:00 PIBO_1345   0.07
#> 2 2010-12-01 00:00:00 PIBO_1345   0.07
#> 3 2010-12-02 00:00:00 PIBO_1345   0.07
#> 4 2010-12-02 00:00:00 PIBO_1345   0.07
#> 5 2010-12-03 00:00:00 PIBO_1345   0.06
#> # ℹ 3,596 more rows
# looks like there are multiple records per day 
# so we need to summarize those in the larger dataset (idaho_bt)

Nested Dataframes

The package tidyr allows us to use a functionality called nested data (tidyr::nest()). For our datatset, we can think of it in terms of a dataframe made up of a bunch of smaller dataframes where the identifier for separating the data is the name in the site column. These sub dataframes then are the records for each site in our example and we can separate them to programatically so that we can use the same function across them without skipping into the next data field. It utilizes the same approach we applied with purrr::map() but allows us to separate the function across individual datasets stored in our larger dataframe.

In this first example we want to summarize our data by day, which we can do using the following code.

First let’s just look at what nesting does before we actually make an object.

idaho_bt |> 
  group_by(site) |>  # we group by site
  nest() |> # nest our grouped data
  head()
#> # A tibble: 6 × 2
#> # Groups:   site [6]
#>   site      data                
#>   <chr>     <list>              
#> 1 PIBO_1345 <tibble [3,601 × 2]>
#> 2 PIBO_1346 <tibble [3,570 × 2]>
#> 3 PIBO_1350 <tibble [3,571 × 2]>
#> 4 PIBO_1368 <tibble [3,581 × 2]>
#> 5 PIBO_1375 <tibble [3,582 × 2]>
#> # ℹ 1 more row

The resulting data structure is tibble with a new data column called data. The data column is actually a list and each row contains its own individual tibble (dataframe).

isaak_summ_bt <- idaho_bt |> 
  group_by(site) |> 
  nest() |> 
  mutate(
    summ_obj = map(
      data, 
      summarize_temp,
      temperature = temp_c,
      dates = date
      )
    ) |> 
  select(site, summ_obj) 

# look at first rows of full object
isaak_summ_bt
#> # A tibble: 139 × 2
#> # Groups:   site [139]
#>   site      summ_obj            
#>   <chr>     <list>              
#> 1 PIBO_1345 <tibble [1,826 × 2]>
#> 2 PIBO_1346 <tibble [1,826 × 2]>
#> 3 PIBO_1350 <tibble [1,826 × 2]>
#> 4 PIBO_1368 <tibble [1,826 × 2]>
#> 5 PIBO_1375 <tibble [1,826 × 2]>
#> # ℹ 134 more rows

# use purrr::pluck() the first site to see its structure
isaak_summ_bt |> pluck("summ_obj", 1)
#> # A tibble: 1,826 × 2
#>   date       daily_temp
#>   <date>          <dbl>
#> 1 2010-12-01       0.07
#> 2 2010-12-02       0.07
#> 3 2010-12-03       0.06
#> 4 2010-12-04       0.06
#> 5 2010-12-05       0.07
#> # ℹ 1,821 more rows

Since we’ll be operating on these as nests we will keep them as a nest, however if you wanted to change back to the original dataframe format it’s easy with tidyr::unnest().

isaak_summ_bt |> unnest(cols = summ_obj)
#> # A tibble: 253,814 × 3
#> # Groups:   site [139]
#>   site      date       daily_temp
#>   <chr>     <date>          <dbl>
#> 1 PIBO_1345 2010-12-01       0.07
#> 2 PIBO_1345 2010-12-02       0.07
#> 3 PIBO_1345 2010-12-03       0.06
#> 4 PIBO_1345 2010-12-04       0.06
#> 5 PIBO_1345 2010-12-05       0.07
#> # ℹ 253,809 more rows

Data Check

We can do a last data check to make sure we have continuous data.

First we’ll use a smaller example to show how it works and then expand to our full dataset.

# Pull data from one site
PIBO_1345_summ <- isaak_summ_bt |>
  filter(site == "PIBO_1345") |>
  unnest(cols = "summ_obj")

# We can then use the hatchR function to check that our data are continuous
# We see from the message they are!
check_continuous(data = PIBO_1345_summ, dates = date)
#> ℹ No breaks were found. All clear!

# To demonstrate an example of our data not being continuous, we remove the 100th row
check_continuous(data = PIBO_1345_summ[-100,], dates = date)
#> Warning: ! Data not continuous
#> ℹ Breaks found at rows:
#> ℹ 100

Next we can map() our check_continuous() across our nested data set.


# Map check_continuous() across isaak_summ_bt
# We'll only do the first 5 nested objects so the output doesn't get too long

isaak_summ_bt[1:5,] |> # run on first five nested dataframes for convenience
                       # remove [1:5,] to run on all nested objects 
  mutate(diff = map( # mutate a dummy diff column to run map
    summ_obj, # nested data object (akin to "data =" argument in normal function)
    check_continuous, # function name, no parentheses
    dates = date # specify dates argument with column name of dates in summ_obj
  )
  )
#> ℹ No breaks were found. All clear!
#> ℹ No breaks were found. All clear!
#> ℹ No breaks were found. All clear!
#> ℹ No breaks were found. All clear!
#> ℹ No breaks were found. All clear!
#> # A tibble: 5 × 3
#> # Groups:   site [5]
#>   site      summ_obj             diff  
#>   <chr>     <list>               <list>
#> 1 PIBO_1345 <tibble [1,826 × 2]> <NULL>
#> 2 PIBO_1346 <tibble [1,826 × 2]> <NULL>
#> 3 PIBO_1350 <tibble [1,826 × 2]> <NULL>
#> 4 PIBO_1368 <tibble [1,826 × 2]> <NULL>
#> 5 PIBO_1375 <tibble [1,826 × 2]> <NULL>

Mapping Across Nested Data

Now that we have our data in the format we want and we’re confident that it doesn’t have any gaps we can start to map our hatchR functions onto the data.

First we need to make a vector of spawn dates and get our model set up.

# spawn dataest
spawn_dates <- map(
  c(2011:2014), # year vector to map for custom function
  function(year) { # custom paste function
    c(
      paste0(year, "-09-01"),
      paste0(year, "-09-15"),
      paste0(year, "-09-30")
      )
    }
  ) |> 
  unlist()

# bull trout hatch model
bt_hatch <- model_select(
  development_type = "hatch",
  author = "Austin et al. 2019",
  species = "bull trout",
  model = "MM"
)

Then we can map it across our nested dataframe.

hatch_res <- isaak_summ_bt |> 
  mutate(
    dev_period = map2(
      summ_obj, spawn_dates, 
      predict_phenology,
      temperature = daily_temp,
      model = bt_hatch,
      dates = date
      ) |> 
      map_df("dev_period") |>
      list()
    ) |> 
  select(site, dev_period) |> # just select the columns we want
  unnest(cols = c(dev_period)) |> # unnest everything
  mutate(days_to_hatch = stop - start) # make a new column of days to hatch

hatch_res
#> # A tibble: 1,668 × 4
#> # Groups:   site [139]
#>   site      start      stop       days_to_hatch
#>   <chr>     <date>     <date>     <drtn>       
#> 1 PIBO_1345 2011-09-01 2011-11-11  71 days     
#> 2 PIBO_1345 2011-09-15 2011-12-27 103 days     
#> 3 PIBO_1345 2011-09-30 2012-02-07 130 days     
#> 4 PIBO_1345 2012-09-01 2012-11-26  86 days     
#> 5 PIBO_1345 2012-09-15 2013-01-04 111 days     
#> # ℹ 1,663 more rows

It’s not output here, but this will output a boatload of warnings that some fish aren’t hatching. However, it’s pretty safe to assume that some of these habitats are actually too cold for bull trout and those are the places where they won’t have hatched. If we were trying to make management decisions about these fish or testing a some kind of hypothesis it would be worth following up with why those fish weren’t hatching, but for now, we’ll rest on the assumption it’s just too cold.

A couple notes about what we did above. First we use purrr::map2() instead of purrr::pmap() but they are interchangeable here as long as you set up your variable grid for purrr::pmap(). Secondly, we’re actually piping the output of the purrr::map2() into purrr::map_df() function so we can just pull out the dev_period list element out of the predict_phenology() output. Otherwise we’d end up with a list of the 4 outputs from predict_phenology(), which would end up being a pretty big object in your memory. So better just to store the 1x2 vector that is dev_period. Lastly, the purrr::map_df() is piped into a list because the nested dataframe doesn’t know how to hold onto multiple dimension object (the 1x2 vector), it just wants a singular list containing all the elements.

With that in mind we can plot our data.

Briefly, we’ll add an extra set of columns to define whether a fish spawned early, mid, or late and then plot the distributions across those time periods.

hatch_res |>
  mutate(day = day(start)) |> 
  mutate(spawn_time = case_when(
    day == 1 ~ "Early",
    day == 15 ~ "Mid",
    day == 30 ~ "Late"
  )) |>
  mutate(spawn_time = factor(
    spawn_time, levels = c("Late", "Mid", "Early"), 
    ordered = TRUE)
    ) |> 
  ggplot(aes(
    x = as.integer(days_to_hatch), 
    y = spawn_time, 
    fill = spawn_time, 
    color = spawn_time
    )) + 
  geom_density_ridges(alpha = 0.9) +
  scale_fill_brewer(palette = "Blues", direction = 1) +
  scale_color_brewer(palette = "Blues", direction = 1) +
  labs(x = "Days to hatch", y = "Spawn time") +
  theme_classic() +
  theme(legend.position = "none")

References