Skip to contents

Overview

hatchR is an R package that allows users to predict hatch and emergence timing for wild fishes, as well as additional tools to aid in those analyses. hatchR is intended to bridge the analytic gap of taking statistical models developed in hatchery settings (e.g., Beacham and Murray 1990) and applying them to real world temperature data sets using the effective value framework developed by Sparks et al. (2019).

In this vignette, we:

  • describes input data requirements
  • provide recommendations for importing data
  • review basic data checks
  • preview the full hatchR workflow covered in other vignettes

Input Data Requirements

Water temperature datasets for wild fishes are often either (1) already summarized by day (i.e., mean daily temperature) or, (2) in a raw format from something like a HOBO TidbiT data logger where readings are taken multiple times per day, which can be summarized into a mean daily temperatures. Alternatively, daily water temperature predictions from novel statistical models (e.g., Siegel et al. 2023) could be similarly implemented.

Fundamentally, hatchR assumes you have input data with two columns:

  1. date: a vector of dates (and often times) of a temperature measurement (must be class Date or POSIXct, which Tibbles print as <date> and <dttm>, respectively).
  2. temperature: a vector of associated temperature measurement (in centigrade).

Additional columns are allowed, but these are required.

We expect your data to look something like this:

date temperature
2000-01-01 2.51
2000-07-01 16.32
2000-12-31 3.13

hatchR assumes you’ve checked for missing records or errors in your data because it can work with gaps, so it’s important to go through the data checks discussed below, as well as your own validity checks.

hatchR can use values down to freezing (e.g., 0 °C), which returns extremely small effective values, and time to hatch or emerge may be > 1 year. In these cases, we suggest users consider how much of that data type is reasonable with their data. To learn more about this, see the section on negative temperatures in the Predict fish phenology: Basic vignette.

Prerequisites: Dates and Times

Numeric temperature values are simple to work with in R, but dates and time can be tricky. Here we provide a brief overview of how to work with dates and times in R, but refer the user to Chapter 17 in R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) for a more in-depth discussion.

The lubridate package makes it easier to work with dates and times in R. It comes installed with hatchR, and can be loaded with:

There are three types of date/time data that refer to an instant in time:

  • A date, which Tibbles print as <date>
  • A time within a day, which Tibbles print as <time>
  • A date-time is a date plus time, which Tibbles print as <dttm>. Base R calls these POSIXct, but that’s not a helpful name.

You can use lubridate::today() or lubridate::now() to get the current date or date-time:

today()
#> [1] "2024-12-16"
now()
#> [1] "2024-12-16 22:07:04 UTC"

In the context of hatchR, the ways you are likely to create a date/time are:

  1. reading a file into R with readr::read_csv()
  2. from a string (e.g., if data was read into R with read.csv())
  3. from individual components (year, month, day, hour, minute, second)
Reading in dates from a file

When reading in a CSV file with readr::read_csv(), readr (which also comes installed with hatchR) will automatically parse (recognize) dates and date-times if they are in the form “YYYY-MM-DD” or “YYYY-MM-DD HH:MM:SS”. These are ISO8601 date (<date>) and date-time (<dttm>) formats, respectively. ISO8601 is an international standard for writing dates where the components of a date are organized from biggest to smallest separated by -. Below, we first load the readr package and then read in a CSV file with dates in the form “YYYY-MM-DD” and “YYYY-MM-DD HH:MM:SS”:

csv <- "
  date,datetime
  2022-01-02,2022-01-02 05:12
"
read_csv(csv)
#> # A tibble: 1 × 2
#>   date       datetime           
#>   <date>     <dttm>             
#> 1 2022-01-02 2022-01-02 05:12:00

If your dates are in a different format, you’ll need to use col_types plus col_date() or col_datetime() along with a standard date-time format (see Table 17.1 in R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) for a list of all date format options).

From strings

If you you read in a CSV file using read.csv() from base R, date columns will be formatted as a characters (<char>; e.g., "2000-09-01" or "2000-09-01 12:00:00"). You will have convert to this column to a <date> or <dttm>. lubridate helper functions attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function.

ymd("2017-01-31")
#> [1] "2017-01-31"
mdy("January 31st, 2017")
#> [1] "2017-01-31"
mdy_hm("01/31/2017 08:01")
#> [1] "2017-01-31 08:01:00 UTC"
ymd_hms("2017-01-31 20:11:59")
#> [1] "2017-01-31 20:11:59 UTC"
From individual components

Sometimes you have the components of a date in separate columns. You can use make_date() or make_datetime() to combine them into a date or date-time.

To show this, we’ll use the flights dataset that comes with nycflights13, which is only installed alongside hatchR when dependencies = TRUE in install.packages(). We also make use of helper functions from dplyr to select() and mutate() columns.

flights |> 
  select(year, month, day) |>
  mutate(date = make_date(year, month, day))
#> # A tibble: 336,776 × 4
#>    year month   day date      
#>   <int> <int> <int> <date>    
#> 1  2013     1     1 2013-01-01
#> 2  2013     1     1 2013-01-01
#> 3  2013     1     1 2013-01-01
#> 4  2013     1     1 2013-01-01
#> 5  2013     1     1 2013-01-01
#> # ℹ 336,771 more rows

flights |> 
  select(year, month, day, hour, minute) |> 
  mutate(departure = make_datetime(year, month, day, hour, minute))
#> # A tibble: 336,776 × 6
#>    year month   day  hour minute departure          
#>   <int> <int> <int> <dbl>  <dbl> <dttm>             
#> 1  2013     1     1     5     15 2013-01-01 05:15:00
#> 2  2013     1     1     5     29 2013-01-01 05:29:00
#> 3  2013     1     1     5     40 2013-01-01 05:40:00
#> 4  2013     1     1     5     45 2013-01-01 05:45:00
#> 5  2013     1     1     6      0 2013-01-01 06:00:00
#> # ℹ 336,771 more rows

Importing your data

Using readr::read_csv()

As noted in the prerequisites section, we recommend loading your data into R using readr::read_csv(). We can load readr using:

In this and other vignettes, we will use two datasets that come installed with the package: crooked_river and woody_island. Each dataset is available to the user as R objects once hatchR is installed and attached (see ?crooked_river or ?woody_island for more information). The raw example data (.csv files) are stored in the extdata/ directory installed alongside hatchR. We may store the file paths to example data:

path_cr <- system.file("extdata/crooked_river.csv", package = "hatchR")
path_wi <- system.file("extdata/woody_island.csv", package = "hatchR")

After specifying path_* (using system.file()), we load the example dataset into R:

crooked_river <- read_csv(path_cr)
woody_island <- read_csv(path_wi)

We check the crooked_river dataset by running either str() or tibble::glimpse() to see the structure of the data. glimpse() is a little like str() applied to a data frame, but it tries to show you as much data as possible. We prefer tibble::glimpse() because it is more compact and easier to read. tibble also comes installed with hatchR.

Now we can check the structure of the crooked_river dataset:

glimpse(crooked_river)
#> Rows: 1,826
#> Columns: 2
#> $ date   <dttm> 2010-12-01, 2010-12-02, 2010-12-03, 2010-12-04, 2010-12-05, 20…
#> $ temp_c <dbl> 1.1638020, 1.3442852, 1.2533443, 1.0068728, 1.2899153, 1.229158…
glimpse(woody_island)
#> Rows: 735
#> Columns: 2
#> $ date   <chr> "8/11/1990", "8/12/1990", "8/13/1990", "8/14/1990", "8/15/1990"…
#> $ temp_c <dbl> 25.850000, 23.308333, 18.533333, 15.350000, 13.966667, 11.35833…

For your own data, assuming you have a .csv file in a data folder in your working directory called your_data.csv, you would call:

library(readr)
library(tibble)
your_data <- read_csv("data/your_data.csv")
glimpse(your_data)

Using read.csv()

If you import your data in with functions like read.csv() or read.table(), date columns will be formatted as a characters (<chr>). You will have convert to this column to a <date> or <dttm> type, and we recommend you do this using lubridate, which makes dealing with date a litter easier.

Below is an example of how you might do this. First, we check the structure of the crooked_river and woody_island datasets:

crooked_river <- read.csv(path_cr)
woody_island <- read.csv(path_wi)

glimpse(crooked_river) # note date column imported as a character (<chr>)
#> Rows: 1,826
#> Columns: 2
#> $ date   <chr> "2010-12-01T00:00:00Z", "2010-12-02T00:00:00Z", "2010-12-03T00:…
#> $ temp_c <dbl> 1.1638020, 1.3442852, 1.2533443, 1.0068728, 1.2899153, 1.229158…
glimpse(woody_island) # note date column imported as a character (<chr>)
#> Rows: 735
#> Columns: 2
#> $ date   <chr> "8/11/1990", "8/12/1990", "8/13/1990", "8/14/1990", "8/15/1990"…
#> $ temp_c <dbl> 25.850000, 23.308333, 18.533333, 15.350000, 13.966667, 11.35833…

Then, we convert the date columns using lubridate:

# if your date is in the form "2000-09-01 12:00:00"
crooked_river$date <- ymd_hms(crooked_river$date)

# if your date is in the form "2000-09-01"
woody_island$date <- mdy(woody_island$date) 

glimpse(crooked_river)
#> Rows: 1,826
#> Columns: 2
#> $ date   <dttm> 2010-12-01, 2010-12-02, 2010-12-03, 2010-12-04, 2010-12-05, 20…
#> $ temp_c <dbl> 1.1638020, 1.3442852, 1.2533443, 1.0068728, 1.2899153, 1.229158…
glimpse(woody_island)
#> Rows: 735
#> Columns: 2
#> $ date   <date> 1990-08-11, 1990-08-12, 1990-08-13, 1990-08-14, 1990-08-15, 19…
#> $ temp_c <dbl> 25.850000, 23.308333, 18.533333, 15.350000, 13.966667, 11.35833…

Data Checks

After data are loaded and dates are formatted correctly, we can complete a series of checks to ensure that the temperature data is in the correct format and that there are no missing values.

hatchR provides several helper functions to assist with this process.

Visualize your data

hatchR comes with the function plot_check_temp() to visualize your imported data and verify nothing strange happened during your import process. The function outputs a ggplot2 object which can be subsequently modified by the user. The arguments temp_min = and temp_max = can be used to custom set thresholds for expected temperature ranges (defaults are set at 0 and 25 °C). Here is an example using the built-in dataset crooked_river:

plot_check_temp(data = crooked_river, 
                dates = date, 
                temperature = temp_c, 
                temp_min = 0, 
                temp_max = 12)

Summarize Data

If you imported raw data with multiple recordings per day, hatchR has a built in function to summarize those data to a daily average mean called summarize_temp(). The output of the function is a tibble with mean daily temperature and its corresponding.

Below, we simulate a dataset with 30 minute temperature recordings for a year and summarize the data to daily means.

# set seed for reproducibility
set.seed(123)

# create vector of date-times for a year at 30 minute intervals
dates <- seq(
  from = ymd_hms("2000-01-01 00:00:00"),
  to = ymd_hms("2000-12-31 23:59:59"), 
  by = "30 min"
  )

# simualte temperature data
fake_data <- tibble(
  date = dates,
  temp = rnorm(n = length(dates), mean = 10, sd = 3) |> abs()
)

# check it
glimpse(fake_data)
#> Rows: 17,568
#> Columns: 2
#> $ date <dttm> 2000-01-01 00:00:00, 2000-01-01 00:30:00, 2000-01-01 01:00:00, 2…
#> $ temp <dbl> 8.318573, 9.309468, 14.676125, 10.211525, 10.387863, 15.145195, 1…

Now we can summarize the data to daily means using summarize_temp():

fake_data_sum <- summarize_temp(data = fake_data,
                                temperature = temp,
                                dates = date)

nrow(fake_data) #17568 records
#> [1] 17568
nrow(fake_data_sum) #366 records; 2000 was a leap year :)
#> [1] 366

We again recommend, at a minimum, visually checking your data once it has been summarized.

# note we use fake_data_sum instead of fake_data
plot_check_temp(data = fake_data_sum, 
                dates = date, 
                temperature = daily_temp, 
                temp_min = 5,
                temp_max = 15)

Check for continuous data

At present, hatchR only uses continuous data. Therefore, your data is expected to be continuous and complete.

You can check whether your data is complete and continuous using the check_continuous() function. This function will return a message if the data is or is not continuous or complete. Below, the calls return all clear:

check_continuous(data = crooked_river, dates = date)
#>  No breaks were found. All clear!
check_continuous(data = woody_island, dates = date)
#>  No breaks were found. All clear!

But if you had breaks in your data:

check_continuous(data = crooked_river[-5,], dates = date)
#> Warning: ! Data not continuous
#>  Breaks found at rows:
#>  5

You’ll get a note providing the row index where breaks are found for inspection.

If you have days of missing data, you could impute them using rolling means or other approaches.

Workflow

After importing, wrangling, and checking your temperature data, there are a number of different actions you can take. These steps are laid out in subsequent vignettes:

The generalized workflow for hatchR, including the above and subsequent steps, is shown below.

References

Beacham, Terry D., and Clyde B. Murray. 1990. “Temperature, Egg Size, and Development of Embryos and Alevins of Five Species of Pacific Salmon: A Comparative Analysis.” Transactions of the American Fisheries Society 119 (6): 927–45. https://doi.org/10.1577/1548-8659(1990)119<0927:TESADO>2.3.CO;2.
Siegel, Jared E., Aimee H. Fullerton, Alyssa M. FitzGerald, Damon Holzer, and Chris E. Jordan. 2023. “Daily Stream Temperature Predictions for Free-Flowing Streams in the Pacific Northwest, USA.” PLOS Water 2 (8): e0000119. https://doi.org/10.1371/journal.pwat.0000119.
Sparks, Morgan M., Jeffrey A. Falke, Thomas P. Quinn, Milo D. Adkison, Daniel E. Schindler, Krista Bartz, Dan Young, and Peter A. H. Westley. 2019. “Influences of Spawning Timing, Water Temperature, and Climatic Warming on Early Life History Phenology in Western Alaska Sockeye Salmon.” Canadian Journal of Fisheries and Aquatic Sciences 76 (1): 123–35. https://doi.org/10.1139/cjfas-2017-0468.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. "O’Reilly Media, Inc.".