All effort-based eBird checklists come with information on the starting time and a duration of the observation period. The ‘TIME OBSERVATIONS STARTED’ column of the EBD describes the initiation of the birding event, whereas the ‘DURATION MINUTES’ column is how long observers were collecting data. Duration is assumed to be continuous. Minimally, care should be taken to filter out checklists by duration length, for example removing very long observation periods, depending on the needs of your particular analysis. This can be done using the ‘auk’ package for R (https://github.com/CornellLabofOrnithology/auk). We routinely control statistically for variation in TIME OBSERVATIONS STARTED in analyses conducted by staff of the Lab of Ornithology.
In analyses, our preferred method for dealing with variation in the durations of observation periods is to include DURATION MINUTES as a predictor variable in analyses, and to treat this predictor as a continuous, non-linear variable. The reasons for this treatment of DURATION MINUTES are:
- The durations of observation periods vary essentially continuously, from 1 minute up to 24 hours (the maximum allowed for submission of data into eBird). Whenever continuous variation such as this exists, breaking this continuum into discrete groups is counter-productive because (1) one looses degrees of freedom because the continuous variable (regression) version of an analysis needs a smaller number of variables to describe the changes through time, and (2) there is no logical basis for creating separate groups using specific dividing points.
- Non-linearity of the effect is the result of saturation: while initially the rate of detection increases rapidly with each additional minute of observation time, at some point increasing the observation time will result in little or no increase in the likelihood of observing a species (or an additional individual of a species). If one treats DURATION MINUTES as a linear predictor variable, they are literally assuming that no matter now long the observation period becomes, each added minute of observation will result in the same increase in likelihood of detecting a bird of the species of interest. This is just wrong!
- A corollary of the above is that it is wrong to correct for variation in observation duration by dividing the number of birds observed by DURATION MINUTES, to produce a birds-per-minute response variable. By doing so, one is implicitly, and erroneously, assuming that there is a linear increase in the number of birds detected with increased duration of the observation period.
We suggest that one useful approach to statistically describing the effect of variation in observation effort in a statistical model is to treat this effect as a smooth term in a generalized additive model (GAM), i.e. as a spline. By doing so, you are letting the process of fitting the model determine the best description for how the likelihood of additional observation of a species changes as effort increases, rather than arbitrarily assuming that you already know the best possible way of describing this relationship. Alternatively, it is possible to describe the relationship using any of a number of non-linear effects, for example by transforming the values of DURATION MINUTES by log transforming the values [i.e. using log(DURATION MINUTES) as your predictor variable]. However, if you choose this approach, you will need to justify that the approach that you have used in an appropriate way of describing how the accumulation of new observations slows with each additional increase in the length of an observation period. A reasonable way of producing this justification is to compare the accuracy of one’s approach with the effect described by a spline…so, you would have to fit a GAM to your model anyway.
For fitting GAMs using R statistical software, when we analyse data, staff at the Lab of Ornithology typically use either the ‘mgcv’ package, or if there are random effects in the model the package ‘gamm4’. In our experience, the online documentation for these packages is far better than average, and the creator of these packages, Simon Wood, has written an extremely useful book Generalized Additive Models: An Introduction with R that is now in its second edition.
If you are using a machine-learning analysis, such as fitting a Random Forest or Gradient Boosted Model, to your data, then deciding on the best description for the effect of changing observation effort is a moot point. As long as you have specified that DURATION MINUTES is to be treated as a continuous variable, then the analysis will determine the best description of the effect of observation period on the accumulation of new observations in much the same way as fitting a GAM using a statistical model.