1. Introduction
The technological advances in recent decades revolutionized the way we currently monitor habitats and species. Among the emerging techniques for biomonitoring, several automated and non-invasive methods have rapidly become standard tools in ecology, such as camera trapping, remote sensing, and passive acoustic monitoring (
Lahoz-Monfort and Magrath 2021). These automated and non-invasive techniques offer researchers the ability to expand the spatial and temporal scales of their studies and contribute to collect large amounts of data. However, datasets obtained through automated techniques often pose issues for investigators because manual processing of automated data is time-consuming, tedious, and subject to human bias. To overcome these issues, machine learning algorithms can effectively process such large datasets (e.g.,
Priyadarshani et al. 2018;
Stowell 2022;
Xie et al. 2022).
Passive acoustic monitoring is increasingly being used to detect different groups such as anurans, bats, birds, or insects (
Sugai et al. 2019;
Hoefer et al. 2023). Surveys relying on passive acoustic monitoring easily generate substantial volumes of recordings, making it impossible to visually inspect or listen to all files (e.g.,
Pérez-Granados and Schuchmann 2020). The development of machine learning algorithms has become crucial to deal with these large numbers of files (
Stowell 2022;
Xie et al. 2022). Unfortunately, implementing some of the state-of-the-art machine learning models can be complex and intimidating for ecologists and managers without an engineering or computing background. Indeed, the difficulty in using sound detection tools is a limiting factor for passive acoustic monitoring surveys (
Wood et al. 2023a). However, a new generation of user-friendly and ready-to-use machine learning tools have recently emerged and may further improve the effectiveness of automated audio recognition, such as BirdNET and Kaleidoscope Pro (e.g.,
Manzano-Rubio et al. 2022;
Bota et al. 2023;
Wood et al. 2023a).
BirdNET is an automated sound classifier that is free, open source, and based on a convolutional neural network architecture for automated identification of over 6000 wildlife species, including birds, anurans, and mammals (
Kahl et al. 2021;
Pérez-Granados 2023;
Wood et al. 2023a,
2023b). For each 3 s fragment of an audio file, BirdNET provides a species identification accompanied by a confidence score, allowing researchers to filter the output according to a desired confidence level. Although BirdNET is a promising tool, its effectiveness for wildlife monitoring has yet to be extensively assessed (reviewed by
Pérez-Granados 2023), with only a single case study testing its capabilities for anuran monitoring (
Wood et al. 2023a). BirdNET can be run through a user-friendly interface or the command line, and requires no expertise in machine learning (
Wood et al. 2023a). Another ready-to-use and user-friendly machine learning tool for audio recognition is Kaleidoscope Pro (
Manzano-Rubio et al. 2022), which requires a paid subscription for an annual license. Unlike BirdNET, the Kaleidoscope Pro workflow relies on the automated detection of candidate sounds based on the input of signal parameters and their classification, through unsupervised machine learning (hidden Markov models), into clusters of species vocalizations (
Pérez-Granados and Schuchmann 2020). Both BirdNET and Kaleidoscope Pro can easily be trained to develop species-specific algorithms without the need for technical expertise. Despite its potential, current knowledge on the ability of automated classification of anuran vocalization in large acoustic datasets remains limited (e.g.,
Huang et al. 2014;
Wood et al. 2023a).
In this paper, we aim to evaluate two user-friendly machine learning tools, BirdNET and Kaleidoscope Pro, to detect the American toad (Anaxyrus americanus (Holbrook, 1836)) in recordings. Specifically, we evaluated the efficacy of each approach in detecting the species compared to a human and then evaluated the efficacy of both methods combined. Additionally, we measured the computing time required to scan the validation acoustic dataset (n = 371 3 min recordings) and the amount of human time needed to verify the output. Then, we evaluated the effectiveness and speed of a two-step approach to detect the presence of the species using a large field acoustic dataset collected in northern Canada (n = 6194 3 min recordings). By sharing our assessments and insights, we hope to provide valuable guidance to apply automated detection and machine learning approaches in wildlife monitoring. For clarity, we use the term “detections” to denote potential multiple instances of predictions made by BirdNET or Kaleidoscope Pro in a given recording, whereas we use the term presence to indicate that the species was confirmed at least once by a human in a 3 min recording.
2. Materials and methods
2.1. Study area and pond selection
We collected acoustic data in the Eeyou Istchee James Bay region of northwestern Quebec, Canada, situated between the latitudes of 49° and 53°N, and longitudes of 71° and 79°W. The study area spans approximately 400 000 km2 and lies within the traditional Cree and Abitibiwinni First Nation territory. The landscape of the study area consists of a mosaic of forests dominated by black spruce, rocky hills bordering coniferous forests, and ombrotrophic to minerotrophic peatlands. The region experiences a subpolar and subhumid climate, characterized by mean temperatures between −0.5 and −4 °C, whereas annual precipitation varies between 700 and 900 mm. Most of the snow generally falls from August to May.
We selected 50 ponds smaller than 2 ha in size, maintaining a distance of at least 800 m between ponds to achieve pond independence. These ponds represented the two main types of ponds in the study area: 12 beaver ponds and 38 peatland ponds. Numbers of each pond type reflected the proportion of ponds available for each type. For more comprehensive details regarding our methodology and pond selection process, we refer interested readers to
Feldman et al. (2023).
2.2. Study species
The American toad is a widespread species in North America, occurring in a variety of breeding and foraging habitats (
Dodd 2013). However, the habitat requirements of the species in the northern part of its range, including the study region, are not well documented (
Fortin et al. 2012;
Feldman et al. 2023). The calling activity of the American toad spans from May to July, with choruses occurring between mid-May and early June, predominantly at night between 10 pm and 2 am (
Taylor 2006). The male call is characterized by a prolonged musical whistled trill at a constant pitch (
Fig. 1;
Hunter et al. 1999), which makes the species a good candidate to evaluate acoustic recognition algorithms. Although encountering toads in small groups is common, full choruses are rare (
Taylor 2006). Hence, investigating the acoustic activity of breeding American toads, particularly in the less-explored northern borders of their distribution, can provide valuable insights into the adaptive responses of anurans to climate change and increasing anthropogenic activities.
2.3. Acoustic monitoring protocol
We deployed automated acoustic recorders at each of the 50 ponds (SM4 Song meter, Wildlife Acoustics Inc., Maynard, MA, USA). The SM4 recorders were placed 2–10 m from the water’s edge and 1.5 m above ground. Recorders were programmed to record 3 min segments in .wav format every hour from 19h00 to 23h00 over seven consecutive days. Sound files were encoded at a sampling rate of 44.1 kHz and a 16-bit sample resolution. Each pond was sampled twice a year in 2018 and 2019 with visits spaced 5–7 weeks apart from May to July. We collected a total of 6194 files across the 2 years, yielding a total of 309.7 h of recording. It is noteworthy that 11.5% of these files were unusable, despite our diligent maintenance and battery supplementation. The nature of these technical issues remained unknown. Data were retrieved at the end of the 7-day recording period. This study exclusively employed acoustic recording units for passive acoustic monitoring without direct interaction with live animals. As such, no formal animal ethics approval was required for this research for data collection.
2.4. Acoustic recording analyses
2.4.1. BirdNET
The acoustic dataset was analyzed using the default values for overlap (0 s) and sensitivity (1.0) and the minimum threshold for confidence score (0.01) on BirdNET-Analyzer (version 2.2.0;
Kahl et al. 2021). We applied the “
American toad” filter to classify sounds only for the target species (
Manzano-Rubio et al. 2022; see extended description of BirdNET settings in
Kahl et al. 2021;
Pérez-Granados et al. 2023 and Supplementary image S1). Every 3 min recording with BirdNET predictions was listened to and inspected by a human at the timestamp of the 3 s spectrograms annotated by BirdNET to verify whether the American toad was present or absent in the recording. If a human did not confirm the presence of the American toad in the first BirdNET prediction, subsequent predictions of the species within the same 3 min recording were reviewed. If the American presence was not confirmed, the species was marked as non-detected in the file. In the latter case, the recording was considered as mislabelled by BirdNET (i.e., false positive).
2.4.2. Kaleidoscope Pro
We used Kaleidoscope Pro (version 5.4.7, Wildlife Acoustics) to analyze the same acoustic dataset described above for BirdNET. To include specific parameters for the American toad, we measured the duration and minimum and maximum frequency of 39 calls of the American toad from the study area using Raven Pro 1.6 (
Cornell Lab of Ornithology 2023; see Supplementary Table S1). The signal parameter inputs included the minimum and maximum length of detection (3–30 s), minimum and maximum frequency range (1.3–2.1 kHz), and a maximum intersyllable gap of 0.5 s. Kaleidoscope Pro reported a series of candidate sounds that met these criteria. Candidate sounds were automatically grouped through unsupervised machine learning by using
K-means clustering through hidden Markov models (default values, see
Pérez-Granados and Schuchmann 2020; see settings in
Pérez-Granados et al. 2023; and Supplementary image S2). Within a cluster, Kaleidoscope Pro sorts candidate sounds by similitude. Thus, most signals of a definite cluster belong to the same type of vocalization of a given species, and the first songs of each cluster are the most similar and representative of each cluster. We reviewed each cluster and labelled them as “
American toad cluster” or “
Other” based on the detection of an American toad call within the first 50 sounds (i.e., the most representative) of each cluster. Previous work showed that this procedure can identify over 99% of candidate sounds of two bird species while also reducing over 95% the time required to verify output (
Pérez-Granados and Schuchmann 2020). Candidate sounds within the “
American toad cluster” were acoustically and visually checked by a human until the presence of the species was confirmed. Once a detection was confirmed within a 3 min recording, we did not review other detections from the same recording. Otherwise, we reviewed all detections within the “
American toad cluster”. We considered a 3 min recording as mislabelled by Kaleidoscope Pro (i.e., false positive) when candidate sounds checked by a human did not reveal the species’ presence. Candidate sounds of the cluster “
Other” were not checked and therefore not considered in subsequent analyses (
Pérez-Granados and Schuchmann 2020).
2.5. Automated software comparison
To assess the ability of each automated tool to detect the presence of the American toad, we created a validation dataset of referenced recordings (see
Pérez-Granados et al. 2023). The validation dataset comprised 371 (3 min) recordings randomly selected from 34 ponds with known presence of the species. For each recording, we reported whether the species was detected or not after checking spectrograms in Raven Pro 1.6 (
Cornell Lab of Ornithology 2023). Recordings were reviewed blinded with respect to site location, date, and hour of recording. To evaluate the effectiveness of the two machine learning approaches for detecting the American toad, we estimated the percentage of presences detected by Kaleidoscope Pro, BirdNET, and by both methods combined with respect to the total number of recordings with confirmed presence in the validation dataset.
We assessed the effectiveness and speed of a two-step approach to scan and detect the American toad in a large acoustic dataset (“field acoustic dataset”) consisting of 6194 files of 3 min recordings. Here, the two-step approach aimed to maximize the number of recordings with confirmed presence while expediting the time required for a human to verify the output. The two-step approach consisted of (1) scanning the entire dataset using Kaleidoscope Pro, which is faster than BirdNET and then (2) using BirdNET to scan files where the species was not detected by Kaleidoscope Pro (n = 5778). We estimated the percentage of total files with confirmed presence, by summing the number of files where the species had been detected by Kaleidoscope Pro and those where the species was detected by BirdNET. We evaluated the computing time required to manually inspect files for acoustic analysis, compared to processing files using each machine learning approach separately, or using the two-step approach. The time required for acoustic analyses was divided by (i) the time required by machine learning tool to scan the entire acoustic dataset and (ii) time for output verification by a human, which included the creation of the final database without misidentifications in a manageable format (Excel file in our case).
4. Discussion
The inherent challenges of automated acoustic recognition of target species in large acoustic datasets limited the widespread use of passive acoustic monitoring (
Wood et al. 2023a). However, recent advancements of algorithms (e.g.,
Kahl et al. 2021;
Wood et al. 2023a) together with the development of user-friendly software for automated recognition have opened up new avenues for automated wildlife recognition from sound recordings (
Manzano-Rubio et al. 2022;
Bota et al. 2023). Here, we demonstrated the ability of two user-friendly machine learning approaches (Kaleidoscope Pro and BirdNET) for detecting the presence of the American toad.
Kaleidoscope Pro was faster and detected the target species in a larger number of recordings than BirdNET. Our findings align with a recent study that compared both Kaleidoscope Pro and BirdNET in detecting the presence of a threatened bird species (
Botaurus stellaris (Linnaeus, 1758), see
Manzano-Rubio et al. 2022). The target species considered by
Manzano-Rubio et al. (2022) and the American toad have simple vocalizations (
Fig. 1), which may partly explain the high ability of Kaleidoscope Pro for detecting both species. However, convolutional neural networks, such as BirdNET, may outperform Kaleidoscope Pro at detecting more complex vocalizations. Further research should investigate into the ability of Kaleidoscope Pro and BirdNET for detecting a wider range of species. In our dataset, Kaleidoscope Pro produced a higher number of false positives than BirdNET, but the computing time and the time required to verify the output of Kaleidoscope Pro by a human was considerably reduced compared to BirdNET. Such speed can be attributed to the ease in verifying the output directly within Kaleidoscope Pro and to the workflow we employed. Users can easily navigate through Kaleidoscope Pro output while checking the sonograms, confirm the classification by pressing a key, and move on to the next candidate sound. In contrast, verifying the BirdNET output required to individually open the file associated with each recording and locate the 3 s spectrogram annotated. However, the most recent version of BirdNET (version 2.4, released in June 2023) now allows users to extract all the segments (in .wav file) where the target species were predicted. Although our study was conducted before this new addition to BirdNET, we expect that this new feature will substantially accelerate output verification. It is worth of highlight that we used an Intel(R) Core(TM) i7 (8th Gen, CPU 1.80 GHz, 1.99 GH, 8 GB RAM) with the acoustic recordings stored and analysed from an external hard drive. Therefore, the speed of data transfer from the hard drive to the laptop was the main limiting factor of computing time (i.e., computing time would be lower if processed from the internal memory of the laptop). However, it is important to consider that many monitoring programmes typically store their acoustic data in hard drives. Therefore, the timing values provided in our study can be useful for managers and researchers as an estimate of the time required for automated detection using Kaleidoscope Pro or BirdNET.
The workflow we applied in our study greatly expedited the review process. Initially, we leveraged the unsupervised clustering performed by Kaleidoscope Pro, focusing our attention on candidate sounds from clusters with a high probability of containing the target species. This approach reduced the number of candidate sounds to be verified by up to 80% from the original output (see also
Pérez-Granados and Schuchmann 2020). We acknowledge the possibility of a few missed American toad vocalizations using this approach. Nonetheless, the number of missed presences (false negatives) at the recording level is expected to be minimal. To expedite the reviewing process, we focused solely on confirming the presence of the species in each recording. Once the American toad was confirmed within a recording, we did not check subsequent candidate sounds within that recording.
BirdNET was originally developed for automated bird song recognition (>6000 bird species included), but it has expanded to include tens of anurans and a few mammals (e.g.,
Wood et al. 2023a,
2023b). Our assessment, to the best of our knowledge, is the second study evaluating the performance of BirdNET for monitoring anurans (see
Wood et al. 2023a). We hope that our results will encourage investigators seeking automated anuran detection to consider BirdNET. The ability of BirdNET to detect the American toad was low compared to Kaleidoscope Pro. Nonetheless, BirdNET was still able to detect the species using the default values in over half of the recordings, despite their short duration of 3 min. We encourage researchers wishing to use BirdNET to further assess the influence of the sensitivity, overlap, and confidence score parameters on wildlife detection. Moreover, further research assessing BirdNET’s effectiveness in detecting a wider range of anurans would be particularly valuable, as our current knowledge is limited to three species from North America (
Wood et al. 2023a and our study).
Surprisingly, BirdNET detected a large number of presences undetected by Kaleidoscope Pro (8% in the validation dataset and 10% in the field acoustic dataset). Upon visual inspection and received sound level, it became evident that most of BirdNET’s detections corresponded to American toad vocalizations emitted far from the recorder. This pattern suggests that BirdNET is more sensitive than Kaleidoscope Pro in detecting the species when it vocalized at larger distances. However, the reason why BirdNET did not detect the species more frequently than Kaleidoscope Pro remains unclear. Differences in the training dataset used for creating the American toad algorithm in BirdNET, such as the inclusion of weak vocalizations, may have contributed to these discrepancies. Unfortunately, because the contents of the BirdNET dataset are not publicly released, it is not possible to determine the number, type and quality of the vocalizations used for training the algorithm. Further research should investigate the underlying reasons for the differences observed between the two machine learning approaches we compared, although their distinct structures (e.g., convolutional neural network architecture on BirdNET and hidden Markov models in Kaleidoscope Pro) may complicate direct comparisons. Moreover, the unsupervised nature of Kaleidoscope Pro machine learning makes the classification process more obscure than with BirdNET.
Our two-step approach consisted in using the speed of Kaleidoscope Pro in scanning and verifying the whole field acoustic dataset and then running BirdNET exclusively on the recordings where Kaleidoscope Pro had not detected the species. This sequential workflow significantly reduced the overall time required for data scanning and output verification in BirdNET. The two-step approach improved the detection probability of the American toad by approximately 10%, with only 45 additional minutes dedicated to BirdNET output verification. Importantly, this approach enabled us to remove all false positives from the final dataset, amounting to a mere 0.5% of the total recording time. Removing false positives is needed for efficient use of acoustic algorithms in large acoustic datasets and to provide reliable results for further analyses highly sensitive to false positives such as occupancy analyses (
Guillera‐Arroita et al. 2017;
Wood et al. 2023a).
Our acoustic monitoring protocol was set to record from 7 pm to 11 pm, aiming to monitor the assemblage of pond-breeding anuran species in the study area (see
Feldman et al. 2023). However, the maximum vocal activity of the American toad occurs between 10 pm and 2 am (
Taylor 2006). Therefore, our protocol does not align with the peak of calling activity of the American toad and may have resulted in a potential underestimation of species occurrence. Although it does not have an impact on our conclusions, further research aiming to acoustically monitor the American toad should extend the recording schedule to cover the hours of maximum calling activity of the species.