Leave-one-batch-out cross-validation reveals strong batch effects in Raman spectroscopy of yeast cultures

Keith Cheveralls; Ilya Kolb; David G. Mets

doi:10.57844/arcadia-xdmk-yq0w

Purpose

We’re interested in using spontaneous Raman spectroscopy as a readout for the chemical state of biological systems. Previous efforts have reported that Raman spectroscopy can be used to distinguish between different strains 1, cell types 2, and physiological states 3–4. In an effort to obtain analogous results relevant to our own work, we acquired spontaneous spectra of two small collections of yeast strains, then trained conventional machine learning models to predict both strain and species identity from the Raman spectra alone. We found that, in this dataset, experimental batch effects dominated strain-level biological signals but not species-level signals. As similar batch effects may be present in other studies, careful experimental design is warranted, especially when machine learning is used for analysis.

This work is primarily for researchers using Raman spectroscopy to detect meaningful differences between biological variables of interest (e.g., species, strain, growth condition, cell type, cell state, etc). We're sharing these results to call attention to the existence of experimental batch effects in spontaneous Raman spectra of biological samples, and of the consequent importance of rigorous experimental design and cross-validated analysis when using this technique.

Background and goals

Raman spectroscopy offers a promising approach for distinguishing between biological samples, as it's label-free, requires minimal sample preparation, and may detect subtle chemical-compositional differences 5–6. Commonly, it's used as an initial screening step to identify samples or conditions of potential interest in a fast, non-destructive way, as a complement to rather than a replacement for high-resolution analytical methods 7–9. More recently, an alternative analytical framework has emerged that treats Raman spectra as feature vectors and trains machine learning models to predict biological labels of interest (such as sample identity, cell identity, or cell state) directly from held-out spectra 1–2, 10–12, potentially expanding the range of applications. Our goal here was to explore the utility of such data-driven approaches by evaluating whether Raman spectra can be used to distinguish between different strains and species of yeast and to understand the extent to which batch effects from experimental variation confound these predictions.

The approach

Experimental design

We collected Raman spectra from nine yeast strains, including wild-type and mutant strains from both Saccharomyces cerevisiae and Schizosaccharomyces pombe (Table 1). We generated three "end-to-end" biological replicates by repeating the sample preparation and imaging protocols in triplicate, each with separate cell cultures, physical plates, and imaging dates.

Species and strains

We chose to work with several strains of budding and fission yeast (Saccharomyces cerevisiae and Schizosaccharomyces pombe, respectively). Our dataset included three standard lab strains (BY4741, SP286, and ED666) along with six knockout strains (see Table 1) that were of interest to us because they encode orthologs of human proteins involved in disease.

Species	Strain or gene ID	Genotype	Description	Source
S. cerevisiae	BY4741		Standard haploid lab strain	ATCC
S. cerevisiae	YGL058w	rad6Δ	Null mutant of rad6, an ortholog of human UBE2A (a ubiquitin-conjugating enzyme)	EUROSCARF
S. cerevisiae	YNL141w	aah1Δ	Null mutant of aah1, an ortholog of human ADA (an adenosine deaminase)	EUROSCARF
S. pombe	SP286		Standard diploid lab strain	Bioneer
S. pombe	ED666		Standard haploid lab strain	Bioneer
S. pombe	SPAC18B11.07c	rhp6Δ	Null mutant of rhp6, an ortholog of human UBE2A	Bioneer
S. pombe	SPBC1198.02	dea2Δ/+	Heterozygous null mutant of dea2 (an ortholog of human ADA)	Bioneer
S. pombe	SPBPB10D8.02c	asgΔ	Null mutant of asg (an ortholog of human ARSG, a lysosomal sulfatase)	Bioneer
S. pombe	SPBC530.12c	pdf1Δ/+	Heterozygous null mutant of pdf1 (an ortholog of human PPT1, a lysosomal hydrolase)	Bioneer

Table 1. List of species and strains used in this study.

The three standard lab strains are identified by their strain IDs, while the six mutant strains are identified by their gene IDs. The two heterozygous mutant strains are homozygous lethal. To find the strains from Bioneer, use the link provided and search the strain/gene ID in the “Search keyword” field.

Sample preparation and Raman spectroscopy

We inoculated each strain in 5 mL of rich media in a 15 mL culture tube. We grew S. cerevisiae strains in YPD media and grew the S. pombe strains in YES media. We grew cultures overnight in a shaking incubator (Infors HT) set to 30 °C, shaking at 200 rpm. We measured the OD₆₀₀ for each strain using a spectrophotometer (Thermo Fisher Scientific, NanoDrop One/One^C). We diluted all strains to an OD₆₀₀ of five in a final volume of 1 mL. We pelleted cells in 1.5 mL microcentrifuge tubes using a tabletop centrifuge (Thermo Scientific, Sorvall Legend Micro 21) at 3,000 × g for 3 min. We removed the supernatant and washed cells by resuspending in 1 mL of 0.9% saline solution. We pelleted cells again under the same conditions and removed the supernatant. We then washed cells by resuspending in 1 mL of deionized water, pelleting, and removing the supernatant. We spotted wet cell pellets onto stainless steel plates (stamped with a 96-well plate layout) and allowed them to desiccate at room temperature and pressure for approximately 3 h.

We acquired spontaneous Raman spectra of the desiccated samples using a probe-based Raman spectrometer with a 785 nm excitation (model WP-785X-F13-R-IL, Wasatch Photonics). We integrated the spectrometer with a motorized XY stage and controlled the hardware using Pycro-Manager 13 and custom Python scripts to enable fully automated acquisitions. We acquired spectra at a 5 × 5 grid of positions in each “well” on each of the three replicate plates, resulting in a total of 75 spectra for each strain. We acquired spectra with a laser power of 93 mW (measured at the sample plane), an integration time of five seconds, and 10 averages per position.

Data processing

The raw Raman spectra were processed using a standard pipeline based on the ramanspy Python package 14.

Cosmic ray removal. We applied the Whitaker–Hayes despiking algorithm with default parameters to remove cosmic ray artifacts.
Background subtraction. For each plate (i.e., for each experimental replicate), we acquired a dark spectrum from an empty region of the stainless steel plate. We computed a "consensus" background spectrum by averaging the dark spectra from all three plates, then subtracted this same consensus background spectrum from all sample spectra in the dataset. The use of a single background spectrum (rather than one per plate) is necessary to avoid injecting additional batch effects into the spectra from small per-plate differences in the background spectra.
Quality control to remove low-intensity spectra. The unnormalized intensity of some spectra was too low to yield detectable Raman peaks. These spectra likely corresponded to regions of the well without sufficient desiccated material. We identified and removed these spectra by applying an empirically determined threshold of 1,000 intensity units to the mean intensity of the background-subtracted but otherwise unprocessed spectra.
Smoothing and baseline correction. After background subtraction, we applied a Savitzky-Golay filter to smooth the spectra, then subtracted the autofluorescence background using the ModPoly algorithm with a polynomial order of five.
Cropping and normalization. After baseline correction, we cropped the spectra to the "fingerprint region" (300–1,800 cm⁻¹) and normalized the cropped spectra using area-under-the-curve (AUC) normalization.
Quality control to remove outlier spectra. Within each well of each plate, we identified outlier spectra by computing the standard deviation of each spectrum's difference from the group median spectrum, then used elbow detection on the sorted distances to identify outliers. This quality control removed spectra that likely resulted from heterogeneous desiccation or imaging artifacts.
Batch correction. To correct for plate-level batch effects, we used two independent approaches. In the first approach, we applied a linear mixed model to each wavenumber independently, treating plate identity as a random effect. We then corrected for plate-specific scale factors by normalizing the residuals by their within-batch standard deviations. This approach assumes that all batches are sufficiently large and that the random effect (plate identity) isn't confounded with the biological variable of interest (strain identity). In the second approach, we used an established third-party method, ComBat 15–16, with plate identity as the batch variable and no fixed effects. Note that, because all strains were present on all plates, we didn't need to include strain identity as a fixed effect in either model.

Analysis strategy

We treated each processed spectrum as a feature vector and the collection of all spectra as a feature matrix in which rows correspond to samples and columns to wavenumbers. We trained classifiers to predict strain identity and evaluated performance using two different cross-validation strategies:

Standard k-fold cross-validation in which all spectra are randomly partitioned into training and test sets. This is the standard approach to cross-validation, and is often used in the literature for datasets that are known or assumed to be homogeneous and well-balanced.
Leave-one-plate-out cross-validation in which spectra are partitioned into training and test sets according to the plate from which they came. This cross-validation strategy explicitly controls for experimental batch effects, as each of our three physical plates corresponds to an end-to-end experimental replicate.

After the QC steps described above, we were left with 24–33 spectra per strain (median 27). These spectra were distributed roughly evenly across the three physical plates. Because of these small sample sizes, we chose to use two conventional nonlinear classification models: a random forest and a support vector machine with an RBF kernel. We used the standard implementations from scikit-learn 17, with class weighting set to "balanced" to correct for class imbalances (which were minor, given our balanced experimental design). In the random forest model, we used 300 trees, and in the SVM, we used a value of 100 for the regularization parameter C. These values were chosen empirically. All other hyperparameters were set to their default values.

To evaluate classification performance, we plotted confusion matrices with counts normalized by the number of true members in each class. For a single overall performance metric, we also calculated the multiclass version of the Matthews correlation coefficient (MCC) as implemented in scikit-learn 17. This is a bounded metric in [−1, +1] that considers all elements of the confusion matrix and is therefore robust to class imbalances. A value of +1 corresponds to perfect predictions, a value of zero corresponds to random guessing, and a negative value indicates systemic misclassification. We computed this metric for the test set from each cross-validation fold, then computed the median across folds.

All code and data are available in our GitHub repo (DOI: 10.5281/zenodo.19226627).

Additional methods

We used Claude (Opus 4, 4.5, and 4.6) to help write, clean up, comment, and review our code, and we selectively incorporated its feedback. We also used Claude to suggest wording ideas and then chose which small phrases or sentence structure ideas to use, rearrange text to fit the structure of one of our pub templates, expand on summary text that we provided, and help clarify and streamline text that we wrote. Additionally, we chatted with Claude throughout the project to support ideation and analysis. We used Gemini (2.5 Pro) to review the publication to catch typos and discrepancies, and Cursor for tab completion while editing code.

We used arcadia-pycolor (v0.6.4) 18 to generate figures before manual adjustment.

The results

After processing, we found that the spectra contained a few clear, sharp peaks and many broad, diffuse features. In general, the mean spectra for all strains and species looked similar (Figure 1). We therefore reasoned that carefully cross-validated modeling (rather than exploratory analysis) would be necessary to determine whether the spectra contained genuine biological signals.

Line plots of mean spectra as intensity (y-axis; a.u.) versus wavenumber (x-axis; cm<sup>−1</sup>), with one plot for each strain in a different (and arbitrary) color. The line plots for all strains from each species are overlaid. The line plots for all strains are generally very similar to one another, with minor differences in relative intensity in regions around 300, 500, and 900 cm<sup>−1</sup>. — Figure 1. **The overall mean spectra for each strain, grouped by species.**

Standard cross-validation gives misleadingly strong results

We first evaluated a strain classification task under standard k-fold cross-validation using five folds. The model performed well, with a median MCC of 0.79 (range 0.71–0.93). The confusion matrix showed a strong diagonal (Figure 2), indicating that the model correctly predicted strain identity across most samples. However, as we discuss below, the strain identities are confounded with plate identities in this analysis.

A visualization of the confusion matrix as a two-dimensional heatmap. True labels are shown as rows, and predicted labels as columns. A strong diagonal is present in the matrix, indicating that the model can make correct predictions for all nine strains under k-fold cross-validation. — Figure 2. **The confusion matrix for strain prediction under standard k-fold cross-validation with five folds.**

Leave-one-plate-out cross-validation reveals batch effects

We then evaluated the same task under leave-one-plate-out cross-validation. The results were substantially worse, with a median MCC of 0.32 (range 0.19–0.42). The confusion matrix revealed that only a few strains remained distinguishable (Figure 3). This result implies that plate-level batch effects dominate whatever strain-level signal exists in the spectra, and that the standard cross-validation approach was effectively overfitting to these batch-specific features. This is possible because each fold in the k-fold cross-validation procedure includes spectra from all three plates, so the model can "see" batch-specific features and use them to help predict strain identity.

A visualization of the confusion matrix as a two-dimensional heatmap. True labels are shown as rows, and predicted labels as columns. There is little diagonal structure in the matrix, with multiple scattered off-diagonal cells occupied, indicating that the model cannot make correct predictions for most strains, and instead tends to confuse each strain for one of several other strains. — Figure 3. **The confusion matrix for strain prediction under leave-one-plate-out cross-validation.**

Adversarial prediction of plate identity confirms a plate-level batch effect

We confirmed the existence of a plate-level batch effect by inverting the prediction and cross-validation dimensions: we trained a classifier to predict plate identity instead of strain identity, using leave-one-strain-out cross-validation. We found that the model could very reliably predict the plate from which each spectrum came (Figure 4, A), with a median MCC of 1.0 (range 0.50–1.0). Since the plates correspond to end-to-end replicates of the same experimental protocol, there should be no "true" biological differences between the plates, implying the presence of strong plate-level experimental batch effects that the model is able to exploit.

A visualization of the two 3 × 3 confusion matrices as two-dimensional heatmaps. The confusion matrix in panel A (corresponding to uncorrected data) has a very strong diagonal, indicating that the model can almost perfectly predict the correct plate identity for each spectrum from among the three plates in the dataset. The matrix in panel B (corresponding to batch-corrected data) is mostly uniformly occupied, indicating the model cannot predict plate identity (and indeed tends to make random predictions for each plate) after batch effects are removed. — Figure 4. **Confusion matrices for predicting plate identity under leave-one-strain-out cross-validation, before (A) and after (B) correcting for batch effects.**

Batch correction removed plate-level batch effects but did not improve strain prediction

We applied a linear mixed model to correct for plate-level effects on a per-wavenumber basis (implementation here). After correction, the "adversarial" classifier trained to predict plate identity no longer performed well (Figure 4, B), with a median MCC of 0.10 (range −0.49–0.64), confirming that the plate-level batch effect had been removed. However, strain-level classification wasn’t meaningfully improved (Figure 5), with a median MCC of 0.39 (compared to 0.32 before batch correction). This likely reflects some combination of stochastic sample-level batch effects and genuinely subtle differences between the strains in our dataset that may not result in detectable Raman signatures.

A visualization of the confusion matrix as a two-dimensional heatmap. There is little diagonal structure in the matrix, with multiple scattered off-diagonal cells occupied, indicating that the model cannot make correct predictions for most strains, and instead tends to confuse each strain for one of several other strains. — Figure 5. **The confusion matrix for strain prediction under leave-one-plate-out cross-validation after correcting for plate-level batch effects.**

Species-level classification works well with or without batch correction

When we shifted from strain-level to species-level classification (S. cerevisiae versus S. pombe), the model performance was nearly perfect under leave-one-plate-out cross-validation (Figure 6). This was true with or without correcting for plate-level batch effects (median MCC of 0.97 without batch correction versus 0.93 with batch correction). This suggests that the spectra contain genuine species-level biological differences that are stronger than the experimental batch effects. Indeed, there was a hint that this was the case in our original strain-level confusion matrix: the misclassifications between strains were predominantly between strains of the same species (Figure 3).

A visualization of the two 2 × 2 confusion matrices for predicting S. cerevisiae vs. S. pombe as heatmaps. In both matrices, almost all density lies on the two diagonal elements, indicating strong classification performance. — Figure 6. **The confusion matrices for predicting species under leave-one-plate-out cross-validation, before (A) and after (B) correcting for batch effects.**

As an additional sanity check, we plotted the mean impurity-based feature importances from the trained random forest. We observed that the most important features aligned with wavenumbers at which the mean spectra visibly differed between the two species (Figure 7). This again suggests that the model is leveraging true differences between species that are stronger than the experimental batch effects.

Panel A shows line plots of the mean spectra for S. cerevisiae and S. pombe. The two line plots are generally very similar and often overlaid, but differ from one another in regions around 500, 700, and 900 wavenumbers. Panel B shows a line plot of feature importance. The line is mostly at or near zero, except for two prominent spikes at around 700 and 900 wavenumbers. These spikes are within the regions in which the mean spectra themselves differ. — Figure 7. Mean spectra and feature importances for distinguishing between *S. cerevisiae* and *S. pombe.*

Results aren't sensitive to model type or batch correction method

Finally, to check that our results weren't strongly dependent on our choice of a random forest classifier or our use of a linear mixed model (LMM) for batch correction, we repeated our analysis with both a support vector machine — a conventional nonlinear classifier with a very different architecture from a random forest — and with an established third-party batch-correction method, ComBat. We found that the median MCC values weren't meaningfully different for any combination of batch correction method and classification model (Figure 8).

A two-dimensional heatmap in which cell color corresponds to the MCC value for each of the prediction tasks discussed in the main text. The values in all but one column are very similar to one another (see text for values), indicating that classification performance for a given task and cross-validation strategy does not depend on the dataset type or model type. The one exception to this is the plate-prediction task; here, as expected, the heatmap shows that the MCC is close to one with uncorrected data and falls to near zero after correction (for both batch-correction algorithms and for both model types). — Figure 8. **The median MCC values for all combinations of dataset type, model type, prediction task, and cross-validation strategy.**

Limitations and caveats

Several limitations may impact the interpretation of these results. Our dataset itself is small; after QC, we were left with only 248 spectra across nine strains and three plates, yielding noisy cross-validated performance metrics. In addition, the strains in our dataset may be too genetically similar to produce detectable differences in the Raman spectra, so our negative result for strain classification here may not generalize to collections of strains with greater diversity. Finally, the desiccation step in our sample preparation protocol — while necessary to yield a sufficiently strong Raman signal from our spectrometer — is inherently variable and likely contributes to the plate-level batch effect we report here. Other sample preparation techniques may result in weaker batch effects.

We also note that our batch correction approach introduces an implicit train-test leakage, as it applies batch correction once to the full dataset, rather than to each training fold separately. For leave-one-plate-out cross-validation, this effect is severe, as the batch-effect dimension and the cross-validation dimension are identical. This means that our batch-corrected leave-one-plate-out cross-validation estimates are not valid estimates of generalization to fully unseen plates (i.e., plates held out from both batch correction and cross-validation). We instead intend our results to represent a description of how much information about a variable of interest is contained in the spectra, given full knowledge of experimental confounders. In other words, much like PCA or clustering applied to an entire dataset, our batch-corrected cross-validation results describe the batch-corrected structure in the data we have, not the performance we'd expect on new data.

Conclusions

Raman spectroscopy is an extremely sensitive technique. It can detect real, relevant biological signals — and sometimes very subtle ones — but it also readily picks up irrelevant signals associated with experimental conditions. In practice, we have found that Raman spectra invariably contain a mixture of both relevant and irrelevant signals. Because experimental batch effects are a kind of structured noise, standard k-fold cross-validation can give misleadingly good results by allowing models to learn batch-specific features. To distinguish authentic biological signals from batch effects, experiments must therefore include end-to-end replicates, and analysis must be cross-validated on experimentally meaningful batch dimensions. We recommend that experiments incorporate at least three end-to-end replicates from the outset, and that analysis workflows include both leave-one-replicate-out cross-validation and "adversarial" prediction tasks in which a model is trained to predict biologically meaningless variables like replicate or plate identifiers.

Leave-one-batch-out cross-validation reveals strong batch effects in Raman spectroscopy of yeast cultures

Leave-one-batch-out cross-validation reveals strong batch effects in Raman spectroscopy of yeast cultures

We acquired spontaneous Raman spectra of several strains and species of yeast. We found that experimental batch effects dominate strain-level biological signals in our dataset, yielding misleadingly high classification accuracy under standard cross-validation.

Additional assets:

Purpose

Background and goals

The approach

Experimental design

Species and strains

Sample preparation and Raman spectroscopy

Data processing

Analysis strategy

Additional methods

The results

Standard cross-validation gives misleadingly strong results

Leave-one-plate-out cross-validation reveals batch effects

Adversarial prediction of plate identity confirms a plate-level batch effect

Batch correction removed plate-level batch effects but did not improve strain prediction

Species-level classification works well with or without batch correction

Results aren't sensitive to model type or batch correction method

Limitations and caveats

Conclusions