Spatial host–microbiome sequencing reveals niches in the mouse gut

SHM-seq data generationMice

Adult C57BL/6 SPF mice were purchased from The Jackson Laboratory and maintained in accordance with ethical guidelines monitored by the Institutional Animal Care and Use Committee (IACUC), established by the Division of Comparative Medicine at the Broad Institute of MIT and Harvard, and consistent with the Guide for Care and Use of Laboratory Animals, National Research Council, 1996 (institutional animal welfare assurance no. A4711-01), with protocol 0122-10-16. Adult C57BL/6 GF mice were obtained from Taconic Biosciences and maintained in a gnotobiotic environment. Some of these mice were randomly selected and inoculated with ASF33 over several generations and used when >6 weeks of age. After colonization, ASF mice were housed in sterile conditions and tested with polymerase chain reaction (PCR) to ensure that sterility was maintained63. Animal housing room temperatures were monitored and always maintained according to species-specific needs. Humidity was maintained at 30–70%. Light intensity and light cycle timing were carefully regulated by Broad Institute animal facilities. To capture material from multiple sections per colonic tube, as well as to maximize the use of a single spatial array (1,007 spatial spots spread over ~42 µm2), we placed 2–3 tissue cross-sections onto one spatial capture area. We sampled ~20 sections from each mouse by sectioning in the aforementioned fashion across one spatial capture slide containing six active capture areas.

Tissue collection

Colonic tubes from the mid part of the colon were dissected within minutes of killing mice, and tissues were dried from excess fluids and embedded in Optimal Cutting Temperature (O.C.T., Fisher Healthcare) in large molds (VWR) pre-filled with O.C.T. The molds were then laid onto a metal plate pre-chilled and set on top of dry ice for 2 min or until complete freezing. Samples were transferred to −80 °C until sectioning.

Generation of slides with customized surfaces

Customized surface primers were immobilized to an amine-activated surface area (~40 mm2 each) using covalent bioconjugation25,27, as recommended by the manufacturer (Surmodics). Three distinct surfaces were generated for validations: 16S, poly(d)T and a mixed poly(d)T/16S surface. The oligonucleotides immobilization in each case were:

5′-[AmC6]UUUUUGACTCGTAATACGACTCACTATAGGGACACGACGCTCTTCCGATCTNNNNNNNNATCTCGACGACTACHVGGGTATCTAATCC-3′

5′-[AmC6]UUUUUGACTCGTAATACGACTCACTATAGGGACACGACGCTCTTCCGATCTNNNNNNNNTTTTTTTTTTTTTTTTTTTVN-3′ (both Integrated DNA Technologies (IDT)).

All slide incubations took place on a thermal incubator (Eppendorf Thermomixer Option C) with slides mounted into a hybridization chamber (ArrayIt). All in situ reactions performed on spatial arrays were carried out in a class II biosafety cabinet.

Generation of spatial arrays with customized surfaces

All spatial arrays were produced as previously described for the original spatial transcriptomics method25,27. In brief, six spatial microarrays per slide were created using amine-activated CodeLink slides (Surmodics). To ensure covalent binding chemistry to the amine-activated surface, DNA oligonucleotides (IDT) were constructed as follows:

5′-[AmC6]UUUUUGACTCGTAATACGACTCACTATAGGGACACGACGCTCTTCCGA TCT-[18mer spatial barcode]-[7mer random UMI]-[20T]-VN.

Printing was performed by ArrayJet LTD by spotting 100 pL of spatially barcoded DNA oligonucleotides (33 µM diluted in 2× CodeLink printing buffer) using inkjet technology to form 100-μm spots with a 200-μm spot-to-spot pitch, resulting in a total of 1,007 different spatially addressable spots printed in a 6.2-mm × 6.6-mm capture area. A complete list of all spatially barcoded DNA oligonucleotides used in this study is available at https://github.com/nygctech/shmseq. After printing the spatial arrays, slides were blocked using a pre-warmed blocking solution (50 mM ethanolamine, 0.1 M Tris, pH 9) at 50 °C for 30 min and washed with 4× saline sodium citrate (SSC) and 0.1% SDS (pre-warmed to 50 °C) for 30 min before rinsing the slides with deionized water and drying.

Next, capture areas were modified to create a customized surface containing a mixture of poly(d)T and 16S capture sequences. To hybridize the 16S probe onto the spatially barcoded poly(d)T surface probes, 75 µl of the 16S (V4) probe (IDT) with the sequence 5′-GGATTAGATACCCBDGTAGTCGAGATNBAAAAAAAAAAAAAAAAAAAA-3′ (sequence28 modified to enable attachment to the spatial arrays) at 0.8 nM concentration in 2× SSC (Sigma-Aldrich), 20% fresh formamide (Thermo Fisher Scientific) and 0.1% Tween (Sigma-Aldrich) was added to each spatial capture area and incubated for 30 min at room temperature. The probe mix was then removed, and capture areas were washed with 100 µl of 0.1× SSC (Sigma-Aldrich). To covalently attach the hybridized 16S probes onto the spatially barcoded poly(d)T surface probes, an extension reaction was performed with 75 µl of 1× M-MuLV buffer, 2 U µl−1 RNaseOUT, 20 U µl−1 M-MuLV and 0.5 mM dNTPs (all from Thermo Fisher Scientific) and 0.20 µg µl−1 BSA (New England Biolabs (NEB)) added to the wells and incubated at 42 °C for 30 min. The M-Mulv solution was then removed, followed by a wash with 100 µl of 0.1× SSC. To strip the 16S probes used in the hybridization and extension reaction, and make the covalently attached 16S surface probes single stranded, surface capture areas were incubated 3× with 75 µl of 100% formamide for 3 min at room temperature. Capture areas were then washed twice with 100 µl of 0.1× SSC before washing the entire slide for 10 min at 50 °C in 2× SSC/0.1% SDS (Sigma-Aldrich), followed by 1-min wash with 0.2× SSC and finally 0.1× SSC, both at 37 °C. This resulted in spatially barcoded capture areas containing ~1:1 ratio of poly(d)T and 16S capture sequences.

Cryosectioning

The entire cryo chamber, including all surfaces and tools used during cryosectioning, were wiped with 70% ethanol before the start of sectioning to avoid bacterial contamination. Both spatial arrays and O.C.T.-embedded gut tissue blocks were allowed to reach the temperature of the cryo chamber before 10-µm-thick cross-sections of gut tissue were placed on customized spatial arrays. Tissue fixation followed immediately as described below.

Tissue fixation, H&E staining and imaging

The spatial array was warmed at 37 °C for 2.5 min. Then, the entire area of the glass slide was covered in a methacarn solution (60% absolute methanol, 30% chloroform stabilized with ethanol and 10% glacial acetic acid (all from Sigma-Aldrich)) for 10 min at room temperature in a closed space to avoid evaporation. Methacarn was then removed, and the slide was allowed to dry before ~300 µl of isopropanol (Sigma-Aldrich) was added to the slide and incubated for 1 min at room temperature. When the slide was completely dry again, it was stained using H&E in an EasyDip Slide Jar Staining system (Weber Scientific). The system included containers separately filled with ~80 ml of Dako Mayer’s hematoxylin and Dako Blueing Buffer (both from Agilent Technologies), 5% Eosin Y in 0.45 M Tris acetate (both from Sigma-Aldrich) buffer at pH 6 and nuclease-free water (Thermo Fisher Scientific). The slide was put in a slide holder and completely dipped in hematoxylin for 6 min, followed by five dips in nuclease-free water and then 10 dips in a beaker filled with ~800 ml of nuclease-free water. The slide holder was then dipped in Dako Blueing Buffer for 5 s, followed by another five dips in nuclease-free water. Finally, the slide holder was put in the eosin solution for 1 min and washed by five dips in nuclease-free water. The slide was removed from the holder and air dried before being mounted with 85% glycerol and covered with a coverslip (VWR) before imaging. Imaging of stained H&E tissue sections on glass arrays was performed on a Metafer VSlide scanning system (MetaSystems) installed on an Axio Imager Z2 microscope (Carl Zeiss) with an LED transmitted light source and a CCD camera. Using an A-P ×10/0.25 Ph1 objective lens (Carl Zeiss) and a configuration program26, focusing and scanning of each tissue section on the glass array was done automatically. Image stitching was done using VSlide (version 1.0.0) with 60-µm overlap and linear blending between fields of view. Images were extracted using jpg compression.

In situ reactions: permeabilization and reverse transcription

Before start, the hybridization chamber was cleaned with RNaseZap (Thermo Fisher Scientific) and 70% ethanol, followed by at least 30 min in a UV light chamber. After section imaging, the slide was again attached to the hybridization chamber to proceed with the following permeabilization reactions (referred to as ‘bacterial treatment’ below). First, 100 µl of a lysozyme solution with 0.05 M EDTA (pH 8.0, Thermo Fisher Scientific), 0.1 M Tris HCl, pH 8 (Thermo Fisher Scientific) and 10 µg µl−1 lysozyme (from chicken egg white, lyophilized powder, Sigma-Aldrich) were added to each well and incubated for 30 min at 37 °C, followed by wash with 100 µl of 0.1× SSC. Second, 75 µl of 10% Triton X-100 (Sigma-Aldrich) was added and incubated for 5 min at 37 °C, followed by a 100-µl wash of 0.1× SSC. Third, a solution with 0.05% SDS and 5 mM DTT (Thermo Fisher Scientific) was added and incubated for 5 min at 37 °C, followed by a 100-µl wash of 0.1× SSC. Fourth, 100 µl of collagenase I (200 U) in 1× HBSS (both from Thermo Fisher Scientific) were added to each well and incubated for 20 min at 37 °C, again followed by a 100-µl wash of 0.1× SSC. Lastly, 75 µl per well of 0.1% pepsin (pH 1, Sigma-Aldrich) was incubated for 10 min at 37 °C, followed by a final wash of 100 µl of 0.1× SSC. In situ cDNA synthesis was performed as previously described26. In brief, 75 µl of 50 ng µl−1 actinomycin D (Sigma-Aldrich) and 0.5 mM dNTPs (Thermo Fisher Scientific, 0.20 µg µl−1 BSA and 1 U µl−1 USER enzyme (both from NEB), 6% v/v Lymphoprep (STEMCELL Technologies), 1 M betaine (B0300-1VL, Sigma-Aldrich), 1× first-strand buffer, 5 mM DTT, 2 U µl−1 RNaseOUT and 20 U µl−1 Superscript III (all from Thermo Fisher Scientific)) were added to each well. The reaction was sealed with Microseal ‘B’ PCR Plate Seals (Bio-Rad) and incubated for at least 6 h. After incubation, 70 µl of the released cDNA material from each hybridization chamber well was collected and stored in a 96-well PCR plate (Eppendorf).

Library preparation

Library preparation was performed using the SM-Omics automated library preparation protocol, as previously described26. In brief, released cDNA material was first made double stranded using the nicked RNA template strands as primers for copying the cDNA strand with DNA polymerase I. To avoid overdigestion, the reaction was terminated with EDTA, and ends were blunted using T4 DNA polymerase before linear amplification by in vitro transcription. Amplified material was again transcribed into cDNA, resulting in material ready for PCR indexing as described in the next subsection.

Quantification, indexing and sequencing

qPCR quantification and indexing were performed as previously described64 using TruSeq LT Illumina indexing and a KAPA HotStart HiFi ReadyMix (Roche). Indexed cDNA libraries were cleaned using a 0.7:1 ratio with AMPure XP beads (Beckman Coulter) to PCR product, according to the manufacturer’s protocol, and eluted in 12 µl of elution buffer (Qiagen). Each sample’s concentration was measured using the DNA HS Qubit assay (Thermo Fisher Scientific), and average fragment length was determined using either Bioanalyzer HS or DNA1000 TapeStation (both from Agilent Technologies). Each sample was then diluted to the desired concentration for sequencing (1.08 pM on a NextSeq and 10 pM on a MiSeq, both with ~10% PhiX). Pooled libraries were sequenced with 25 nucleotides (nt) in the forward read and 55 nt and 150 nt in the reverse read on NextSeq and MiSeq (Illumina), respectively.

Generation of bacterial validation dataMechanical extraction of bacterial RNA

An approximately 1-mm-thick tissue section with pellet was sectioned from SPF colons in O.C.T. and put in a dry ice-cold Lysis Matrix D tube (MP Biomedicals). Then, 400 µl of RLT buffer (Qiagen) with 1% 2-mercaptoethanol (Sigma-Aldrich) was added to the tube, and the solution was homogenized in a FastPrep-24 instrument (MP Biomedicals) at speed 6 for 40 s. Tubes were then centrifuged for 5 min at 12,000 r.p.m. Supernatant was transferred to a new tube, and RNA extraction was done using the RNeasy Mini Kit (Qiagen), according to the manufacturer’s instructions. Extracted RNA was fragmented using the NEBNext Magnesium RNA Fragmentation Module Kit (NEB), heating for 2 min. Fragmented RNA was cleaned with the MinElute Cleanup Kit (Qiagen), according to the manufacturer’s instructions. Quality of the fragmented RNA was evaluated by the Bioanalyzer Pico Kit (Agilent Technologies). Next, ~20 ng µl−1 mechanical extracted RNA was added on a 16S surface probe coated quality control (QC) array in an in situ cDNA reaction, as described in the ‘In situ reactions: permeabilization and reverse transcription’ subsection. After at least 6-h incubation at 42 °C, 70 µl of the released material from each well was collected and stored in a new 96-well PCR plate (Eppendorf). Library preparation, quantification, indexing and sequencing on the MiSeq were performed as described in the ‘Library preparation’ and ‘Quantification, indexing and sequencing’ subsections.

Extraction and metagenomic sequencing of fecal DNA

Pellet was collected from the colon of SPF mice by perforating the colon wall and scraping the pellet and mucus into a 1.5-ml collection tube (Eppendorf). Collected pellet was stored at −80 °C until further processed. DNA was extracted from the pellet using a Lysing Matrix Y tube (MP Biomedicals), according to the manufacturer’s instructions. Extracted DNA concentration was determined using the DNA HS Qubit assay. DNA was made into libraries using Nextera XT (15031942 v05). Concentration and average fragment length of each sample were evaluated using the DNA HS Qubit assay (Thermo Fisher Scientific) and Bioanalyzer HS (Agilent Technologies), respectively. Each sample was diluted to the desired concentration for sequencing (9 pM, ~10% PhiX), and pooled samples were sequenced on a MiSeq (2 × 150 bp, lllumina). Each sample was sequenced to ~5–10 million reads.

FISH

FISH was performed on the same fresh-frozen gut tissue samples from ASF mice. All sections were 10-µm-thick cross-sections and consecutively collected. First sections were placed on the spatial array, followed by placing consecutive sections on a CodeLink amine-activated slide (Surmodics); the following two sections were then again placed on the spatial array. Sections on the spatial array were used for SHM-seq, and sections on the amine-activated CodeLink slide (Surmodics) were prepared for FISH as further described. Slides were warmed at 37 °C for 2.5 min on a thermal incubator, before tissue sections were fixed using freshly prepared methacarn, as described in the ‘Tissue fixation, H&E staining and imaging’ subsection. Slides were then placed in a hybridization chamber, and 75 µl of preheated FISH solution (0.9 M NaCl and 20 mM Tris, pH 7 (both Thermo Fisher Scientific), 0.1% SDS (Sigma-Aldrich) and a FISH oligonucleotide detection probe (0.06 µg ul−1)) was added to each well and incubated for 2 h at 25 °C. Oligonucleotide detection FISH probes (IDT) were used depending on the target of interest: probe EUB338 (5′-/Cy5/GCTGCCTCCCGTAGGAGT-3′) for all bacteria; probe non-338 (5′-/Cy5/ACTCCTACGGGAGGCAGC-3′) as a negative control; probe Lab158 (5′-/Cy5/GGTATTAGCAYCTGTTTCCA-3′)65,66,67 to target ASF360; probe Lac435 (5′-/Cy5/TCTTCCCTGCTGATAGA-3′)68,69 to target ASF502; and probe Bac303 (5′-/Cy5/CCAATGTGGGGGACCTT-3′)8,67 to target ASF519. After the 2-h incubation, FISH solution was removed, and wells were washed with 100 µl of 1× PBS before the hybridization chamber was removed and slides were dipped 12 times in 50 ml of 1× PBS before being air dried. Slides were mounted with 85% glycerol (Sigma-Aldrich) and a coverslip (VWR). Epifluorescent images were acquired on an Axio Imager Z2 microscope using a PhotoFluor LM-75 light source (89North) in combination with a Plan-APOCHROMAT ×63/1.4 oil DIC objective (Carl Zeiss). Images were processed using VSlide (version 1.0.0, MetaSystems).

Processing on H&E imaging dataImage registration and annotation

Image processing and registration of barcoded spots was done using SpoTteR26. H&E images (collected in RGB channels) were downscaled to approximately 500 × 500 pixels. For efficient grid spot detection, tissues were masked from the images using quantile thresholding in the red channel. Centroids of spatial array spots were detected by computing the image Hessian. Centroid coordinates were used as probable grid points, and a rectangular grid was then fitted to these probable points using a local optimizer (nlminb, R package stats (R version 3.6.3)). With iterations and removing 10% of the probable spots that did not fit the perfect grid structure, a new grid was fitted until the target number of grid points per row (here, 35) and column (here, 33) was reached. Final grid points were overlapped with the previously masked tissue section to select spatial points present only under the detected tissue section area. These points were used in further analysis.

H&E images were annotated using a graphical cloud-based interface24 by manually assigning each spatial coordinate (x,y) resulting from the grid fitting process with one or more morphological region tags. The tags used were epithelium (E), epithelium and muscle and submucosa (ALL), epithelium and mucosae and submucosa (EMMSUB), epithelium and mucosae (EMM), muscle and submucosa (MSUB), crypt base (BASE), externa and interna (MEI), externa (ME), interna (MI), mucosae and interna (MMI), mucosa and pellet (MUPE), crypt mid (MID), crypt apex and mucosa (APEXMU), crypt apex and crypt mid (UPPERMID), Peyer’s patch (PP) and pellet (PE). E, EMMSUB, EMM, BASE, MEI, ME, MI, MUPE, MID, MMI, APEXMU, UPPERMID, PP and PE were visualized in tissue vector representations.

Processing of host readsRaw reads processing and mapping of host reads

Reads were generated with bcl2fastq2 (version 2.20.0) and trimmed to remove adaptor sequences and the 16S surface probe sequence using BBDuk70 (version 38.33). ST Pipeline (version 1.7.6)29 was used to generate gene-by-barcode matrices. The reverse quality-filtered reads were mapped with STAR (version 2.6.0)71 to the mouse genome reference (GRCm38 primary assembly), and mitochondrial sequences were removed. Mapped reads were annotated using HTseq-count (version 0.11.4)72 and the mm11 mouse annotation reference (https://www.gencodegenes.org/mouse/release_M11.html). Annotated reads were demultiplexed with TagGD29,73 (version 0.3.6) with a Hamming distance clustering approach (k-mer 6, mismatches 2). This connected transcript information to spatial barcodes. Finally, UMI collapsing per transcript and spatial barcode was performed with a naive clustering approach (mismatches 1) similar to that described in UMI-tools74.

Processing of bacterial readsGeneration of gold standard mouse gut bacterial reference

FASTQ reads were generated with bcl2fastq2, and reads were quality filtered using KneadData (version 0.7.4) (https://huttenhower.sph.harvard.edu/kneaddata/) (mouse database mouse_C57BL). MEGAHIT75 (version 1.2.9) was used for assembly of the filtered reads, and bowtie2 (ref.76) (version 2.3.4.3) was used for mapping reads to the assembly. MetaBAT2 (ref.77) (version 2.15) was used for binning the assembly, and the command-line version of NCBI BLAST78 (version 2.9.0+) was used to assign taxonomy to contigs with blastn and database ‘nt’. MEGAHIT, bowtie2 and MetaBAT2 were all run using default settings. Assignments were filtered (E-value ≤ 10E−6) and sorted (by E-value and percent identity), and each contig was then assigned the top taxonomy assignment. Contigs belonging to an assigned taxonomy on species level at various cutoffs (>0.1%, >0.05% and >0.01% corresponding to 65, 121 and 419 species, respectively) were retained. For each cutoff, reference genomic sequences (complete genomes, chromosomes or scaffolds, depending on availability for these species) were downloaded from the NCBI RefSeq database31 (release 205), resulting in FASTA sequence databases (one for each cutoff) of the taxa found in SPF mice (n = 6) and used as input to build custom databases in Kraken2 (version 2.0.9)30 according to Kraken2 default instructions, including masking of low-complexity regions. Reference genomes for six species were not found in the RefSeq database (Supplementary Table 5) and were not included in the FASTA sequence databases. The mouse gut bacterial references were also filtered for genera that have previously been found in mice and/or the intestine79,80,81. A phylogenetic tree of the reference taxa was built using NCBI’s Common Tree and visualized using iTOL (version 6.4.3)82. When analyzing mouse gut tissue with defined flora (ASF), genome sequences according to ref. 83 were downloaded from the NCBI and used as input to build a custom ASF database in Kraken2.

Generation of simulated data

Two simulated datasets were generated based on the abundance of taxa using cutoffs 0.1% and 0.01% (as described in the ‘Generation of gold standard mouse gut bacterial reference’ subsection): 16S rRNA FASTA sequences for the taxa found in SPF mice were downloaded from the NCBI (downloaded 24 July 2021), except two taxa where the 16S rRNA FASTA sequence were missing (Sodaliphilus pleomorphus and Anaerocolumna sedimenticola). Command-line NCBI BLAST78 (version 2.9.0+) was used to align every possible sequence version of the 16S surface probe to the 16S rRNA FASTA sequences to find the best possible alignment for the 16S surface probe per taxa. To mimic spatially captured reads from a real SHM-seq, 2 million paired reads from a real SHM-seq experiment were used as a template for FASTQ headers, sequence and quality scores for the forward read and FASTQ headers and quality scores for the reverse read. The sequences in Read 2 were replaced by 150-bp-long fragments of the 16S rRNA sequences from randomly selected taxa. Fragments were created by selecting a region upstream of the best possible alignment of the 16S surface probe per randomly selected taxa. Each region was then randomly selected a length based on a normal length distribution with parameters characteristic to a spatial array (400 ± 44 bp) and trimmed to 150 bp. This resulted in a simulated dataset with 2 million randomly selected 16S rRNA gene sequences, generated from where the 16S surface probe was expected to capture, from the taxa in our mouse gut bacterial references but with known exact taxa and both reverse and forward reads.

Deep learning model: data pre-processing

A total of 500,000 DNA sequences were randomly selected from the simulated dataset based on a 0.1% abundance cutoff (described in the ‘Generation of simulated data’ subsection) and uniformly sampled, and single-point mutations with 0.1% rate were introduced. This was followed by random shortening based on a normal distribution of fragment lengths from a true SHM-seq experiment (143 ±13 bp, truncated at 150 bp). Reads from each taxon in the mouse gut bacterial reference were represented at least 100 times per genus. Sequences were one-hot encoded, such that each nucleotide (A, C, T, G and N) was represented by a five-dimensional binary vector, followed by sequence padding up to the maximum length (150 bp). Taxa labels were one-hot encoded into one of N genera. The encoded sequences and taxa labels were provided as input for training the model.

Deep learning model: architecture

A taxonomic classifier of short reads was implemented using Keras84 with TensorFlow85 back end (version 2.2.0) in Python (version 3.8.10) (Supplementary Fig. 4a). The model takes as input one-hot encoded DNA sequences of varying lengths and provides a genus label as output. First, a masking layer was used to ignore padded entries, followed by four layers of a one-dimensional convolutional layer with kernel sizes of 15, 17, 19 and 23 to extract short motifs, followed by a concatenation and a dropout (50% rate) module and two bidirectional long short-term memory network layers, which processed the sequences in both directions. This was followed by another dropout layer (20% rate), a dense layer (reLU activation), a dropout layer (10% rate), another dense layer (reLU activation) and, finally, a fully connected layer (softmax activation) to reduce the final output size to the number of distinct genera in the input data. In total, the model consisted of 298,760 trainable parameters. Cross-entropy loss was used to train a multi-class classifier with Adam as the optimization algorithm86. The model architecture was visualized using Netron87.

Deep learning model: training details

Model parameters were optimized by using 80% of sequences for training and 20% for testing. Each epoch started with shuffling the training data and computing the gradient update once for each training data point to obtain unbiased gradient estimates88. During training, categorical accuracy and cross-entropy loss were used to monitor progress. Training was terminated after a maximum of 15 epochs or when the training loss did not decrease in five consecutive epochs. The area under the receiver operating characteristic (ROC) curve and the F1 score were calculated using Scikit-learn (version 0.24.2)89 and used to report the final performance on test data.

Deep learning model: evaluation

One million simulated sequences with corresponding taxa (as in the ‘Generation of simulated data’ subsection) were modified with a sequencing error rate of 1%90 and random shortening as described above. Sequences were classified either by Kraken2 alone or by Kraken2 followed by the deep learning model. Performance was evaluated compared to the ground truth taxa labels by calculating Bray–Curtis dissimilarities and Pearson correlation coefficients of the bacterial relative abundances per spot using Scipy (version 1.1.0)91 spatial.distance.braycurtis and Scikit-learn (version 0.24.2)89 stats.pearsonr, respectively. A higher similarity of the relative abundances between classifications and the ground truth resulted in lower Bray–Curtis dissimilarities and higher Pearson correlations. Accuracy and F1 score were calculated on the whole dataset using Scikit-learn (version 0.24.2)89 metrics.classification_report.

Comparison of taxonomy assignments

To compare how well Kraken2 performs when using different RefSeq databases (whole genome versus 16S rRNA) of different sizes (restricted versus unrestricted), taxonomy assignments were made by the taxonomy assignment pipeline but without using the deep learning model (as described in the ‘Raw reads processing and mapping of bacterial data’ subsection). The four databases used in the comparisons were: RefSeq Bacteria whole genome database (downloaded from Kraken2 GitHub version 2.1.2) and adding to it the whole genomes from all eight ASF species in Kraken2 (ref.83) (‘RefSeq whole genomes’); the custom gold standard restricted whole genome database (‘65 species whole genome’, as described in the ‘Generation of gold standard mouse gut bacterial reference’ subsection) and the RefSeq Bacteria 16S rRNA database, derived from those RefSeq bacterial taxa that had available 16S rRNA sequences in the NCBI (‘RefSeq 16S rRNA’, ~3,000 taxa, downloaded on 24 July 2021); and finally, we restricted the RefSeq 16S rRNA database to the 65 species detected in the gold standard restricted whole genome database (‘65 species 16S rRNA’). For comparing the impact of read lengths, simulated datasets were prepared as described in the ‘Generation of simulated data’ subsection by using cutoff 0.1% but with longer length distribution (650 ± 44 bp) and trimmed to 150 bp, 300 bp, 450 bp and 600 bp.

Raw reads processing and mapping of bacterial data

FASTQ reads were generated with bcl2fastq2 and trimmed to remove adaptor sequences using BBDuk70. Trimmed reads were quality filtered using the same quality-filtering step as in the ST Pipeline (version 1.7.6)29, but only reads longer than 100 nt were kept. TagGD73 was used to connect the spatial barcode to each forward read (k-mer 6, mismatches 2, Hamming distance clustering algorithm), and BWA-MEM (version 0.7.17)92 with reference mouse genome (GRCm39) was used to remove host mapping sequences. Remaining reverse reads were mapped to the mouse gut bacterial reference (created as described in the ‘Generation of gold standard mouse gut bacterial reference’ subsection) using Kraken2 (version 2.0.9)30 (confidence 0.01). Reads originated from GF and SPF mice were mapped to the mouse gut bacterial reference, whereas reads originated from ASF mice were mapped to the ASF reference. Taxonomy assignments made by Kraken2 were improved using the deep learning model. UMIs with identical spatial barcodes and taxonomical assignments were collapsed using UMI-tools (version 1.0.0)74 (UMIClusterer, threshold 1), resulting in a bacteria-by-barcode matrix.

Analysis of bacterial validation dataSpatial analysis of bacterial fluorescence

Bacterial presence in scanned fluorescence images was detected using ilastik (version 1.3.3)93. After training and testing each bacterial fluorescence print separately in ilastik, the resulting bacterial detection mask was aligned with the fluorescent image to detect mean fluorescence intensity per spatial coordinate and stored as a matrix. This matrix was then run in Splotch (as described in the ‘Hierarchical probabilistic modeling using Splotch’ subsection). Resulting normalized fluorescence intensity was compared to the normalized bacterial presence by randomly selecting, at most, three spatial coordinates from each annotated region per sample (only annotated regions that were shared between the normalized fluorescence intensity and the normalized bacterial presence were considered) and scaling them within each sample, before matching them to a spatial coordinate in the same region and comparing them to each other (normalized fluorescence intensity versus normalized bacterial presence per spatial coordinate). To limit the region annotated as pellet, spatial coordinates annotated as pellet were selected if they were spatially adjacent to coordinates annotated as mouse tissue. This procedure was repeated 1,000 times to generate an average spatial correlation measurement between normalized bacterial FISH intensity and normalized sequenced bacterial presence, expressed as Spearman correlation.

16S surface probe sensitivity

To evaluate 16S surface probe sensitivity, reference DNA sequence and gene annotation files were downloaded from Ensembl Bacteria94 for the ASF bacteria available in the database (version 104.1) (ASF356, ASF360, ASF457, ASF492, ASF500 and ASF519 (taxonomy ID 1235789)). Reads captured from ASF tissue sections on a spatial transcriptomics QC array with only 16S surface probes on the array surface were separately mapped against each ASF bacteria genome using BWA-MEM (version 0.7.17)92. Gene body coverage over the 16S rRNA genes in respective reference genomes was generated using RSeQC (version 4.0.0)95. Genome binning was done by summarizing the aligned reads in separate bins, each bin representing a hundredth of the respective ASF genome.

16S surface probe specificity

Specificity was first evaluated by proportion of bacteria versus mouse read alignment. Tissue sections from SPF, ASF and GF mice were placed on QC arrays with 16S surface probes, and finished libraries were prepared using either bacterial treatment or colon treatment. Each finished library was sequenced to approximately 660,000 reads. Reads were taxonomically annotated by using the taxonomy assignment pipeline without the deep learning model. The proportion of reads mapping to the respective bacterial reference (mouse gut bacterial reference for SPF and GF tissue samples and ASF reference for ASF tissue samples) was calculated by using the number of trimmed reads.

Protocol specificity was also evaluated by comparing the bacterial treatment with a mechanical treatment (see the ‘Mechanical extraction of bacterial RNA’ subsection). Spearman rank and Pearson correlation coefficients were calculated using Scipy’s (version 1.1.0)91 stats.spearmanr and stats.pearsonr.

Bacterial treatment was compared to a bulk 16S rRNA sequencing dataset62 where the 16S libraries were made from material originating from feces of C57BL/6J mice (Sequence Read Archive (SRA) sample references: SRR9212951, SRR9213178 and SRR9213335). The correlation was calculated using Scipy’s (version 1.1.0) Pearson correlation coefficient91.

Comparison of the taxonomy assignment pipeline with QIIME 2

The taxonomy assignment pipeline (as described in the ‘Raw reads processing and mapping of bacterial data’ subsection) was compared to QIIME 2 (ref.32) (version 2022.2) by using the simulated dataset (generated as described in the ‘Generation of simulated data’ subsection). QIIME 2 was run with default settings for single-end sequences, and the Silva 138 99% OTUs full-length sequences classifier was used for taxonomic profiling.

Effect of bacterial treatment on mouse gene expression

To evaluate the effect of the bacterial treatment on measured host (mouse) gene expression, we normalized96 gene counts from samples with and without bacterial treatment (reads downsampled to the same saturation levels) and from samples prepared on a spatial array with customized surface or a standard spatial array (reads downsampled to the same saturation levels). Pearson correlation coefficient was calculated using Scipy’s (version 1.1.0)91 stats.pearsonr.

Spatial modeling and visualization of host–microbiome dataHierarchical probabilistic modeling using Splotch

Splotch

View original article

NATURE BIOTECHNOLOGY

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Spatial host–microbiome sequencing reveals niches in the mouse gut

Comments (0)