A 3D map of the adult mouse brain, containing Single-cell DNA methylome and 3D maps of multiple genes


Clustering of Hindbrain Neuronal Clusters in the Astro-Epen Class and the Few Highly Heterogeneous Clusters

Note that the precision of this analysis is highly dependent on the accuracy of CCFv3 registration. There might be confusion from other CCFv3 regions due to registration inaccuracies.

Some of the clusters in the Astro–Epen class did not fully capture the rich spatial gradients present in the data, and so we made a note of it. We also identified some hindbrain neuronal clusters that had high within-cluster heterogeneity. There could be factors that contribute to presence of these heterogeneous clusters. Sampling the cell types wasn’t comprehensive enough. There are a significant number of non-neuronal cells present in these areas, and it can be very difficult to collect them. The cell types that are vulnerable to tissue dissociation are those in the hindbrain, which make it difficult to profile and cause the cells that survive to leak more transcripts. This is what we observed in Purkinje neuron population, which was very small in our dataset. After initial clustering, we identified a pair of Purkinje neuron clusters, with high or low gene count, respectively. The cluster was removed because it was a low-quality one. Finally, transcriptomic differences in hindbrain neuronal types appear to be subtler compared to neuronal types in other brain regions. The hindbrain neuronal types are more difficult to categorize as they have subtler differences. To address the problems stated above, we re-clustered cells from the Astro–Epen class and the few highly heterogeneous hindbrain neuronal clusters with more relaxed threshold: de.score.th = 80, min.cells = 8. The MERFISH dataset has more distinct distribution and expression of marker genes when mapped to the more refined clusters.

We predicted the cell-type identity at class, subclass and supertype levels of each cell using its 3D spatial coordinates and the majority votes of the cell’s KNNs. The confusion matrix between the actual cell-type identity and the predicted cell type (Extended Data Fig. 15) indicates the cell types that colocalized in the same spatial location.

We first compared the joint clustering result with the independent clustering result from each dataset. We then sorted the data by marker genes for each dataset. For each marker gene, we computed the Pearson correlation between its average expression for each cluster across two different datasets to quantify the consistency of its expression at the cluster level between datasets (Extended Data Fig. 7d. We performed a similar analysis between 10xv3 and MERFISH datasets.

The Brain Map helps to define brain regions for profiling and boundaries for dissection. We covered all regions of the brain using sampling at top-ontology level with judicious joining of neighbouring regions (Extended Data Fig. The Supplementary Table 3 is the 1d and e table. These choices were guided by the fact that microdissections of small regions were difficult. Joint dissection of neighboring regions is sometimes necessary to get enough cells for profiling. The MERFISH data showed that our CCFv3-based microdissections were mostly accurate at cell and brain region levels.

After registration to CCFv3, we found that out of 554 terminal regions (grey matter only, Supplementary Table 1), there were only 7 small subregions completely missed in the MERFISH dataset: frontal pole, layer 1 (FRP1), FRP2/3, FRP5, accessory olfactory bulb, glomerular layer (AOBgl), accessory olfactory bulb, granular layer (AOBgr), accessory olfactory bulb, mitral layer (AOBmi) and accessory supraoptic group (ASO).

The registration was done at 10m in the sky. A reference image was created for each section by the number of spots found in a 10 10 m grid. A single image was created across all probes by taking the maximum count for each grid unit. The midline was manually determined by annotating the most dorsal and most ventral point. These points were then used to compute a rigid transform to rotate the section upright and centre in the middle. This set of images was placed in order to create a configuration for registration.

Following clearing, sections were washed twice for 5 min in Sample prep wash buffer (VIZGEN, 20300001). VIZGEN DAPI and PolyT Stain (VIZGEN, 20300021) was applied to each section for 15 min followed by a 10 min wash in formamide wash buffer. Formamide wash buffer was removed and replaced with sample prep wash buffer during MERSCOPE set up. The added 100 l of New England Bio Lab’s M0314L was put to use in a pre-thawed and mixed MERSCOPE, and the 250 l that was added was used for the Imaging Buffer Activator. The MERSCOPE fluidics system was primed with 15 litres of mineral oil and the activation port was added. The flow chamber was assembled with the hybridized and cleared section coverslip according to VIZGEN specifications and the imaging session was initiated after collection of a 10× mosaic DAPI image and selection of the imaging area. VIZGEN proprietary protocol was used when processing specimens that passed the minimum count threshold.

To allow for the gel to fully polymerize the sections were incubated at room temperature for 1.5 h. To clear the tissue the section was incubated in 5 ml of VIZGEN Clearing Solution (VIZGEN 20300003) with Proteinase K (NEB P8107S) according to the Manufacturer’s instructions for at least 24 h or until it was clear in a humidified incubation oven at 37 °C.

The fresh mouse brains were sliced at 1,200-m intervals, and each slice was embedded in OCT, rapidly frozen in dry ice and stored at 80C until ready for slicing. The coral section was obtained from the tissue using a cryostat which was fixed in 4% formalin for 30 minutes, and then permeabilized in 70% alcohol. Sample preparation, including probe hybridization and gel embedding, was performed using a sample preparation kit from Vizgen (10400012) following the manufacturer’s protocol. Each section was imaged using a MERSCOPE 500 Gene Imaging kit (Vizgen, 10400006) on a MERSCOPE (Vizgen).

Of note, a reported unconventional mechanism118 underlying co-transmission of dopamine and GABA by SNc dopaminergic neurons in the STR does not depend on cell-autonomous GABA synthesis but instead on presynaptic uptake from the extracellular space through the GABA transporter Slc6a1 (Gat1) while still relying on Slc18a2 for synaptic GABA release, which could make Aldh1a1 unnecessary in these cells. Because Slc6a1 is expressed in all of the GABAergic neurons as well as astrocytes, it is unclear to us how widely applicable this unconventional mechanism is. The criteria we used didn’t consider Slc6a1 at all due to the risk of having some false negatives.

We assigned the neurotransmitter identity to each cell cluster based on its expression of the neurotransmitter transporter genes. Supplementary table 7 is included in extended data fig. 9 Criteria used are:

Exploring the UMAP dataset with a scratch-mapping method: Application to the multiscale neuronal network (MERFISH)

The imputation results for 10xv2, 10xMulti and MERFISH datasets, along with the 10xv3 dataset as the anchor, were used to generate the integrated UMAP shown in Extended Data Fig. 7a.

We have tested many different methods in parallel. To make it work for this large dataset with high complexity, we down-sampled the datasets and increased the size of neuronal network model dramatically to achieve reasonable performance. The performance needs to be improved because of a huge space of parameter space that needs to be explored.

An alternative simpler strategy is to simply compute the KNNs of each query cell among the reference cells within the same cluster and/or the same subclass and impute the expression of all marker genes. However, the imputation values using this strategy could not preserve the transitions between clusters or subclasses, exaggerating the separations between cell types, especially at the finer level.

The first thing we did was build a tree with root, classes, subclasses, and clusters. We picked markers that best discriminate the clusters from one child to another and then assign query cells to the child with the nearest cluster centroids based on the selected markers. The process was repeated at each level of the tree till the query cells were mapped to the leaf-level clusters. The package for implementing the scrattch-mapping method is publicly available at the Allen Institute.

The global relatedness between cell types was visualized using a constellation plot (Extended Data Fig. 6). The number of cells in a transcriptomic subclass is reflected in a circle on the surface of thenode, which is how the plot was generated. The position of the nodes was determined by the positions of the subclasses in UMAP coordinates. The relationships were shown by the edges that were calculated. For each cell, 15 nearest neighbours in reduced dimension space were determined and summarized by subclass. We subtracted the amount of nearest neighbours that were assigned to other subclasses. In the connected nodes, at least one of the others had a high proportion of nearest neighbours. The width of the edge at the node reflected the fraction of nearest neighbours that were assigned to the connecting node and was scaled to node size. For all nodes in the plot, we then determined the maximum fraction of “outside” neighbours and set this as edge width = 100% of node width. scrattch.bigcat has the function for plot_constellation.

Clustering with HVF and calculating cluster enriched features (CEFs) of the HVF clusters. The framework was adapted from the main package. We did a clustering based on variable features, then used these clusters to pick CEFs with stronger marker genes. The concept of CEF was introduced in ref. 77. The CEF calling and permutation-based statistical tests were implemented in ALLCools.clustering.cluster_enriched_features, for which we selected for hypomethylated genes (corresponding to highly expressed genes) in methylome clustering.

MERFISH data analysis, including imaging, spot detection, cell segmentation and cell-by-gene matrix generation, was conducted using MERSCOPE instrument software (v.2023-01). We removed abnormal cells (artificial segmentation and doublets) from the cell-by-gene matrix in each experiment based on the following criteria: (1) cell volume <30 μm3 or > 2,000 μm3; (2) total RNA count <10 or >4,000; (3) total RNA counts normalized by cell volume <0.05 or >5; (6) total gene detected <3; and (5) cells with >5 blank probes detected (negative control probe included in the gene panel). We then integrated the pass-QC MERFISH cells with the scRNA-seq datasets6 to annotate the MERFISH cells with transcriptome nomenclature using the ALLCools integration functions described above.

Computation of KNN using BiocNeighbors and Annoy Index using a Parallelized Algorithm to Identify Noise Clusters in Cells

Compared to the previous version, the key improvement is step 3 for computing KNN. We used BiocNeighbor package (v1.16.0, https://github.com/LTLA/BiocNeighbors) for computing KNN using Euclidean distance within modality and cosine distance across modality using the Annoy algorithm (v1.17.1, https://github.com/spotify/annoy). The Annoy index was built based on anchor cells for the reference dataset, and KNNs were computed in parallel for all the query cells. Due to increased dataset sizes, it is impossible to fit a similarity graph in a memory. The method down-sampled the datasets based on a user specified parameter, and if the cluster membership of each modality was provided as input for integration algorithm, we down-sampled cells by within-modality clusters, ensuring preservation of rare cell types. The anchor cells were included in the down-sampled datasets. On the down sampled dataset, the cluster membership of other cells were computed based on the KNNs found in step 3.

We identified 933 noise clusters with 153,598 cells in 10xv3, and 201 noise clusters with 38,073 cells in 10xv2. 10xv3 noise clusters were removed from integration analysis but 10xv2 noise clusters were included accidentally. Most of the 10xv2 noise clusters were excluded from the next steps after integration.

Before proceeding with integration between 10xv2, 10xv3, and 10xMulti datasets, we first needed to remove noise clusters. The presence of such clusters can confuse the integration algorithm and reduce the cell-type resolution. There are two main categories of noise clusters: clusters with significantly lower gene detection due to extensive drop out, and clusters due to doublets or contamination.

We developed a strict process for removing low-quality cells. Cells were first classified into broad cell classes after they were mapped to an existing, preliminary version of taxonomy. The log-transformed expression of some genes is compared to their expression levels in poor quality cells to calculate the QC score. The housekeeping genes are strongly expressed in almost all cells and are an anti-correlated with Malat1 (Supplementary Table 4). There are 30 genes which are annotated as mitochondrial inner membrane category based on GO component, although they aren’t located on the chromosomes. Some evidence suggests the mRNAs of some of these genes or their homologues are translocated to the mitochondrial surface130,131. We used this QC score to quantify the integrity of cytoplasmic mRNA content, which tended to show bimodal distribution. The cells in the low end were very similar to the ones at the single nucleus. Doublets were identified using a modified version of the DoubletFinder algorithm132 (available in scrattch.hicat, https://github.com/AllenInstitute/scrattch.hicat, v1.0.9) and removed when doublet score >0.3. The data we kept was tailored to different cell classes and it included a number of different cell types. Threshold parameters and number of cells filtered are summarized in Supplementary Table 4. For example, for neurons (excluding granule cells) We used cutoffs of 200 and 2,000 for the genes.

10x Genomics Multiome (10xMulti) libraries were sequenced on the Illumina NovaSeq6000, and sequencing reads were aligned to the mouse references downloaded from 10x Genomics, which includes ensembl GRCm38 (v98) fasta and gencode (vM23) gtf file, using the 10x Genomics CellRanger Arc (v2.0) workflow with default parameters.

We used the TenX Multiome processing for single cell multiome andgene expression reagent bundle. We followed the manufacturers instructions for nucleus capture, transposition, and barcoding. For the snRNA-seq libraries, we loaded 16,007 ± 692 nuclei per port and targeted a sequencing depth of 120,000 reads per nucleus. The average for the 1,687 nuclei included in the study was 157,023 reads per nucleus.

Some neuronal types are difficult to isolate using a cell-isolation procedure. We have collected another 10 single-nucleus. Technical limitations cause the data in midbrain and hindbrain regions to be lost.

Cells were collected using a 130-m nozzle using a FACS protocol developed at AIBS122. Cells were prepared to sort by passing thesuspension through a 70 m filter and adding Hoechst or DAPI. Sorting strategy with example images122 was as previously described23, and most cells were collected using the tdTomato-positive label. Around 30,000 cells were sorted within 10 min into a tube containing 500 µl of quenching buffer. We found that sorting more cells into one tube diluted the ACSF in the collection buffer, causing cell death. We also observed decreased cell viability for longer sorts. Each of the 30,000 cells that were sorted were gently stacked on top of 200 l of high-BSA buffer and immediately subjected to an instument for 10 minutes in a reactor with a swinging-bucket rotor. No pellet could be seen with this small number of cells, so we removed the supernatant and left behind 35 µl of buffer, in which we resuspended the cells. The cells were stored in a high-BSA buffer without ACSF getting into them. The resuspended cells were stored at 4 °C until all samples were collected, usually within 30 min. The samples from the same ROI were pooled, quantified, and loaded onto the 10x Genomics controller.

Dissected tissue pieces were digested with 30 U ml−1 papain (Worthington PAP2) in ACSF for 30 min at 30 °C. The indirect heat exchange during the short incubation period made it necessary to put the oven temperature up to 35 C, which is a target solution temperature. The papain solution was exchanged three times for the quenching buffer. The samples were put on ice for 5 minutes before trituration. The tissue pieces in the quenching buffer were triturated through a fire-polished pipette with 600-µm diameter opening approximately 20 times. The tissue pieces were allowed to settle and the supernatant, which now contained suspended single cells, was transferred to a new tube. Fresh quenching buffer was added to the settled tissue pieces, and trituration and supernatant transfer were repeated using 300-µm and 150-µm fire-polished pipettes. A high-BSA buffer was put at the bottom of the conical tube to help cushion the cells during centrifugation at 100g. The supernatant was discarded, and the cell pellet was resuspended in the quenching buffer. We collected 1,508,284 cells without performing FACS. The concentration of the resuspended cells was quantified, and cells were immediately loaded onto the 10x Genomics Chromium controller.

Following the cell-isolation protocol developed at AIBS23, single cells were isolated. The brain was dissected, submerged in artificial cerebrospinal fluid (ACSF), embedded in 2% agarose, and sliced into 350-μm coronal sections on a compresstome (Precisionary Instruments). Block-face images were taken. The slices were microdissected from the ground and put into single cells. Fluorescent images of each slice before and after ROI dissection were taken at the dissection microscope. These images were used to document the precise location of the ROIs using annotated coronal plates of CCFv3 as reference.

For cell collection during the dark phase of the circadian cycle, mice were randomly assigned to circadian time groups at time of weaning and housed on the reversed 12:12 h light:dark cycle. Brain dissections were done in the morning. From 267 donors, 5,836,825 cells were collected during the light phase of the light:dark cycle. The brain cells of 50 donors were collected during the dark phase of the light cycle.

There are between 2 and 266 mice involved in each cluster with an average of 19 and a median of 14. There are 23 clusters that have fewer than 4 donor animals each. Individual mouse variability should not have an impact on cell-type identities.

We used 95 mice to collect more than 2.5 million cells for 10xv2 and 222 for 10xv3. Animals were euthanized at postnatal day (P)53–59 (n = 141), P50–52 (n = 3), or P60–71 (n = 173). No statistical methods were used to predetermine sample size. The donor animals are listed in the Supplementary Table 2.

All experimental procedures related to the use of mice were approved by the Institutional Animal Care and Use Committee of the AIBS, in accordance with NIH guidelines. Mice were housed in a room with temperature (21–22 °C) and humidity (40–51%) control within the vivarium of the AIBS at no more than five adult animals of the same sex per cage. On a dark cycle of 13:10 h or 12:12 h, mice were provided food and water and maintained with ad libitum. Mice were maintained on the C57BL/6 J background. All mice with anophthalmia and microphthalmia were excluded.

Training Random Forest Regressor Using Highly Variable Proton and Exon Interactions with Exon and Transcript Transcriptions

After gathering all the features, we took genes with variable transcripts and exons to training for model training. Highly variable transcripts were selected on the basis of the following criteria: (1) mean TPM across cell groups of >0.2; (2) TPM standard deviation of >0.3; and (3) transcript body (TSS to TTS) length of >30 kb. Highly variable exons were selected based on PSI standard deviation of >0.02 and PSI 90% quantile and 10% quantile difference of >0.05. We used mC or 3C features to train the models that predicted transcript TPMs and PSI exons. The training contains two steps. Using the score function f_regression, we selected the top 100 features for each transcript. All the features that were chosen at least once were used. Fivefold cross-validation is performed to train random forest models. RandomForestRegressor is a random forest. In each cross-validation run, we calculated the PCC between predicted values and true values as the model performance. We also shuffled the selected features within each sample (Fig. 6c) to train the model and calculate PCC values again as the shuffled model performance.

We quantified mC and m3C intragenic features first for predicting exon and alternative usage. We used exon and exon-flanking region to determine the mC features of the genes. The upstream region of exon-flanking was 300 bp of each exon. We combined features with >90 percent overlap in their genomes, removing features with no difference in their coordinates. For 3C features, we used all the intragenic highly variable interactions (F statistics > 3) from the above section as features.

We derived a cell-by-5-kb bin matrix to represent the data. In a single cell i, we modelled its mCG base call ({M}{{ij}}) for a 5-kb bin j using a binomial distribution ({M}{{ij}} \sim {Bi}({\mathrm{cov}}{{ij}},{p}{i})), where p represented the global mCG level of the cell (and ‘~’ indicates ‘distributed as’). We then computed (P({M}{{ij}} > {{mc}}{{ij}})) as the hypomethylation score of cell i at bin j. The less likely to observe smaller or equal methylated base calls, the more hypomethylated the bin was. The hypomethylation score matrix was binarized with a sparse matrix A and we computed the columns with more than five non-zero values. One read in a 5kb bin is represented in a cell-by-5-kann matrix with the snAtac-seq data. The features were filtered in the same way as the mC matrix, and the bins remaining in both datasets were used for further analysis.

The Taiji framework94 was used to perform the analysis on a weighted GRN. The framework uses a personalized PageRank method to calculate the importance of each TF and spread the information across the network. To add subclass information as network weights, we simplified the network by including only TF and target gene nodes and weighting the gene node by inverted gene-body mCH value in the subclass. Specifically, we first performed quantile normalization across all subclasses. We then performed a robust scale of the matrix using sklearn.preprocessing. RobustScaler with quantile_range = (0.1, 0.9). We then inverted the scaled mCH fraction by

We also added DMR mCG fraction into the edge weights. Specifically, we performed the same quantile normalization and robust scale on all the mCG fractions of DMRs involved in the network and calculated the inverted DMR mCG value by

Source: Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain

Motif clustering and DMR calling from SCENIC+ (ref. 43) using allcools merged-allc and generated-base-ds

We used an ensemble motif database from SCENIC+ (ref. 43), which contained 49,504 motif position weight matrices (PWMs) collected from 29 sources. Redundant motifs (highly similar PWMs) were combined into 8,045 motif clusters through clustering based on PWM distances calculated using TOMTOM92 by the authors of SCENIC+ (ref. 43). The mouse TF genes were annotated with one or more motif clusters. To calculate motif occurrence on DMRs, we used the Cluster-Buster93 implementation in SCENIC+, which scanned motifs in the same cluster together with hidden Markov models.

After clustering analysis, we merged the single-cell ALLC files into pseudo-bulk level using the allcools merge-allc command. Next, we used allcools generate-base-ds to generate the BaseDS from multiple ALLC files. The BaseDS was a Zarr dataset storing sample-by-base tensors for the entire dataset and allowed querying cytosines by genome position and methylation context (CpG and CpH). Next, we performed DMR calling as previously described9,23,84 using the ALLCools.dmr.call_dms_from_base_ds and ALLCools.dmr. DMSAggregate functions that were reimplemented for BaseDS. In brief, we first calculated CpG differential methylated sites using a permutation-based root mean square test84. The base calls of each pair of CpG sites were combined before analysis. We then merged the differential methylated site into a DMR if they were within 250 bp and had PCC > 0.3 across samples. We were able to downsample the coverage at each base in each sample to its base call coverage of 50 because the genome coverage was unbalanced.

To investigate the relationship between gene status and the surrounding chromatin conformation diversity, we first performed quantile normalization along subclasses using the Python package qnorm (v.0.8.0)88 to normalize the transcript body mCH fractions and contact strengths of highly variable interactions. If any of the interactions with the gene body overlap with the transcript body, we calculated the PCC between the two. To calculate the null distribution and estimate the FDR, we used the shuffled matrix and shuffled the contact strengths and mCH fractions within each sample. FDR is selected as a significant correlation.

We used the imputed cell-level contact at the 10-kb resolution to perform the highly variable interaction analysis, for which the interaction represented one 10 kb-by-10 kb pixel in the conformation matrix. We looked at interactions that had one of the anchor with the ENCODE blacklist. We then performed ANOVA for each interaction to see if there was any significant difference in contact strength between subclasses. The overall variability of interaction across the brain was represented by the F statistics. To select highly variable contacts, we used F > 3 as the cut-off value, which was decided by visually inspecting the contact maps as well as fulfilling the FDR < 0.001 criteria. ANOVA was only performed on interactions for which anchor distance was between 50 kb and 5 Mb, given that increasing the distance only led to a limited increase in the number of significantly variable and gene-correlated interactions (Extended Data Fig. 10b).

We used the imputed cell-level contact matrices at 25-kb resolution to identify domain boundaries within each cell using the TopDom algorithm89. We first filtered out the boundaries that overlapped with ENCODE blacklist (v.2)75. The domain boundary of a bin is the proportion of cells having the bin in a group or subclass. We used a contingency table for each bin to identify the differential domain boundaries between n cell subclasses. The values in each row represent the number of cells from the group that has the bin called a boundary or not as a boundary. We computed the Chi-square statistic and P value for each bin and used FDR < 1 × 10–3 as the cut-off for calling 25-kb bins with differential boundary probability.

We used the imputed subclass-level contact matrices at the 100-kb resolution to analyse the compartment. The bins that were screened were the ones that overlap with the blacklist or had abnormal coverage. Specifically, the coverage of bin i on chromosome c (denoted as Rc,i) was defined as the sum of the i-th row of the contact matrix of chromosome c. We only kept the bins that had coverage between the 99th percentile of Rc and the median of Rc. The PCA34 was performed using the PCA of the normalized matrices. The model was fit with the IncrementalPCA class using all of the cell subclass matrices. We changed all of the cell subclasses with the fitted model so that the PCs for each subclass were changed from the same loading. The correlation was calculated between PC1 or PC2 and 100-byte bin CpG. We used the component with higher absolute correlation to assign the compartment with higher CpG density with positive scores.

CCA was also performed with the downsampling framework using 100,000 cells from each dataset as a reference and the others as query, but taking the TF–IDF transformed matrices as input. Using the reference cells’ CCV and IDF, the query cells were projected to the same space. Specifically, ({{B}{{\rm{m}}}}{{\rm{ref}}}) and ({{B}{{\rm{a}}}}{{\rm{ref}}}) were converted to ({{X}{{\rm{m}}}}{{\rm{ref}}}) and ({{X}{{\rm{a}}}}{{\rm{ref}}}), respectively, with TF–IDF, and the CCVs (denoted as ({U}{{\rm{ref}}}) and ({V}{{\rm{ref}}})) were computed using ({U}{{\rm{ref}}}S{V}{{\rm{ref}}}^{T}={{X}{{\rm{m}}}}{{\rm{ref}}}{{X}{{\rm{a}}}}{{\rm{ref}}}^{T}). Then ({{B}{{\rm{m}}}}{{\rm{qry}}}) and ({{B}{{\rm{a}}}}{{\rm{qry}}}) were converted to ({{X}{{\rm{m}}}}{{\rm{qry}}}) and ({{X}{{\rm{a}}}}{{\rm{qry}}}), respectively, with TF–IDF using the IDF of reference cells, and the CCVs (denoted as ({U}{{\rm{qry}}}) and ({V}{{\rm{qry}}})) were computed using ({U}{{\rm{qry}}}={X}{{{\rm{m}}}{{\rm{qry}}}}({X}{{{\rm{a}}}{{\rm{ref}}}}^{T}{V}{{\rm{ref}}})/S) and ({V}{{\rm{qry}}}={X}{{{\rm{a}}}{{\rm{qry}}}}({X}{{{\rm{m}}}{{\rm{ref}}}}^{T}{U}{{\rm{ref}}})/S). The steps to find anchors and align. Um and Ua were the same as integrating the mC and RNA data.

Source: Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain

Embedding of single cells into $rm T$-matrices and $mathrmIDF_ij$

$${{X}{{\rm{a}}}}{ij}=\log ({{{\rm{T}}{\rm{F}}{\rm{r}}{\rm{e}}{\rm{q}}}{{\rm{a}}}}{ij}\times \mathrm{100,000}+1)\times {{\rm{I}}{\rm{D}}{\rm{F}}}_{j}$$

$${{{\rm{T}}{\rm{F}}{\rm{r}}{\rm{e}}{\rm{q}}}{{\rm{a}}}}{ij}=\frac{{{B}{{\rm{a}}}}{ij}}{\mathop{\sum }\limits_{{j}^{{\prime} }\,=1}^{{\rm{No.}}\,{\rm{bins}}}{{B}{{\rm{a}}}}{{ij}^{{\prime} }}}$$

({X}{ij}=\log ({{\rm{T}}{\rm{F}}{\rm{r}}{\rm{e}}{\rm{q}}}{ij}\times \mathrm{100,000}+1)\times {{\rm{I}}{\rm{D}}{\rm{F}}}{j}), where ({{\rm{T}}{\rm{F}}{\rm{r}}{\rm{e}}{\rm{q}}}{ij}={B}{ij}/{\sum }{{j}^{{\prime} }\,=1}^{{\rm{N}}{\rm{o}}.\,{\rm{b}}{\rm{i}}{\rm{n}}{\rm{s}}}{B}{i{j}^{{\prime} }}) and ({{\rm{IDF}}}{j}=\log (1+{\rm{no.}}\,{\rm{cells}}/\mathop{\sum }\limits_{{i}^{{\prime} }=1}^{{\rm{No.}}\,{\rm{cells}}}{B}_{{i}^{{\prime} }j})). The embedding of single cells U was then computed using singular value decomposition of X, where (X=US{V}^{T}). We fit the LSI model with mCG data Bm to derive Um. The intermediate matrices S and V and vector IDF were used to transform the ATAC data Ba to Ua, by

LSI with log term frequencies were used to calculate the embedded. B was normalized by dividing the row sum of the matrix into two equal parts, and then converted into X by taking the inverse document Frequency and converting it to term Frequency.

PCA on gene-body signals was insufficient to capture the open chromatin heterogeneity in snATAC-seq data10,30. Latent semantic indexing (LSI) applied to binarized cell-by-5-kb bin matrices had demonstrated promising results for snATAC-seq data embedding and clustering30. Therefore, to align snATAC-seq data with snmC-seq data at a high resolution, we developed an extended framework based on the previously described approach30 to utilize binary sparse cell-by-5-kb bin matrices as input.

The memory crunch made it difficult to scale up the original framework of Seurat up to millions of cells. To handle this limitation, we randomly selected 100,000 cells from each dataset (({X}{{\rm{ref}}}) and ({Y}{{\rm{ref}}})) as a reference to fit the CCA and transformed the other cells (({X}{{\rm{qry}}}) and ({Y}{{\rm{qry}}})) onto the same CC space. The correlation between X_rmref and (Y_rmRef) was denoted. Then the CCV of ({X}{{\rm{qry}}}) and ({Y}{{\rm{qry}}}) (denoted as ({U}{{\rm{qry}}}) and ({V}{{\rm{qry}}}), respectively) were computed using ({U}{{\rm{qry}}}={X}{{\rm{qry}}}({Y}{{\rm{ref}}}^{T}{V}{{\rm{ref}}})/S) and ({V}{{\rm{qry}}}={Y}{{\rm{qry}}}({X}{{\rm{ref}}}^{T}{U}{{\rm{ref}}})/S), respectively. The anchor identification was made using the embedded data from the reference and query cells.

We applied the DMR calling framework across subclasses of the whole mouse brain and cell clusters within each major region. The two sources of DMRs were combined to capture the CpG fraction diversity in different cell-type granularities. There were around 10 million unique DMRs after the combination. The final non-overlapping DMR list included 2.56 million DMRs. The Data available section shows all the overlap and non-overlapping DMRs. In the subsequent analysis, when DMR was used to calculate correlation or scan motif occurrence, we started with the 10 million overlapping DMRs. We selected the DMR with the strongest value (that is, most significant PCC or highest motif score) among the overlapping ones. The DMRs in the final results were non-overlapping.

For each clustering round, we assessed whether a cluster required additional clustering in the next iteration based on two criteria: (1) the final prediction model accuracy exceeded 0.9, and (2) the cluster size surpassed 20. In total, we executed four iterative clustering rounds, which produced the following cluster numbers: 61 (L1), 411 (L2), 1,346 (L3) and 2,573 (L4). We separated the cells in L4 clusters using the brain dissection region information. We first separated dissection regions that had more than 20 cells into different groups based on the average distance between their nearest and furthest parts, and then we combined other dissection regions with fewer than 20 cells into the same group. The groups combined L4 clusters and dissection regions. Incorporating dissection region data, which provided insights into the physical location of a cell, enhanced the flexibility of the analysis. Furthermore, we acknowledged that generating pseudo-bulk profiles from cell-level data demanded substantial computational resources. Aggregating cells at a higher granularity initially facilitated more straightforward merging later, such as combining them at the subclass level during subsequent analyses.

There is consensus clustering. We first performed Leiden clustering82 200 times, using different random seeds. We then combined these result labels to establish preliminary cluster labels, which were typically larger than those derived from a single Leiden clustering owing to its inherent randomness82. We trained models in the PC space to predict labels and make sense of the confusion matrix. The clusters merged to make them less confusing. The confusion matrix is applied to the cluster selection by R1 and R2 according to the package83.

Source: Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain

t-SNE and UMAP79 principal component embeddings for visualization. I. Data and Code availability of all Cools9, Xarray67, scHiCluster70 and ALLCools

Calculate principal components (PCs) in the selected cell-by-CEF matrices and generate the t-SNE78 and UMAP79 embeddings for visualization. t-SNE was performed using the openTSNE80 package using previously described procedures81.

Both snmC-seq3 and snm3C-seq samples followed the same library preparation protocol (detailed in Supplementary Methods). This protocol was automated using a Beckman Biomek i7 and Tecan Freedom Evo instrument to facilitate large-scale applications. The snmC-seq3 and snm3C-seq libraries were sequenced on an Illumina NovaSeq 6000 instrument, using one S4 flow cell per 16 384-well plates and using 150 bp paired-end mode. The following software were used to do the process:BD Influx, Freedom EVOware, Illumina MiSeq control and NovaSeq 6000 control.

The large tensor dataset was hosted within the object storage of the cloud platform, which was chunked and compressed. All Cools9, Xarray67, and dask68 were used for the analysis. The SnakeMake package65 and SkyPilot package69 were used to set up cloud environments. Additionally, the ALLCools package (v.1.0.8) was updated to perform methylation-based cellular and genomic analyses, and the scHiCluster70 package (v.1.3.2) was updated for chromatin conformation analyses. In the Data and Code availability section, we provide information on the package version, analysis notebook and pipeline files. For simplicity, the description below focused mainly on key analysis steps and parameters.

Primary QC for DNA methylome cells included the following criteria: (1) overall mCCC level of <0.05; (2) overall mCH level of <0.2; (3) overall mCG level of >0.5; (4) total final reads of >500,000 and <10,000,000; and (5) bismark mapping rate of >0.5. Note that the mCCC level estimates the upper bound of the cell-level bisulfite non-conversion rate. Additionally, we calculated lambda DNA spike-in methylation levels to estimate the non-conversion rate for each sample. The samples had a low non-conversion rate. We chose loose cut-off values for the primary filtering step to prevent potential cell or cluster loss. There are some potential doublets and low-quality cells that can be accessed by clustering-based QC. In snm3C-seq cells, we required cis-long-range contacts (two anchors between2,500 and50,000) for 3C.

All experimental procedures using live animals were approved by the Salk Institute Animal Care and Use Committee under protocol number 18-00006. The mice were bought from Jackson Laboratory at 7 weeks of age and were kept in the animal barrier facility for up to 10 days with food and light. Brains were cut in between P56 and P63 and then sliced and reassembled as described. For snmC-seq3 samples, brains were sliced coronally at 600-μm intervals from the frontal pole across the whole brain, producing 18 slices, and dissected brain regions were obtained according to the Allen Brain Reference Atlas CCF (version 3)59 The data fig is called (CCFv3) 1a) The brain regions were sliced at a speed of 1,200 m, resulting in 9 slices and 6 combined brain areas according to the CCFv3. 1b. For nuclei isolation, each dissected region was pooled from 3–30 animals, and 2–3 biological replicas were processed per region. Supplementary Table 1 has a comprehensive brain dissection Metadata. Statistics were not used to determine sample sizes. We empirically determined the use of two to three biological experiments for all single-cell epigenomic experiments to achieve minimum reproducibility for this large-scale project. Blinding and randomization were not done during the handling of tissue samples. Additionally, all dissected regions were digitally registered into CCFv3 using ITK-SNAP60 (v.4.0.0) at 25 μm resolution (details of the annotated voxel file available in the Data availability section).