Fig. 4: Distribution of beta values after SeSaMe normalization.
From: A mammalian methylation array for profiling methylation levels at conserved sequences

a–c Distribution of beta values (relative intensity) of all probes on the array after SeSaMe normalization for a human samples, b mouse samples, and c rat samples. These cytosines are based on the CMAPS design criteria, i.e., a n = 35,453 human cytosines, b n = 21,900 mouse cytosines, c n = 18,157 rat cytosines. d–f Analogous to a–c but based on mappable cytosines from QuasR and after using calibration data to identify and remove severely outlying cytosines. Specifically, the lower panels use respective subsets of cytosines whose Pearson correlation with Percent methylated exceeds 0.8, which was: n = 37,152 CpGs for human, n = 27,966 for mouse, and n = 25,669 for rat. Beta-valued distributions are heteroscedastic in that distributions at a fractional methylation value close to 0.5 are expected to have a higher variance than those at fractional value close to zero or 1. Based on the binomial distribution, one would expect that the variance and mean value across of the SeSaMe normalized beta values across designed CpGs follow the following relationship: variance = constant*mean*(1 − mean). Indeed, in a separate analysis, we find that the left-hand side (variance) is highly correlated with the mean*(1 − mean) in mice (Pearson correlation r = 0.92), rats (r = 0.95), and humans (r = 0.86). It can be advisable to use statistical models and distributions that model the over-dispersion inherent in these data. Both array and sequencing methods that use bisulfite conversion followed by amplification can lead to biases in the ratio of converted to unconverted strands (beta values)67, which could explain the broad peaks we see in the estimate of calibration data. Each boxplot visualizes the median value and the upper and lower quartile. The whiskers extend to the most extreme data point, that is, no more than 1.5 times the interquartile range from the box.