Introduction

Cannabis sativa L. is a versatile crop that has been cultivated for thousands of years for its bast fiber (stem) and hurd fiber (stem pith), edible grain (seed) and cannabinoids (flowers). Cannabis plants produce a substantial number of chemical compounds known as cannabinoids. These cannabinoids comprise C21 terpenophenolic compounds responsible for the medicinal and psychotropic effects obtained from Cannabis use1,2,3. There are over a hundred cannabinoids4 but the two main molecules of interest have always been delta-9-tetrahydrocannabinol (THC) and cannabidiol (CBD), both derived from the same precursor, cannabigerolic acid (CBGA)5. Consumption of high THC (drug-type Cannabis) results in intoxicating and/or analgesic effects6while hemp-type Cannabis contains predominantly non-intoxicating cannabinoids (CBD). De Meijer et al. (2003) and Staginnus et al. (2014) identified a co-dominant locus (B) with two alleles, BT and BD, where BT/BT individuals produce predominantly THC, BD/BD produce predominantly CBD, and BT/BD display an intermediate chemotype7,8. Building on this, Weiblen et al. (2015) and Grassa et al. (2017, 2021) proposed a multi-locus inheritance model in which drug-type cannabis carries a functional THCAS gene linked to inactive CBDAS copies, whereas hemp possess a functional CBDAS gene linked to non-functional THCAS homologs, explaining the segregation of cannabinoid profiles across cultivars9,10,11. In this model, functional CBDAS is classified by the B1 allele and assumed non-functional/inactive CBDAS as the B2 allele where B1/B1 individuals produce predominantly CBD (Type III), B2/B2 individuals produce predominately THC (Type I), and B1/B2 produce both in similar quantities (Type II)12. Classification of Cannabis as hemp or drug-type, is based on the concentration of THC, with 0.3% THC as the allowable upper threshold of industrial hemp (hemp-type Cannabis) as defined by the 2014 and 2018 Farm Bills13,14.

Hemp was introduced to North America in the early 17th century15and was used to make ropes, grain bags, Conestoga wagon covers, and clothing. In the early 20th century, a USDA agriculturalist Lyster Dewey, began researching industrial hemp production to help meet the US demand for a domestic fiber source with initial breeding efforts aimed at improving fiber quality and yield16,17. During this period, industrial hemp production was primarily concentrated in the Midwest region of the US due to favorable growing conditions and accessible hemp processing mills. Production peaked in 1943 when approximately 176,000 acres were cultivated to supply fiber for the World War II war effort18. After the war, all Cannabis effectively became impractical to grow because of the resumed enforcement of the 1937 Marihuana Tax Act and production rapidly halted. Despite the cessation of cultivation, feralized industrial hemp populations were established after escaping production and continued reproducing in the wild across the major prior production regions in the US19. Geographically, these remnants of feral Cannabis populations are widely dispersed and grow in a wide range of climatic conditions. Feral Cannabis thrives under full exposure to sunlight and in nitrogen-rich, well-drained soils where its extensive and deep root structure absorbs enough water for the shoots to withstand hot and dry weather20. As of April 2025, iNaturalist and Global Biodiversity Information Facility (GBIF) online databases currently recorded, 1,799 and 1,216 human observations of feral Cannabis in the US, respectively. Many of these observations occurred in the Midwest and Northeast areas of the US where they were previously cultivated in large acreages. Feral hemp is commonly found growing in disturbed, nutrient-rich lands including pastures, feed lots, farmyards, cultivated field borders, open woods, flood plains, dumpsites, roadsides, and railroad rights-of-way.

The recent legalization of hemp, with its many potential industrial and medicinal applications, has reinvigorated the scientific investigation of Cannabis. Significant scientific advancements have been made in Cannabis genetics and genomics over the last decade. This includes the completion of a several high-quality genome assemblies9,21,22understanding genetic relatedness of major market classes, and studies on the genetic underpinnings of cannabinoid inheritance, sex determination, and many agronomic and morphological traits23,24,25. Cannabis research and breeding efforts lag behind that of many modern crops. The urgency of creating and disseminating industrial hemp genetic resources cannot be overemphasized, with these tasks falling on genebanks to ensure adequate germplasm supply. Despite many germplasm repositories across the globe, only a handful of them show records of deliberate conservation of Cannabis accessions. Governments and universities have been the custodians of large-scale ex-situ germplasm collections for centuries. One such government-owned collection is the USDA-ARS National Plant Germplasm System (NPGS), the world’s largest supplier of germplasm, holding 617,467 accessions representing 17,482 species26. As a result of federal prohibition starting in 1937, hemp germplasm has been excluded from such conservation until the 2021 establishment of the USDA-ARS hemp germplasm repository in Geneva, New York. The current collection holds 581 Cannabis sativa accessions of which 57 are available for distribution.

Escaped, naturalized, and regionally adapted feral cannabis populations that exist across the US represent a potential source of untapped genetic diversity. These populations potentially harbor a wide range of genetic variation lost in cultivated crops, providing opportunities to introduce new traits for improved crop performance. As crops become feral, traits suited to local conditions like drought, extreme temperatures, poor soil, could be selected for. These adaptations are valuable for developing climate-resilient crops by incorporating genetic material from feral populations via breeding. Moreover, studying feral populations offers insights into the evolutionary processes of crop domestication and artificial selection. Several studies have described the population structure and chemical profiles in diverse C. sativa collections27,28,29,30,31,32,33 by using next-generation sequencing methods on mostly outsourced drug type Cannabis from Canada, Europe or Asian origin. These studies have identified well defined groups including basal cannabis, including mostly Chinese genotypes, hemp types further divided into grain and fiber, and both feral and cultivated drug varieties distributed worldwide30,34. Previous studies from four US states (Minnesota, Nebraska, Kansas and Colorado) have explored feral Cannabis diversity and found feral plants to be a rich source of genetic diversity35,36,37. The goal of this study is to populate the recently established USDA-ARS hemp germplasm repository in Geneva, NY, with genetically characterized, diverse, and compliant plant material to provide a valuable resource for the scientific community. To complete this goal, we executed a large-scale collection effort to collect, genotype, and chemotype unique feral Cannabis germplasm from across the US to support research efforts, conserve biodiversity and develop regionally adapted, compliant hemp cultivars.

Results

Sample collection

Seeds and floral tissues from 1,821 feral plants were collected across 12 states and 91 unique populations, in the late summer or early fall of 2022 and 2023 (Fig. 1a and b; Supplemental Table 1). The largest collection of samples came from Wisconsin (n = 589) and the smallest from Colorado (n = 11).

Fig. 1
figure 1

Feral Cannabis collection. (a) Bar plot showing total number of individual feral Cannabis accessions collected by state alongside the number of populations sampled. (b) Midwestern map showing geographical distribution of sampled locations (Colorado and New York excluded).

SNP distribution

Among the 91 sampled populations, 5–20 individual plants were selected for Genotyping-By-Sequencing (GBS). A total of 2,739,241,114 barcoded reads were aligned with the Cannabis genome cs10 v29 resulting in identification of 159,382 polymorphic SNPs. After filtering, the final dataset included 21,037 high quality SNPs. SNP counts ranged from 1,386 on chromosome 7 to 2,829 on chromosome 1 (Supplemental Table 2), with well distributed coverage across all chromosomes (Supplementary Fig. 1).

Population structure

The cross-validation method by ADMIXTURE was unable to resolve the optimum number of clusters (Supplemental Fig. 2) for the samples. However, the Bayesian Information Criterion (BIC) value indicated a rapid decline until K = 5 clusters, after which the value began to rise, suggesting that five groups should be retained (Fig. 2a). Following ADMIXTURE analysis, individual feral Cannabis accessions were assigned into a group based on their maximum proportion of membership (Q) value (Fig. 2b). To avoid groups with recent admixture between populations or cultivated Cannabis, a > 70% membership threshold was used to assign subpopulations identity, retaining 346 samples (Supplementary Table 3). Group 1 (Mississippi-River) had the largest number of samples with a total of 177 accessions largely composed of feral collections from eight states (MN, WI, IL, MO, SD, IA, NE and KS) and dominated by accessions from WI (57), MN (52), and IL (36) (Supplementary Table 4). Group 2 (West North Central-a) comprised 96 accessions from SD, NE, IA, ND, CO, KS, MO, IL and MN, however SD (40 accessions), NE (17 accessions), and ND (12 accessions) were the most represented in this group. Group 3 (New York) comprised 14 samples from New York. Group 4 (West North Central-b) comprised 37 samples, with accessions from SD (11), IA (8) and NE (7), similar to Group 2. Finally, group 5 (Indiana) comprised 22 accessions from Indiana. Principal component analysis (PCA) of the ~ 21 K SNPs sorted the feral Cannabis accessions into the same five major groups (Fig. 3a). The majority of the variation explained came from New York and Indiana, respectively, while the other groups clustered tightly. Samples mostly grouped by states with a few minor exceptions (Fig. 3b and c).

Fig. 2
figure 2

Population ADMIXTURE. (a) Graph of Bayesian Information Criterion versus number of clusters. (b) Admixture ancestry of 346 feral Cannabis accessions representing Group 1 (Mississippi-River), Group 2 (West North Central-a), Group 3 (New York), Group 4 (West North Central-b), and Group 5 (Indiana), respectively (from left to right).

Fig. 3
figure 3

Principle Component Analysis of (a) 346 feral Cannabis plants colored by ADMIXTURE clusters, (b) collection states, and (c) collection states after removing Indiana and New York samples.

Genetic differentiation

The FST values between the inferred groups ranged from 0.07 to 0.39 (Table 1). Pairwise genetic differentiation was highest between the New York and Indiana groups (FST = 0.39), with very low values (FST = 0.07–0.09) between Mississippi-River, West North Central-a and West North Central-b. Sliding window FST analysis revealed extremely high FST values on chromosome 4 and the X chromosome between groups (Fig. 4a and b, Supplementary Fig. 3). The SNPs associated with elevated FST (65–90 Mb) were analyzed separately by PCA showing three discrete clusters (Supplementary Fig. 4a and b) explaining 63.9% of the observed variation. Inspection of this region indicated a large genome rearrangement with respect to the CS10 v2 genome that has been observed previously in feral samples22,36. The inversion genotype (+/+) was characterized for each of the five groups and was found at 93% and 91% in the West North Central-a (Group 2) and Indiana (Group 5) populations, respectively (Table 2).

Table 1 Pairwise comparison of genetic differentiation (FST values) among the five observed groups.
Fig. 4
figure 4

Genetic differentiation between all groups as compared to the “West North Central - a” group as determined by sliding window FST analysis of 100 kb bins across (a) the entire genome and (b) a highly differentiated region (65–90 Mb) on Chromosome 4. The red dashed lines indicates the top 5% of FST values for each comparision.

Table 2 Frequency of chromosome 4 inversion across subpopulations.

Genetic diversity analysis

Minor allele frequency (MAF) observed and expected heterozygosity (Ho and He), polymorphism information content (PIC), and the inbreeding coefficient (FIS) are used as indicators of the genetic diversity within a population38. The average Ho, He, MAF, PIC, FIS of the 346 accessions were 0.20, 0.20, 0.19, 0.26 and − 0.03, respectively (Table 3). Analysis of the genetic diversity of the five Cannabis groups showed MAF ranged between 0.12 and 0.22, while polymorphic information content (PIC) was between 0.15 and 0.30. The Ho, MAF and PIC of the New York population were lower than those of the Midwestern populations. Hierarchical Analysis of Molecular Variance (AMOVA) based on ADMIXTURE groupings found approximately 9% variation between groups, 20% between accessions that were clustered within groups and extensive variation (71%) among all samples (Supplementary Table 5).

Table 3 Diversity indices for the entire feral Cannabis collection and by subpopulation.

Linkage disequilibrium

LD decay occurred rapidly with increasing physical distance. The average maximum r2 value was 0.47. As LD decayed to its half maximum (r2 = 0.24), the corresponding physical distance was approximately 6.7 kb (Supplementary Fig. 5).

CBDAS genotyping

Among US feral Cannabis collections, CBDAS genotypes were resolved for approximately 1,400 accessions from 90 populations located among 12 states into one of three diploid classes (Table 4; Supplementary Table 6). Among all accession genotypes, 1120 were homozygous for functional (B1), 91 were homozygous for non-functional (B2), and 208 were heterozygous (B1/B2) comprising study wide allele frequencies of 0.86 and 0.14 for functional (B1) and non-functional (B2) CBDAS, respectively (Supplementary Table 7). Among a subset of 53 populations with > = 10 resolved genotypes, functional CBDAS (B1) allele frequency ranged from 0.99 to as low as 0.4 among 36 populations and was fixed at 1.0 among 17 populations (Supplementary Table 7). The CBDAS allele frequency among this population subset varied geographically across a region comprising 10 contiguous upper midwestern states (Fig. 5). Populations from which accession genotypes were exclusively homozygous for functional CBDAS were located in all but two states, North Dakota and Indiana, at the northeastern and southeastern margins of the upper Midwest sampling region. A total of five populations with exclusively functional CBDAS alleles were also composed of individuals belonging entirely to the Mississippi River group. In addition, three populations outside of the upper Midwest, two in Colorado and one in New York, were completely homozygous for functional CBDAS (Supplementary Table 8).

Table 4 Cannabinoid content and genotype frequencies among feral Cannabis populations.
Fig. 5
figure 5

CBDAS allele frequency variation among populations with >/= 10 resolved genotypes. Population localities are indicated by small black circles. Pie-charts display the proportion of functional (B1; white) and non-functional (B2; black) CBDAS alleles observed within populations among individuals for which a CBDAS genotype was resolved. Populations not linked to a chart are fixed where fB1 = 1.

Cannabinoid analysis

Among US feral Cannabis collections across the 10 states comprising the upper Midwest sampling region (Fig. 1b), female floral tissues (pistillate inflorescences) from 523 accessions were analyzed by gas chromatography for determination of the % dry weight content of eight cannabinoid compounds except for 30 accessions for which determinations included seven compounds (omitting ∆8-THC; Supplementary Table 6). When categorized by the ratio of ∆9-THC: CBD content, these accessions included 91 THC-type, 233 intermediate-type, and 244 CBD-type feral Cannabis plants selected from 61 populations (Supplementary Table 8). Total cannabinoid content ranged 0.23–2.96%, 0.29–4.09%, and 0.21–4.73% among accessions classified as THC-type, intermediate-type, and CBD-type, averaging (± SE) 0.98% (0.07), 1.38%, and 1.50%, respectively (Table 4; Supplementary Table 8). Histograms of total cannabinoid content among 523 feral Cannabis accessions are shown in Fig. 6, along with the histograms for a subset of accessions meeting criteria for assignment to one of five ADMIXTURE groups.

Fig. 6
figure 6

Distribution of total cannabinoid content (% dry weight of “female flower” or pistillate inflorescence) among plants differing in the ratio of THC: CBD. Left, center, and right columns include plants with either an excess of THC (THC-type or type-I), similar amounts of THC and CBD (intermediate or type-II), or an excess of CBD (CBD-type or type-III), respectively. Upper row histograms include all accessions for which cannabinoid content was measured by gas chromatography where N indicates the total number of plants. Lower row histograms include a subset of genotyped plants assigned to each of the three types using the PACE assay. Histograms are proportional to the number of samples sharing ± 0.5% cannabinoid content.

Discussion

US hemp breeding programs have been legislatively impeded for nearly a century due to federal prohibition under the 1937 Marihuana Tax Act and the Controlled Substances Act of 1970. Federal drug law policies prohibited the cultivation and use of Cannabis in any form, crippling potential industrial and medicinal use. Interests in industrial hemp production increased with the passage and expansion of the 201413 and 2018 Farm Bill14. Still, the limited availability of complaint and diverse Cannabis germplasm represents a major bottleneck for breeding and research efforts.

Although the last legal US hemp crops were planted in 195839,40, remnants of domesticated industrial hemp have escaped to form naturalized, thriving feral populations across regions of the US where it had been previously cultivated. Additional dispersal of seed by birds, animals, and anthropogenic activities might have resulted in further establishment in regions where hemp had not previously been cultivated. Sites suggestive of such secondary range expansion include roadsides, railroad rights-of-way, livestock feedlots, and agricultural boundaries especially as associated with trees and utility lines. We relied on university collaborators, farmers, and citizen scientists to identify and/or collect feral Cannabis plants across the US based on sites recorded in iNaturalist or received by word of mouth. Whereas site identification was essentially straightforward, collection of samples was occasionally challenging because of residual stigma associated with Cannabis. Additionally, occurrences within interstate highway corridors, on private property, and accompanied by dense vegetation made collection difficult. Wider identification of feral Cannabis occurrence and collection is particularly needed in the West, South, and Northeast geographical regions of the US. Meeting this need will require effective communication about the potential economic benefits of adopting industrial hemp production among existing agricultural crops.

Five subpopulations were identified among US feral Cannabis collections. Generally, samples clustered according to state of origin showing that geographical ___location explains genetic variation. Most notable were two clusters, one containing accessions primarily from Indiana and the other containing accessions from New York. The remaining three clusters were less variable and contained samples from states along the Mississippi River basin and within West North Central states.

Pairwise comparison of fixation indices (FST) between subpopulations found genetic distance to be in accordance with geographical ___location with the most divergent subpopulations from Indiana (group 5) and New York (group 3). The FST values from this study are higher than those previously recorded in numerous studies focused on drug-type Cannabis27,28,30,36which is likely attributed the broader genetic background of hemp as compared to drug-type Cannabis. The least divergent subpopulations (FST = 0.07–0.09) may have resulted from a single introgression to the Upper Midwest during the WWII “hemp for victory” era followed by dispersal by humans or animals across the sampled regions. Interestingly, many of the collection sites in these regions were either along railroad rights-of-way or livestock facilities. High FST between the Indiana and New York groups indicate separate introgressions. Indeed, hemp arrived at different geographical regions at distinct periods throughout history with samples arriving in the Northeast as early as the 16th century, Kentucky in the 18th century and Wisconsin during the 20th century20,41,42. The three primary subpopulations identified in the current study may have originated from distinct introgression and feralization events and therefore harbor unique alleles based on origin.

High pairwise FST scores were found for all comparisons indicating regions of the genome that may be under selective pressures across different regions. The highest FST values were associated with genomic regions on chromosomes 4 and X (Fig. 4; Supplementary Fig. 4). Previous work identifies similar peaks and hypothesizes that the regions may be the result of chromosomal inversions maintained at high frequency in the feral US populations36. The Cannabis pangenome has confirmed a large-scale inversion on chromosome 4 in a feral collection from Boone County, Iowa as compared to high cannabinoid cultivars22. We find this same elevated FST primarily in samples from the West North Central-a and Indiana groups suggesting it is segregating within US feral populations. Low global FST observed between West North Central-a and -b groups, indicates these accessions are mostly differentiated by this inversion. Analysis of the 15 Mb region surrounding the chromosome 4 inversion explains 65% of the variation among samples in the PCA (Supplementary Fig. 4). Because this inversion is found in some frequency across all subpopulations, it is likely this inversion occurred prior to the introgression of hemp to the United States. Fixation in several populations suggests that it may confer a selective advantage in certain environments.

The expected heterozygosity level (0.20) for the accessions in our study was similar to the observed heterozygosity (0.20) indicating the populations are not experiencing significant selective forces such as inbreeding or genetic draft (Table 3). Our observed heterozygosity is more similar to what has been previously described in hemp (0.27 − 0.16) as compared to drug-type cannabis (0.12–0.14)27,29. A lower heterozygosity was observed in the New York subpopulation which may be attributed to its small population size and geographic isolation. The negative value of inbreeding coefficients recorded in the MS-River, WNC-a and WNC-b subpopulations, imply increased heterozygosity and higher genetic diversity when compared to the Indiana and New York feral values reflecting random mating and genetic stability among the populations. Polymorphism Information Content (PIC) is a measurement of how well a genetic marker can distinguish different individuals in a population. It is a key indicator of marker quality in genetic studies and is used to predict the likelihood of a marker’s heterozygous genotype being passed on to offspring43. The PIC marker values per population ranged between 0.15 and 0.30, indicating they are very informative at discriminatory analysis between feral populations.

High genetic variability from AMOVA as observed within accessions followed by between accessions within groups and lowest between groups suggests potential for trait discovery in breeding programs. The moderate differentiation between groups reiterates the high heterozygosity often recorded in Cannabis associated with wind pollination. Within accession AMOVA values among US feral collections was higher than that of the Spanish and Moroccan Cannabis germplasm studies36,44. Discrepancies in variation contributed by these studies are consistent with other studies and may be due to differences in sample size, sample composition and types of markers used.

Linkage disequilibrium decayed rapidly (6.7 kb) in the feral germplasm. This was similar to a recent report using a mixture of hemp and drug-type ferals (LD = 6.0 kb)30 and faster than in a recent study of drug-type Cannabis averaging between 22.6 and 89.0 kb across chromosomes25. Historic breeding efforts during the 18th and 19th centuries that entailed intentional hybridization as well as wind dispersal of hemp pollen over great distances may account for the plunge in LD decay of feral Cannabis plants. This finding will be important for finding marker-trait associations and developing genotyping resources to adequately cover haplotype blocks.

Total cannabinoid content in Midwestern feral Cannabis was rather low compared to drug cultivars although somewhat higher than hemp cultivars45 with population averages (± SE) ranging from 0.62 (0.05) to 4.04 (0.26) (Supplementary Table 8). Most populations contained at least some individual plants with THC content above the US statutory threshold of 0.3%. Such plants were either of intermediate (B1/B2) or THC-type (B2/B2) genotypes as indicated by the presence of non-functional CBDAS alleles at low frequency. Approximately one third of 53 populations with at least 10 resolved accession genotypes were fixed for functional CBDAS, with mixed populations varying in non-functional allele frequency from > 0.1 to 0.6. These observations are consistent with evidence from Minnesota37 suggesting that midwestern “feral hemp” does not neatly fit the current statutory definition of hemp but is more accurately described as feral Cannabis. Genotype frequencies in mixed populations were generally skewed in favor of CBD-type, with a lower frequency of intermediates and the occasional THC-type plant. There is little reason to expect that the original Cannabis introductions to the region in the 19th century for fiber production were pure hemp, considering that the chemical nature of intoxicating versus non-intoxicating Cannabis (THC versus CBD) was not discovered until decades later46,47. Populations at or near Hardy-Weinberg equilibrium with relatively low yet varying non-functional CBDAS allele frequencies might be interpreted as subject to local genetic drift with little or no selection on cannabinoid content and/or little immigration of THC-type genetics from elsewhere. However, several populations deviated from this pattern with a high proportion of intermediate-type plants or a non-equilibrium excess of THC-type plants, which might be interpreted as the result of introgression from more recent drug-type cultivation. Genotyping of historical specimens from herbarium collections might shed light on how and whether cannabinoid genetics has changed over time48.

CBDAS genotypes as measured by the PACE assay49 accurately predicted each THC: CBD ratio class in most cases (87%). In three cases, cannabinoid content was so close to the limit of detection by gas chromatography (0.01%) that the THC: CBD ratio lacked meaning, and we do not consider them to challenge the accuracy of the genetic assay. However, 33 of 465 accessions (13%) had CBDAS genotypes that did not correlate with the THC: CBD ratio even after independent technical replication. Such plants merit further investigation of the possibility that allelic variation at the primer binding sites could skew the assay. This seems more likely than a departure from the simple genetic mechanism of codominant inheritance11 as others have suggested50. Another limitation of the PACE assay that we observed when applied to feral Cannabis was a relatively high failure rate (20%) in which some accessions failed to amplify (18%) while others (2%) yielded ambiguous fluorescence patterns where it was not possible to distinguish between the intermediate genotype and a homozygous state.

Feral accessions are important genetic reservoirs of useful alleles and superior traits for the improvement of Cannabis plants for fiber, grain or cannabinoids51,52. The persistence of these feral Cannabis populations in selected habitations and even spread to new localities after decades of naturalization attest to their resilience and importance as new tools for breeding improved cultivars with increased adaptation to biotic and abiotic stress factors. The conservation of these Cannabis collections in the USDA-ARS Hemp Germplasm Repository will ensure the long-term security of these genetic resources against any future political or societal pressures and their genetic integrity will be continually maintained through thorough regeneration practices. Conservation of these diverse feral Cannabis materials in the genebank promises in contrast to historic absence under decades-long prohibition, to provide access to diverse and compliant Cannabis germplasm.

Materials and methods

Germplasm collection

To identify feral Cannabis sites, we used the iNaturalist application (a joint initiative by the California Academy of Sciences and the National Geographic Society), an American nonprofit social network of citizen scientists and biologists who identify and share information about biodiversity across the globe. Additional locations were shared by collaborators via word of mouth. Seeds and floral tissue from 1,815 female hemp plants from 88 unique populations were collected across twelve states in 2022 and 2023 (Fig. 1a; Supplementary Table S1; Supplementary Material). An additional feral population from New York was obtained from USDA-ARS (Geneva, NY) and seeds from two Colorado populations were contributed from Colorado State University. All feral cannabis was collected with permissions under USDA Hemp Producer License USDA_55_R0091.

Sample preparation and genotyping

A subset of 760 samples, representing twelve states and ecogeographic US locations, were selected for genotyping. Leaf tissue from individual plants was lyophilized (Labconco, Kansas City, MO, USA), then approximately 20 mg dry weight was homogenized (SPEX SamplePrep, Metuchen, NJ, USA), and genomic DNA was extracted using NucleoSpin 96 Plant II core kit (Macherey-Nagel GmbH & Co., Duren, Germany). DNA quality was checked by 1% agarose electrophoresis and quantified with NanoDrop ND-1000 Spectrophotometer (Thermo Fisher Scientific, USA). The feral hemp DNA quantification, library preparation, and sequencing was constructed using the genotyping-by-sequencing (GBS) protocol53 at the University of Wisconsin-Madison Biotechnology Center. In summary, Illumina adapters and sample-specific barcodes were annealed after restriction enzyme digestion using ApeKI. Multiplexed samples were then sequenced on Illumina NovaSeq 6000 (Illumina, Inc., San Diego, CA, USA), generating on average 3.5 million, 150 bp paired end reads per sample.

Variant discovery and data filtering

The Tassel GBS Discovery Pipeline v254 was used for variant detection, using the CS10 reference genome v29 (Genbank assembly accession ID = GCF_900626175.2) with BWA-MEM55. Raw VCF files were filtered for SNPs > 10% mising data, < 5% mior allele frequency, > 10% mising individuals, and finally pruned to reduce redundancy by removing SNPs that are highly correlated with others (--geno 0.1, --maf 0.05, --mind − 0.1, --indep-pairwise 50 2 0.2) using PLINK V.1.90b6.2156. The final VCF file contained 21,037 SNPs (~ 21 K) used in downstream analyses. Chromosome-wide SNP count and density plots were performed (https://github.com/YinLiLin/R-CMplot) in the R environment (version 4.3.2).

Population structure and principal component analysis

Population structure was determined using ADMIXTURE57 to estimate the genetic ancestries of each feral accession, and cross-validation plots were observed for the optimum number of clusters. ADMIXTURE identifies K genetic clusters, where K is specified by the user (K2-K40), from the provided SNP data. For each individual, the ADMIXTURE method estimates the probability of membership to each cluster using Bayesian Information Criterion. Low admixture samples were assigned by setting the membership coefficient to 0.7. Eigenvalues and eigenvectors for genome-wide principal component analysis (PCA) were generated using PLINK open-source analysis toolset and then plotted in the R-software58package “ggplot59.

Genetic differentiation and diversity indices

To measure genome-wide genetic differentiation (FST) and molecular variance (AMOVA) among identified subgroups, poppr60 and Hierfstat61 R packages were employed. Diversity indices (Ho = Observed heterozygosity, He = Expected heterozygosity, MAF = Minor allele frequency, PIC = Polymorphic information content, and FIS = Inbreeding coefficient) were estimated with PLINK. To estimate localized genetic differentiation across the genome, pairwise FST was calculated using VCFtools with a 100,000 bp, non-overlapping sliding window and plotted in R using package “ggplot”.

Linkage disequilibrium

The LD squared allele frequency correlation (r2) estimate for each pairwise SNP comparison was generated in TASSEL 5.0 and visualized by plotting r2 values against physical distance in R package “ggplot”. A nonlinear regression curve was used in the estimation of LD decay. The LD decay rate was determined by the point of intersection with the LD curve and half its maximum value (r2 = 0.24).

CBDAS locus genotyping

To resolve CBDAS locus genotypes of accessions, we used the CCP1 primer set49 that has been validated to detect and distinguish presence of functional (B1) and non-functional (B2) psuedogenous alleles of CBDAS that are tightly and opposingly linked to non-functional pseudogenes and a functional allele of the THCAS gene9,11. PACE™ 2.0 2X master mix with standard ROX was used according to the manufacturer’s protocol (3CR Bioscience Limited) to produce fluorescently labeled CBDAS locus real-time PCR fragments on Bio-Rad CFX96, Roche 980, or Thermo-Fisher QuantStudio qPCR instruments following respective PACE™ protocols for bi-allelic genotyping. Genomic DNA from feral Cannabis accessions were prepared as assay templates from dried leaf or floral tissues using either Extract-N-Amp™ Plant PCR (Millipore Sigma) extraction reagents or using a Nucleospin 96 Plant II core kit (Macherey-Nagel GmbH & Co., Duren, Germany) kit according to the respective manufacturer’s protocol.

Chemical cannabinoid diversity

Dried floral tissue samples were analyzed using gas chromatography62 to measure the percentage of total inflorescence dry mass of: cannabichromene (CBC), CBD, cannabigerol (CBG), cannabinol (CBN), delta-8-tetrahydrocannabinol (d8-THC), THC, and tetrahydrocannabivarin (THCV). We report total cannabinoid content as the sum of these individual compounds as well as the percent CBD and THC content. Cannabinoid profiling was conducted on up to five plants of each genotype, as determined by the PACE assay, per population.